Designing LLM agent tools for due diligence in financial instruments

Mihail Dungarov
CFA Product Management Lead, Text A... CFA Product Management Lead, Text Analytics
Iulian Giusca
Senior Analyst Product Development Senior Analyst Product Development
Bogdan-Mihai Elefteriu
Data Scientist, Product Data Scientist, Product


At London Stock Exchange Group (LSEG), our mission is to optimise our clients’ efficiency throughout the entire trade lifecycle. The complexity of securitisation documents, with their intricate legal details and term specifications, can often make them seem overwhelming. Investors, traders and salespeople must meticulously analyse various aspects of a security, including its overall structure, individual loan mechanics and seniority structures, as part of their due diligence. Similarly, equity structured notes require a precise understanding of nuances in term definitions from varying issuers and with variation in the use of lexicon. While these documents are shorter, customers need to quickly and at scale identify the mechanics of guarantees/protection, pay-out formulas, governing laws, etc. The primary tool at the investor’s disposal is PDF keyword search, which can often be time-consuming and inefficient in locating precise answers and all the relevant context

Large Language Models

LLMs are ideally suited to tackle this challenge, offering a natural language interface capable of delivering contextually relevant responses. However, the obstacle lies in the fact that LLMs alone cannot “learn” specific deal documentation accurately through fine-tuning – and the resulting answers can easily be “hallucinated”. A prevalent solution to this problem is the implementation of a Retrieval Augmented Generation (RAG) system. This system combines efficient document storage and retrieval using vector databases to select relevant text snippets. After that, an LLM is employed alongside prompt engineering methods to generate an accurate answer to the user query from the associated retrieved snippets. 

To ensure scalability, it is crucial to maintain both repeatability and precision within these experiments. While the RAG method has been extensively researched for a variety of general use cases, in deep domainspecific context, particularly in finance, it merits further investigation. Consequently, the objective of this paper is to identify the optimal setup of ML systems for such use cases. We approach this in the following ways:

  • Identifying the right metric by measuring ourselves against the right questions.
  • Considering the trade-offs between long context LLMs and a RAG solution for our use case (i.e. by analysing the recently released 128k context GPT4 by OpenAI).
  • Finding the optimal setup of such a system by individually analysing the following components: vector database similarity search, LLM context comprehension and the quality of the LLM-generated answers.
  • Identifying further components needed for an optimal system setup, such as UI & UX components, LLM approaches, etc.

Model evaluation and results

To evaluate the model’s capabilities, subject-matter experts (SMEs) selected a set of high-value questions for the investment due diligence process. These questions target key features of the security, like the offered assets and their principal allocation/nominal value, the identity of the relevant entities, geographical spread and more. In addition to focusing on the main details from the provided documentation, these questions were designed to test a range of language comprehension challenges for the LLMs, including understanding names, dates, locations, lists and tables. This diverse questioning aims to reveal the model’s strengths and limitations. 

We have divided our experimentation into the three primary components of a functional RAG tool:

  • Experiment 1: similarity search – we aim to identify sections of the document containing relevant information for answering our query. We discovered that typically up to five search results are sufficient to build a representative context for the model. This approach has an efficiency component as it reduces the volume of information sent to the LLM, thus reducing operational costs and system latency.
  • Experiment 2: context comprehension – we evaluate the LLM’s ability to correctly identify supporting evidence within the text snippets returned from the similarity search. In some cases, we may find it useful to return a direct quotation from the source document or reinforce an LLM-generated answer with the original text. For these cases, it will be sufficient for the model simply to identify the correct supporting text. On average, the model accurately identifies the text snippet containing the answer 76% of the time and effectively disregards paragraphs lacking relevant information to the user’s query 91% of the time.
  • Experiment 3: answer quality – we analyse responses for queries with two distinct purposes: value extraction (where the answer is a specific value, e.g., notional amount, date, issue size, etc.) and textual answers (where the answer is textual and is contained within a sentence or a paragraph). For both tasks we compare the performance of both GPT3.5 and GPT4 models, with the latter consistently demonstrating superior results. For value extraction tasks, GPT4’s accuracy ranges between 75-100%, whereas for textual information extraction, the quality of the generated answers ranged between 89-96%, depending on the complexity of the task. The 128k context window tends to perform here largely on par or slightly worse than the traditional shorter window


In this research, we have analysed the impact of different designs and setups on a retrieval augmented system (RAG) for performing investment due diligence on documentation relating to different financial instruments. Such a system will likely be an integral reasoning component of LLM agents’ design and the overall AI-powered experience for our customers. Current experimentation shows promising results both in terms of identifying the right context and extracting the relevant information. This in turn suggests that the RAG system is a viable tool for an LLM conversational agent to access if the user needs to extract specific deal definitions from extensive financial documentation. In conclusion, the results of these investigations give us a solid foundation to inform the future design of LLM questionanswering tools. However, we recognise that effective retrieval and generation is just one part of the design of a fully integrated conversational flow. LLM agents will likely use a set of such tools to understand and contextualise a range of customer needs and the right UX methods will play a crucial part in producing a timely and informative financial due diligence experience for our customers

For a detailed view of the analysis, architecture, implementation, and results of this research please download the full PDF  available.

  • Register or Log in to applaud this article
  • Let the author know how much this article helped you
If you require assistance, please contact us here