1. Introduction
The emergence of large language models (LLMs), such as GPT-3 [
1], has sparked a wide range of innovations, powering intelligent chatbots, personal assistants, and other natural language processing (NLP) applications (ChatGPT, Copilot). LLMs have also gained increasing popularity as a tool for information seeking and question answering (QA). LangChain [
2], a framework for building applications with LLMs, has been instrumental in leveraging LLMs for a variety of scenarios to improve their efficiency.
While these QA systems have shown remarkable general abilities [
3,
4], their outputs are prone to hallucination [
5]. They continue to suffer severely from challenges, particularly factual inaccuracies [
5] and knowledge outdating [
6,
7]. This makes it harder for users to trust and verify LLM-generated answers. Therefore, validating the factual accuracy of generative large language model-based question-answering system poses a contemporary research challenge.
Given their inherently generative nature, ensuring that the output aligns with the sources of information proves to be a challenging task. Incorporating external knowledge via information retrieval, i.e., retrieval-augmented generation (RAG) [
8], has been regarded as a promising way to resolve the above challenges. RAG optimizes the output of a large language model by referencing an external knowledge base outside of the LLM training data sources before generating a response. RAG enhances the responses of the LLM and reduces the occurrence of hallucinations, thereby increasing the models’ credibility [
9,
10].
Nevertheless, it is crucial to recognize that in practical scenarios, the text retrieved by these models frequently contains a certain level of noise. This poses a problem, as the language model’s RAG-enabled responses can significantly vary based on the quality and accuracy of the retrieved content. Therefore, objective evaluations of RAG-enabled LLM performance are just as vital as benchmarking their non-RAG analog.
The contributions of this paper are as follows:
We propose QA-RAG for constructing question-answering systems empowered by LLMs to handle the occurrence of hallucination. The proposed architecture is versatile, does not require fine-tuning, and can be applied to both black-box and open-source LLMs.
We investigate the behavior of QA-RAG from three perspectives: (1) Noise Robustness: assessing how noise in retrieved documents affects answer accuracy. (2) Knowledge Gap Detection: evaluating the system’s ability to recognize and handle missing information. (3) External Truth Integration: examining how well the system integrates external data that contradict its pre-existing knowledge.
We modified TriviaQA dataset: reformulated 1100 answers as long-form answers to enhance the complexity of the evaluation. Two testbeds comprising 600 rows were designed to assess the aforementioned behaviors.
We examine and highlight various practices and implementations applicable to the development of a RAG system.
Our code and dataset are publicly released to facilitate future research on interpretable RAG.
2. Literature Review
The predominant paradigm for utilizing large language models in the question-answering task involves adjusting large language models through fine-tuning. Although LLMs have general knowledge, they may lack effectiveness in specific domains. Fine-tuning addresses this by retraining a pre-trained language model on a particular dataset or task to enhance its performance.
It has been observed that LLMs can acquire a kind of implicit “knowledge base” after being previously trained on unstructured text. The work in [
11] examined the potential benefit of this behavior for an open-domain question-answering task. The authors fine-tuned the T5 model in order to be able to answer questions without providing additional information or context, based solely on knowledge acquired during fine-tuning. Thus, the model must query in a natural language and “find the information” stored in its weights. However, this approach gives competitive results only with the language models that have more than 10 billion parameters. Second, the model extracts knowledge about its parameters in an inexplicable way and hallucinates realistic answers when it is unsure. Third, the maximum-likelihood objective used to train the model provides no guarantees as to whether a model will learn a fact or not.
In the work [
12], Bloom 7B has been fine-tuned with domain data sourced from the Wikidata knowledge base. The authors tried implement knowledge injection to improve the model’s ability to store knowledge in its parameters. To reduce hallucination a teacher–student approach was proposed. That is, a more powerful LLM (GPT-4) was used to provide guidance to the weaker LLM (Bloom 7B). However, use of the proposed approach in practice is questioned [
13,
14] by the importance and complexity of creating a reliable knowledge graph through entity and edge extraction leveraging LLMs. Also, there are significant costs associated with the teacher model’s intervention in the student model’s explanations.
Consequently, fine-tuning is a strong starting point for building QA systems, but it has several limitations. Fine-tuning can be very efficient if the goal is to optimize the performance of LLM on that task alone. It helps adapt a pre-trained model to unique requirements. However, fine-tuning cannot deal with hallucination as knowledge is limited to what the model has encountered in its training data. Secondly, the process of fine-tuning can be extremely time-consuming and expensive. Storage space is limited by the size of the neural network—to capture more world knowledge, one must train ever-larger networks. Furthermore, even if it is achieved, it is challenging to determine what knowledge is stored. Stored knowledge may inevitably be incomplete, out of date, or incorrect.
The pioneering work of Lewis et al. marked a significant milestone in the field of language generation [
8]. They introduced the retrieval-augmented generation (RAG) model, which ingeniously combines pre-trained parametric and non-parametric memory for language generation. This model, with its unique ability to access a dense vector index of Wikipedia through a pre-trained neural retriever, has set a new standard in the realm of free-form, abstractive text generation.
The term “RAG” was introduced in [
8], and it has since become synonymous with the method of integrating external knowledge bases into language generation models. However, it is important to clarify that the current reference to “RAG” pertains to the broader concept of retrieval-augmented generation, not the specific methodology outlined in the work. This approach, as subsequent research by [
10,
15,
16] has shown, allows models to generate more accurate and reliable responses by using external knowledge as guidance. Impressive performance has been demonstrated in a range of tasks, including open-domain question answering [
17,
18] and dialog [
19].
LangChain [
2], an open-source Python framework, simplifies the complexities associated with interacting with LLMs. Offering numerous tools like chains, web search ability, vector databases, and embedding models to create and store vector embeddings, it thereby accelerates the development of RAG-based AI applications. In our previous work [
20], a QA chatbot that responds to fintech-domain-specific queries was developed with two potential scenarios following a user’s query. The first scenario involved the chatbot generating an answer based on information found in its external knowledge base. The second scenario occurred when the chatbot could not find information to formulate an answer. In this case, the chatbot initiated a search operation through the Google Search API, ensuring that users always received a response.
The current study builds upon the work in [
20], incorporating several adjustments to simplify system evaluation. One notable modification is in the system’s response behavior. If the system does not contain an answer to the question within its vector database, it will explicitly state so, thereby enhancing the transparency of the system’s responses. Additionally, the scope of this research is limited to the vector database, and as such, the web search capability will not be assessed.
The research in [
21] presents a method for rapidly creating QA over a single pdf file application using LangChain. The paper in [
22] serves as a practical guide, highlighting the framework’s potential in creating versatile and powerful virtual assistants that can be deployed across various industries. Others [
23] explored the use of the framework to develop an automated customer service chatbot that offers responsive and context-aware interactions. The chatbot, integrated into the customer service platform of Birla Vishvakarma Mahavidyalaya, utilizes web scraping and embeddings to provide real-time support and query resolution. The paper in [
24] presents MindGuide, a sophisticated mental health support tool designed to assist individuals with mental health challenges. The mentioned works outline the framework’s features that facilitate the swift development and deployment of chatbots.
Nevertheless, these studies fall short of providing a comprehensive evaluation of the system, which limits their suitability for deployment or practical application. In contrast, our work fills this gap by rigorously assessing system performance across key dimensions such as noise robustness, knowledge gap detection, and external truth integration. Unlike prior research, we prioritize evaluation over technical implementation, offering insights into the reliability and behavior of LLMs in real-world scenarios.
3. Materials and Methods
3.1. Open-Book Question-Answering System
Open-book question answering refers to the task of developing systems capable of providing accurate and contextually relevant answers
to a specific set of questions
posed in natural language with access to external knowledge. In this research, the task is solved by using a methodology that involves a two-step architecture, consisting of a retriever and a generator (or reasoner). The system is constructed in such a manner that it can only answer based on the external knowledge provided, without relying on its parametric memory. The overall system architecture is depicted in
Figure 1, detailing the sequential steps as follows:
The query is forwarded to the embedding model for encoding into an embedded query vector.
The embedded query vector is then transmitted to a vector database.
The retriever algorithm dictates the retrieval of the top-k pertinent segments from the database.
Subsequently, both the query text and the retrieved segments are forwarded to the generator.
The LLM generates an output that must be relevant and contextually connected to the original query and the information retrieved from the database.
3.2. Retriever
The objective of the retrieval component is to identify a subset of documents,
, from a corpus of documents
, that would assist the generator in correctly responding the query. The use of a dense retriever has gained popularity among various retrieval methods due to its ability to handle queries that are both complex and varied. The fundamental principle of dense retrieval is the conversion of textual data into high-dimensional vector representations, typically accomplished with a neural network, often a transformer-based encoder like BERT [
25].
Both the query
and potential source documents
are processed by the dense retriever to generate corresponding embeddings:
for the query and
for each document
. This embedding process can be formulated as follows:
where
and
are encoders based on neural networks, which share architecture or weights, and are engineered to convert textual data into vector space representations.
Upon creation, embeddings are saved in the vector database ChromaDB. This method offers the distinct advantage of permitting an “offline” initialization, where document embeddings are precomputed. Consequently, only the embedding of the query needs to be computed during the search, thereby reducing latency.
Cosine similarity was initially chosen over the dot product for comparing text vectors due to its emphasis on contextual similarities rather than purely dimensional or frequency-based similarities. However, since each vector in our study has the same length (as per the specifications of the embedding model), the distinction between cosine similarity and the dot product becomes negligible. Consequently, the computation of similarities between query and document embeddings was conducted using the dot product, defined as follows:
This metric assesses the relevance of each document to the query by examining their similarity in the embedded vector space; higher scores indicate increased relevance. Documents are then ranked according to these scores, and the top-ranked (top-k) documents are forwarded for further processing by the generator component.
In the realm of abstractive question answering, it has been established that dense methods outperform sparse methods. Sparse retrievers, such as the BM25, utilize statistical weighting to determine the relevance between search terms and documents. The BM25 scoring function considers term frequency and document length, providing an efficient means of retrieving relevant information. However, its exact-match basis can be limiting when a query and document are relevant but share no common words.
Contrary to this, there is a growing body of research suggesting that hybrid methods may offer superior results [
26]. These methods, which combine dense and sparse searches, aim to harness the strengths of both approaches. Upon receiving a question, both searches are executed in parallel, producing two lists of potential documents to provide the answer.
The application of hybrid search becomes necessary when it is required to integrate the results from multiple retrieval methods. A widely adopted algorithm employed to tackle this challenge is the Reciprocal Rank Fusion (RRF) [
27], an uncomplicated technique for combining document rankings from various information retrieval systems.
For a given set of documents
and search results
for the question
from various methods
within
, the
can be computed for each document
in
as outlined in [
28]:
is a reciprocal rank, with denoting the rank at which the document is retrieved by the search mechanism . The variable is a ranking constant employed to mitigate the impact of outlier systems.
Figure 2 illustrates the calculation of the RRF score with
set to 1. In the provided example, three chunks are retrieved in varying sequences by two search methodologies (sparse and dense searches). The reciprocal rank score for each chunk is computed. These scores are subsequently aggregated to form a new cumulative score. The resulting hybrid list ranks the chunks based on this composite score.
3.3. Generator
The subsequent phase employs an answer synthesis component—a generative large language model within a RAG framework. The purpose of these models is to generate text that is coherent, contextually appropriate, and semantically correct in response to a specific query, often known as a prompt. Generative language models function by estimating the probability distribution of the subsequent token based on the preceding tokens. For a given word sequence
, the goal of a generative language model is to optimize the likelihood of this sequence, which is computed using the chain rule of probability:
Here,
represents the conditional probability of the word
, given the prior sequence of words. The generative language model accepts
query
and the fetched documents
as input, and it formulates a response by sequentially forecasting the subsequent token in the sequence:
To put it more formally, is the retrieval component that offers a truncated probability distribution for the highest-ranking documents, while is a probability distribution parameterized by that generates a current token based on the query, the retrieved document, and the previously generated tokens. This is performed by the LLM.
In the context of dense retrieval, the probability distribution for the highest-ranking documents may take a functional form like . Such formalization of the RAG process reveals how the generative component depends on the query and the retrieved documents.
Generators Tested
In our experiments, we evaluate multiple LLMs: the open-source model Llama2 [
29] with 7B and 13B parameters. For all tested models, we consistently used a greedy generation strategy, limiting the response length to a sentence, with a maximum of 50 tokens. Recognizing the limitations of memory and computational resources, we implemented a strategy to quantize the models, reducing all models to a 4-bit representation using the bitsandbytes library from HuggingFace.
3.4. Required Abilities of Generator
Previous studies have predominantly focused on evaluating the end-to-end performance of systems, particularly in their capacity to handle relevant information. However, a notable challenge arises from the presence of irrelevant or misleading information within external knowledge bases. LLMs frequently encounter difficulties in generating reliable output and are susceptible to being misled by inaccuracies in documents. Consequently, this study shifts the evaluation focus to include the LLMs’ ability to manage irrelevant documents and effectively identify and reject misleading information. We have identified three critical abilities essential for LLMs when employing retrieval-augmented generation for question answering: noise robustness, knowledge gap detection, and external truth integration.
Noise robustness is the capability of an LLM to extract useful information from contexts filled with noise. This ability is critical as it ensures that LLMs can still function effectively even when the retrieved documents contain irrelevant or misleading data.
Knowledge gap detection refers to the LLM’s ability to recognize when the necessary knowledge is absent from any retrieved documents and appropriately choose not to answer the question. This capability is vital for preventing the generation of incorrect or misleading responses based on insufficient information.
External truth integration: the LLM’s capability to provide accurate answers based solely on its non-parametric memory, even when external knowledge initially contradicts general truth facts (stored in the model’s parametric knowledge).
3.5. Dataset
To evaluate our proposed approach, we utilize the TriviaQA [
30] open-domain dataset. We aligned the TriviaQA open-domain dataset with the SQuAD [
31] style of question–answer pairs with corresponding excerpts from evidence documents. TriviaQA is a comprehensive dataset designed to challenge and evaluate short-form question-answering systems. The dataset was sourced from a combination of Wikipedia articles and general web content, providing a rich variety of lexical and syntactic structures. The dataset necessitates reasoning across multiple sentences, making it a robust resource for evaluating the performance of open-domain QA systems.
The initial dataset comprised short-form answers, averaging between 1 and 5 words. To facilitate the study of abstractive question answering, we modified 1100 answers into one-sentence-long answers utilizing the state-of-the-art language model GPT-4.
Figure 3 demonstrates the process of creating the test dataset and knowledge base.
The unfiltered subset of TriviaQA dataset contains questions that do not have corresponding answer strings. For our system, which relies solely on the knowledge that comes from the vector database, rather than its prior knowledge, those questions must be unanswerable. Such dataset simulates real-world conditions and will subsequently serve as a testbed for evaluating the generator’s capabilities to manage irrelevant information. The testbed creation process of 500 questions is illustrated in
Figure 4.
To assess how accurately answers can be provided based on external truths, despite contradictions from the model’s internal knowledge, the Contradictions testbed was constructed by modifying 100 question–answer–context pairs from a test dataset. Specifically, the answer texts were altered as shown in
Figure 5.
3.6. Evaluation
3.6.1. Establish the Baseline
To compare the performance of the QA system, we evaluate the following established baselines: closed-book T5 [
11], Llama 2 with 7B and 13B [
27] parameters, as well as open-book Atlas [
32] and RAG [
8]. For closed-book settings, only the question was utilized without retrieving any additional context. Additionally, no prompt engineering technique was used.
All baseline works use the exact match score for evaluation. However, it is not suitable for long-form answer evaluation. Previous studies mostly rely on human evaluation [
33,
34], which is expensive and difficult to reproduce. Traditional long-form answer evaluation metrics, which rely on n-gram-based methods like BLEU, have demonstrated limitations outside of Machine Translation [
35]. In such a scenario, the LLMs-as-judges evaluation method emerges as a promising alternative to human evaluation [
36], exhibiting the highest similarity to human evaluation compared to other methods [
37,
38]. Among the LLMs-as-judges metrics, the Retrieval-Augmented Generation Assessment (Ragas) automated framework [
39] was chosen.
Ragas provides robust evaluation metrics, with scores ranging from 0 to 1, for both context retrieval and answer generation tasks. For context retrieval, the primary metric is Context Recall, which assesses how well the retrieved context aligns with the ground truth. It measures the proportion of sentences in the ground truth answer that can be found in the retrieved context. For answer generation, Ragas offers two key metrics: Answer Semantic Similarity and Answer Correctness. Answer Semantic Similarity evaluates how closely the generated answer matches the semantics of the ground truth answer. Answer Correctness, on the other hand, considers both semantic and factual aspects, providing a comprehensive evaluation of answer generation performance.
3.6.2. Experimental Settings
Our basic configuration for the QA system includes the following settings: chunk size of 50, no chunk overlap, the embedding model ‘all-MiniLM-L12-v2′, and a top-k value of 1. During the ablation study, we modify one element at a time to assess its impact across various tasks, comparing different settings for each component:
LLM versions: Llama 2 13b, Llama 2 7b.
Retriever: dense, hybrid.
Top-k values: 1, 2, 3, 5, and 8.
4. Experiments and Results
In this section, we conduct a series of experiments. We validate the performance of the QA system by comparing against baselines on the TriviaQA dataset. Subsequently, we employ ablation studies to evaluate the effectiveness of each component. Finally, we aim to analyze the behaviors of the generators tested.
4.1. The Main Results
In this subsection, we present the main results from our evaluation based on information obtained from the sources. The performance of various approaches on the TriviaQA dataset is summarized in
Table 1 below.
In closed-book scenarios, Llama 2 13b achieved the highest accuracy of 73.1%, while T5-11B performed least effectively at 50.1%. Among the open-book methods, Atlas-XL + CIT achieved an accuracy of 77.4%, while RAG-Token and RAG-Seq achieved accuracies of 66.1% and 68.0%, respectively. Our approach demonstrated the highest accuracy of 83.3%, showcasing its effectiveness across both closed-book and open-book settings.
Based on the ablation study, the optimal configuration for achieving the best results involved using Llama 2 13b as the generator, coupled with a dense retriever and a top-k value of 1. This setup consistently delivered superior performance compared to other configurations tested, highlighting its effectiveness in the context of this study’s experiments.
4.2. Ablation Study on Retriever
The experiments were performed to compare various retrieval methods using the Context Recall metric. The results, detailed in
Table 2, reveal that the dense retriever achieved the highest performance with a top-k value of 1, attaining a context recall of 95%, demonstrating its strong capability to accurately retrieve the most relevant context passages.
The high performance of the dense retriever may be due to the fact that the questions in the dataset do not always align word for word with the answers and necessitate the analysis of multiple sentences to derive an answer. The nature of the task is abstractive rather than extractive, thereby exposing the limitations of the sparse retriever.
4.3. Knowledge Gap Detection
The LLMs were tasked with answering 500 questions in a testbed. They were instructed to respond “No” when the required information was absent from the retrieved documents. This experimental design aimed to simulate scenarios where LLMs must make informed decisions about whether to answer based on the completeness of available knowledge. The exact match score in
Table 3 indicates the percentage of correctly identified unanswerable questions.
Specifically, the Llama 2 7b model attempted to answer 170 questions and rejected 330 questions due to the lack of necessary information, amounting to 66%. The Llama 2 13b model attempted to answer 110 questions and rejected 390 questions, accounting for 78%.
4.4. Noise Robustness Results
This experiment evaluated the noise robustness of the generators, measuring their ability to extract useful information from contexts containing varying degrees of noise. Noise refers to retrieved irrelevant or misleading documents. The experiment was specifically constructed to vary the number of retrieval documents (top-k from 1 to 3, 5, and 8) provided to the LLM, thereby introducing varying levels of noise. A noise value of 0 ensured that only the exact relevant context was provided. This experimental setup aimed to simulate real-world scenarios where LLMs must operate effectively amidst noise and irrelevant data.
Figure 6 presents the results.
The results indicate that creating a retriever that extracts the relevant context with minimal noise is a crucial factor that improves the quality of answers. However, even if the retriever accurately provides the necessary information, reliable answers are not guaranteed.
4.5. External Truth Integration
To investigate the discrepancy between the quality of retrieved context and the reliability of the answers generated by the system based on that context, we conducted the experiment exploring external truth integration.
Hypothetically, the system should rely solely on non-parametric memory, where knowledge can be easily modified (updated, deleted), unlike the knowledge stored in the model’s parametric memory. This is achieved through the prompts that instruct the model’s behavior accordingly. We employed prompts of three styles—strict, standard, and weak (
Figure 7)—to evaluate the responses of LLMs in the Contradictions testbed. The results are as shown in
Figure 8.
From the results, we observe that while the type of prompt significantly influences answer alignment (with stricter prompts aligning more closely with external knowledge), the prompt alone is insufficient to fully adjust the system’s behavior. Even when the retriever accurately provides necessary information, reliable answers are not assured as the generator may prefer the knowledge embedded in its training data.
The lack of clear guidelines on how models combine knowledge in answer evidence documents with prior knowledge can lead to problems. With LLMs set to be widely used in various fields, users and developers must be aware of potential unintended consequences, especially if they assume RAG-enabled systems are always truthful.
There are two main limitations in our study: (1) QA_RAG was only tested on factoid questions without involving complex reasoning; (2) we did not evaluate the system on GPT-3.5 and GPT-4 due to cost constraints, which prevented us from conducting comprehensive evaluations on other state-of-the-art models.
5. Conclusions
In this study, we rigorously evaluated QA systems using the TriviaQA dataset, comparing closed-book and open-book approaches. Based on the comprehensive evaluation, several key insights and findings emerged. Firstly, all Llama 2 models perform better in the open-book setting compared to the closed-book setting. This highlights the advantage of enabling models to access external information sources during answer generation. Our proposed QA-RAGllama2-13b approach excelled with the highest score of 83.3%. Secondly, the ablation study highlighted the efficacy of a dense retriever, achieving a context recall of 95%, crucial for enhancing system performance through precise context retrieval. Additionally, the investigation of knowledge gap detection ability revealed that Llama 2 13b correctly identified unanswerable questions, achieving 78% accuracy compared to 66% for Llama 2 7b. The results underscore the critical importance of developing retrievers capable of extracting relevant context with minimal noise. However, despite effective retrieval, the reliability of answers remains somewhat unpredictable. Finally, our exploration integration of information that contradicts general truth facts into the knowledge base emphasized the complex underlying tension between a model’s prior knowledge and the information presented in reference documents, highlighting room for future research and improvements.