QA-RAG: Exploring LLM Reliance on External Knowledge

Mansurova, Aigerim; Mansurova, Aiganym; Nugumanova, Aliya

doi:10.3390/bdcc8090115

Open AccessArticle

QA-RAG: Exploring LLM Reliance on External Knowledge

by

Aigerim Mansurova

^*,

Aiganym Mansurova

and

Aliya Nugumanova

Big Data and Blockchain Technologies Science and Innovation Center, Astana IT University, 020000 Astana, Kazakhstan

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2024, 8(9), 115; https://doi.org/10.3390/bdcc8090115

Submission received: 19 June 2024 / Revised: 13 August 2024 / Accepted: 3 September 2024 / Published: 9 September 2024

(This article belongs to the Special Issue Generative AI and Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

:

Large language models (LLMs) can store factual knowledge within their parameters and have achieved superior results in question-answering tasks. However, challenges persist in providing provenance for their decisions and keeping their knowledge up to date. Some approaches aim to address these challenges by combining external knowledge with parametric memory. In contrast, our proposed QA-RAG solution relies solely on the data stored within an external knowledge base, specifically a dense vector index database. In this paper, we compare RAG configurations using two LLMs—Llama 2 7b and 13b—systematically examining their performance in three key RAG capabilities: noise robustness, knowledge gap detection, and external truth integration. The evaluation reveals that while our approach achieves an accuracy of 83.3%, showcasing its effectiveness across all baselines, the model still struggles significantly in terms of external truth integration. These findings suggest that considerable work is still required to fully leverage RAG in question-answering tasks.

Keywords:

large language model; question-answering system; retrieval-augmented generation; external knowledge base

1. Introduction

The emergence of large language models (LLMs), such as GPT-3 [1], has sparked a wide range of innovations, powering intelligent chatbots, personal assistants, and other natural language processing (NLP) applications (ChatGPT, Copilot). LLMs have also gained increasing popularity as a tool for information seeking and question answering (QA). LangChain [2], a framework for building applications with LLMs, has been instrumental in leveraging LLMs for a variety of scenarios to improve their efficiency.

While these QA systems have shown remarkable general abilities [3,4], their outputs are prone to hallucination [5]. They continue to suffer severely from challenges, particularly factual inaccuracies [5] and knowledge outdating [6,7]. This makes it harder for users to trust and verify LLM-generated answers. Therefore, validating the factual accuracy of generative large language model-based question-answering system poses a contemporary research challenge.

Given their inherently generative nature, ensuring that the output aligns with the sources of information proves to be a challenging task. Incorporating external knowledge via information retrieval, i.e., retrieval-augmented generation (RAG) [8], has been regarded as a promising way to resolve the above challenges. RAG optimizes the output of a large language model by referencing an external knowledge base outside of the LLM training data sources before generating a response. RAG enhances the responses of the LLM and reduces the occurrence of hallucinations, thereby increasing the models’ credibility [9,10].

Nevertheless, it is crucial to recognize that in practical scenarios, the text retrieved by these models frequently contains a certain level of noise. This poses a problem, as the language model’s RAG-enabled responses can significantly vary based on the quality and accuracy of the retrieved content. Therefore, objective evaluations of RAG-enabled LLM performance are just as vital as benchmarking their non-RAG analog.

The contributions of this paper are as follows:

We propose QA-RAG for constructing question-answering systems empowered by LLMs to handle the occurrence of hallucination. The proposed architecture is versatile, does not require fine-tuning, and can be applied to both black-box and open-source LLMs.
We investigate the behavior of QA-RAG from three perspectives: (1) Noise Robustness: assessing how noise in retrieved documents affects answer accuracy. (2) Knowledge Gap Detection: evaluating the system’s ability to recognize and handle missing information. (3) External Truth Integration: examining how well the system integrates external data that contradict its pre-existing knowledge.
We modified TriviaQA dataset: reformulated 1100 answers as long-form answers to enhance the complexity of the evaluation. Two testbeds comprising 600 rows were designed to assess the aforementioned behaviors.
We examine and highlight various practices and implementations applicable to the development of a RAG system.
Our code and dataset are publicly released to facilitate future research on interpretable RAG.

2. Literature Review

The predominant paradigm for utilizing large language models in the question-answering task involves adjusting large language models through fine-tuning. Although LLMs have general knowledge, they may lack effectiveness in specific domains. Fine-tuning addresses this by retraining a pre-trained language model on a particular dataset or task to enhance its performance.

It has been observed that LLMs can acquire a kind of implicit “knowledge base” after being previously trained on unstructured text. The work in [11] examined the potential benefit of this behavior for an open-domain question-answering task. The authors fine-tuned the T5 model in order to be able to answer questions without providing additional information or context, based solely on knowledge acquired during fine-tuning. Thus, the model must query in a natural language and “find the information” stored in its weights. However, this approach gives competitive results only with the language models that have more than 10 billion parameters. Second, the model extracts knowledge about its parameters in an inexplicable way and hallucinates realistic answers when it is unsure. Third, the maximum-likelihood objective used to train the model provides no guarantees as to whether a model will learn a fact or not.

In the work [12], Bloom 7B has been fine-tuned with domain data sourced from the Wikidata knowledge base. The authors tried implement knowledge injection to improve the model’s ability to store knowledge in its parameters. To reduce hallucination a teacher–student approach was proposed. That is, a more powerful LLM (GPT-4) was used to provide guidance to the weaker LLM (Bloom 7B). However, use of the proposed approach in practice is questioned [13,14] by the importance and complexity of creating a reliable knowledge graph through entity and edge extraction leveraging LLMs. Also, there are significant costs associated with the teacher model’s intervention in the student model’s explanations.

Consequently, fine-tuning is a strong starting point for building QA systems, but it has several limitations. Fine-tuning can be very efficient if the goal is to optimize the performance of LLM on that task alone. It helps adapt a pre-trained model to unique requirements. However, fine-tuning cannot deal with hallucination as knowledge is limited to what the model has encountered in its training data. Secondly, the process of fine-tuning can be extremely time-consuming and expensive. Storage space is limited by the size of the neural network—to capture more world knowledge, one must train ever-larger networks. Furthermore, even if it is achieved, it is challenging to determine what knowledge is stored. Stored knowledge may inevitably be incomplete, out of date, or incorrect.

The pioneering work of Lewis et al. marked a significant milestone in the field of language generation [8]. They introduced the retrieval-augmented generation (RAG) model, which ingeniously combines pre-trained parametric and non-parametric memory for language generation. This model, with its unique ability to access a dense vector index of Wikipedia through a pre-trained neural retriever, has set a new standard in the realm of free-form, abstractive text generation.

The term “RAG” was introduced in [8], and it has since become synonymous with the method of integrating external knowledge bases into language generation models. However, it is important to clarify that the current reference to “RAG” pertains to the broader concept of retrieval-augmented generation, not the specific methodology outlined in the work. This approach, as subsequent research by [10,15,16] has shown, allows models to generate more accurate and reliable responses by using external knowledge as guidance. Impressive performance has been demonstrated in a range of tasks, including open-domain question answering [17,18] and dialog [19].

LangChain [2], an open-source Python framework, simplifies the complexities associated with interacting with LLMs. Offering numerous tools like chains, web search ability, vector databases, and embedding models to create and store vector embeddings, it thereby accelerates the development of RAG-based AI applications. In our previous work [20], a QA chatbot that responds to fintech-domain-specific queries was developed with two potential scenarios following a user’s query. The first scenario involved the chatbot generating an answer based on information found in its external knowledge base. The second scenario occurred when the chatbot could not find information to formulate an answer. In this case, the chatbot initiated a search operation through the Google Search API, ensuring that users always received a response.

The current study builds upon the work in [20], incorporating several adjustments to simplify system evaluation. One notable modification is in the system’s response behavior. If the system does not contain an answer to the question within its vector database, it will explicitly state so, thereby enhancing the transparency of the system’s responses. Additionally, the scope of this research is limited to the vector database, and as such, the web search capability will not be assessed.

The research in [21] presents a method for rapidly creating QA over a single pdf file application using LangChain. The paper in [22] serves as a practical guide, highlighting the framework’s potential in creating versatile and powerful virtual assistants that can be deployed across various industries. Others [23] explored the use of the framework to develop an automated customer service chatbot that offers responsive and context-aware interactions. The chatbot, integrated into the customer service platform of Birla Vishvakarma Mahavidyalaya, utilizes web scraping and embeddings to provide real-time support and query resolution. The paper in [24] presents MindGuide, a sophisticated mental health support tool designed to assist individuals with mental health challenges. The mentioned works outline the framework’s features that facilitate the swift development and deployment of chatbots.

Nevertheless, these studies fall short of providing a comprehensive evaluation of the system, which limits their suitability for deployment or practical application. In contrast, our work fills this gap by rigorously assessing system performance across key dimensions such as noise robustness, knowledge gap detection, and external truth integration. Unlike prior research, we prioritize evaluation over technical implementation, offering insights into the reliability and behavior of LLMs in real-world scenarios.

3. Materials and Methods

3.1. Open-Book Question-Answering System

Open-book question answering refers to the task of developing systems capable of providing accurate and contextually relevant answers

a

to a specific set of questions

q

posed in natural language with access to external knowledge. In this research, the task is solved by using a methodology that involves a two-step architecture, consisting of a retriever and a generator (or reasoner). The system is constructed in such a manner that it can only answer based on the external knowledge provided, without relying on its parametric memory. The overall system architecture is depicted in Figure 1, detailing the sequential steps as follows:

The query is forwarded to the embedding model for encoding into an embedded query vector.
The embedded query vector is then transmitted to a vector database.
The retriever algorithm dictates the retrieval of the top-k pertinent segments from the database.
Subsequently, both the query text and the retrieved segments are forwarded to the generator.
The LLM generates an output that must be relevant and contextually connected to the original query and the information retrieved from the database.

3.2. Retriever

The objective of the retrieval component is to identify a subset of documents,

D r

, from a corpus of documents

D = {d_{1}, d_{2}, \dots, d_{n}}

, that would assist the generator in correctly responding the query. The use of a dense retriever has gained popularity among various retrieval methods due to its ability to handle queries that are both complex and varied. The fundamental principle of dense retrieval is the conversion of textual data into high-dimensional vector representations, typically accomplished with a neural network, often a transformer-based encoder like BERT [25].

Both the query

q

and potential source documents

D r

are processed by the dense retriever to generate corresponding embeddings:

\bar{q}

for the query and

\bar{d_{i}}

for each document

d i \in D

. This embedding process can be formulated as follows:

{\bar{d}}_{i} = E n c o d e r_{d} (d_{i}); \bar{q} = E n c o d e r_{q} (q),

(1)

where

E n c o d e r_{d}

and

E n c o d e r_{q}

are encoders based on neural networks, which share architecture or weights, and are engineered to convert textual data into vector space representations.

Upon creation, embeddings are saved in the vector database ChromaDB. This method offers the distinct advantage of permitting an “offline” initialization, where document embeddings are precomputed. Consequently, only the embedding of the query needs to be computed during the search, thereby reducing latency.

Cosine similarity was initially chosen over the dot product for comparing text vectors due to its emphasis on contextual similarities rather than purely dimensional or frequency-based similarities. However, since each vector in our study has the same length (as per the specifications of the embedding model), the distinction between cosine similarity and the dot product becomes negligible. Consequently, the computation of similarities between query and document embeddings was conducted using the dot product, defined as follows:

d o t p r o d u c t (q, d_{i}) = \bar{q} \cdot {\bar{d}}_{i},

(2)

This metric assesses the relevance of each document to the query by examining their similarity in the embedded vector space; higher scores indicate increased relevance. Documents are then ranked according to these scores, and the top-ranked (top-k) documents are forwarded for further processing by the generator component.

In the realm of abstractive question answering, it has been established that dense methods outperform sparse methods. Sparse retrievers, such as the BM25, utilize statistical weighting to determine the relevance between search terms and documents. The BM25 scoring function considers term frequency and document length, providing an efficient means of retrieving relevant information. However, its exact-match basis can be limiting when a query and document are relevant but share no common words.

Contrary to this, there is a growing body of research suggesting that hybrid methods may offer superior results [26]. These methods, which combine dense and sparse searches, aim to harness the strengths of both approaches. Upon receiving a question, both searches are executed in parallel, producing two lists of potential documents to provide the answer.

The application of hybrid search becomes necessary when it is required to integrate the results from multiple retrieval methods. A widely adopted algorithm employed to tackle this challenge is the Reciprocal Rank Fusion (RRF) [27], an uncomplicated technique for combining document rankings from various information retrieval systems.

For a given set of documents

D

and search results

r

for the question

q

from various methods

m

within

M

, the

R R F (d \in D)

can be computed for each document

d

in

D

as outlined in [28]:

R R F (d \in D) = \sum_{r a n k (r (q), d) \in M} \frac{1}{l + r a n k (r (q), d)},

(3)

\frac{1}{r a n k (r (q), d)}

is a reciprocal rank, with

r a n k (r (q), d)

denoting the rank at which the document

d

is retrieved by the search mechanism

m

. The variable

l

is a ranking constant employed to mitigate the impact of outlier systems.

Figure 2 illustrates the calculation of the RRF score with

l

set to 1. In the provided example, three chunks are retrieved in varying sequences by two search methodologies (sparse and dense searches). The reciprocal rank score for each chunk is computed. These scores are subsequently aggregated to form a new cumulative score. The resulting hybrid list ranks the chunks based on this composite score.

3.3. Generator

The subsequent phase employs an answer synthesis component—a generative large language model within a RAG framework. The purpose of these models is to generate text that is coherent, contextually appropriate, and semantically correct in response to a specific query, often known as a prompt. Generative language models function by estimating the probability distribution of the subsequent token based on the preceding tokens. For a given word sequence

w_{1}, w_{2}, \dots w_{n}

, the goal of a generative language model is to optimize the likelihood of this sequence, which is computed using the chain rule of probability:

P (w_{1}, w_{2}, \dots w_{n}) = \prod_{i = 1}^{N} P (w_{i} | w_{1}, w_{2} \dots w_{i - 1}),

(4)

Here,

P (w_{i} | w_{1}, w_{2} \dots w_{i - 1})

represents the conditional probability of the word

w_{i}

, given the prior sequence of words. The generative language model accepts

a

query

q

and the fetched documents

D_{r}

as input, and it formulates a response by sequentially forecasting the subsequent token in the sequence:

P (d | q) \approx \prod_{i}^{N} \sum_{d \in D_{r}} p_{η} (d | q) \cdot p_{θ} (y_{i} | q, d, y_{1 : i - 1})

(5)

To put it more formally,

p_{η} (d | q)

is the retrieval component that offers a truncated probability distribution for the highest-ranking documents, while

p_{θ} (y_{i} | q, d, y_{1 : i - 1})

is a probability distribution parameterized by

θ

that generates a current token based on the query, the retrieved document, and the previously generated tokens. This is performed by the LLM.

In the context of dense retrieval, the probability distribution for the highest-ranking documents may take a functional form like

p_{η} (d | q) \propto \exp (\bar{q} \cdot \bar{d)}

. Such formalization of the RAG process reveals how the generative component depends on the query and the retrieved documents.

Generators Tested

In our experiments, we evaluate multiple LLMs: the open-source model Llama2 [29] with 7B and 13B parameters. For all tested models, we consistently used a greedy generation strategy, limiting the response length to a sentence, with a maximum of 50 tokens. Recognizing the limitations of memory and computational resources, we implemented a strategy to quantize the models, reducing all models to a 4-bit representation using the bitsandbytes library from HuggingFace.

3.4. Required Abilities of Generator

Previous studies have predominantly focused on evaluating the end-to-end performance of systems, particularly in their capacity to handle relevant information. However, a notable challenge arises from the presence of irrelevant or misleading information within external knowledge bases. LLMs frequently encounter difficulties in generating reliable output and are susceptible to being misled by inaccuracies in documents. Consequently, this study shifts the evaluation focus to include the LLMs’ ability to manage irrelevant documents and effectively identify and reject misleading information. We have identified three critical abilities essential for LLMs when employing retrieval-augmented generation for question answering: noise robustness, knowledge gap detection, and external truth integration.

Noise robustness is the capability of an LLM to extract useful information from contexts filled with noise. This ability is critical as it ensures that LLMs can still function effectively even when the retrieved documents contain irrelevant or misleading data.
Knowledge gap detection refers to the LLM’s ability to recognize when the necessary knowledge is absent from any retrieved documents and appropriately choose not to answer the question. This capability is vital for preventing the generation of incorrect or misleading responses based on insufficient information.
External truth integration: the LLM’s capability to provide accurate answers based solely on its non-parametric memory, even when external knowledge initially contradicts general truth facts (stored in the model’s parametric knowledge).

3.5. Dataset

To evaluate our proposed approach, we utilize the TriviaQA [30] open-domain dataset. We aligned the TriviaQA open-domain dataset with the SQuAD [31] style of question–answer pairs with corresponding excerpts from evidence documents. TriviaQA is a comprehensive dataset designed to challenge and evaluate short-form question-answering systems. The dataset was sourced from a combination of Wikipedia articles and general web content, providing a rich variety of lexical and syntactic structures. The dataset necessitates reasoning across multiple sentences, making it a robust resource for evaluating the performance of open-domain QA systems.

The initial dataset comprised short-form answers, averaging between 1 and 5 words. To facilitate the study of abstractive question answering, we modified 1100 answers into one-sentence-long answers utilizing the state-of-the-art language model GPT-4. Figure 3 demonstrates the process of creating the test dataset and knowledge base.

The unfiltered subset of TriviaQA dataset contains questions that do not have corresponding answer strings. For our system, which relies solely on the knowledge that comes from the vector database, rather than its prior knowledge, those questions must be unanswerable. Such dataset simulates real-world conditions and will subsequently serve as a testbed for evaluating the generator’s capabilities to manage irrelevant information. The testbed creation process of 500 questions is illustrated in Figure 4.

To assess how accurately answers can be provided based on external truths, despite contradictions from the model’s internal knowledge, the Contradictions testbed was constructed by modifying 100 question–answer–context pairs from a test dataset. Specifically, the answer texts were altered as shown in Figure 5.

3.6. Evaluation

3.6.1. Establish the Baseline

To compare the performance of the QA system, we evaluate the following established baselines: closed-book T5 [11], Llama 2 with 7B and 13B [27] parameters, as well as open-book Atlas [32] and RAG [8]. For closed-book settings, only the question was utilized without retrieving any additional context. Additionally, no prompt engineering technique was used.

All baseline works use the exact match score for evaluation. However, it is not suitable for long-form answer evaluation. Previous studies mostly rely on human evaluation [33,34], which is expensive and difficult to reproduce. Traditional long-form answer evaluation metrics, which rely on n-gram-based methods like BLEU, have demonstrated limitations outside of Machine Translation [35]. In such a scenario, the LLMs-as-judges evaluation method emerges as a promising alternative to human evaluation [36], exhibiting the highest similarity to human evaluation compared to other methods [37,38]. Among the LLMs-as-judges metrics, the Retrieval-Augmented Generation Assessment (Ragas) automated framework [39] was chosen.

Ragas provides robust evaluation metrics, with scores ranging from 0 to 1, for both context retrieval and answer generation tasks. For context retrieval, the primary metric is Context Recall, which assesses how well the retrieved context aligns with the ground truth. It measures the proportion of sentences in the ground truth answer that can be found in the retrieved context. For answer generation, Ragas offers two key metrics: Answer Semantic Similarity and Answer Correctness. Answer Semantic Similarity evaluates how closely the generated answer matches the semantics of the ground truth answer. Answer Correctness, on the other hand, considers both semantic and factual aspects, providing a comprehensive evaluation of answer generation performance.

3.6.2. Experimental Settings

Our basic configuration for the QA system includes the following settings: chunk size of 50, no chunk overlap, the embedding model ‘all-MiniLM-L12-v2′, and a top-k value of 1. During the ablation study, we modify one element at a time to assess its impact across various tasks, comparing different settings for each component:

LLM versions: Llama 2 13b, Llama 2 7b.
Retriever: dense, hybrid.
Top-k values: 1, 2, 3, 5, and 8.

4. Experiments and Results

In this section, we conduct a series of experiments. We validate the performance of the QA system by comparing against baselines on the TriviaQA dataset. Subsequently, we employ ablation studies to evaluate the effectiveness of each component. Finally, we aim to analyze the behaviors of the generators tested.

4.1. The Main Results

In this subsection, we present the main results from our evaluation based on information obtained from the sources. The performance of various approaches on the TriviaQA dataset is summarized in Table 1 below.

In closed-book scenarios, Llama 2 13b achieved the highest accuracy of 73.1%, while T5-11B performed least effectively at 50.1%. Among the open-book methods, Atlas-XL + CIT achieved an accuracy of 77.4%, while RAG-Token and RAG-Seq achieved accuracies of 66.1% and 68.0%, respectively. Our approach demonstrated the highest accuracy of 83.3%, showcasing its effectiveness across both closed-book and open-book settings.

Based on the ablation study, the optimal configuration for achieving the best results involved using Llama 2 13b as the generator, coupled with a dense retriever and a top-k value of 1. This setup consistently delivered superior performance compared to other configurations tested, highlighting its effectiveness in the context of this study’s experiments.

4.2. Ablation Study on Retriever

The experiments were performed to compare various retrieval methods using the Context Recall metric. The results, detailed in Table 2, reveal that the dense retriever achieved the highest performance with a top-k value of 1, attaining a context recall of 95%, demonstrating its strong capability to accurately retrieve the most relevant context passages.

The high performance of the dense retriever may be due to the fact that the questions in the dataset do not always align word for word with the answers and necessitate the analysis of multiple sentences to derive an answer. The nature of the task is abstractive rather than extractive, thereby exposing the limitations of the sparse retriever.

4.3. Knowledge Gap Detection

The LLMs were tasked with answering 500 questions in a testbed. They were instructed to respond “No” when the required information was absent from the retrieved documents. This experimental design aimed to simulate scenarios where LLMs must make informed decisions about whether to answer based on the completeness of available knowledge. The exact match score in Table 3 indicates the percentage of correctly identified unanswerable questions.

Specifically, the Llama 2 7b model attempted to answer 170 questions and rejected 330 questions due to the lack of necessary information, amounting to 66%. The Llama 2 13b model attempted to answer 110 questions and rejected 390 questions, accounting for 78%.

4.4. Noise Robustness Results

This experiment evaluated the noise robustness of the generators, measuring their ability to extract useful information from contexts containing varying degrees of noise. Noise refers to retrieved irrelevant or misleading documents. The experiment was specifically constructed to vary the number of retrieval documents (top-k from 1 to 3, 5, and 8) provided to the LLM, thereby introducing varying levels of noise. A noise value of 0 ensured that only the exact relevant context was provided. This experimental setup aimed to simulate real-world scenarios where LLMs must operate effectively amidst noise and irrelevant data. Figure 6 presents the results.

The results indicate that creating a retriever that extracts the relevant context with minimal noise is a crucial factor that improves the quality of answers. However, even if the retriever accurately provides the necessary information, reliable answers are not guaranteed.

4.5. External Truth Integration

To investigate the discrepancy between the quality of retrieved context and the reliability of the answers generated by the system based on that context, we conducted the experiment exploring external truth integration.

Hypothetically, the system should rely solely on non-parametric memory, where knowledge can be easily modified (updated, deleted), unlike the knowledge stored in the model’s parametric memory. This is achieved through the prompts that instruct the model’s behavior accordingly. We employed prompts of three styles—strict, standard, and weak (Figure 7)—to evaluate the responses of LLMs in the Contradictions testbed. The results are as shown in Figure 8.

From the results, we observe that while the type of prompt significantly influences answer alignment (with stricter prompts aligning more closely with external knowledge), the prompt alone is insufficient to fully adjust the system’s behavior. Even when the retriever accurately provides necessary information, reliable answers are not assured as the generator may prefer the knowledge embedded in its training data.

The lack of clear guidelines on how models combine knowledge in answer evidence documents with prior knowledge can lead to problems. With LLMs set to be widely used in various fields, users and developers must be aware of potential unintended consequences, especially if they assume RAG-enabled systems are always truthful.

There are two main limitations in our study: (1) QA_RAG was only tested on factoid questions without involving complex reasoning; (2) we did not evaluate the system on GPT-3.5 and GPT-4 due to cost constraints, which prevented us from conducting comprehensive evaluations on other state-of-the-art models.

5. Conclusions

In this study, we rigorously evaluated QA systems using the TriviaQA dataset, comparing closed-book and open-book approaches. Based on the comprehensive evaluation, several key insights and findings emerged. Firstly, all Llama 2 models perform better in the open-book setting compared to the closed-book setting. This highlights the advantage of enabling models to access external information sources during answer generation. Our proposed QA-RAG_llama2-13b approach excelled with the highest score of 83.3%. Secondly, the ablation study highlighted the efficacy of a dense retriever, achieving a context recall of 95%, crucial for enhancing system performance through precise context retrieval. Additionally, the investigation of knowledge gap detection ability revealed that Llama 2 13b correctly identified unanswerable questions, achieving 78% accuracy compared to 66% for Llama 2 7b. The results underscore the critical importance of developing retrievers capable of extracting relevant context with minimal noise. However, despite effective retrieval, the reliability of answers remains somewhat unpredictable. Finally, our exploration integration of information that contradicts general truth facts into the knowledge base emphasized the complex underlying tension between a model’s prior knowledge and the information presented in reference documents, highlighting room for future research and improvements.

Author Contributions

Conceptualization, A.N.; data curation, A.M. (Aiganym Mansurova); formal analysis, A.M. (Aigerim Mansurova); funding acquisition, A.N.; investigation, A.M. (Aigerim Mansurova) and A.M. (Aiganym Mansurova); methodology, A.M. (Aigerim Mansurova); project administration, A.N.; resources, A.M. (Aigerim Mansurova); software, A.M. (Aigerim Mansurova); supervision, A.N.; validation, A.M. (Aigerim Mansurova); visualization, A.M. (Aiganym Mansurova); writing—original draft, A.M. (Aigerim Mansurova). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Science and Higher Education of the Republic of Kazakhstan, grant number AP19677756.

Data Availability Statement

The code and datasets used in this study are available in the GitHub repository https://github.com/Tbinma/Exploring-LLM-Reliance-on-External-Knowledge (accessed on 1 August 2024).

Acknowledgments

We express our sincere gratitude to Aliya Kalykulova for her invaluable assistance in collecting the data used in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

OpenAI. ChatGPT (Mar 14 Version) [Large Language Model]. 2023. Available online: https://chat.openai.com/chat (accessed on 17 June 2024).
Chase, H. LangChain; GitHub: San Francisco, CA, USA, 2022; Available online: https://github.com/langchain-ai/langchain (accessed on 17 October 2022).
Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv 2023, arXiv:2302.04023. [Google Scholar]
Guo, C.; Lu, Y.; Dou, Y.; Wang, F.Y. Can ChatGPT boost artistic creation: The need of imaginative intelligence for parallel art. IEEE/CAA J. Autom. Sin. 2023, 10, 835–838. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
He, H.; Zhang, H.; Roth, D. Rethinking with retrieval: Faithful large language model inference. arXiv 2022, arXiv:2301.00303. [Google Scholar]
Shen, X.; Chen, Z.; Backes, M.; Zhang, Y. In ChatGPT we trust? measuring and characterizing the reliability of ChatGPT. arXiv 2023, arXiv:2304.08979. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Kiela, D.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H.; et al. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G.B.; Lespiau, J.-B.; Damoc, B.; Sifre, L.; et al. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; PMLR. pp. 2206–2240. [Google Scholar]
Roberts, A.; Raffel, C.; Shazeer, N. How much knowledge can you pack into the parameters of a language model? arXiv 2020, arXiv:2002.08910. [Google Scholar]
Elaraby, M.; Lu, M.; Dunn, J.; Zhang, X.; Wang, Y.; Liu, S. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv 2023, arXiv:2308.11764. [Google Scholar]
Stechly, K.; Marquez, M.; Kambhampati, S. GPT-4 Doesn’t Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems. arXiv 2023, arXiv:2310.12397. [Google Scholar]
Huang, R.; Li, M.; Yang, D.; Shi, J.; Chang, X.; Ye, Z.; Wu, Y.; Hong, Z.; Huang, J.; Watanabe, S.; et al. Audiogpt: Understanding and generating speech, music, sound, and talking head. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 23802–23804. [Google Scholar]
Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E.; et al. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res. 2023, 24, 1–43. [Google Scholar]
Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval augmented language model pre-training. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 13–18 July 2020; PMLR. pp. 3929–3938. [Google Scholar]
Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv 2022, arXiv:2212.10509. [Google Scholar]
Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 17754–17762. [Google Scholar]
Wang, H.; Huang, W.; Deng, Y.; Wang, R.; Wang, Z.; Wang, Y.; Mi, F.; Pan, J.Z.; Wong, K.F. UniMS-RAG: A Unified Multi-source Retrieval-Augmented Generation for Personalized Dialogue Systems. arXiv 2024, arXiv:2401.13256. [Google Scholar]
Mansurova, A.; Nugumanova, A.; Makhambetova, Z. Development of a question answering chatbot for blockchain domain. Sci. J. Astana IT Univ. 2023, 15, 27–40. [Google Scholar] [CrossRef]
Jacob, T.P.; Bizotto BL, S.; Sathiyanarayanan, M. Constructing the ChatGPT for PDF Files with Langchain–AI. In Proceedings of the 2024 International Conference on Inventive Computation Technologies (ICICT), Bangkok, Thailand, 16–17 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 835–839. [Google Scholar]
Topsakal, O.; Akinci, T.C. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. In Proceedings of the International Conference on Applied Engineering and Natural Sciences, Konya, Turkey, 10–12 July 2023; Volume 1, pp. 1050–1056. [Google Scholar]
Pandya, K.; Holia, M. Automating Customer Service using LangChain: Building custom open-source GPT Chatbot for organizations. arXiv 2023, arXiv:2310.05421. [Google Scholar]
Singh, A.; Ehtesham, A.; Mahmud, S.; Kim, J.H. Revolutionizing Mental Health Care through LangChain: A Journey with a Large Language Model. In Proceedings of the 2024 IEEE 14th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 8–10 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 73–78. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; Chen, E. CRUD-RAG: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. arXiv 2024, arXiv:2401.17043. [Google Scholar]
Cormack, G.V.; Clarke, C.L.; Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA, 19–23 July 2009; pp. 758–759. [Google Scholar]
Ni, J.; Qu, C.; Lu, J.; Dai, Z.; Abrego, G.H.; Ma, J.; Zhao, V.; Luan, Y.; Hall, K.; Chang, M.-W.; et al. Large dual encoders are generalizable retrievers. arXiv 2021, arXiv:2112.07899. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Scialom, T.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Joshi, M.; Choi, E.; Weld, D.S.; Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv 2017, arXiv:1705.03551. [Google Scholar]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. Squad: 100,000+ questions for machine comprehension of text. arXiv 2016, arXiv:1606.05250. [Google Scholar]
Zhang, Z.; Reddy, R.G.; Small, K.; Zhang, T.; Ji, H. Towards Better Generalization in Open-Domain Question Answering by Mitigating Context Memorization. arXiv 2024, arXiv:2404.01652. [Google Scholar]
Liu, Y.; Huang, L.; Li, S.; Chen, S.; Zhou, H.; Meng, F.; Zhou, J.; Sun, X. Recall: A benchmark for llms robustness against external counterfactual knowledge. arXiv 2023, arXiv:2311.08147. [Google Scholar]
Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Schulman, J.; et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv 2021, arXiv:2112.09332. [Google Scholar]
Reiter, E. A structured review of the validity of BLEU. Comput. Linguist. 2018, 44, 393–401. [Google Scholar] [CrossRef]
Chiang, C.H.; Lee, H.Y. Can large language models be an alternative to human evaluations? arXiv 2023, arXiv:2305.01937. [Google Scholar]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar]
Svikhnushina, E.; Pu, P. Approximating online human evaluation of social chatbots with prompting. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czechia, 11–15 September 2023; pp. 268–281. [Google Scholar]
Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. Ragas: Automated evaluation of retrieval augmented generation. arXiv 2023, arXiv:2309.15217. [Google Scholar]

Figure 1. QA RAG architecture and workflow.

Figure 2. Scheme of hybrid search with RRF algorithm.

Figure 3. Methodology for test dataset and knowledge base construction.

Figure 4. Illustration of testbed creation process with unanswerable questions.

Figure 5. Process of modifying answers for the Contradictions testbed (initial values are shown in green, and the changed values are highlighted in orange).

Figure 6. Performance of Llama 7b and 13b generators on noise robustness ability: (a) Answer Correctness metric; (b) Answer Semantic Similarity metric.

Figure 7. Strict, standard, and weak prompts used in the experiment.

Figure 8. LLM responses to contradictions: prompt style impact.

Table 1. Performance comparison of baselines, the highest results in each approach are shown in bold.

Approach	Model, Work	TriviaQA
Closed-book	Llama 2 7b, [29]	65.8
	Llama 2 13b, [29]	73.1
	T5-11B, [11]	50.1
	T5-11B + SSM, [11]	60.5
Open-book	RAG-Token, [8]	66.1
	RAG-Seq, [8]	68.0
	Atlas-XL + CIT, [32]	77.4
	QA-RAG_llama2–7b, (ours)	69
	QA-RAG_llama2–13b, (ours)	83.3

Table 2. Performance comparison of dense vs. sparse retrievers (highest result in bold).

Retriever Type	Top-k	Context Recall
Dense	1	0.95
	2	0.863
	3	0.703
	4	0.471
Hybrid	2	0.844
	3	0.692
	4	0.483

Table 3. Knowledge gap detection performance of LLMs.

LLM	EM
Llama 2 7b	0.66
Llama 2 13b	0.78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mansurova, A.; Mansurova, A.; Nugumanova, A. QA-RAG: Exploring LLM Reliance on External Knowledge. Big Data Cogn. Comput. 2024, 8, 115. https://doi.org/10.3390/bdcc8090115

AMA Style

Mansurova A, Mansurova A, Nugumanova A. QA-RAG: Exploring LLM Reliance on External Knowledge. Big Data and Cognitive Computing. 2024; 8(9):115. https://doi.org/10.3390/bdcc8090115

Chicago/Turabian Style

Mansurova, Aigerim, Aiganym Mansurova, and Aliya Nugumanova. 2024. "QA-RAG: Exploring LLM Reliance on External Knowledge" Big Data and Cognitive Computing 8, no. 9: 115. https://doi.org/10.3390/bdcc8090115

APA Style

Mansurova, A., Mansurova, A., & Nugumanova, A. (2024). QA-RAG: Exploring LLM Reliance on External Knowledge. Big Data and Cognitive Computing, 8(9), 115. https://doi.org/10.3390/bdcc8090115

Article Menu

QA-RAG: Exploring LLM Reliance on External Knowledge

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Open-Book Question-Answering System

3.2. Retriever

3.3. Generator

Generators Tested

3.4. Required Abilities of Generator

3.5. Dataset

3.6. Evaluation

3.6.1. Establish the Baseline

3.6.2. Experimental Settings

4. Experiments and Results

4.1. The Main Results

4.2. Ablation Study on Retriever

4.3. Knowledge Gap Detection

4.4. Noise Robustness Results

4.5. External Truth Integration

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI