1. Introduction
The growing dependence on artificial intelligence (AI) to enhance human functions, especially in customer service settings like contact centers, has attracted considerable interest in recent years. Call center workers frequently need to swiftly adjust to new projects and intricate client inquiries, highlighting the urgent demand for efficient, scalable technologies that facilitate rapid learning and problem solving in these dynamic settings. Virtual assistants utilizing large language models (LLMs) provide an innovative resolution to this issue by delivering instant access to pertinent information, hence minimizing the time needed to address client inquiries. AI technologies in these environments enhance agent efficiency while responding to broader industry movements to automate customer service and reduce human error.
This research presents a virtual assistant tailored for call center agents, utilizing the Retrieval-Augmented Generation methodology and the VERTEX AI Palm-2 model. The main objective is to evaluate the effectiveness of VA in enhancing problem solving by using contextual data from literature and educational methodologies. Additionally, the study examines the possible detrimental impacts of erroneous VA responses on end users. The research aims are situated within the larger framework of AI’s contribution to enhancing operational efficiency and reducing hazards linked to dependence on machine-generated data. This research tackles the special requirement for contextual accuracy and reliability in the high-stakes customer service domain, contrasting with many studies concentrating on enhancing LLMs for generalized tasks. This work is significant for its potential to connect human expertise with AI aid, improving the speed and accuracy of responses in areas where human agents may lack domain knowledge.
Recent studies on AI in customer service indicate substantial advancements in utilizing LLMs, primarily via one-shot and few-shot learning techniques, enabling models to excel with limited training data. Nevertheless, obstacles persist, particularly in guaranteeing that LLMs deliver precise, contextually pertinent responses while circumventing prevalent issues such as hallucinations and inaccuracies in handling organized data, such as tables. Current methods frequently inadequately resolve these problems, especially in multilingual contexts where understanding is essential. This study extends previous research by employing an enhanced RAG approach within the context of a student office, illustrating a practical application of LLMs in specialized settings.
There are four contributions from this paper:
Utilization of the RAG method in conjunction with VERTEX AI for specific contexts: The research introduces a novel use of the RAG approach combined with the VERTEX AI Palm-2 model to develop a VA tailored for university contact centers. This approach is utilized in a specialized setting (university student office), demonstrating the adaptability of AI models to domain-specific tasks where context and precision are essential. This expands the use of RAG and LLMs beyond broad uses, illustrating their capability to address highly specialized issues.
Mitigation of hallucinations in AI responses by contextual learning: A notable challenge with large language models, such as Palm-2, is the production of hallucinations—erroneous or deceptive information that lacks foundation in the given context. This paper illustrates how the RAG approach, in conjunction with contextual learning, can diminish the incidence of hallucinations, particularly when addressing organized data such as tables. The significance of this is undeniable in the deployment of AI within professional settings, where precision is essential.
Evaluation of the adverse effects of AI errors on end users: The research investigates the possible damage inflicted by erroneous or deceptive responses from the VA. It offers a framework for comprehending the effects of AI-generated errors on users by assessing user testing findings and performing a thorough evaluation of the VA’s correctness. This contribution is essential for developing secure and dependable AI systems, especially in educational and customer service environments where inaccurate information may result in considerable adverse outcomes.
Implementation of a functional VA system for educational institutions: The paper illustrates the practical execution of a VA tailored for a university’s student office, emphasizing its real-world relevance. The research offers empirical evidence regarding the efficacy of AI in automating administrative chores, alleviating workload, and enhancing response times by evaluating the system in a controlled setting with real users. This contribution connects theoretical AI models with their actual implementations in academic settings, providing a scalable option for institutions aiming to adopt AI-driven systems.
The primary objective of this research is to assess the VA’s capacity to improve problem-solving skills for call center agents, especially in educational contexts where precise information retrieval is critical. The study validates the VA’s efficacy in producing dependable replies and highlights opportunities for additional enhancement. The research enhances the discourse on optimizing AI to aid human decision-making in real-time, high-pressure contexts by examining its potential advantages and constraints.
This paper is organized as follows: in the next section, we will discuss the current state of the research field, followed by an explanation of the business model and hypothesis of this research. In the following sections, we will discuss the solution architecture, data preparation, methodology used to integrate the generative language model, and results. In the last two sections, we will discuss potential future research areas and discuss the conclusions.
2. Related Works
Incorporating AI-driven virtual assistants in educational environments has garnered significant interest, as they enhance the student experience by providing prompt assistance and information regarding academic and administrative inquiries. The “CollegeBot” system offers automated replies for student inquiries regarding class schedules and placements, utilizing Google API and Dialogflow for instantaneous responses [
1]. Comparable implementations encompass an intelligent virtual assistant created for students at Minnesota universities, designed to offer ongoing academic assistance [
2]. An additional AI-driven system emphasizes personalized learning, utilizing natural language processing to tailor content delivery to distinct learning styles, thereby improving student engagement and satisfaction [
3]. AI chatbots have been employed in online learning contexts, such as Malaysia’s MERLIN project, which enhanced students’ comprehension of course materials during the pandemic [
4]. These applications illustrate the versatility of AI virtual assistants in various educational settings.
AI virtual assistants have demonstrated significant efficacy in enhancing the student learning experience by responding to inquiries beyond conventional office hours [
5]. In higher education, advanced AI platforms, such as GPT-3, have automatically addressed curriculum-related inquiries [
6]. AI-based virtual assistants in education have demonstrated their ability to support students academically while alleviating the workload of educators by managing routine inquiries [
7].
The RAG method improves the efficacy of AI virtual assistants by integrating a pre-trained generative language model with a retrieval system. This hybrid methodology enables the system to retrieve pertinent information from external knowledge sources and produce contextually suitable responses. The RAG architecture has been acknowledged for enhancing the accuracy of virtual assistants in responding to specific, domain-oriented inquiries. A prevalent application of RAG entails its integration into educational assistants, facilitating the retrieval and provision of real-time academic and institutional data in response to student inquiries [
8]. Research indicates that integrating retrieval mechanisms markedly improves the precision and pertinence of AI-generated reactions, allowing the AI to access current information rather than relying solely on pre-trained knowledge.
RAG has been pivotal in resolving various academic inquiries in the educational sector. Retrieval mechanisms enable virtual assistants to address intricate and dynamic inquiries regarding curriculum, academic policies, and administrative services by accessing real-time institutional databases [
9]. Integrating retrieval mechanisms with AI assistants augments information accuracy and markedly enriches the learning experience by delivering precise answers on demand.
The VERTEX AI Palm-2 model is a sophisticated AI solution engineered to enhance the deployment and scalability of machine learning models, rendering it an optimal platform for educational applications. Palm-2 utilizes Google Cloud’s Vertex AI to optimize the administration and implementation of AI systems, facilitating the integration of AI into educational institutions’ existing infrastructures. The model efficiently processes large-scale, complex queries, which is crucial for addressing the substantial volume of student inquiries on university websites.
Research on the implementation of virtual assistants utilizing Palm-2 has shown its efficacy in addressing a variety of academic and administrative inquiries, allowing institutions to provide a cohesive support system for students and staff [
10]. Integrating Palm-2 with AI platforms like Google Cloud provides a scalable and customizable solution, making it suitable for large institutions that handle thousands of student interactions daily [
11]. Additionally, Palm-2 augments the system’s capacity to deliver personalized responses by assimilating user interactions, enhancing the quality of the information provided [
12].
Research on RAG methodologies has advanced considerably in recent years. RAG offers a method to improve the efficacy of generative models by integrating retrieval systems with large language models. Numerous studies have investigated the capabilities and challenges of RAG in various contexts. Shahul et al. (2023) presented the RAGA framework to assess RAG pipelines without ground truth annotations, facilitating a more rapid and efficient evaluation of these systems [
13]. You (2022) illustrated how RAG can enhance models’ scalability and knowledge retrieval when addressing heterogeneous knowledge sources, offering potential solutions to challenges in single-source knowledge retrieval [
14].
Nakhod (2023) examined the application of RAG in low-code development environments, demonstrating that incorporating domain-specific knowledge into LLMs can enhance developer efficiency by delivering more precise and pertinent information retrieval [
15]. Furthermore, Chen et al. (2023) introduced the RGB benchmark to methodically assess the efficacy of RAG across various languages and models, pinpointing substantial opportunities for enhancement in information integration and mitigating hallucinations in LLM outputs [
16].
Other researchers have implemented RAG in particular domains. Manathunga and Illangasekara (2023) investigated its application in medical education, creating a summarization framework capable of managing extensive, unstructured datasets while ensuring precision and minimizing hallucinations [
17]. Arora et al. (2023) presented the GAR-meets-RAG paradigm, which exhibited improved retrieval accuracy and generation in zero-shot information retrieval tasks, a vital factor in augmenting the precision of AI responses in knowledge-deficient contexts [
18]. Lewis et al. (2020) offered essential insights into integrating the RAG model with Wikipedia to improve its knowledge-intensive tasks, including open-domain question answering [
19].
Palm-2, an advanced language model, is optimized for extensive and multilingual tasks, demonstrating notable enhancements in performance, efficiency, and reasoning abilities. Palm-2 surpasses its predecessor PaLM in numerous reasoning tasks, notably improved multilingual capabilities and computational efficiency [
20]. Chowdhery et al. (2022) introduce PaLM, a language model created via Pathways to attain high efficiency and exceptional performance on benchmarks, particularly in few-shot learning tasks [
21]. The Med-PaLM 2 model, an iteration of Palm-2, exhibited remarkable efficacy in medical question-answering tasks, achieving a significant performance milestone by exceeding human-level performance across various medical datasets [
22]. PaLM-E is an embodied multimodal language model that integrates real-world sensor modalities with language comprehension, essential for addressing intricate robotic and reasoning challenges [
23]. Additional applications of Palm-2 are apparent in biological domains, where it has been employed to analyze gene-phenotype associations for genetic discovery, utilizing AI to formulate innovative hypotheses [
24].
3. Hypothesis and Functional Specification
This paper uses the RAG methodology with the VERTEX AI Palm-2 model to create a virtual assistant to support call center agents. The virtual assistant discussed in this paper seeks to expedite query responses by utilizing context from pertinent literature to improve response precision. The hypothesis being tested in this paper uses empirical data from a student office at Algebra University in Zagreb, Croatia.
The virtual assistant functions as a resource for call center agents by addressing customer inquiries and validating the accuracy of responses generated through a literature review. The application aims to enhance query resolution efficiency and ensure information accuracy by utilizing an LLM that retrieves semantically analogous documents from a knowledge base to inform responses. The system accommodates queries with and without pertinent context. Without relevant context, the system advises the user on acquiring the requisite information, thereby reducing the likelihood of erroneous or fabricated responses.
The application includes adjustable model parameters, including temperature, topK, topP, and token limits, to regulate the variability and length of responses. These parameters are essential for modifying the LLM’s behavior by various tasks. The system employs two models for response generation and retrieving semantically analogous literature, with established thresholds for acceptable semantic similarity. The system is designed to limit the context utilized in a response to a maximum of 8000 characters.
3.1. Hypothesis
This paper is based on two principal hypotheses. The initial hypothesis investigates the efficacy of learning methodologies devoid of samples (zero-shot learning) versus those utilizing one or more samples (few-shot learning) in addressing inquiries pertinent to a student office, significantly when augmented with relevant context. The model’s capacity to comprehend a query and produce a precise response is essential to the hypothesis.
The second hypothesis evaluates the Palm-2 LLM’s capacity to produce responses that consistently do not substantially harm users. It assesses whether erroneous responses could prompt users to undertake misguided actions, such as overlooking critical deadlines or failing to comprehend their rights to contest specific decisions.
3.2. Functional Specification
The virtual assistant researched in this paper is purposefully created to aid call center personnel by swiftly producing precise and contextually appropriate responses. The assistant employs the RAG approach to extract pertinent literature from our database and integrate it into the responses made by the Palm-2 model. The established architecture prioritizes speed and accuracy in processing user queries, especially when pertinent contextual data is accessible.
The virtual assistant produces responses with or without pertinent material in two modes. In the absence of suitable literature in the database, the assistant nevertheless delivers a response while offering direction on acquiring the requisite material, mitigating the danger of hallucinations (the generation of erroneous or irrelevant data). Our system is extensively configurable, enabling precise adjustments to the model’s performance according to parameters such as randomness (temperature, top-K, and top-P settings), the maximum token count in responses, and the semantic similarity threshold, guaranteeing the utilization of only highly pertinent documents for context.
Additionally, a helper to extract literature through a cosine similarity search from a vector database was designed as a part of the architecture. The system guarantees the selection of just the most semantically analogous sources for a specific query by implementing stringent similarity criteria. The design has three fundamental phases: indexing, retrieval, and response creation. In the indexing phase, input texts from diverse sources—such as PDFs and websites—are decomposed into smaller segments, each stored in the vector database with associated embeddings. During the retrieval phase, the system queries the database for the most semantically pertinent samples to serve as a context in response creation. Ultimately, the model produces a response, utilizing the contextual information to offer relevant answers.
This design enhances the assistant’s efficiency. Every query is addressed autonomously, indicating that the assistant has no recollection of prior exchanges. This guarantees that responses concentrate exclusively on the present inquiry and the most pertinent context, minimizing the likelihood of redundancy or extraneous material. The RAG technique enhances the model’s accuracy, especially in multilingual contexts, by grounding replies in specific, validated facts.
There are also problems with this architecture. A primary concern is the possibility of hallucinations in the absence of a pertinent background. Furthermore, the model’s need for stringent contextual data may demonstrate constrained inventiveness in its responses. However, future iterations are anticipated to resolve these issues through enhanced training and optimization methodologies. Future iterations are expected to resolve current limitations by integrating advanced training and optimization methodologies, such as augmented adversarial training, iterative retrieval generation, and domain-agnostic robustness techniques. As Wu et al. demonstrated, augmented adversarial training enables models to align more effectively with the intended retrieval domains by exposing them to semantically varied data during training. This technique allows the model to distinguish relevant from irrelevant content, improving retrieval accuracy in diverse contexts [
24]. Additionally, Hoang et al. highlight the importance of training with shuffling and domain-agnostic samples to mitigate domain mismatches in retrieval-augmented systems. By doing so, models can maintain performance even when exposed to unexpected data variations, enhancing their robustness without needing constant retraining on specific datasets [
25].
Another promising direction involves iterative retrieval-generation processes, where the model dynamically retrieves and integrates new information as needed to refine responses continually. Jiang et al. propose a method where retrieval occurs throughout the generation process, allowing for contextually appropriate updates that improve output accuracy and reduce irrelevant information [
26]. Complementing this, Shao et al. present an iterative retrieval-generation synergy model, which enhances model outputs by integrating newly retrieved data at each generation step, making it well-suited for dynamic or evolving datasets [
27]. Together with retrieval-enhanced adversarial training, as discussed by Yang et al., which leverages adversarial frameworks to further refine the generation quality by exposing models to N-best response candidates, these techniques lay the groundwork for future iterations that are more adaptable, contextually aware, and accurate in retrieval and response generation [
28].
Our functional specification delineates a virtual assistant that markedly improves the efficiency of call center operations by providing accurate, context-specific responses. This solution efficiently minimizes the time agents need to handle issues while maintaining the accuracy and relevancy of the information supplied.
3.3. Solution Architecture
The architecture proposed in this paper is based on the RAG method combined with the Google Vertex AI Palm-2 model. The system comprises three primary phases: indexing, retrieval, and response creation.
During indexing, documents from diverse sources, including PDFs and webpages, are deconstructed into smaller text segments. Subsequently, the samples are saved in a vector database alongside their appropriate embeddings, enabling the system to perform semantic searches. The vector database employs cosine similarity to align incoming searches with the most pertinent literature included within the database.
Data preparation is conducted in advance. Data from university sources is processed, cleaned, and stored in a vector database, with each text chunk assigned a fixed vector representation. The preprocessed and indexed data is utilized to respond to inquiries, indicating that the preparation was not conducted on the fly but was completed before query handling.
Upon submitting a query, the system searches the vector database for semantically analogous text according to the established cosine similarity level during retrieval. The model utilizes BERT (Bidirectional Encoder Representations from Transformers) as the core retrieval mechanism to search the vector database for semantically analogous text. BERT is pre-trained on a large corpus and fine-tuned for contextual understanding, allowing it to create dense embeddings for user queries and database entries. This architecture supports semantic search by encoding contextual information, helping the model capture nuanced meanings and find semantically relevant results even when exact keyword matches are absent. The model applies a cosine similarity metric to measure the relevance of the query and the text embeddings, leveraging BERT’s capabilities for contextual matching, which enhances retrieval accuracy and relevance [
29]. BERT’s use in dense retrieval has shown strong performance in handling synonym and polysemy issues by representing words and sentences in a rich, multi-dimensional space, effectively bridging the gap between user intent and document content. Research has demonstrated that BERT-based dense retrieval models, such as those described by Zhan et al. and Wang et al., are particularly effective for tasks requiring deep semantic alignment, offering significant improvements over traditional lexical retrieval models like BM25. These advancements make BERT an ideal model for enhancing retrieval quality in complex, context-dependent searches [
30,
31]. Without pertinent documents, the model produces a context-free reply containing guidance for acquiring the required information.
During the concluding phase of answer generation, the obtained context is transmitted to the Palm-2 model, which constructs a response informed by the input query and the pertinent context. The architecture is engineered for scalability and accuracy, guaranteeing that each query is processed autonomously, avoiding superfluous repetition or extraneous information. The architecture’s focus on context retrieval markedly improves model accuracy while maintaining quick reaction times. The architecture proposed in this paper can be seen in
Figure 1:
The proposed architecture can be examined through its building blocks, which consist of three main components: the database, the server application, and the client application.
The integration of pgvector enhances the PostgreSQL relational database for vector-based data storage and querying. The system is enabled to execute intricate searches predicated on semantic similarity, which is essential for the assistant’s operation. Vector-based storage facilitates applications such as natural language processing and recommendation systems, enabling the assistant to extract the most contextually pertinent information from a collection of papers.
The server program, created in Python utilizing the Flask framework, is the foundation of assistance. Two primary API methods are incorporated: one for generating predictions devoid of context and another employing the Retrieval-Augmented Generation technique to enhance answer accuracy by integrating pertinent documents from the database. The server manages user query preparation, document retrieval, and communication with the VERTEX AI model, with the most relevant literature included when necessary.
The SMAC (Subject Matter Assistant Client) application on the client side has been developed using the Next.js framework, which provides a user interface for query input and return presentation. Every question is handled autonomously, ensuring that no record of prior exchanges is retained. The client application aims to enhance user experience using visual components, including a question submission form, response display, and context visualization, which allow users to validate the precision of the model’s responses.
The main interface components and their functions within the SMAC virtual assistant application are presented in
Figure 2. The interface is initiated with the (1) Title and Logo linked to the university’s official website, enhancing the user experience by providing a recognizable branding element. A (2) GitHub Link is provided to direct users to the project’s GitHub repository, where the source code and technical documentation can be accessed for transparency and further exploration. The question entered by the user is displayed in the User Query field, which initiates interaction with the virtual assistant. In contrast, the response generated by the assistant is shown in the Answer to Query section. The combined input, including context, model instructions, and the user’s question, is contained within the (5) Full Query box, which the VERTEX AI API subsequently processes. The literature snippets retrieved using the RAG method are provided by a (6) Source field, ensuring that responses are grounded in contextually relevant information. Instructions regarding the assistant’s role and the user’s question are incorporated into the (7) Model Query, guiding the model in response generation. Inquiries may be entered by users in the (8) “Question” Form, a dedicated text input field, and submitted via the (9) “Send” Button, which facilitates the transmission of the query to the server for processing. Finally, the (10) “Clear” Button is designed to enable the deletion of the displayed query history from the interface, thereby ensuring that the workspace remains organized and accessible for new interactions. A user-friendly workflow is supported by this comprehensive layout, with engagement enhanced through the virtual assistant while ensuring that responses are both relevant and well documented.
The efficiency of this architecture was the reason for its selection. PostgreSQL is integrated with a PNG vector for semantic similarity, facilitating accurate information retrieval. The RAG technique improves precision by integrating pertinent materials into model queries, diminishing hallucinations. Rapid processing and an intuitive interface for query submission and return verification facilitate the division of server (Flask) and client (Next.js).
The virtual assistant determines the absence of suitable literature by conducting a cosine similarity search within a vector database, where each query is evaluated against a threshold for semantic similarity. Suppose the similarity score falls below this threshold, indicating no sufficiently relevant documents. In that case, the model flags the absence of context and generates a response without specific retrieved information, guiding acquiring additional information instead. This approach minimizes the risk of generating irrelevant or fabricated content, aligning with methods that ensure retrieval relevance and manage context scarcity through similarity-based thresholds [
14,
26].
3.4. RAG Method
The solution comprises three essential phases: indexing, retrieval, and response generation, which are utilized in the RAG method. The processing and structuring of documents are entailed by indexing to facilitate rapid and efficient access to information. At the same time, the identification of semantically relevant text that aligns closely with the user’s query is focused on by retrieval. The responses are grounded in an indexed content pool, resulting in answers derived from carefully acquired contextual data. This process significantly enhances the reliability of responses, as the model does not merely generate text but is also referenced from pertinent information.
The mitigation of hallucinations, a phenomenon characterized by the fabrication of inaccurate or misleading information by models, is identified as one of the primary objectives of the RAG method. The model is provided with relevant context, resulting in a reduction of the likelihood of hallucinations; however, several challenges are still faced by the approach. The challenges encountered include the risk of irrelevant literature being retrieved, which may dilute the quality of the response, the introduction of redundancy within the context, and sporadic hallucinations occurring in specific instances where the retrieved content is insufficiently aligned with the query.
The challenges are counteracted by incorporating a robust query formatting mechanism within the architecture. The mechanism ensures that only the most pertinent context is employed during response generation, with redundant data being filtered out. Additionally, the architecture refines the retrieved documents through ranking based on relevance, which enhances the model’s ability to generate accurate and concise responses. The role of query formation in shaping output is critical, as user queries are structured to align better with the content of the document repository.
Moreover, it has been observed that the system leverages semantic search capabilities, allowing for the detection of nuanced relationships between terms and concepts, thereby making retrieval more aligned with user intentions. Accuracy is improved, and retrieval latency is reduced, which allows for a prompter response from the system.
Figure 3 illustrates the implementation of the RAG method, outlining the workflow for query formation and contextual response generation.
The enhancements made to the system’s architecture specifically aim to overcome the inherent challenges within the RAG framework. This design decision is a testament to the balance between delivering precise responses and maintaining an efficient retrieval process.
3.5. Data Preparation
As the study outlines, the data preparation process is characterized by a systematic and rigorous approach to data acquisition from various sources, particularly on web pages and PDF documents. This method ensures the reliability and quality of the data. The data is initially stored in plain text format to ensure uniform processing. Once acquired, the raw data undergoes a comprehensive refinement process to eliminate extraneous and potentially distracting content. This cleaning stage includes removing non-essential elements such as headers, footers, personal names, and redundant information that may be repeated across multiple pages. Irrelevant or repetitive data, such as headers, footers, and hyperlinks that do not contribute to informational value, is deleted. Text alignment and spacing are ensured for readability. Furthermore, the removal of any personal information is conducted to ensure privacy, and the simplification of complex or redundant tables is achieved through the reorganization of data into smaller, semantically meaningful chunks. A vector representation is then assigned to each chunk for efficient retrieval based on semantic similarity, ensuring that essential information remains accessible in a more user-friendly format.
Following the initial cleanup, the text is organized into meaningful paragraphs, which enhances coherence and readability. Unnecessary spaces within the text are removed, ensuring a streamlined and concise format. Complex elements such as tables, which may not be directly readable in text format, are converted into a more legible structure that retains the essential information. As illustrated in
Figure 4, the preparation phase ensures that the data is optimal for further processing and analysis.
After the data’s sanitization, segmentation into smaller, manageable portions is performed. The segmentation is designed to enable finer-grained analysis and to assist in the alignment of the data with specific user queries during the retrieval phase. Each segment is transformed into a vector representation, which encodes the semantic content of the text in a numerical format. The importance of this vectorization is underscored by its role in enabling the system to perform semantic similarity retrieval, whereby user queries are matched with relevant information based on meaning rather than mere keyword similarity.
Subsequent storage of these vectorized segments occurs in a specialized database designed to support efficient similarity searches. Data is indexed in this format, allowing for the rapid retrieval of contextually pertinent information, which makes it suitable for applications such as natural language processing and recommendation systems. This approach improves retrieval accuracy, and response times are optimized, allowing the system to handle large datasets effectively.
This systematic preparation enhances the model’s ability to retrieve relevant information accurately and quickly, improving the semantic similarity retrieval process’s overall performance.
3.6. VERTEX AI in the Proposed Solution Architecture
The proposed solution architecture uniquely applies VERTEX AI in three distinct ways, leveraging its language model capabilities for handling various tasks. The paper explicitly utilizes the Bison (text-bison) model within the PaLM-2 family of models on the VERTEX AI platform. This model, designed for text-based tasks, is employed in the paper for predictions and response generation, playing a crucial role in the architecture.
Prediction without context: In this method, the model is asked to provide answers without any supporting context. The user poses a question, and the model responds based on its internal knowledge and instructions on the role it should assume. However, this method often leads to hallucinations—instances where the model generates information not present in its training data. These issues are discussed in the testing results section, where it was observed that the model frequently provided inaccurate responses.
Contextual prediction (using the RAG method): The second approach integrates the RAG method, wherein the model is supplied with relevant context before generating responses. This context typically comes from a set of documents or literature provided to the model. In this method, the model uses the supplied information to refine its answers, significantly improving accuracy. The RAG method plays a crucial role in solving queries that rely on external data and is central to the hypotheses tested in the paper.
The third approach, learning from one or more patterns, involves providing the model with examples of correct responses (one-shot or few-shot learning) to guide its predictions. This method, akin to human knowledge, is particularly beneficial when limited contextual information is available, as it helps the model understand how to answer future questions, thereby enhancing its predictive capabilities.
The usage of VERTEX AI in these three ways demonstrates how different generative techniques can be applied to improve language model performance in specific scenarios, balancing between hallucination risks and improving accuracy with additional context.
4. User Testing Methodology
The functionality and reliability of a VA specifically developed to assist call center agents within an academic setting were evaluated through user testing in this study. The design of user testing was aimed at addressing two main aspects: first, the examination of the VA’s ability to generate accurate and contextually relevant responses, and second, the identification of potential adverse effects of incorrect or misleading responses on end users. Given the importance of precise and timely information in educational institutions, the VA must provide dependable assistance to students and administrative staff without introducing confusion or inaccuracies.
The validation of AI-driven applications, especially in contexts where the information provided may influence user actions and decisions, is critically supported by user testing. In this study, the effectiveness of the VA is assessed not only in terms of response accuracy but also in its ability to understand and incorporate context into responses, thereby minimizing errors. The significance of this requirement is underscored by the potential repercussions associated with incorrect responses in an educational setting. Consequently, user testing aims to identify areas in which the VA performs well and areas in which refinement may be required to meet the high standards of reliability expected within a university’s administrative support framework.
Two hypotheses were formulated to structure the evaluation. Hypothesis 1 posits that the VA can effectively understand and respond to user queries when contextual data is provided. The core expectation that the VA can utilize contextual information to enhance response accuracy and relevance is reflected in this hypothesis. The VA leveraged the RAG methodology to access and apply semantically relevant literature from its database, allowing for the generation of contextually appropriate responses that address user queries accurately. It is emphasized that the VA can process contextual data effectively and align responses with user intent. Hypothesis 2 proposes that the VA limits the potential for harm to end users by minimizing incorrect or misleading information. The risks associated with hallucinations or erroneous outputs, which are inherent in many large language models, are addressed by this hypothesis. Given the reliance on generated responses in a university’s operational setting, assessing whether the VA can provide accurate responses without introducing information that could negatively affect user decisions is crucial. The relevance of this hypothesis is particularly noted in scenarios where inaccurate responses may misinform students or staff regarding critical processes, such as course registration deadlines or appeals procedures.
The user testing methodology was carefully designed to align with these hypotheses, ensuring a structured and systematic evaluation of the VA’s performance. A representative group of participants, consisting of students and administrative staff, was included in the testing process, and they interacted with the VA in simulated real-world scenarios. Feedback was gathered from a diverse participant base, allowing for reflection on the different types of inquiries that the VA might encounter in daily operations.
To assess Hypothesis 1, various query types were incorporated into the testing framework, each designed to examine the VA’s ability to retrieve and apply relevant contextual data effectively. Questions were submitted to the VA by participants in two modes: one mode allowed access to contextual data, while the other did not provide such data. The extent to which contextual information influenced the VA’s performance was determined by comparing responses generated in these two modes. Each response was evaluated based on accuracy, relevance, and coherence, with particular attention given to the VA’s ability to interpret and apply context effectively.
The testing approach for Hypothesis 2 was centered around identifying instances of incorrect or misleading responses, with an assessment of their potential impact on end users being conducted. Participants rated the VA’s responses based on clarity, accuracy, and any perceived risk of harm. Responses deemed potentially harmful, whether due to inaccuracies or the inclusion of misleading information, were flagged for further analysis. The frequency of erroneous responses was measured in this component of the user testing, along with the severity of these errors, considering the implications of such errors in a real-world academic context.
Each participant’s interaction with the VA was scored utilizing a structured evaluation framework, incorporating binary ratings (such as correct vs. incorrect) and Likert scale ratings for relevance, coherence, and overall user satisfaction. The application of a mixed-methods approach facilitated the acquisition of quantitative and qualitative insights regarding the performance of the VA through the user testing methodology. The percentage of accurate responses and the frequency of hallucinations were utilized to provide a high-level overview of the reliability of the VA. Qualitative feedback was captured to reflect user perceptions, particularly concerning the VA’s ability to handle complex or ambiguous queries and its overall usefulness as an academic support tool.
The outcomes of this testing methodology provide a comprehensive understanding of the VA’s strengths and limitations. The results offer empirical evidence to support or refute the hypotheses, guiding subsequent phases of refinement and optimization for the VA. The user testing approach is a robust framework for evaluating the VA’s readiness for deployment within an educational institution by systematically addressing accuracy and potential harm.
4.1. Participants
During the user testing phase, 187 participants were engaged to assess the effectiveness and reliability of the virtual assistant system. A diverse group was included, comprising students from various academic levels and one professor, representing a realistic cross-section of potential users who might interact with the VA in an educational setting. A comprehensive understanding of how well the VA could respond to the diverse needs of an academic environment was aimed to be captured through the simulation of a wide array of real-life scenarios in this participant profile. A robust evaluation across different types of user inquiries was facilitated by the larger sample size of 187 participants, which helped identify consistent patterns in performance and user satisfaction.
4.2. Testing Environment
A simulated operational setting was utilized to test the VA, designed to mimic its intended deployment within a university’s student office. Participants used a client application developed with the Next.js framework in this environment, providing an intuitive interface for submitting queries and viewing responses. Each query was handled by the client application as an independent interaction, ensuring that no residual data from previous queries influenced subsequent responses. The setup was designed to enable the VA to autonomously address each question, focusing on the current input and maintaining a “stateless” approach to maximize response clarity.
The backend of the VA system was powered by a server application based on Flask, which was responsible for managing query processing and facilitating communication with the VERTEX AI model. This server handled the RAG method, with relevant contextual data being retrieved from a PostgreSQL database that was enhanced with pgvector for vector-based querying. The VA leveraged the RAG methodology to integrate contextual information when necessary, resulting in more accurate and relevant responses. The testing environment was designed to assess the performance of the VA under realistic operational conditions, examining its ability to manage user queries and deliver coherent responses seamlessly.
4.3. Test Design
A structured test design was employed to evaluate the VA’s capabilities effectively. This design focused on three primary aspects: the number of queries, scoring criteria, and critical performance metrics. This approach enabled a detailed analysis of the VA’s performance across various dimensions.
The total number of queries submitted during the user testing process was 561, which resulted from each of the 187 participants being asked to submit three independent queries. This approach provided a substantial data set for analyzing the VA’s response consistency and accuracy across different inquiries. The design of each query was intended to reflect common questions that might be posed by students or staff in a university setting, encompassing a range of topics from academic policies to administrative procedures.
The VA’s responses were evaluated using a multi-faceted scoring system incorporating binary (Yes/No) and scale-based ratings (
Table 1). The binary scoring was focused on the accuracy of the response, categorized as either accurate (Yes) or containing errors (No). Furthermore, the VA’s responses were rated by participants on a Likert scale ranging from 1 to 5 across various dimensions, which included relevance, coherence, and overall quality. A rating of 1 was indicated as a highly unsatisfactory response, while a rating of 5 was signified as an excellent response that was fully aligned with user expectations. This nuanced scoring system provided insights into specific areas where the VA excelled or required improvement, enabling a comprehensive analysis of its strengths and weaknesses.
The scoring system provided a well-rounded framework for assessing the VA’s technical accuracy and overall user experience. This multifaceted approach enabled a more nuanced evaluation, which helped to pinpoint areas for further refinement while affirming the VA’s effectiveness across diverse query types.
4.4. Key Metrics Evaluated
The metrics used to evaluate the VA’s performance were meticulously selected to ensure that technical accuracy and user satisfaction were adequately assessed. This careful selection process ensured the validity of the evaluation.
The accuracy and hallucination rate were focused on in this metric, emphasizing measuring the correctness of the VA’s responses, particularly regarding the presence or absence of hallucinations. Instances where the model generates inaccurate or misleading information that does not align with the given context are hallucinations. The frequency of hallucinations was tracked to identify patterns in the VA’s response generation that could lead to inaccurate outputs. The establishment of the VA as a reliable tool in an academic setting, a vital focus of this metric, depended on providing accurate responses.
The relevance metric was utilized to evaluate the extent to which the responses provided by the VA aligned with the specific queries posed by participants. The responses were scored based on the degree to which the question was addressed and the provision of practical, contextually appropriate information. Each response’s grammatical and syntactical correctness was assessed in terms of quality, ensuring that the output was relevant but also clear and professional. These metrics offered a comprehensive view of the VA’s ability to deliver polished and pertinent responses.
The ability of the VA to interpret user queries accurately was measured by this metric, with each response being analyzed for logical coherence and the absence of fallacies or misunderstandings. In a VA setting, the critical nature of comprehension is recognized, as the system needs to provide clear and logically consistent answers that align with the intent of the question posed by users. High scores in comprehension were indicated, suggesting that practical parsing and responding to complex queries by the VA could be achieved. In contrast, lower scores reflected areas where the model struggled with understanding user intent.
Overall User Experience: Beyond technical performance, a broader assessment of user satisfaction was provided by the overall user experience metric. Participants rated the experience based on the ease of use of the VA, the response time, and the general helpfulness in answering queries. The user’s perspective on the VA’s functionality was captured through this metric, with the identification of any friction points that could impact user adoption and satisfaction being achieved. The high overall user experience score indicated that the VA effectively provided accurate information, while ease of use and engagement were also noted.
The metrics provided a robust and holistic evaluation framework, allowing for an in-depth understanding of the VA’s technical capabilities and reception among users. Areas of success were highlighted, and specific aspects needing refinement to enhance the VA’s functionality and user satisfaction were identified.
4.5. Evaluation Process
The evaluation process was conducted through quantitative and qualitative assessments, ensuring a comprehensive VA performance analysis. The responses were quantitatively scored based on the above criteria, resulting in a data set that provided insights into accuracy rates, hallucination frequency, and average scores across various metrics. The data was analyzed to determine trends and identify areas where the VA exhibited consistent performance or weaknesses.
Participants qualitatively provided open-ended feedback on interactions with the VA. This feedback offered valuable insights into user perceptions, highlighting specific aspects of the VA’s performance that might not be captured through quantitative scores alone. For instance, users noted areas where contextually relevant responses were particularly adeptly provided by the VA and instances where the system struggled with ambiguous or complex questions.
The evaluation process was conducted through quantitative and qualitative assessments, ensuring a comprehensive VA performance analysis. The scoring of each response was conducted quantitatively, based on the criteria outlined in
Table 1, which enabled structured scoring across different response dimensions. The scoring system was comprised of both binary and Likert scale evaluations. The assessment of response accuracy, detection of hallucinations, and identification of logical errors were conducted using binary scores. The binary ratings provided a straightforward assessment of whether the output of the VA met essential accuracy and coherence standards, with each response categorized as either correct or containing an error.
A Likert scale ranging from 1 to 5 was employed in addition to binary scoring to rate responses across various dimensions, including relevance, quality, and overall user experience. A more nuanced evaluation was facilitated by the Likert scale, with the degree of alignment between the response and the query (relevance), the grammatical and syntactic quality (quality), and the participant’s satisfaction with their interaction with the VA (overall user experience) being captured. This multi-dimensional scoring approach provided a detailed view of the VA’s performance, with not only binary correctness being identified but also subtler factors influencing user satisfaction.
4.6. Detailed Scoring
The response patterns of the VA were further illustrated through the meticulous application of specific scoring methods to assess various critical aspects of the responses. The foundational accuracy metric was established, with each response being evaluated through a binary score that determined whether the VA provided factually correct information. Responses were designated as “Yes” for accuracy when correct data relevant to the query was contained or “No” when errors were identified. The binary approach was offered as a straightforward yet essential measure of the VA’s reliability in delivering accurate information, regarded as a core requirement for an educational support system.
Furthermore, Hallucination Detection was implemented as a binary score, with responses that included content deviating from the context or intent of the question being flagged. Instances of hallucinations, characterized by the VA producing information without a basis in the available data, were identified. This scoring metric addressed an essential concern in AI-driven applications. The detection and mitigation of hallucinations are considered vital to ensuring that the VA is maintained as a dependable source of information and is not inadvertently used to mislead users.
Another crucial component, Logical Errors, was assessed through a binary check to identify coherence issues within the responses. The scoring aimed to detect cases in which the VA’s responses exhibited logical fallacies or inconsistencies with the posed questions, thereby undermining the overall coherence and reliability of the response. Logical consistency in responses is recognized as essential for maintaining user trust, especially in the context of complex or ambiguous queries.
Relevance and Quality were assessed on a Likert scale from 1 to 5 to complement these binary scores. This approach permitted a more nuanced evaluation of the alignment between the VA’s responses and the user’s query intent. This scale considered the contextual appropriateness and syntactic quality of each response, ensuring that the VA addressed the content of the question. At the same time, clarity and professionalism were maintained in its presentation. High scores in this area indicated that the VA’s answers were well suited to the inquiry and professionally constructed, contributing positively to user experience.
Finally, the overall user experience was evaluated using a Likert scale, and participants’ subjective impressions of their interaction with the VA were captured. The scoring was conducted considering ease of use, response clarity, and overall satisfaction, providing insights into user-centered aspects of the VA’s performance. The holistic user experience was focused on, leading to the identification of areas in which the VA excelled at user engagement, along with any friction points that could potentially impact user adoption.
In addition to the structured scoring system, the analysis of examples of both positive and negative responses allowed for more profound insights into the VA’s strengths and areas needing improvement. Illustrative samples of these responses were provided in
Table 2 and
Table 3 from the thesis, with specific user queries paired with the VA’s generated answers. These tables facilitated the documentation of cases where the VA performed effectively and instances in which contextual relevance or accuracy struggles were observed.
A positive response, as illustrated in
Table 2, was featured about a commonly asked administrative question. The VA successfully retrieved accurate, contextually relevant information and presented it with proper syntax and grammar, resulting in high scores in relevance and quality. This sample highlighted the potential of the VA for handling straightforward queries reliably.
Table 3 illustrates a negative response example in which a hallucination occurred, and the VA fabricated details that were inconsistent with the question’s context. In such cases, a scoring rationale was applied to explain specific scores, particularly for logical errors and irrelevance, ensuring clarity in the evaluation process.
The scoring framework was enhanced by these examples, which provided qualitative insights into the response patterns of the VA, clarifying the likely occurrences of issues such as logical fallacies or hallucinations. The evaluation process was enhanced by the combination of structured scoring and detailed response samples, resulting in a comprehensive view of VA’s performance that provided quantitative data and contextual analysis. Identifying areas for targeted improvement was effectively achieved through this holistic approach, which ultimately guided refinements that could enhance the efficacy of the VA in real-world academic applications.
5. Performance Analysis and Results
The effectiveness of the generative model is examined in this chapter, with particular emphasis placed on the ability of the VA to minimize errors and hallucinations while maximizing response accuracy. The testing explored multiple response generation methods: predictions without context, contextual predictions utilizing the RAG method, and learning from one or more examples, including one-shot or few-shot learning. Each approach presents unique strengths and limitations. Responses deemed correct for one-shot and few-shot learning are derived from a curated set of accurate answers and contextually relevant literature. These have been selected to give the model reliable patterns for responding to similar inquiries. The tendency for hallucinations to be produced by the model was revealed through context-free predictions, mainly when complex or nuanced queries were interpreted without supporting data. The addition of contextual information through the RAG method significantly reduced these errors, affirming the importance of context for generating relevant, accurate responses. Furthermore, additional promise was shown by the learning method from examples, particularly in the enhancement of response accuracy where relevant context was sparse.
5.1. Impact of Context on Prediction Accuracy
Integrating context within query predictions is crucial for improving the virtual assistant’s accuracy and reducing the rate of hallucinations. The model is aided by contextual data, which grounds responses in relevant information, thereby minimizing the potential for misleading answers. The RAG method is relied upon in this approach, with contextually relevant documents being retrieved from a pre-defined database. Testing revealed that relevant context dramatically enhances response accuracy, with 65 out of 74 questions (87.73%) correctly answered by the VA when pertinent context is provided through RAG, as can be seen in
Figure 5:
Without context, responses are often fabricated, or the VA introduces inaccuracies. A sharp increase in hallucinations was observed when responses were generated by the model without any supporting information, resulting in incorrect or misleading answers in 75% of cases. The importance of structured contextual data is underscored by this discrepancy, particularly for complex queries that demand a high degree of specificity, such as questions related to administrative procedures or eligibility criteria. In instances where queries were presented without contextual information, it was observed that the nuances were frequently not recognized by the model, leading to logical errors and deviations from expected responses.
The consistency of responses is also impacted by context. The VA’s consistency improvements across similar queries were observed with contextual support, particularly in tests involving enumerative or multi-step answers. The presentation of fragmented information or entirely irrelevant details was prone to occur without context regarding the VA. When context was provided by the RAG method, the VA could pull cohesive and correct information, even for complex procedural questions. The ability of the VA to provide logical and relevant responses was refined through this structured approach, while its capability to filter out unnecessary or incorrect information was effectively reinforced.
Overall, it has been observed that the inclusion of context through the RAG method yields two primary benefits: the enhancement of factual accuracy in responses and the reduction of hallucinations. This result validates the hypothesis that context-based predictions substantially improve VA performance. Implementing context retrieval protocols in the VA’s architecture is essential to maintaining high accuracy and ensuring user trust, particularly in environments where reliable information is paramount.
5.2. Reduction of Hallucinations in Complex Responses
The RAG method significantly reduced hallucinations, particularly in complex and multi-step queries. The model’s success in avoiding hallucinations—in which details are fabricated without a basis in the query or context—was notably higher when relevant contextual information was provided. It was shown through analysis that hallucinations were present in 12.72% of responses utilizing the RAG method, which is significantly lower than the rate that was observed in context-free predictions. The role of the RAG method in grounding the model’s output is confirmed by providing pertinent data, enhancing the model’s ability to handle complex queries accurately.
The influence of relevant context as opposed to irrelevant context was identified as another critical factor. When relevant context was provided for generating responses, hallucinations were observed in only 2 out of 74 queries. In contrast, hallucinations occurred in 33% of cases when irrelevant context was present. The stark contrast is demonstrated, indicating that while the retrieval function of the RAG method generally mitigates hallucinations, its effectiveness depends on the retrieved context’s relevance. The model showed reduced proficiency in differentiating between relevant and irrelevant contexts where precise semantic alignment was absent, leading to inaccuracies, as can be seen in
Figure 6:
Challenges were observed in handling queries requiring enumeration or listing items, such as steps in procedures or lists of documents. Hallucinations were observed in approximately 30.3% of responses in which the model listed items despite incorporating an RAG-based context. This frequency was exacerbated in cases where context was deemed irrelevant, highlighting the necessity for refinement in handling enumerative queries. Enumerative responses lacking adequate context heightened the likelihood of hallucinations, and errors that could potentially mislead users in procedural or administrative tasks were also introduced.
Further statistical validation of these findings indicated a significant improvement in hallucination reduction was observed when relevant context was utilized. The p-value calculations from the thesis support this outcome, indicating that the probability of hallucination-free responses was considerably higher with the application of RAG context (p < 0.001). Consequently, the null hypothesis, which posits no significant difference between context-free and context-augmented predictions, is rejected.
The effectiveness of the RAG method in limiting hallucinations in complex responses has been demonstrated; however, challenges persist with queries that require structured lists or involve ambiguous context. The relevance of context retrieval is improved, and the approach to enumerative data is refined, enhancing the VA’s response accuracy further, thereby reducing risks associated with complex academic queries.
5.3. Quantitative Assessment of User Satisfaction
The effectiveness and usability of the virtual assistant were evaluated using user satisfaction as a critical metric. The understanding of queries by the VA, the relevance and quality of responses, and the overall user experience were primarily centered on user feedback. A Likert scale ranging from 1 to 5 was utilized to quantify various aspects, while additional binary assessments were implemented to capture specific response attributes, including accuracy and hallucinations. The results from user testing were found to indicate a generally favorable impression, with an average satisfaction rating of 3.575 recorded across all responses, thereby underscoring the VA’s effectiveness in meeting most users’ expectations, as shown in
Figure 7:
The syntactic and grammatical quality received the highest scores, with an average of 4.425 noted from the testing. The score indicates that clarity and professionalism were perceived in the responses, which is crucial for maintaining trust in automated systems. Many users noted that the reactions were stylistically correct and well structured, contributing positively to the perceived reliability of the VA. Despite this, it was observed that responses were occasionally rated highly by some users based solely on grammatical quality, with minor factual inaccuracies or hallucinations potentially being overlooked.
The ability of the VA to comprehend and interpret queries was rated at 4.225, indicating that users generally felt that the system understood their questions adequately. The importance of this metric is highlighted, as it reflects the system’s performance in parsing user intent, which is considered a key component for any language model-based assistant. The average score of 3.825 for response relevance was observed, indicating occasional mismatches between user queries and the answers provided by the VA. Minor misinterpretations of the context were often responsible for these mismatches, resulting in slight deviations from the intended answer, particularly in questions that involved specific institutional policies or requirements.
The overall user experience, while positive, was found to have room for improvement. The final satisfaction metric of 3.575 indicated that users perceived the VA as a helpful tool; however, limitations in response accuracy were also encountered. It has been observed that incorrect or hallucinated responses, while infrequent, have had an impact on user confidence in the reliability of the VA. Users have been required to cross-reference answers with additional sources to ensure correctness. The feedback suggested that in contexts where hallucinations occurred, users would appreciate a more unambiguous indication of confidence levels or disclaimers to help mitigate potential misinformation risks.
This quantitative assessment provided valuable insights into the strengths and limitations of the VA. Although linguistic and syntactic quality received high ratings, it is recognized that improvements in context retrieval and response accuracy are required for broader user satisfaction. By refining these aspects, the VA may achieve a more dependable tool in an academic setting, resulting in increased user confidence in its responses and enhanced overall user experience.
5.4. Limitations and Areas for Improvement
Implementing the virtual assistant utilizing the RAG method in conjunction with the VERTEX AI platform demonstrated significant strengths; however, limitations were also highlighted, particularly in handling context-sensitive data and maintaining response accuracy across varied queries. The generalization issue was identified as one of the main challenges. Although effective in grounding responses, the RAG method occasionally struggled with generating accurate responses in instances where the input context was either highly specific or sparse. The tendency to generalize excessively is attributed to the reliance of the language model on the immediate context, which can sometimes narrow the scope of relevant responses.
Furthermore, obstacles were presented by overfitting and bias in specialized topics, as inherent biases from the training data were sometimes reflected in the model’s responses. It was particularly evident in topics where neutral perspectives or diverse viewpoints were lacking in the context retrieved. The need for continuous model tuning and enhanced training data to mitigate these biases is underscored by these instances. Furthermore, the model’s reliance on context limited the creativity and flexibility of responses, as highly structured answers were often produced that did not always accommodate more nuanced or interpretative queries.
An area of improvement is identified in the interpretation of tabular and structured data. While the VA demonstrates adequate performance with text-based documents, limitations were exhibited in the parsing and responding to questions that required an understanding of structured data, such as schedules or course curricula. Occasional inaccuracies were observed due to the inability to seamlessly integrate tabular data, particularly in enumerative responses or when items from structured information were required to be listed by the VA. Future enhancements may be considered, including model training specifically designed for the interpretation of structured data, with the potential use of datasets containing annotated tables to enhance the model’s adaptability.
Challenges were further observed in the form of hallucinations in enumerative responses, where the model generated additional or inaccurate information when tasked with listing items or following a sequence. This tendency underscores the need for stricter filtering mechanisms to verify context relevance and accuracy, particularly in queries requiring multi-step answers. To address these hallucinations, it is suggested that enhanced prompt engineering and query design be utilized to guide the model more effectively, thereby reducing instances of fabricated information and ensuring that responses are accurate and grounded in retrieved content.
Lastly, it has been noted that the current setup of the VA lacks a feedback mechanism for continuous learning and adaptation, which could allow for real-time adjustments based on user interactions. Incorporating a system designed to capture and process user feedback regarding response accuracy is expected to refine the model’s responses over time. This would facilitate the development of an adaptive and evolving virtual assistant that is better aligned with user expectations. The model could be aided in learning from user corrections through this feedback loop, ultimately enhancing its reliability in a dynamic, real-world environment.
Refining retrieval algorithms, incorporating feedback mechanisms, and developing specialized models for structured data processing will address the limitations. This approach will result in a more adaptable, accurate, and user-centered VA in academic settings.
6. User Testing Results and Impact Analysis
The testing results provided insights into the strengths and areas for refinement of the VA. Correct responses were generated by the VA for 85 out of 1210 questions, resulting in a percentage of 77.27%. The level of user satisfaction was notably high regarding grammatical and syntactic quality, with an average score of 4.425 recorded. This metric indicated that the clarity and professionalism of the VA’s responses were found to enhance trust in its functionality among users. The average score for query comprehension was recorded at 4.225, indicating that it was generally perceived that the VA understood the questions posed by users and appropriately addressed their queries, as shown in
Figure 8:
Despite these strengths, the VA faced challenges regarding relevance and context accuracy. Hallucinations were observed in 30% of the responses, with plausible yet inaccurate information generated by the model. This finding underscores the need for refined context integration, as hallucinations impacted the VA’s perceived reliability, which could potentially mislead users if left unchecked, as shown in
Figure 9:
The hypothesis that contextual information significantly enhances response accuracy was supported by statistical validation through z-score and p-value calculations. For instance, the probability of hallucination-free responses was statistically higher with context (p < 0.001), confirming the benefits of the RAG method for mitigating hallucinations. Furthermore, when accuracy was examined, it was found that the VA maintained a 75% accuracy rate in samples with relevant literature, indicating that improved performance was demonstrated by the model when grounded in specific, applicable data.
The potential adverse effects of incorrect responses on user decisions were further examined in the study. Of the 199 incorrect responses identified, it was determined that 21 could significantly impact users’ actions, including the misinformation of students regarding critical academic procedures, as can be seen in
Figure 10:
The importance of reliable information in educational contexts is highlighted by this finding, where it is noted that inaccurate responses could affect a user’s academic standing, including meeting attendance requirements or missing exam deadlines due to incorrect guidance.
7. Future Works
This research identifies several avenues for future virtual assistant enhancement, focusing on refining accuracy, expanding functionality, and ensuring user-centered application improvements. These recommendations aim to address the current limitations encountered, with the VA being adapted for broader, cross-domain usability.
Mechanisms for Enhanced Context Retrieval: Improvements are identified in refining the retrieval algorithms associated with the Retrieval-Augmented Generation method. The VA could implement adaptive retrieval thresholds that adjust based on query complexity and semantic similarity, resulting in more precise, contextually relevant responses. This enhancement is likely to reduce the inclusion of irrelevant context, which would subsequently minimize hallucinations and increase response accuracy across varied and complex queries.
The capability of the VA to interpret structured data, including tables, course curricula, and schedules, is expected to be expanded, thereby enhancing its functionality in responding to detailed procedural queries. Future work may involve training the VA on datasets focused on structured information, thereby enabling the model to handle tabular and multi-step procedural data effectively. The adaptation would allow for an accurate response to queries that rely heavily on detailed, structured information by the VA.
The occurrence of hallucinations in enumerative responses is addressed, as it remains a critical challenge in the field. Advanced prompt engineering techniques or response-filtering mechanisms designed explicitly for enumerative tasks may be employed to mitigate hallucinations. The VA is ensured to consistently provide contextually grounded, reliable information, particularly in procedural or multi-step queries that are prone to inaccuracies.
A user feedback mechanism is proposed to facilitate continuous learning. Incorporating a real-time feedback loop within the VA’s operational framework is expected to support continuous learning and improvement based on user interactions. User feedback is captured and integrated, particularly in instances of inaccurate or hallucinated responses, allowing for the adaptation of response strategies over time, thereby enhancing the accuracy and reliability of answers. Implementing this user-centric feedback mechanism is expected to be a valuable dynamic model optimization tool.
Domain adaptability and cross-industry applications are observed. The VA, while primarily designed for educational environments, is noted for its potential for adaptation across other domains, including healthcare, legal services, and customer support. Tailored data inputs and domain-specific query-handling mechanisms would be required for each field. Future research is suggested to explore adapting the VA’s capabilities to meet the unique needs of these fields, thereby establishing it as a versatile tool that can serve diverse professional environments.
Error Detection and Ethical Safeguards: In environments where accuracy is critical, deploying advanced error-detection algorithms and warning systems is suggested to enhance the reliability of the VA. Mechanisms of this nature may be utilized to identify and mitigate the risk of propagating incorrect information, thereby safeguarding against potential misinterpretation by users. Ethical safeguards are essential in applications within sensitive sectors where significant impacts could result from erroneous responses.
Multimodal Interaction Capabilities: The VA’s development to support multimodal interactions, including text, voice, and potentially image or video inputs, is expected to result in a more versatile and user-friendly tool. The enhancement is aligned with current trends in artificial intelligence, with the VA being positioned as an accessible and interactive solution that can meet users’ needs across various communication preferences.
These enhancements address the VA’s current limitations and broaden its applicability, resulting in a more adaptable, user-centered tool for use across various contexts. The proposed future work aims to advance technical capabilities and enhance ethical considerations and user safety within the VA, thereby contributing to a more robust and trustworthy AI application for educational and professional environments.
8. Conclusions
The development and evaluation of a virtual assistant designed to support users within an academic setting by utilizing a generative language model enhanced by the Retrieval-Augmented Generation method are presented in this study. The VA was subjected to testing across several dimensions, including response accuracy, contextual relevance, user satisfaction, and the ability to minimize hallucinations—issues in which information is generated by the model without a basis in the query context. The results indicate that context-based responses significantly improve accuracy and reduce hallucinations, affirming the RAG method’s importance in providing grounded, relevant answers.
Although the VA has observed positive performance in structured and context-rich queries, challenges exist in handling complex, enumerative, and structured data queries, along with occasional hallucinations. The limitations identified indicate areas for future refinement, including enhancing the model’s context retrieval accuracy and integrating capabilities to interpret structured data, such as schedules or tabular information. Furthermore, implementing feedback mechanisms to facilitate continuous model improvement based on real-world interactions is expected to enhance user trust and reliability.
The adaptability of VA across various contexts is demonstrated, indicating potential applications beyond academic environments. Such applications may include healthcare, legal advisory, and customer support settings, where providing accurate and contextually relevant responses is essential. Future work is expected to address current limitations, ensure ethical safeguards, and expand multimodal interaction capabilities. Through these advancements, a robust, user-centered tool is poised to be developed by the VA, which is adaptable to diverse professional environments and capable of supporting users with reliable, contextually relevant information. This research demonstrates the potential of AI-driven virtual assistants in enhancing user support and information retrieval across various domains, paving the way for more reliable and context-aware applications.