1. Introduction
Large Language Models (LLMs) have emerged as powerful tools in the realm of natural language processing, possessing the unique capability of understanding and generating human language. This capability positions LLMs as transformative assets across a wide array of applications. In recent years, the deployment of LLMs has expanded rapidly, with diverse use cases emerging across multiple sectors. The introduction of generative AI models such as ChatGPT [
1] has brought LLMs into the public spotlight, significantly elevating the recognition and adoption of AI technologies. This surge in interest has opened new avenues for developers and users alike, leading to the integration of generative AI into various aspects of daily life.
One of the key strengths of LLMs is their ability to process and analyze vast amounts of data in a relatively short time frame. As the world becomes increasingly interconnected, enormous volumes of related data are transmitted across the internet, necessitating structured approaches for processing and utilization. To manage these data, specialized database management systems have been developed, particularly for handling graph data. Among these, Neo4j stands out as the most popular graph database management system [
2].
In this context, data models like NoSQL databases, when integrated with LLMs, offer a flexible and efficient way to process and represent diverse data types and complex relationships. Traditional relational database models, which require join operations for every edge usage, are inefficient in comparison to NoSQL models within the context of knowledge graphs [
3]. Knowledge graphs are sophisticated data structures that represent entities as nodes and their relationships as edges, with both entities and nodes containing various attributes [
4]. This structure provides a comprehensive model of a given domain, making it an ideal foundation for training LLMs. Neo4j, with its inherent compatibility with knowledge graph structures, is well-suited for managing such data. The semantic richness of knowledge graphs aligns seamlessly with the capabilities of LLMs, enabling a deeper understanding of data and their interconnections [
5].
This paper aims to explore the potential synergies between NoSQL databases, knowledge graphs, and LLMs, specifically focusing on data processing and contextual analysis. To achieve this objective, we have developed a chatbot that leverages multiple databases, including knowledge graphs, to answer user queries about the data contained within these graphs. The application is designed with a modular architecture, allowing researchers to customize the chatbot to meet their specific needs. For this work, we utilized the OpenAI GPT-4 Turbo API [
6]. Furthermore, we tested the application to assess its real-world applicability and to evaluate the accuracy of the system. Additionally, we address potential limitations and challenges associated with generating Cypher Query Language (CQL) statements.
To this end, this paper presents four key contributions to advance research in complex database query generation: (1) an automated approach for CQL generation tailored to meet the demands of complex query requirements, (2) a fully automated process that removes the dependence on template-based insertion methods prevalent in prior research, (3) integrated error correction mechanisms designed to enhance query accuracy, and (4) a robust database selection framework that optimally aligns generated queries with suitable databases. Together, these contributions aim to establish a more reliable and versatile framework for CQL generation in database applications.
The remainder of this paper is structured as follows: In
Section 2 the core principals that are necessary to understand the topic are explained. First, Large Language Models in general are examined, followed by Prompt Engineering and Sampling. After that, NoSQL is described, combined with Knowledge Graphs and CQL. In
Section 3 we investigate existing literature that focuses on creating query statements from natural language. In
Section 4 the research approach of this paper will be dissected. Following this, the technical realization and implementation of the developed chatbots architecture is then shown in
Section 5, and a brief demonstration of the application is given. Then, an evaluation of the developed prototype is carried out in
Section 6, based on several criteria and the results are presented. Afterwards, the findings, challenges, and limitations of the study will be discussed in
Section 7. The paper concludes with an outlook on opportunities for improvement and further development in
Section 8.
4. Concept
As the existing literature attests, text-to-SQL systems have been thoroughly researched in the context of Natural Language Interfaces (NLIs) for databases. Shifting the focus to NoSQL databases, though, reveals more significant challenges due to their complexity. Query generation in graph databases is particularly challenging. Most of the literature focuses on SPARQL as a query language. Conversely, the Cypher Query Language (CQL), which is used to query graph databases in Neo4j, has been relatively less explored, both in terms of breadth and depth. To fill this research gap and develop an appropriate artifact for this topic, the Design Science Research Methodology (DSRM) as outlined by March and Smith [
61] and further developed by Peffers et al. [
62], was applied. This approach is particularly appropriate, as it allows a specific organizational problem to be addressed through the development of a meaningful IT artifact [
63]. By applying DSRM, the research focuses on creating a solution that not only addresses the technical aspects of querying graph databases using CQL, but also considers the usability and practicality of the solution in organizational settings. The development of such an artifact, adapted to the complexities of NoSQL databases and Neo4j in particular, aims to make a significant contribution to the field by improving the capabilities and ease of use for practitioners and researchers alike.
4.1. Problem Identification and Motivation
In accordance with this methodology, the first step was to identify the problems and motivation for this research contribution. These were primarily outlined in the Introduction and Related Work sections. It is apparent that there is a substantial need for research in the area of Cypher Query Language for querying graph databases in Neo4j, whereas the most significant research gap can be identified in generating these CQL queries with LLMs. This need is critical not only for driving research in this area, but also for making the query capabilities of different databases more accessible.
4.2. Objectives of a Solution
In line with this idea, the goal is to develop an application that takes a modular approach to integrating and querying different databases in natural language, with a particular focus on querying using CQL in Neo4j. On a more granular level, the following objectives of the solution are of particular importance. These are derived from the existing literature and our own assessment. The primary objective of this research is to enhance the natural language interface. The aim is to develop a system that can seamlessly translate natural language queries into CQL without any intermediate substructures. This advancement will allow users to interact with Neo4j graph databases using conversational language, making the process of accessing data both intuitive and user-friendly. In further stages of this approach, this will be a key aspect to make graph database technology more accessible to a wider audience, including those without technical expertise in database querying. Another important goal is to improve the accuracy of these CQL queries. It is important that the queries generated from natural language are correct. If this is not the case at the first attempt, the solution includes mechanisms to correct the query and proceed with optimizing the Cypher queries. In addition, there is a focus on facilitating complex data retrieval. The system will be designed to handle more complex queries, allowing users to access and retrieve data relationships stored in graph databases. Ensuring scalability and performance is also a key consideration. The system should be able to manage different types of datasets and graph schemas while maintaining high performance in data processing and query execution. In addition, the research aims to promote integration and compatibility. The proposed solution will be designed to easily integrate with existing systems and be compatible with various databases and large language models, ensuring its adaptability and long-term utility. Lastly, a user-friendly interface is another crucial aspect of our project. The success of the model depends on its ease of use, underlining the need for an interface that simplifies interactions and enhances the overall user experience.
4.3. Design Principles
Each of the aforementioned goals has been developed to bridge the gap between advanced database technology and user-friendly interfaces to enhance the field of natural language processing for database queries. Derived from these goals, the following design principles can now be derived, which were adopted in the development of the technical artifact.
Consistent approach: The design of our artifact is driven by a consistent approach, ensuring uniformity in functionality across different modules and aspects of the application. This consistency is essential for providing a seamless user experience, making the transition between different types of queries and databases intuitive and straightforward. By maintaining a consistent approach, users can expect predictable outcomes and interactions, which is crucial for building user trust and proficiency with the application.
Accurate query generation: Another fundamental aspect of our solution is its capability to generate correct queries from natural language inputs. This involves advanced processing models that can accurately interpret the user’s intent and translate it into valid CQL queries. The system is designed to understand different verbal constructs and convert them into corresponding database operations, ensuring that the results match the user’s expectations.
Robustness: This marks another core principle guiding our design. The artifact is constructed to handle a wide range of queries reliably, maintaining performance and accuracy even under varying conditions. This robustness includes the ability to manage complex queries, interpret nuances in natural language, and provide results consistently. Additionally, the system is designed to be error tolerant, offering clear feedback to users to correct issues or refine their queries.
Appropriate database selection and correct reference to chat history: The artifact intelligently selects the appropriate database based on the query context and user requirements. It includes a mechanism to understand the context within which a query is made, including references to chat history. This feature allows the system to provide more accurate and contextually relevant responses by understanding the user’s interactions and the nature of their queries. The ability to reference and utilize chat history improves the system’s ability to handle repetitive, multi-step interactions and thus improve performance.
4.4. Workflow Design
Derived from the objectives and design principles developed above,
Figure 1 was created. The workflow was created to explain the sequential operations from user input to final response delivery. This section examines each step of the Business Process Model and Notation (BPMN) model, providing a detailed understanding of the decision making processes and interaction with the databases. The process is initiated when a user enters a question. This is the primary interaction between the human and system, and triggers the chatbot’s response mechanism. The chatbot internally generates a prompt that encapsulates the user’s query, preparing the system for subsequent analysis and response. The system evaluates the chat history to determine if a similar query has been previously addressed, which optimizes the response time by avoiding redundant database queries. If the system finds a relevant instance in the chat history, the chatbot constructs an answer in natural text form, ready to be presented to the user. If the chat history does not provide a satisfactory response, the system moves on to assess the feasibility of a database query with available databases. This decision is crucial, since it involves choosing the correct database for executing queries. If the underlying databases are not sufficient to answer the questions, an “Unsuccessful Response” message is generated. However, if it is possible to answer with the underlying database, the respective database schema for constructing an accurate and efficient query is retrieved. Using the user’s prompt and the retrieved database schema, the chatbot proceeds to generate a structured query. This query is tailored to retrieve the relevant information from the database in response to the user’s initial question. If the query execution fails, the chatbot generates an error prompt and reproduces improved queries to retry with optimized Cypher queries. Upon a successful query execution, the chatbot generates an answer from the query result. This answer is then formatted into a natural text response that can be easily understood by the user. In case the chatbot cannot generate a successful response from either the chat history or the database query, it generates an unsuccessful response. The final step in the chatbot’s operational process is presenting the response to the user. Whether the response is a direct answer, an error message, or a notification of an unsuccessful query, the system communicates the outcome clearly and effectively to the user, maintaining transparency in the interaction.
4.5. Technical Realization and Implementation
The following section marks a shift from the conceptual framework to the technical realization and implementation of the chatbot. It details the architecture, programming languages, and database connections that are integral to the construction of the chatbot, focusing on the design choices made to solve the identified problems.
4.6. Evaluation
Following the technical exposition, an evaluation phase will assess the chatbot’s performance against pre-defined objectives. Metrics such as execution accuracy, response time and syntax errors will be central to this analysis.
4.7. Discussion and Limitations
The discussion will then place these findings in wider context of existing research, highlighting the implications of the study. This examination will serve to review new insights and limitations of the solution created.
4.8. Conclusion and Future Work
In the concluding section, the research will present a synthesis of the findings and suggest areas for future research. The conclusion will summarize the significance of the findings while acknowledging the scope and limitations of the study. The proposed future research directions will build on the reflective findings and suggest modifications and improvements for the next possible iterations of the chatbot.
5. Implementation
The NLI built as part of this project was designed on the basis of modular architecture. This approach allows for easier testability, maintainability of the individual components and, above all, expandability for additional databases or language models.
Figure 2 shows the schematic architecture of the chatbot. The entire chatbot was implemented in the Python programming language. The whole system is hosted in docker [
64] containers, and consists of three main pipelines, namely the
Agent,
Chat_from_History, and
QA, which are responsible for processing the user’s request and generating a suitable response. In the following, the implementation of the individual modules and the processing of user questions in them will be described.
At the start of the chatbot, the user interface, database descriptors, and language models are initialized first. For the creation of the user interface (UI), the open source python library gradio [
65] was used, which provides a variety of components to build a UI for a chatbot quickly and easily.
The database descriptors Listing 1 contain the name of the available databases, as well as a brief description of the information they contain. This is necessary in order to be able to select the appropriate database for the user’s question later in the agent pipeline. The database adapter contains the connection information and a client to the respective database. The required large language models are also initialized at the same time. Three independent LLMs, i.e.,
Json LLM,
Query LLM and
Chat LLM, are required for the chatbot. Due to hardware limitations, it was decided to use the latest version of Open AI’s GPT-4-Turbo for each of the models.
Table 1 shows an overview of the models and their parameters. Regarding the temperature parameter, it is particularly important for the first two models to be set to 0.0 so that the output is as focused and deterministic as possible.
Listing 1. Database Descriptors. |
|
The reason behind this is that when making decisions or generating CQL statements, it is fundamentally important that the model only uses the nodes and parameters that are given to it in the prompt, and does not invent its own values here, as these would then lead to an incorrect decision/cipher query. For the
Chat LLM, this parameter was set slightly higher in order to prevent the model to include its internal knowledge into the answer. In the latest GPT-4-Turbo models, a so-called response_format can also be defined. This allows the JSON mode [
66] to be activated for a model, which is used in the
Json LLM. Its functionality will be explained in more detail in the Agent Pipeline section. For the other two models, the default parameter of the response_format were applied.
5.1. Agent Pipeline
When the user enters a question, it is sent to the Agent pipeline together with the database descriptors, LLM instances and the current chat history. The task of this pipeline is first to decide whether the question posed by the user can be answered with the available resources. For this purpose, the special decision prompt, Listing 2, is created for the user question. This contains the available databases with their description from the database descriptors, the user question, the current chat history and an answer schema. In this, the response options for the expected answer can be defined in advance. In the first field, “database”, the names of the available databases and the option “None” are given to the model as a selection option. It is important to note here that the model can only make one selection in the current implementation.
Listing 2. Decision Prompt. |
|
In the second field, the model should use a boolean value to indicate whether the question can only be answered with information from the specified chat history. Now, using the JSON mode of Open AI, the language model will follow exactly this schema in its response, and does not add any additional explanations to the answer, which makes it possible to parse it as a JSON object. Depending on the model’s answer, three different paths can now be initiated for further processing of the user question: First, if the Json LLM has decided that the question cannot be answered neither from the previous chat history or from a available database, then the user will be shown an apology which is streamed into the UI an ends the process (1). Secondly, if the question can be answered from the previous chat history, it is sent together with the current history and the Chat LLM instance to the Chat_from_History pipeline (2). The third and last possible case is when the question cannot be answered by the chat history but by a query to an available database (3). Here, a information message about the selected database is streamed back to the UI, and subsequently, the question is sent together with the Query LLM instance and the selected database descriptor to the QA pipeline, where it is further processed. In the following, the last two options will be explained in more detail below.
5.2. Chat from History Pipeline
The purpose of this pipeline is to generate an answer to the question based only on the current chat history. A chat history contains three different types of messages: SYSTEM, USER and ASSISTANT. The system message is added at the beginning of each conversation and assigns a role to the language model. All questions created by the user are flagged as USER messages. Full text responses from the Chat LLM, database query results and also auxiliary information, such as which database was selected, are all categorized as ASSISTANT messages. To prevent irrelevant information such as the auxiliary information from being taken into account when answering on the basis of the history and thus possibly distorting the answer, a process parameter is added to each message. This is a boolean value which specifies whether a message should be included in such tasks or not. In order to be able to generate a full text answer with the Chat LLM, another customized prompt Listing 3 is used, which contains the current formatted chat history, as well as the user question.
Particularly important in this prompt is the request to the model that the information provided is authoritative. This ensures that it neither adds internal knowledge to its response that does not originate from the chat history, nor attempts to correct potentially incorrect statements. When the Chat LLM has finished generating the answer, it is streamed back to the UI and presented to the user and the process is complete.
Listing 3. Chat from History Prompt. |
|
5.3. Query Answering (QA) Pipeline
The task of this pipeline is to answer the user question from the corresponding database. The first step in this process is to generate a suitable CQL statement to query the underlying Neo4j database. For this purpose, an additional customized prompt, Listing 4, was created, which will be used for the Query LLM. In addition to the user question, this contains the schema of the selected graph, which lists all nodes and relationships between them, as well as the properties of them both. This schema is dynamically retrieved from the graph database for prompt creation, using the connection details from the selected DB descriptor. Furthermore, example queries can also be added for few shot prompting, which the model can use as a guide for query generation. In addition to the temperature parameter of the Query LLM the prompt also includes requests to strictly adhere to the provided schema. This is to prevent the model from creating new nodes/relationships or properties that do not exist in the graph and thus causing errors in the database during execution. It is also required not to explain the generated cipher statement to ensure that the query can be parsed for the database without problems. After the CQL statement has been generated by the Query LLM, it is first streamed back to the UI and presented to the user before being executed via the database client of the database descriptor. Depending on the result of the request, two different processing paths are now taken. If the query could be executed on the database without errors, the raw query result is first streamed back to the UI in JSON format. This allows the user to check it and recognize possible errors in the plain text response. The query result is then transferred to the Chat LLM. A customized prompt was also created for this similar to the chat from history prompt, which contains the raw JSON and user question.
Listing 4. Generate Query Prompt. |
|
The model is also asked to consider the result as authoritative and absolute and not to try to correct or explain it. This is to ensure that the generated full-text answers are concise, brief, and accurate to the initial question, and only contain information which can be found in the database. When the model has finished generating the answer text, it is streamed to the UI and presented to the user. This concludes the processing procedure. However, if the generated query has caused an error such as a syntax or semantic error during execution on the database, the processing path through the error correction module is taken. Due to the fact that the cipher language of Neo4j can be very error-prone, due to the extensive syntax, and a smooth query generation can therefore never be guaranteed, this module was created to automate the error correction in CQL queries as best as possible, and thus improve the user experience. The task of the module is to improve the initially generated query depending on the user question and the error message. In the event of an error, first an error correction prompt Listing 5 is created. Analogous to the generated query prompt, it contains the user query, the schema of the graph, and few shot examples.
In addition, the incorrect CQL statement and the error that the Neo4j database has returned are also included. This is then sent to the Query LLM, which generates an improved query which is then executed a second time on the database. If the fixed CQL statement again causes an error, a new error correction prompt is created and the process starts again. As can be seen, the correction of the faulty query happens in a loop. To prevent this from potentially running for an infinitely long time, the maximum number of attempts was set to three in the implementation via a hardcoded parameter. When the maximum number of retries is reached, the module aborts the cycle and streams an error message back to the UI, asking the user to rephrase and resubmit their question. However, if the error could be resolved, or the revised cipher query no longer caused an exception, an information message is streamed back to the user in the UI. Following this, identical to error-free query generation, the database result is streamed back into the UI and passed to the Chat LLM to generate the full-text response. After this has also been streamed back to the UI, the processing of the request is finished. A major advantage in this approach is that the error correction module is error agnostic. This means that it is not necessary to define the possible error types and the corresponding reaction to them beforehand, but that the Query LLM decides independently how to deal with the error. Furthermore, this also ensures that each troubleshooting is customized to the query instead of using generic approaches. At the beginning of this section, it was emphasized that the chatbot was built in such a way that potentially several similar, but also different, databases can be used. In the current implementation, this is achieved by adapting the chatbot to a specific database type using only prompts, more precisely the query generation and error correction prompt. It must therefore be possible to exchange these two prompts flexibly depending on the database selected by the Json LLM. This is achieved by storing these prompts in the database client from the database descriptor. If, for example, a new SQL database is to be stored in the chatbot, its client must first be created programmatically, and thus also the two prompts. Then, depending on the selection of the database, the respective prompts and client connector are passed on to the QA pipeline, which can then be used to create suitable queries in the correct language. In this way, it is possible to use only one pipeline for the query generation of many different languages, which significantly improves maintainability and minimizes code overhead. In addition, the modular design of the pipelines makes it possible to easily add new pipelines or change the processing sequence.
Listing 5. Error Correction Prompt. |
|
5.4. Demonstration
This section provides a brief overview of the graphical user interface and application of the chatbot.
Figure 3 shows the Gradio UI of the bot and the four possible communication scenarios. As the interface in 1 shows, the UI consists of three main components. In the center of the screen is the chat box for any located interaction with the bot. All user questions and any answers or help information from the bot are streamed into this. Above this, the available database descriptors are listed in a table with the data abbreviation and their short description. This should help the user to receive a quick overview of the available databases and their content. Below the chat is the input box, in which the user can ask and send questions to the bot. The first example shows the scenario in which the user asks a question that the chatbot cannot answer either from the chat history or with the help of a database. In the second exemplary chat process, an answer to the user’s question is provided by a query to the database. Here, it can be observed as to how the chatbot first tells the user which database it uses to answer the question. It then presents the generated query and, after it has been executed, the number of results it has found. In this chat message, a dropdown is also added, which contains the raw JSON result of the database query. The last message of the bot displays the full text answer to the question. The third scenario shows the answering of a user request with the help of the chat history. For better visualization, in this simplified example, the same question was asked twice in a row, but in reality it would also be possible for the chatbot to refer to messages that are longer in the past. In this conversation, you can clearly see how the bot tells the user that it has decided to answer their question using history. It can also be seen here, that due to the rather low value of the temperature parameter of the
Chat LLM, and the requirement in the prompt to adhere strictly to the history, the bot tries to deviate his answers as little as possible from the previous answers. The last scenario shows the error correction module in action. The user question asked in this example was “How many stations are between Snoiarty St and Groitz Lane ?” Analogous to regular CQL generation, the user is again presented with the initially created cipher statement. However, as a timeout error occurs on the database when this query is executed, it is fixed within the error correction module, and the user is informed of the error and presented with the revised version. In the event that the error in the query could not be resolved in the first attempt, this message is presented to the user for each new version, as long as the loop is running. When the error has been fixed, the raw database results in the dropdown and the full text answer will be displayed again, identical to the second scenario.
6. Evaluation
For the evaluation of the chatbot, a suitable test data set for the text to CQL task was first required. In
Section 3 of this work, it was already shown that, with Spider [
40] and WikiSQL [
36], established and famous data sets for text to SQL exist. In addition, there are also widely used data sets for the text to SPARQL task in the area of graph databases with LC-QuAD [
53] and QALD [
54] series. Considering this, a suitable test data set for the generation of cipher queries was also searched for in order to evaluate the translation capabilities of the chatbot. However, after an extensive search, it was discovered that only a few exist or have been made publicly available. A total of four data sets were found: Guo et al. [
67] created a Chinese dataset with 10,000 native language cipher pairs as part of their work. While these were published on GitHub [
68], the underlying knowledge graph needed for query generation is only available for download on the Chinese website Baidu, which can only be accessed from selected countries which currently does not include Germany. Chatterjee et al. [
59] and Kobeissi et al. [
57] created test data sets for their respective use cases in the area of maintenance information on wind turbines [
69] and process execution data [
70], respectively. Although the question–CQL pairs and the available graph database are publicly available, they could not be used for evaluation. The reason for this is that the schemas of the graphs are rather complex and, therefore, very extensive, which would result in large query prompts. Since each of these contains the schema of the graph and properties of the nodes/connections, the resulting costs for the Open AI API would significantly exceed the cost constraints of this work. The fourth test data set found was CLEVR [
71]. The underlying graph behind CLEVR simulates an artificial subway network inspired by the London tube and train network. The included nodes and relationships have been expanded to include a variety of properties such as cleanliness or music played, which can be queried. The repository contains scripts for generating a random graph and the corresponding test dataset. Due to the relatively lightweight nature of the graph, which nonetheless allows complex queries, it was decided to use this dataset to evaluate the text to CQL capabilities of the chatbot. In addition to the generation of Cypher statements, other capabilities of the chatbot, such as database selection and response from history, were also evaluated. Since there are no test data sets for such tasks, these were self-constructed on the basis of the available graphs. In total, three different Knowledge Graphs were used across all parts of the evaluation, the characteristics of which are briefly presented in
Table 2. The Movie Graph contains information about movies and people who were in certain relationships to them, for example actors or directors. Northwind represents a traditional retail system with products, orders, customers, suppliers and employees.
Both of these graphs are part of sample datasets provided directly by NEO4J [
72]. In the following, the evaluation methods used are explained and the results presented.
For the evaluation of all experiments, the Exact Set Match Accuracy (EM) metric was utilized to ensure a consistent measurement of performance across different evaluation targets [
73].
The EM metric is calculated by comparing each predicted answer, denoted as Yhat, with the ground truth answer Y. This comparison is performed for all N instances within the respective dataset. If the predicted answer exactly matches the ground truth, it is considered correct. The final accuracy is then derived by taking the ratio of correctly matched instances to the total instances in the dataset. Given that each experiment in this study targeted different aspects of the chatbots capabilities, tailored question schemas were necessary to align with specific evaluation goals. These schemas were designed to accommodate the unique requirements of each experiment, ensuring that the evaluations accurately captured the performance of the chatbot in generating CQL, selecting the appropriate database, and providing responses based on conversational context. Descriptions of these question schemas, as well as further analysis of the EM metric results, are presented in the subsequent sections.
6.1. Database Decision Evaluation
The aim of this evaluation was to find out how accurate the chatbot is at selecting the right database to answer the user’s question. For this, 282 questions and correct database selection pairs were created by hand, covering all databases from
Table 2. In order to be able to evaluate whether the bot is also able to correctly recognize that it cannot answer a question with the available databases, the test data set also contains questions that are not related to one of the databases. Consequently, the correct choice for these questions was expected to be “None”. Only the
Json LLM was used for the evaluation, as only this part of the chatbot is responsible for the database selection.
The schema for database selection was structured as follows: “Given the [Database Descriptions] and [question], determine which database is best suited for the task.” In the evaluation dataset, the ground truth answers were the names of the respective databases, serving as the benchmark for accuracy in database selection.
The options available for selection were “MovieDatabase, Northwind, CLEVR, None”. For each of the databases, a short and concise description of their content was given in the corresponding descriptor. The selection accuracy is defined by the number of correctly selected databases divided by the sum of all questions. As can be seen in
Table 3, the chatbot, or rather the
Json LLM, is very good at selecting the correct database for the user’s question, reaching a overall selection accuracy of 96.45%. The data set for CLEVR alone is a little less accurate in comparison. After closer examination, the reason behind this is that some of the questions in this test data set do not contain any keywords from the CLEVR database description, which prevents the model from finding an assignment to a descriptor. In addition, due to the fact that CLEVR was generated synthetically, none of the station names or subway lines have an equivalent in the real world that the LLM could reference from its internal knowledge. This is different in the movie dataset. Here, the model can conclude from its training data that, for example, the person “Keanu Reeves” is an actor and that a question containing this person can be answered with the help of the movie graph.
6.2. Chat from History Evaluation
Further evaluation aimed to assess the ability of the
Json LLM to effectively reuse information from previous interactions in the chat history. The chatbot was tested on a series of questions within the Movie and CLEVR databases, categorized into zero-step, one-step, two-step, and three-step reasoning questions, as shown in
Table 4. Throughout the process, the ability to recognize the question in the chat history is evaluated, not necessarily the correctness of the answer, meaning the results can be either true or false. For evaluating the chatbots capability to reference prior interactions, the following schema used was: “Given the current [chat history] and [question], determine if the question can be answered solely using the chat history.” Here, the ground truth answers were boolean values, indicating whether the required information was indeed present in the conversation history.
The zero-step reasoning describes the posing of an initial question without any historical context for that question, implying for the model that it would correctly output “false” in such cases. In one-step reasoning, the question selected at the start is asked once again and then checked to see if it was recognized in the existing history, and the history would be used to answer it. In this case, the “true” result would be appropriate. The two-step reasoning now merges the first question with the second question in the chat history and asks a composite question. It then checks if the chatbot recognizes this combined question from the chat history and correctly handles it. Again, “true” would be the proper return value. With increasing complexity, an analogous approach was applied to three-step reasoning. For each reasoning step, 214 questions were asked, resulting in a total of 856 questions.
The evaluation revealed varying patterns in response accuracy. The results revealed the model’s ability to perform zero-step reasoning questions with 100% accuracy, demonstrating flawless decisions when there is no prior input. In contrast, one-step reasoning questions showed a decline in accuracy, with the model correctly using chat history 82.24% of the time. This trend continued with two-step reasoning questions, where the model’s accuracy decreased to 79.44%. However, an interesting pattern appeared in the three-step reasoning category, where accuracy increased to 87.38%. This indicates that although the model struggles with intermediate complexity, it is better at handling more complex referencing tasks, which likely involve more robust integration of contextual information.
6.3. CQL Query Generation Evaluation
The evaluation further investigated the effectiveness of CQL statement generation of the model using the CLEVR dataset, as shown in
Table 5. The schema for CQL generation was framed as: “Given the selected [graph schema] and [question], generate a Cypher query”. This schema required the chatbot to formulate accurate CQL queries based on the graph’s structure and the specified query intent. The test data set contains an English question, a so-called “gold query”, which generates the correct query result and the raw query result. Thereby, two distinct setups were compared, zero-shot prompting, where the system generates queries without prior context, and few-shot prompting, where the system utilizes a small number of examples, respectively 4, to support its query generation. In order to evaluate the ability of the
Query LLM to generate database queries from natural language input, both the raw query results of the generated query and the full text responses were considered. In this approach, we checked both whether the raw database results matched and whether the generated full-text response was correct. The reason for this is that some test questions from CLEVR are aimed at yes/no answers, so the result set of the generated statement may be empty. In this case, the answer would be wrong if our chatbot answered “it does not know the answer”. The so-called execution accuracy, presented by Guo et al. [
67], was used as a suitable evaluation metric. This describes how many of the generated CQL statements produce a correct query result in relation to all generated queries. A conscious decision was made not to use metrics such as logical accuracy, which check whether the generated CQL query is identical to the gold query, because the evaluation was about how well the system is able to generate statements that produce correct answers and not identical CQL statements. The evaluation was performed using 500 questions per prompt type and syntax and semantic errors, respectively, timeout errors, were also recorded as indicators of the system’s proficiency.
The findings revealed a notable discrepancy in the agent’s performance between the zero-shot and few-shot prompting. In the zero-shot scenario, where the chatbot was required to generate queries without prior examples, the execution accuracy stood at 61%. However, the model’s accuracy improved significantly in the few-shot context, reaching 92.8%. This indicates the chatbot’s ability to learn and adapt from examples, improving its query generation capabilities. In particular, the few-shot approach eliminated syntax and semantic errors, with zero instances detected, while the zero-shot approach produced 48 syntax errors and 20 semantic errors.
6.4. Performance Evaluation
To ensure practical application in real-world scenarios and to identify any potential bottlenecks in the system, the performance times during the execution of all evaluation tasks have been measured (see
Table 6). To evaluate the performance of our approach, we use the average duration (in seconds) of all observations as the primary performance measure. Mathematically, the performance measure is defined as
where
represents the duration (in seconds) for the
i-th observation, and
n is the total number of observations. This was gathered simultaneously during the previous evaluation techniques in the previous sections, starting with the database selection times for each database, including Movie, CLEVR, Northwind, and None if no applicable database is the correct reference for the model. Then, the chat history evaluation times were measured with increasing complexity from zero-step reasoning to three-step reasoning. Lastly, the times of zero-shot and few-shot were compared, differentiating between query generation time, query execution time, and answer generation time. Indeed, the evaluation provided differentiated insights into the performance of our system. As for the database selection task, the CLEVR dataset revealed slightly higher durations compared to the other datasets, which is consistent with the previous findings of slightly lower accuracy. When examining the reasoning tasks, we observed a trend where increased context led to slightly longer durations.
However, the model showed stable performance in scenarios with extended context lengths. The three-step reasoning trials showed that the model could maintain efficient processing times despite the added complexity. During zero-shot prompting, query generation was the most time-consuming process, confirming the general assumption, that the initial context setup requires a high amount of resources. In contrast, query execution was performed in near real-time, demonstrating the model’s efficiency in translating queries into database actions. Notably, performance time improvements were observed, as expected, in the few-shot query scenarios. The model used previous examples to optimize query generation, resulting in a reduced duration for subsequent tasks.
7. Discussion and Limitations
In this paper, an innovative approach for the creation and execution of Cypher queries for Neo4j by a chatbot was presented, which is characterized by selecting the appropriate one to answer a question from several predefined databases and recognizing when a question can only be answered from the current chat history. However, this approach involved both technical and conceptual challenges, which are discussed below. One of the primary challenges was the problem of hallucination, where the chatbot generates queries with incorrect syntax or properties. The results show that 24.62% of all incorrectly generated queries in a zero-shot setting are attributed to syntax errors. This underlines the need for continuous refinement of the model’s understanding and generation capabilities, especially in the context of a specialized and extensive query language, such as CQL. Furthermore, this also emphasizes the importance of developing error checking mechanisms, as implemented in this work, to identify and correct syntax errors prior to query execution. Considering this, the evaluation results also show the compelling ability of GPT-4-turbo to learn from example queries. With few shot prompting, we were able to increase the execution accuracy from 61% to 92.8%, with a complete elimination of syntax and semantic errors. This improvement highlights the effectiveness of few-shot learning in increasing the precision and reliability of the model in generating Cypher queries. However, the current implementation of the chatbot also raises important privacy and security considerations. Given the potential of the chatbot to access all information contained in the graph, which may contain sensitive information, the current implementation should not be used to handle private data that cannot be disclosed to Open AI or in general. For this purpose, alternative Large Language Models, such as Mixtral 8x7B [
74] or Code Llama [
75], should be considered which are publicly available and can, therefore, be executed on proprietary hardware for which suitable data protection measures can be taken. Additionally, the chatbot’s ability to generate not only selection, but also deletion and creation queries requires strict countermeasures to prevent unauthorized or unwanted database modifications. For this, different approaches can be taken depending on the database technology used. Neo4j, for example, allows to set the access rights of a database client read-only, which prevents any modifying transactions and generates a corresponding error. This solution was chosen in our implementation. If such setting options are not available, another possibility would be the definition of keywords according to which the generated queries are checked and, if present, rejected. At the beginning of this project, it was also considered to use the Python framework Langchain for implementation. With its Neo4j DB QA Chain [
76], this already offers pre-built functionalities for generating and executing CQL on graph databases. Ultimately, however, the decision was made not to utilize this library, as there would have been problems particularly with the dynamic generation of prompts, on which our system is primarily based. The reason for this is that it would have been necessary to extend Langchain’s pre-implemented prompts, for which they are not intended. Consequently, this would have led to a continuous attempt to adapt the program logic to Langchain, which would have required more effort than programming it ourselves. Finally, it is also necessary to note the limitations of the test dataset used for CQL generation. The current dataset, based on the CLEVR framework, currently supports 33 different question types, which are extended by permutation and substitution techniques. While this approach demonstrates the capabilities of the chatbot within a limited set of queries, it does not reach the complexity and diversity found in state-of-the-art datasets such as Spider [
40], WikiSQL [
36], or the LC-QuAD/QUALD series [
53,
54]. In order to fully validate and benchmark the performance of our system, due to the lack of state-of-the-art datasets in the text to Cypher domains, the development of a new, more comprehensive evaluation dataset that reflects the diversity and complexity of real-world database query scenarios is essential. At the time of writing, initiatives from the neo4j community are already underway to fill this gap [
77].
8. Conclusions and Future Work
In this paper, we propose a way to integrate NoSQL and Knowledge Graphs into Large Language Models. The paper enriches the field of natural language interfaces by applying the design science research methodology to develop a chatbot that can answer user queries by generating CQL queries. It also provides an extension that allows the bot to select the appropriate database or chat history to answer the question. Through a comprehensive evaluation, we were able to show that the chatbot is reliable and accurate in generating Cypher statements, as well as making the right decision regarding the database and chat history. We also provided an overview about the necessary theoretical foundations and the related work concerning literature in the tasks “Text to SQL” and “Text to NoSQL”. The further development of the chatbot was already considered during implementation. Although only graph databases were used as a knowledge base in this paper, the modular architecture of the system was developed in such a way that it is potentially possible to support both multiple and different storage technologies simultaneously, such as relational or document-oriented databases. Furthermore, it was also ensured that it is possible to exchange the Large Language Models in the back-end. Since the processing logic of the system was designed independently of the underlying database technology or LLM, it is possible to extend the chatbot by defining the appropriate adapters that are fed into the pipelines without having to change the logic in the rest of the system. In this way, the chatbot should form a basis for meeting diverse user requirements in terms of database technology and language models, with the aim of achieving a polyglot persistent system.
Future work could focus on several promising directions to extend the capabilities and flexibility of the chatbot. In this study, only closed-source foundational models were used to support the workflow. However, a valuable area for further research lies in fine-tuning open-source models on text-to-Cypher datasets, allowing a comparison of these models performance with the closed-source results achieved in this paper. Adaptive prompt tuning could also significantly enhance the chatbot’s performance. By storing corrected Cypher queries or specific user questions in a vector database, the chatbot could dynamically reference these stored examples when faced with similar queries in the future, reducing the likelihood of repeated mistakes. The use of DSPy, an open-source framework that facilitates prompt optimization through programmatically defined modules, could further streamline this adaptive tuning process. By treating language model interactions as structured modules, DSPy enables a more systematic prompt optimization framework and reduces reliance on manually crafted prompts [
78]. Another promising research direction involves expanding the chatbot’s functionality to clarify ambiguous user questions. When a query lacks specificity, an LLM could engage users with clarifying questions to refine their input, ensuring responses that are accurate and contextually relevant. Frameworks like LangGraph could be used to structure these interactions, enabling iterative dialogues to better capture user intent. This capability would reduce misunderstandings and improve answer quality, creating a more responsive, user-centered experience and advancing the field of natural language interfaces.