1. Introduction
Recently, enormous databases have come to contain substantial knowledge about an organization because of the digital storage of data. These vast data repositories can contribute to research in data analysis and finding trends and patterns therein, according to any particular research goal. Increasingly, data repositories concerning medical health, movies and employee data require direct access by users according to their questions and queries. Traditionally, users have needed to learn structured query languages such as SQL to get precise results from relational databases. All experts of a particular domain, for example, medicine, do not necessarily know structured query languages, limiting the access of organizational knowledge to a limited number of users. Therefore, existing data storage is not being utilized to its maximum potential.
Compulsory knowledge of structured query language for comprehensive access to all aspects of data has become a hurdle. Translating natural language questions into structured query language would provide maximum user access to relational databases. Asking questions in English-language text provides constraint-free access to databases, liberating the user from careful selection and click-based interfaces. Solving text-to-SQL tasks can expose the whole relational database for users to utilize and analyze according to their needs and choices. Seeking a solution for this issue leads us toward the field of natural language processing. However, natural language processing techniques, alone, cannot solve this problem, as the database, itself, is a critical part of the stated problem in this context, leading us to data-science methods as well.
Combining natural language processing and data science methods may yield the solution to building a natural language interface for databases. Thus, translating natural language queries into structured-language queries is a struggling area for merging natural language processing and data science. The goal is to translate natural language queries into SQL that can be executed on a system to access its data for all kinds of users. Translating natural language questions into programmed language has been a long-standing problem [
1,
2]. In this work, we have focused on a natural language interface to databases by the generation of executable SQL queries.
Figure 1 elaborates the text-to-SQL task briefly.
Deep learning has been an emerging tool in various research areas like communication, machine translation, and networking. Deep learning methods have arrived at competitive performances, compared with the traditional techniques, in a short time. They have been shown to be a potential tool for growth, even for mature fields with a higher bar of entry for new approaches, such as communication [
3,
4]. Deep learning has exhibited its problem-solving potential and growth in mobile traffic classification, as well. Though port information is not critical when working with deep learning traffic classifiers, mobile classifiers using deep learning methods can identify the application sources from which information is coming. Feature representations of direct-input data from training can be a potential path for improving traffic classifiers [
5].
Deep learning has also shown promising results in the area of encrypted mobile traffic classification. However, deep learning, here, has some limitations because it is less mature in this area than are traditional methods, besides which, its black-box nature allows the least human intervention but contributes to resolving some of the complex matters in improving performance [
6,
7]. Deep learning has recently been adopted for text-to-SQL tasks; keeping in mind deep learning’s success stories in neighbouring research areas, deep learning shows potential for rapid improvement in this task as well.
The initial work in adapting deep learning to text-to-SQL tasks was primarily based on neural network sequence-to-sequence learning [
8], adopting copying and attention mechanisms and the sequence-to-sequence RNN model to improve translation accuracy. These approaches improved the basic framework and their contributions were pioneering in NLIDB systems with deep learning concepts.
Recent work in the text-to-SQL research area mainly focuses on improving two significant aspects, syntactic accuracy and semantic accuracy. Semantic accuracy is to interpret the user’s intention correctly and map it to a given database schema. The semantic gap between mapping user’s preferences from natural language query into schema nomenclature is also known as a mismatch [
9,
10] or lexical problem [
11,
12]. Often, exact column names are not mentioned in the text query. Instead, one of the synonyms of column names are mentioned, or the intended column’s value is mentioned. This scenario makes mapping the intended column to the actual column name in the database vague.
Figure 2 elaborates the issue more precisely with an example. The natural language question in
Figure 2 does not have the exact intended column name in the SQL query, which makes the column prediction vague in the given example.
Bridging the gap between user intentions and data can improve column predictions and, ultimately, impact overall accuracy [
13]. Primarily, this issue of semantic gap occurs during the keyword mapping component of the whole process. Keyword mapping is part of the text-to-SQL translation process in which words from a natural language query are mapped to column names and values from the database schema. Formulating a query with the user’s intended columns and values from the database is the fundamental semantic issue of this task.
This semantic issue has been an intricate part of the text-to-SQL task because NLIDB systems are intended for everyday users who do not necessarily have explicit knowledge of database schemas. Therefore, they cannot specify the schema items in their natural language questions precisely. Finding a pattern within previous user queries and extracting co-occurrences between columns and between columns and tables would resolve the issue of mapping the intended schema elements as closely as possible to the user’s choice in a structured query. The calculated pattern of previous query histories can provide a concept of typical users’ preferences regarding the database and columns in question. Therefore, in this work, we have focused on filling the semantic gap between database schemas and users’ intentions with the help of patterns established from previous query data. In this work, we have captured such patterns in a co-occurrences graph of various columns. The graph is explained in detail in
Section 4.1. Co-occurrences graph scores can be utilized in two main steps; (1) capturing the occurrences and co-occurrences of columns and tables in previous query data; and (2) integrating that data as feature input and other input vectors in a deep learning NLIDB model.
A similarity measure has been described to capture scores from the column co-occurrences graph effectively. We have performed experiments with a text-to-SQL task on the Spider dataset. The Spider dataset and its difficulty categorization and criteria are explained in
Section 5. Our experiments show that our approach improves exact match accuracy. We have compared our results with a base model of SyntaxSQLNet and two other contributions that have implemented preprocessing on input feature vectors and integrated with SyntaxSQLNet [
14,
15,
16]. We show that our work improved accuracy up to 10% in an experiment with BERT embedding. The rest of the paper is constructed as shown in
Figure 3 below.
Table 1 in the following contains all the acronyms used in the article.
3. Related Work
An NLIDB (natural language interface to database) system aims to cover the language hurdle between users and database systems. With NLIDB systems, non-technical users can also interact with the system without a knowledge of structured query languages. Therefore, NLIDB systems provide friendlier and greater access to relational databases for a broader range of users. Users query an NLIDB system in plain- or natural language text. The natural language question is then translated into a structured query language (SQL) query to be executed by the database engine and provide the user-intended results. Work by [
1] is one of the early contributions to NLIDB [
1], in which they presented an NLIDB task with brief examples describing problem statements and emphasized the importance of separating the linguistic task from the database information in their NLIDB task. To this end, syntax tree-based and semantic-based systems have been in the leading position, alongside intermediate representations of text and SQL languages. Until this point, the format of text query had been fixed, such as with fixed-syntax queries and menu-based systems, to limit associated semantic problems.
Subsequently, Ref. [
2] proposed a method based on knowledge representation to separate the exact match tokens of natural language questions in the context to database elements. Moving further along the NLIDB research timeline, Ref. [
23] introduced the tree kernels concept in NLIDB systems. They integrated tree kernels to rank candidate queries. Further work in this area has belonged to two categories; pipeline-method NLIDBs [
24] and neural-network NLIDBs. Deep learning is used most in research on neural-network NLIDBs. Deep learning NLIDBs can be categorized further into sequence-to-sequence learning [
8] and sequence-to-set learning. For our work, we have adopted sequence-to-set learning.
Sequence-to-set learning for text-to-SQL tasks was initially proposed to solve the ‘order matters’ issue [
25]. The reward is calculated based on the whole ground-truth sequence compared to the gold-standard query for sequence-to-sequence structures. Sequence-to-set learning has been a fundamental approach for much work in NLIDB systems, in parallel with sequence-to-sequence learning, as the order of the columns does not matter in the context of execution results [
8]. Later, however, in the timeline of NLIDB systems, semantic accuracy became one of the significant issues in processing text-to-SQL tasks, as it has a substantial contribution in overall accuracy. Researchers have attempted to resolve this issue in various ways, such as with slot-filling approaches [
26,
27,
28], which separate the syntactic and semantic issues by dealing with the former via building syntax grammatically before predicting the schema elements to populate slots. This allowed models to focus more on the semantic issue while predicting columns and tables. Then, Ref. [
12] proposed global reasoning by a gating GCN to encode the DB schema for improving the prediction of its elements. Schema linking is another way to tackle mismatch issues, and was recently used in IRNet [
10,
29].
IRNet was extended to a sequence-based approach by [
8]. They incorporated schema linking and intermediate representation into the baseline method. Schema linking, in IRNet, is done by simple string matching to identify the columns and tables mentioned in the question, labelling them with the column’s information type. Although IRNet improved upon the baseline method, it is still not clear how schema linking impacts performance [
10]. They separated the schema-linking task from the overall text-to-SQL task to analyze the impact of schema linking, showing that separating the schema entities from the questions and using placeholders instead allows the model to focus more on the syntactic component and improve overall query accuracy. Therefore, it is impactful on overall accuracy, but schema linking is not a perfect solution for semantic issues in text-to-SQL tasks. Additionally, using table content in the process has also improved prediction accuracy [
30]. Despite this accuracy improvement, table contents may not be available in all cases, due to confidentiality. “Human in the loop” is another method of improving the output, in the context of capturing the user’s intention, by taking feedback from them and revising it accordingly. DialSQL performs post-processing over the output of a prediction model. They take users’ feedback in the form of multiple-choice questions. Various options related to defined error categories are provided for the user to select from. Then, taking the user’s input into account, they improve the predicted query according to the user’s provided information. This process requires the user to be trained and to explain the defined user categories beforehand. Ref. [
31] advanced the “human in the loop” method by collecting user feedback in natural language and improving the interface’s usability, as compared with DialSQL. With a “human in the loop”, recruiting experts who know the database schema well enough to point to semantic errors is an additional effort compared with the other approaches. Both of these methods perform post-processing over generated output queries.
Various research contributions have proposed the preprocessing of input features to map DB schema elements to the user’s question. Such work, by [
14], implemented data anonymization over the DB schema and text queries before encoding the input features. Data anonymization consists of identifying the DB elements in the question via probability scores. Placeholders then replace the identified DB elements for the training phase. After anonymizing the text question from the DB elements, many training examples become similar and thus can share in the training process. Column/cell binding is the final step in generating probability distributions for the DB elements. The authors integrated this preprocessed anonymized data with an existing model, SyntaxSqlNet, for their experiments.
Another example of preprocessing feature vectors to impact a learning model is shown by [
16]. They used column value polarities, generated beforehand, and integrated them with SyntaxSQLNet to better predict data elements. In the present work, we have followed a similar pattern of extending the baseline model by adding a preprocessed-input feature, i.e., column occurrences scores. Our contribution is unique in that we have utilized the data from previous queries to find patterns of users’ intentions regarding the columns required of their questions. After a co-occurrences score is calculated it is integrated with SyntaxSQLNet as an additional feature vector. We have made the required changes to the encoder to adjust our other features with respect to the model. Those variations and further details of the model are explained in the Methods section.
4. Methods
4.1. Column Occurrences Graph
First, we introduce the column occurrences graph. Following the idea of [
32], the column occurrences graph is built from the query log or, in the Spider dataset case, from the previous SQL queries of a particular database in the training data. Example of such set of queries is shown in
Figure 4.
Figure 5 shows the columns occurrence frequency in the set of queries.
Figure 6 shows the columns occurrences graph, where nodes represent the frequency of individal column occurrence and edges represent the co-occurrences of the comuns. The intuition of this graph is to capture the user’s intention for clearer column prediction. Primarily in cases when the exact column name is not mentioned in the text question, prediction of one column can assist in predicting the other columns as well. Our method is an extended version of [
32], in that we compute occurrences and co-occurrences of columns specifically, explicitly excluding the “from” part of queries to avoid noise and repetition. Instead, tables are concatenated with their column names, column types and relations, following the encoding method of [
27]. Graph G contains edges,
e, representing the column occurrences and vertices,
v, representing co-occurrences of the involved schema elements in a particular database’s query sets to capture this intuition. Following the idea of [
32], the Dice similarity coefficient is used to reflect columns co-occurrences as follows:
and
are pairs of columns at a given time, and
and
are co-occurrences and occurrences of the respective column elements. Finally, the accumulate Dice coefficient for all column pairs is calculated with the following equation.
This section elaborates on our work and its implications for the issue of minimizing the ambiguities in the column prediction phase of the text-to-SQL task. First of all, we describe a problem statement for the semantic component of the text-to-SQL task. After that, the workings and components of the base model are explained to understand our extended work. The rest of the section addresses the implications of the column occurrences score in this work.
4.2. Problem Statement
Given a natural language text query Q and database schema primarily consisting of column names concatenated with table names and type information, labelled COL, our goal is to improve column prediction accuracy and ultimately enhance the model’s overall accuracy. Utilizing previously used queries in a particular database can minimize the gap between the predicted columns and the user’s intended output columns. Given that precalculated column-occurrences graph score, COG, along with Q and COL, PVALCOL is generated, ultimately generating the corresponding SQL, is the final goal. Text queries or natural language questions are treated as series of tokens to feed into the encoder–decoder model.
4.3. Base Model
Our base model, SyntaxSQLNet [
27], is an encoder–decoder grammar-based slot-filling approach. The encoder encodes columns, the natural language question and history tokens as inputs, applying an attention mechanism to embed the most relevant question tokens in a columnular context. The weighted sum is calculated to bring the hidden state of the question token to the columns’ attention. Decoders of the model predict the SQL syntax via a grammar to call modules based on history tokens. For semantic slot filling, the model has nine separate modules consisting of independent and respective biLSTM decoders. The column-predicting module is one of these nine modules and is trained separately.
The equation above reflects the column module’s intuition. It is formulated in two parts; finding the total number of columns in the query, followed by the prediction of the column’s values. The first equation computes the number of columns in the query, where
is the hidden state of the question-to-column word embedding and
represents the hidden state of the history of the last decoded element to the columns’ attention mechanism.
and
are trainable parameters. Softmax was used for the probability distribution.
4.4. Encoder
The word embedding of question tokens, the schema, and history tokens is obtained from a pretrained GloVe [
18] embedding. The current decoding history is further encoded with a biLSTM, denoted by HS [
27]. We adopted the idea from [
33] for schema encoding to capture self-attention of the schema elements. Schema encoding starts from obtaining the embedded column names, whererafter table names are embedded, as are column types. These initial embeddings are concatenated together. Self-attention is used between columns to capture the internal structure of the schema more effectively, where table names are also integrated. In the self-attention layer, another layer of biLSTM is applied to connect them and denoted as
. The table schema encoding process is portrayed in
Figure 7.
Following [
27], after tokenization and GloVe embedding, the natural language question is encoded with biLSTM. To effectively capture the meaning and context of natural language questions fully with respect to the available column and tables in the database schema, an attention mechanism layer is applied, with the outputs of the hidden state of the question tokens’ biLSTMs and the columns’ biLSTMs, generating the
.
Figure 8 shows the NL question tokens encoding process.
Columns occurrences scores have a graphical structure, with nodes as columns and edges as their co-occurrences. The embedded column names and occurrence scores are integrated. Along with other input features, column occurrence scores are also embedded via GloVe embedding, and then a biLSTM is applied. The hidden-state input of this biLSTM is further processed with the columns’ hidden states to implement an attention mechanism between the question embedding and column occurrences embedding to find the relevant pairs of co-occurrences and their scores, denoted by .
4.5. BERT Embedding of Input Features
Another option is to encode the input features using a pretrained BERT model. Question tokens and schema elements are fed into one sequence with a separator token. That sequence of input tokens is then provided to the BERT model. The final hidden state of the BERT is used as embedded input for the decoder. Previously, BERT has been shown to improve the overall accuracy of many models. We conducted one experiment with BERT, as well, to exhibit its compatibility.
4.6. Decoder
The decoder used in SyntaxSQLNet [
27] is a recursively activating, sketch-based sequence-to-set approach. Nine separate decoder modules for each task are trained independently from each other. Grammar is used to structure the syntax of a query and the calling of modules for each prediction. The decoder history path is encoded as part of the input to decide the next module with which to predict the next token. The sequence-to-set approach is used to avoid the ‘order matters’ problem caused in sequence-to-sequence approached [
8], as identified by [
27]. For example, “select id, name” is equivalent to “select name, id”; however, in the traditional sequence-to-sequence approach, the model penalizes these over even correct queries if the order of the sets is changed. The attention mechanism in [
27] is generalized for all embeddings, as follows:
where
W is a trainable parameter and
and
are the last hidden states of the biLSTM. Softmax was used for the calculation of probability distribution. The output of each module is computed in a sketch mode, independently of each other and whenever required, according to the syntax grammar [
27].
The column module first computes the number of columns and then the values of the columns of the whole query altogether. We extended the column module by adding column occurrences scores as an input feature and using self-attention between columns instead of simple column embeddings, as follows:
Here,
,
,
and
are trainable weights. Similarly, other modules are called according to the decoding history path and syntax grammar. For further details of the whole decoder and other modules, we refer the reader to the [
25].
6. Results and Analysis
Although our decoder is similar to the base model SyntaxsqlNet, our columns occurrences score, in the encoder, allows the model to include user intentions regarding column prediction from the database. In addition, overall accuracy increased with our model’s greater focus on column prediction. Our model achieved a prominent increase in exact match accuracy.
Table 2 shows the experimental results in the form of exact match accuracy compared with the base model and two other methods from [
14,
16]. Our model’s performance was higher than the base model in terms of exact match accuracy. Efforts by [
14,
16] are similar to our work in preprocessing the input features to impact overall model performance. Our integration of column occurrences scores outperformed the other two models, contributing to “understanding accurate intention regarding column prediction from the natural language questions”. Column prediction by column occurrences score enhanced the overall accuracy and outperformed the previous similar works [
16] by, at our model’s highest accuracy, 9.7%. Work by [
16] extended the SyntaxSQLNet model by integrating the column value polarities as a feature vector [
14], wherein they anonymized their input utterances to conceal lexical problems, minimizing semantic issues before data encoding. Column occurrences score and the self-attention mechanism between columns also contributed to the improved results.
We also experimented, on our model as well as the base model, with a pretrained language model, BERT. It improved the exact match accuracy even better for both models.
Table 3 shows the partial matching accuracy, in terms of F1 scores, for “select”, “where”, “group by”, “order by” and “keywords” separately. As shown in the table, accuracy improvement was less in the “group by”, “order by” and “keywords” than the “select” and “where” clauses. The reason behind this is that group-by and order-by data are less represented in the training data. Therefore, there was less margin for improvement. Besides this observation, our overall exact match improvement shows our approach’s potential for further work improvements in this area.