A Cardinality Estimator in Complex Database Systems Based on TreeLSTM
Abstract
:1. Introduction
- We propose a cardinality estimation model based on deep learning for the tree long short-term memory neural network. Moreover, we added a sampling method to the model to solve the problem that the sampling-based method increases the spatial overhead and the 0-tuple problem [11]. Meanwhile, we adopted the idea of natural language processing to encode the SQL queries so that the model can capture the semantics implied in the queries. We also constructed the “WHERE” clause as a tree structure and generate tree vectors based on the operators between predicates. Thus, the model can capture the multiple tables’ join relationships and the sequential dependencies between predicates in a complex query;
- In order to solve the problem mentioned above that the Q-error cannot distinguish between large and small cardinalities, we designed a hybrid loss function that combines two different losses; namely, Q_error_loss and absolute value loss. This loss function is based on the idea of the Q-error and absolute error and uses an adjustable parameter to balance the contributions of the two loss functions. The Q-error part optimizes the relative error, while the absolute error part distinguishes between the large and small cardinalities problems;
- In addition, we evaluated our model with two large datasets, the IMDB [3] and DMV [12] datasets, to demonstrate the generalization of our model. The experimental results show that our method outperforms traditional and other learning-based cardinality estimation methods. They also demonstrate the feasibility and superiority of our proposed model.
2. Related Work
2.1. Traditional Cardinality Estimation Methods
- Summary-based methods. These methods collect statistical information about the database in advance and make independent assumptions to estimate cardinality quickly. For example, the histogram-based method [13,14,15] targets real databases where the distribution of each column value is not uniform, so each column in the data table is summarized as an equal-width or equal-depth histogram to keep information about each data range to fit the real distribution of the data. The implementation of this method is relatively simple and it is widely used, but it is ineffective when estimating the correlation between different columns. Data sketching [16,17,18,19] is also a summary-based cardinality estimation method, the core idea of which is to use a hash function to map a tuple value to a set of positions on a bitmap and to add count values to the corresponding positions according to the number of tuple values. The cardinality number of the tuple values can be inferred by counting the number of consecutive zeros or hits in the bitmap. This method focuses more on saving memory and improving estimation robustness, but the method does not apply to estimating range queries.
- Sampling-based methods [20,21]. The core idea of these methods is to randomly select a certain proportion or a certain number of tuples from the original data table and then to estimate the cardinality of the query in the original database based on the size of the result after executing the query on the sampled set divided by the corresponding scaling ratio. The effectiveness of this method is influenced by the sampling strategy, and it is more effective if the sampling strategy can better reflect most of the data distribution. However, due to the random nature of sampling, the method may produce results that do not match the sampled set, resulting in large estimation errors.
2.2. Learning-Based Cardinality Estimation Methods
- Data-driven cardinality estimation: This is a class of unsupervised methods that aim to approximate the cardinality of the joint data distribution [22,23]. In [24], the authors used the kernel density estimation method to estimate cardinality. The core idea is to sample some tuples in data tables randomly, and the closer the query and the sampled tuples are, the more likely it is that the query has a result. The authors used a Gaussian kernel and learnable parameters to control the dispersion of the probability density. In a later study [25], the authors extended the kernel density estimation approach to cardinality estimation of multiple table joins. In [26], the authors used the DeepDB model to capture joint probability distributions for data and essential features, such as correlations across attributes and data distributions for individual attributes. An important difference between DeepDB and other approaches is that it supports direct updates; i.e., insertions, updates, and deletions in the underlying database can be directly absorbed by the model without retraining it. The neural optimizer (Neo) was proposed in [27], which relies on deep neural networks to generate query execution plans. Neo starts its query optimization model from an existing optimizer and continues to learn from incoming queries, building on its successes and learning from its failures. The estimators Naru [28], Neurocard [29], and MADE [30] approximate the data distribution in a table or joint table using a deep autoregressive model, which captures the data distribution in a table by multiplying the estimated data distributions in each column, with an implicit assumption that each column depends on the previous column.
- Query-driven cardinality estimation: This method was also used in our experiments. Unlike the data-driven approach, the query-driven approach can be considered a regression problem that maps “query” to “cardinality values”, where the labels are the true cardinality values. The main difficulty of this approach lies in the acquisition of training data and the representation of query statements. In [31], a “Black-Box” method was proposed to estimate the query cardinality values. This method provides accurate estimates by grouping queries into grammar families and learning the cardinality distribution of the group directly from the points in the high-dimensional input space without knowing the query execution plan and data distribution. Kipf et al. [7] proposed a method called the multi-set convolutional neural network (MSCN), which decomposes the query into three parts (table names, joins, and predicates) and uses a neural network to process these three inputs independently. The MSCN can model joins and predicates from multiple tables and cover data correlations. However, the approach cannot handle complex query statements and string types in predicates. In [9], the authors proposed an end-to-end learned estimation model for processing complex SQL queries and estimating the cardinality and cost of the queries. The model relies on the DBMS optimizer to generate a physical execution plan for the query. However, obtaining the physical execution plan is difficult, and the model is more complex.
- Hybrid cardinality estimation: In addition, some current research has attempted to capitalize on the strengths of both types of methods by proposing hybrid approaches. The authors of [32] proposed a deep autoregressive model (UAE) to bridge the gap between query-driven and data-driven approaches. The approach uses data as unsupervised information and the query workload as supervised information for cardinality estimation. Dutt and his team [33] merged a learning-based approach with the cardinality estimation of histograms by using the output of the histogram estimation as an additional feature input. They proposed a design for cardinality estimation using machine learning and an integrated tree-based learning model. However, neither of the hybrid cardinality estimations can be trained directly from the data and the cost of the model is increased.
3. Cardinality Estimator
3.1. Cold Start Problem
3.2. Representation of SQL Query
3.2.1. Table Information Encoding
3.2.2. Predicate Information Encoding
3.2.3. Construction of SQL Tree
Algorithm 1 The process of pruning tree construction |
Input: postfix form of the logical expression pfix; empty stack stack; the value of the node value; child node children; the number of nodes in the subtree num_children; parent node parent Output: pruning tree
|
3.3. Model Design
3.3.1. Learning Table Information
3.3.2. Learning Predicate Information
3.3.3. TreeLSTM Design
3.4. Loss Function
4. Evaluation
4.1. Datasets
- The IMDB dataset [3]. The IMDB dataset is a dataset from the Internet Movie Database (IMDB), one of the largest movie and television databases. The dataset contains 21 tables, and after joining all the tables in a certain order, it has a data volume of 2.8 × . Therefore, many researchers use this dataset or several tables in this dataset to study multi-table related content, and we chose six tables for our experiments.
- The DMV dataset [12]. The DMV dataset is a large, single-table dataset for storing car, sled, and boat registration information. The dataset includes the personal information of car and boat owners, basic information about vehicles and boats, and registration times. The dataset has a data volume of 1.25 × , and some columns were selected for use in our experiments.
4.2. Evaluation Metrics
4.3. Comparison Methods and Experimental Environment
4.4. Hyperparameter Tuning
4.5. Model Complexity
4.6. Estimation Quality
4.7. Prediction Time Comparison
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Negi, P.; Marcus, R.; Mao, H.; Tatbul, N.; Kraska, T.; Alizadeh, M. Cost-Guided Cardinality Estimation: Focus Where it Matters. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering Workshops (ICDEW), Dallas, TX, USA, 20–24 April 2020; pp. 154–157. [Google Scholar]
- Harmouch, H.; Naumann, F. Cardinality Estimation: An Experimental Survey. Proc. VLDB Endow. 2017, 11, 499–512. [Google Scholar] [CrossRef]
- Leis, V.; Gubichev, A.; Mirchev, A.; Boncz, P.A.; Kemper, A.; Neumann, T. How Good Are Query Optimizers, Really? Proc. VLDB Endow. 2015, 9, 204–215. [Google Scholar] [CrossRef]
- Leis, V.; Radke, B.; Gubichev, A.; Mirchev, A.; Boncz, P.A.; Kemper, A.; Neumann, T. Query optimization through the looking glass, and what we found running the Join Order Benchmark. VLDB J. 2018, 27, 643–668. [Google Scholar] [CrossRef]
- Perron, M.; Shang, Z.; Kraska, T.; Stonebraker, M. How I Learned to Stop Worrying and Love Re-optimization. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1758–1761. [Google Scholar] [CrossRef]
- Liu, H.; Xu, M.; Yu, Z.; Corvinelli, V.; Zuzarte, C. Cardinality estimation using neural networks. In Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, ON, Canada, 1–4 November 2015. [Google Scholar]
- Kipf, A.; Kipf, T.; Radke, B.; Leis, V.; Boncz, P.A.; Kemper, A. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. arXiv 2018, arXiv:1809.00677. [Google Scholar]
- Marcus, R.; Papaemmanouil, O. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Houston, TX, USA, 10 June 2018. [Google Scholar] [CrossRef]
- Sun, J.; Li, G. An End-to-End Learning-based Cost Estimator. arXiv 2019, arXiv:1906.02560. [Google Scholar] [CrossRef]
- He, Z.; Yu, J.; Gu, T.; Li, Z.; Du, X.; Li, P. Query cost estimation in graph databases via emphasizing query dependencies by using a neural reasoning network. Concurr. Comput. Pract. Exp. 2023, e7817. [Google Scholar] [CrossRef]
- Wang, T.; Chan, C.Y. Improved Correlated Sampling for Join Size Estimation. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; pp. 325–336. [Google Scholar] [CrossRef]
- State of New York. Available online: https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations (accessed on 26 March 2022).
- To, H.; Chiang, K.; Shahabi, C. Entropy-based histograms for selectivity estimation. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013. [Google Scholar] [CrossRef]
- Zeng, X.; Xudong, L.; Caiyan, P.; Cao, J. Cardinality Estimation Based on Cluster Analysis. J. Inf. Sci. Eng. 2017, 35, 201–222. [Google Scholar]
- Li, H.; Cui, J.; Meng, X.; Ma, J. IHP: Improving the utility in differential private histogram publication. Distrib. Parallel Databases 2019, 37, 721–750. [Google Scholar] [CrossRef]
- Reviriego, P.; Martínez, J.A.; Rottenstreich, O.; Liu, S.; Lombardi, F. Remove Minimum (RM): An Error-Tolerant Scheme for Cardinality Estimate by HyperLogLog. IEEE Trans. Dependable Secur. Comput. 2022, 19, 966–977. [Google Scholar] [CrossRef]
- Kipf, A.; Vorona, D.; Müller, J.; Kipf, T.; Radke, B.; Leis, V.; Boncz, P.A.; Neumann, T.; Kemper, A. Estimating Cardinalities with Deep Sketches. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 30 June–5 July 2019. [Google Scholar] [CrossRef]
- Tschorsch, F.; Scheuermann, B. An algorithm for privacy-preserving distributed user statistics. Comput. Netw. 2013, 57, 2775–2787. [Google Scholar] [CrossRef]
- Izenov, Y.; Datta, A.; Rusu, F.; Shin, J.H. Online Sketch-based Query Optimization. arXiv 2021, arXiv:2102.02440. [Google Scholar]
- Wu, W.; Naughton, J.F.; Singh, H. Sampling-Based Query Re-Optimization. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016. [Google Scholar] [CrossRef]
- Leis, V.; Radke, B.; Gubichev, A.; Kemper, A.; Neumann, T. Cardinality Estimation Done Right: Index-Based Join Sampling. In Proceedings of the Conference on Innovative Data Systems Research, Chaminade, CA, USA, 8–11 January 2017. [Google Scholar]
- Lan, H.; Bao, Z.; Peng, Y. A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration. Data Sci. Eng. 2021, 6, 86–101. [Google Scholar] [CrossRef]
- Sun, J.; Zhang, J.; Sun, Z.; Li, G.; Tang, N. Learned Cardinality Estimation: A Design Space Exploration and A Comparative Evaluation. Proc. VLDB Endow. 2021, 15, 85–97. [Google Scholar] [CrossRef]
- Heimel, M.; Kiefer, M.; Markl, V. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, VIC, Australia, 31 May–4 June 2015. [Google Scholar] [CrossRef]
- Kiefer, M.; Heimel, M.; Breß, S.; Markl, V. Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models. Proc. VLDB Endow. 2017, 10, 2085–2096. [Google Scholar] [CrossRef]
- Hilprecht, B.; Schmidt, A.; Kulessa, M.; Molina, A.; Kersting, K.; Binnig, C. DeepDB: Learn from Data, not from Queries! arXiv 2019, arXiv:1909.00607. [Google Scholar] [CrossRef]
- Marcus, R.; Negi, P.; Mao, H.; Zhang, C.; Alizadeh, M.; Kraska, T.; Papaemmanouil, O.; Tatbul, N. Neo: A Learned Query Optimizer. arXiv 2019, arXiv:1904.03711. [Google Scholar] [CrossRef]
- Yang, Z.; Liang, E.; Kamsetty, A.; Wu, C.; Duan, Y.; Chen, P.; Abbeel, P.; Hellerstein, J.M.; Krishnan, S.; Stoica, I. Deep Unsupervised Cardinality Estimation. arXiv 2019, arXiv:1905.04278. [Google Scholar] [CrossRef]
- Yang, Z.; Kamsetty, A.; Luan, S.; Liang, E.; Duan, Y.; Chen, X.; Stoica, I. NeuroCard: One Cardinality Estimator for All Tables. Proc. VLDB Endow. 2020, 14, 61–73. [Google Scholar] [CrossRef]
- Hasan, S.; Thirumuruganathan, S.; Augustine, J.; Koudas, N.; Das, G. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
- Malik, T.; Burns, R.C.; Chawla, N. A Black-Box Approach to Query Cardinality Estimation. In Proceedings of the Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, 12–15 January 2007. [Google Scholar]
- Wu, P.; Cong, G. A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation. In Proceedings of the 2021 International Conference on Management of Data, Online, 20–25 June 2021. [Google Scholar] [CrossRef]
- Dutt, A.; Wang, C.; Nazi, A.; Kandula, S.; Narasayya, V.R.; Chaudhuri, S. Selectivity Estimation for Range Predicates using Lightweight Models. Proc. VLDB Endow. 2019, 12, 1044–1057. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. arXiv 2013, arXiv:1310.4546. [Google Scholar]
- Karamyan, D.S. Cardinality Estimation of an SQL Query Using Recursive Neural Networks. Math. Probl. Comput. Sci. 2020, 54, 41–52. [Google Scholar] [CrossRef]
- Woltmann, L.; Hartmann, C.; Thiele, M.; Habich, D.; Lehner, W. Cardinality estimation with local deep learning models. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Amsterdam, The Netherlands, 5 July 2019. [Google Scholar] [CrossRef]
- Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Keysers, D.; Uszkoreit, J.; Lucic, M.; et al. MLP-Mixer: An all-MLP Architecture for Vision. In Proceedings of the Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
- Tai, K.S.; Socher, R.; Manning, C.D. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. arXiv 2015, arXiv:1503.00075. [Google Scholar]
- Moerkotte, G.; Neumann, T.; Steidl, G. Preventing Bad Plans by Bounding the Impact of Cardinality Estimation Errors. Proc. VLDB Endow. 2009, 2, 982–993. [Google Scholar] [CrossRef]
- Zhao, K.; Yu, J.X.; He, Z.; Zhang, H. Uncertainty-aware Cardinality Estimation by Neural Network Gaussian Process. arXiv 2021, arXiv:2107.08706. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Category | Method | Advantage | Disadvantage |
---|---|---|---|
Traditional cardinality estimation methods | Summary-based | Simple to implement and widely used | Based on independence assumptions with large errors |
Sampling-based | More accurate estimation, connection-oriented estimation for join queries | Requires extra space to store sampled data; has 0-tuple problem | |
Learning-based cardinality estimation methods | Query-driven | The most accurate and fastest cardinality estimation method, excellent performance with multiple tables | Requires extensive training; insufficient generalization ability |
Data-driven | Accurate and relatively stable estimates with a single table | Requires extensive training; insufficient generalization ability | |
Hybrid | Performs well with a single table and estimates accurately | It cannot be trained directly on the data and would increase the cost of the model |
Epochs | Batch Size | Hidden Units | Learning Rate | |
---|---|---|---|---|
100 | 32 | 32 | 0.0001 | 0.96 |
200 | 64 | 64 | / | 0.97 |
/ | 128 | 128 | / | 0.98 |
/ | 256 | 256 | / | 0.99 |
/ | / | / | / | 1.00 |
Method | Memory | Time | Parameters |
---|---|---|---|
MSCN | 2.6 MB | 0.48 ms | 683,521 |
VSCNN | 5.2 MB | 0.52 ms | 1,379,841 |
TreeLSTM (ours) | 1.6 MB | 0.63 ms | 409,106 |
Method | Q-Error | MAE | SMAPE |
---|---|---|---|
PostgreSQL | 2 × | / | / |
MySQL | 2 × | / | / |
VSCNN | 3.12 | 2291 | 0.59 |
MSCN (string) | 1.94 | 2176 | 0.31 |
NoSamplingTreeLSTM | 10.37 | 5387 | 0.89 |
NoSamplingTreeLSTM | 2.38 | 1959 | 0.33 |
TreeLSTM (ours) | 1.63 | 1109 | 0.19 |
Method | Q-Error | MAE | SMAPE |
---|---|---|---|
PostgreSQL | 1 × | / | / |
MySQL | 1 × | / | / |
TreeRNN | 1.71 | 1526 | 0.40 |
VSCNN | 3.01 | 2593 | 0.55 |
MSCN (string) | 2.96 | 2418 | 0.43 |
NoSamplingTreeLSTM | 8.90 | 4374 | 0.86 |
NoSamplingTreeLSTM | 2.36 | 2106 | 0.33 |
TreeLSTM (ours) | 1.49 | 923 | 0.36 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qi, K.; Yu, J.; He, Z. A Cardinality Estimator in Complex Database Systems Based on TreeLSTM. Sensors 2023, 23, 7364. https://doi.org/10.3390/s23177364
Qi K, Yu J, He Z. A Cardinality Estimator in Complex Database Systems Based on TreeLSTM. Sensors. 2023; 23(17):7364. https://doi.org/10.3390/s23177364
Chicago/Turabian StyleQi, Kaiyang, Jiong Yu, and Zhenzhen He. 2023. "A Cardinality Estimator in Complex Database Systems Based on TreeLSTM" Sensors 23, no. 17: 7364. https://doi.org/10.3390/s23177364
APA StyleQi, K., Yu, J., & He, Z. (2023). A Cardinality Estimator in Complex Database Systems Based on TreeLSTM. Sensors, 23(17), 7364. https://doi.org/10.3390/s23177364