Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery
Abstract
:1. Introduction
- (1)
- It investigates and defines a method for representing joinability through combining static and dynamic similarity.
- (2)
- It innovatively uses changes in database logs to study the joinability information between tables and reveals how to use database operation logs for dynamic similarity analysis.
- (3)
- It proposes a data-partitioning method based on time slices. By stacking different time slices, we can obtain the co-occurrence matrix of each table regarding specific operations. Using this co-occurrence matrix as a parameter in combination with the static and dynamic similarity, we can extract collections of joined tables.
2. Methodology
2.1. Preliminaries
2.2. Dynamic Data-Driven Time-Slicing LSH Method
2.2.1. Constructing the Static Similarity
- Defining the Minwise Function
- 2.
- Defining the LSH Function
- 3.
- Calculating LSH Tables
2.2.2. Dynamic Data Acquisition
2.2.3. Time Slice and Dynamic Similarity
2.2.4. Joinability Construction
3. Experiments
3.1. Data Sets
3.2. Evaluation Metrics
3.3. Performance Comparison
3.4. Investigation of Parameter Sensitivity
3.4.1. Threshold Experiment
3.4.2. Experiment
3.4.3. Time Slice Statistics Experiment
3.4.4. Time Slice Quantity Experiment
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chepurko, N.; Marcus, R.; Zgraggen, E.; Fernandez, R.C.; Kraska, T.; Karger, D. ARDA: Automatic Relational Data Augmentation for Machine Learning. Proc. VLDB Endow. 2020, 13, 1373–1387. [Google Scholar] [CrossRef]
- Dong, Y.; Oyamada, M. Table Enrichment System for Machine Learning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 6 July 2022; pp. 3267–3271. [Google Scholar]
- Zhao, Z.; Jian, Z.; Gaba, G.S.; Alroobaea, R.; Masud, M.; Rubaiee, S. An Improved Association Rule Mining Algorithm for Large Data. J. Intell. Syst. 2021, 30, 750–762. [Google Scholar] [CrossRef]
- Jang, H.-J.; Yang, Y.; Park, J.S.; Kim, B. FP-Growth Algorithm for Discovering Region-Based Association Rule in the IoT Environment. Electronics 2021, 10, 3091. [Google Scholar] [CrossRef]
- Gomes Dos Reis, D.; Ladeira, M.; Holanda, M.; De Carvalho Victorino, M. Large Database Schema Matching Using Data Mining Techniques. In Proceedings of the 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Singapore, 17–20 November 2018; pp. 523–530. [Google Scholar]
- Pan, Z.; Pan, G.; Monti, A. Semantic-Similarity-Based Schema Matching for Management of Building Energy Data. Energies 2022, 15, 8894. [Google Scholar] [CrossRef]
- Hättasch, B.; Truong-Ngoc, M.; Schmidt, A.; Binnig, C. It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. In Proceedings of the 2nd International Workshop on Applied AI for Database Systems and Applications, Tokyo, Japan, 31 August 2020. [Google Scholar]
- Wang, J.; Lin, C.; Zaniolo, C. MF-Join: Efficient Fuzzy String Similarity Join with Multi-Level Filtering. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 386–397. [Google Scholar]
- Choi, D.; Wee, J.; Song, S.; Lee, H.; Lim, J.; Bok, K.; Yoo, J. K-NN Query Optimization for High-Dimensional Index Using Machine Learning. Electronics 2023, 12, 2375. [Google Scholar] [CrossRef]
- Zhang, H.; Zhang, Q. MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 23 August 2020; pp. 566–576. [Google Scholar]
- Zhu, E.; Deng, D.; Nargesian, F.; Miller, R.J. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 25 June 2019; pp. 847–864. [Google Scholar]
- Zhang, M.; Hadjieleftheriou, M.; Ooi, B.C.; Procopiuc, C.M.; Srivastava, D. Automatic Discovery of Attributes in Relational Databases. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, 12 June 2011; pp. 109–120. [Google Scholar]
- Esmailoghli, M.; Quiané-Ruiz, J.-A.; Abedjan, Z. MATE: Multi-Attribute Table Extraction. Proc. VLDB Endow. 2022, 15, 1684–1696. [Google Scholar] [CrossRef]
- Dong, Y.; Takeoka, K.; Xiao, C.; Oyamada, M. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 456–467. [Google Scholar]
- Fan, G.; Wang, J.; Li, Y.; Zhang, D.; Miller, R.J. Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proc. VLDB Endow. 2023, 16, 1726–1739. [Google Scholar] [CrossRef]
- Khatiwada, A.; Fan, G.; Shraga, R.; Chen, Z.; Gatterbauer, W.; Miller, R.J.; Riedewald, M. SANTOS: Relationship-Based Semantic Table Union Search. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
- Taha, I.; Lissandrini, M.; Simitsis, A.; Ioannidis, Y. A Study on Efficient Indexing for Table Search in Data Lakes. In Proceedings of the 2024 IEEE 18th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 5 February 2024; pp. 245–252. [Google Scholar]
- Broder, A.Z. On the Resemblance and Containment of Documents. In Proceedings of the Proceedings. Compression and Complexity of SEQUENCES 1997, Salerno, Italy, 11–13 June 1997; pp. 21–29. [Google Scholar]
- Indyk, P.; Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing—STOC ’98, Dallas, TX, USA, 23–26 May 1998; pp. 604–613. [Google Scholar]
- Ukey, N.; Yang, Z.; Li, B.; Zhang, G.; Hu, Y.; Zhang, W. Survey on Exact kNN Queries over High-Dimensional Data Space. Sensors 2023, 23, 629. [Google Scholar] [CrossRef] [PubMed]
- Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23 September 2007; pp. 950–961. [Google Scholar]
- Korzeniowski, L.; Goczyla, K. Landscape of Automated Log Analysis: A Systematic Literature Review and Mapping Study. IEEE Access 2022, 10, 21892–21913. [Google Scholar] [CrossRef]
- Ma, J.; Liu, Y.; Wan, H.; Sun, G. Automatic Parsing and Utilization of System Log Features in Log Analysis: A Survey. Appl. Sci. 2023, 13, 4930. [Google Scholar] [CrossRef]
- Zhang, T.; Qiu, H.; Castellano, G.; Rifai, M.; Chen, C.S.; Pianese, F. System Log Parsing: A Survey. IEEE Trans. Knowl. Data Eng. 2023, 35, 8596–8614. [Google Scholar] [CrossRef]
- Naseer, H.; Desouza, K.; Maynard, S.B.; Ahmad, A. Enabling Cybersecurity Incident Response Agility through Dynamic Capabilities: The Role of Real-Time Analytics. Eur. J. Inf. Syst. 2024, 33, 200–220. [Google Scholar] [CrossRef]
- Imani, F.M.; Widyasari, Y.D.L.; Arifin, S.P. Optimizing Extract, Transform, and Load Process Using Change Data Capture. In Proceedings of the 2023 6th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Batam, Indonesia, 11 December 2023; pp. 266–269. [Google Scholar]
- Qaiser, A.; Farooq, M.U.; Nabeel Mustafa, S.M.; Abrar, N. Comparative Analysis of ETL Tools in Big Data Analytics. Pak. J. Eng. Technol. 2023, 6, 7–12. [Google Scholar] [CrossRef]
- Khan, B.; Jan, S.; Khan, W.; Chughtai, M.I. An Overview of ETL Techniques, Tools, Processes and Evaluations in Data Warehousing. J. Big Data 2024, 6, 1–20. [Google Scholar] [CrossRef]
- Chy, M.S.H.; Arju, M.A.R.; Tella, S.M.; Cerny, T. Comparative Evaluation of Java Virtual Machine-Based Message Queue Services: A Study on Kafka, Artemis, Pulsar, and RocketMQ. Electronics 2023, 12, 4792. [Google Scholar] [CrossRef]
- Zhu, E.; Nargesian, F.; Pu, K.Q.; Miller, R.J. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow. 2016, 9, 1185–1196. [Google Scholar] [CrossRef]
- Castro Fernandez, R.; Abedjan, Z.; Koko, F.; Yuan, G.; Madden, S.; Stonebraker, M. Aurum: A Data Discovery System. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 16–19 April 2018; pp. 1001–1012. [Google Scholar]
Symbol | Description |
---|---|
Data collection tables in the database. | |
; collection of data in column j of table B. | |
. | |
A distance function used to determine similarity. | |
The similarity value. | |
A query column. | |
A target column in the repository. | |
The similarity threshold. | |
The relevance calculation function. | |
The minimum hash function. | |
The hash function. | |
A collection of logs in time period t. | |
A collection of field names in the log within time period t; attribute f in table r. | |
A data collection of all fields. | |
The hash function group | |
The dynamic similarity between columns i and j. | |
The number of co-occurrences within the time slice between tables A and B; probability of co-occurrence. |
Join Key | Type | Length | Description |
---|---|---|---|
Primary Key | INT | Default, 11 bits | A self-incrementing approach is typically used, usually starting at 0. This approach results in higher similarity between columns. |
BIGINT | 20 bits and above | A snowflake algorithm generates a unique value; there will be no duplicates. | |
Business Code | BIGINT | 20 bits and above | Custom codes are unique in businesses, such as prefix + date + serial number. |
VARCHAR | 10 bits and above | ||
CHAR | 10 bits and above |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, W.; Zhu, C.; Yan, H. Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery. Electronics 2024, 13, 3920. https://doi.org/10.3390/electronics13193920
Wang W, Zhu C, Yan H. Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery. Electronics. 2024; 13(19):3920. https://doi.org/10.3390/electronics13193920
Chicago/Turabian StyleWang, Weiwei, Chunxiang Zhu, and Han Yan. 2024. "Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery" Electronics 13, no. 19: 3920. https://doi.org/10.3390/electronics13193920
APA StyleWang, W., Zhu, C., & Yan, H. (2024). Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery. Electronics, 13(19), 3920. https://doi.org/10.3390/electronics13193920