Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery

Wang, Weiwei; Zhu, Chunxiang; Yan, Han

doi:10.3390/electronics13193920

Open AccessArticle

Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery

by

Weiwei Wang

^*

,

Chunxiang Zhu

and

Han Yan

College of Engineering Training Centre, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3920; https://doi.org/10.3390/electronics13193920

Submission received: 5 September 2024 / Revised: 26 September 2024 / Accepted: 2 October 2024 / Published: 3 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

In legacy industrial systems, discovering joinable information between database tables is important for applications such as data integration and data analysis. Locality-Sensitive Hashing-based methods have been proven to be capable of handling chaotic and diverse table relationships, but these methods often rely on an incorrect assumption—that the similarity of table columns in the database directly reflects their joinability, causing problems related to the accuracy of their results. To solve this problem, this study proposes a dynamic data-driven time-slicing Locality-Sensitive Hashing method for joinable table discovery. This method introduces database log information and within different time slices, uses the co-occurrence matrix of data tables to determine their joinability. Specifically, it first performs a MinHash dimensionality reduction on database columns and then uses Locality-Sensitive Hashing to calculate the static similarity. Next, it identifies business modular time slices through database logs, calculates the dynamic similarity of the slice time data, and builds a co-occurrence matrix between tables. Finally, the joinability between data tables is calculated using the static similarity, dynamic similarity, and co-occurrence matrix. The experimental results demonstrate that this method effectively excludes tables that only have similarity but no business relationship for data cleaning, and its accuracy exceeds that of methods that only depend on similarity.

Keywords:

table discovery; data integration; data cleaning; time slice; database log

1. Introduction

The construction of factories has become the main driving force triggering growth in the digital economy, opening a new era of data-driven production and enterprise transformation. As a key element of production, the value of data is becoming increasingly apparent in this era. To fully utilize the core role of data, we need to organize and clean the information contained in the databases of various industrial systems, especially when considering the manufacturing sector, due to the complexity of manufacturing processes and technology. The relationships between data tables are very complex, with the number of tables far exceeding the number of ordinary systems, and it is necessary to join dozens of table fields to query and sort the required results. In the field of data element analysis, joining two or more database tables is a fundamental and necessary task. Database table joining is not only widely used in the fields of database management and business intelligence but is also commonly applied in the field of machine learning [1,2]. Finding tables that can be joined is essentially an inverse analysis process of the current database design; that is, a type of data schema analysis. At present, data schema analysis approaches can be divided into two types: schema-level and instance-level methods. Schema-level methods focus on the structure and design of the database; however, for small- and medium-sized manufacturing enterprises, information about the design of many legacy system databases is often difficult to obtain. Meanwhile, instance-level analysis methods focus on the actual data in databases. Existing instance-level association rule mining algorithms include Apriori [3] and FP-Growth [4], among others. Most research has focused on mining association rules; however, in the cleaning of data elements, more attention is paid to whether database columns can be subjected to joinable query operations to obtain detailed information about a particular business.

Existing column relationship schema-matching techniques mainly use simple statistics- and string-based comparisons, such as prefix/suffix detection, edit distance, value range, minimum/maximum similarity, and mutual information measurements based on the q-gram distribution [5,6,7,8]. Set similarity search is another version of this problem [9], which has been widely used in various applications such as document searches, keyword querying, entity recognition, and data cleaning [10]. As examples of relevant research, Zhu et al. [11] proposed the Intersection Estimation algorithm to find joinable tables; Zhang et al. [12] clustered data columns to discover joinable tables; Esmailoghli et al. [13] aggregated row values into hash values to construct a super key, facilitating the search for joinable tables across multiple columns; Dong et al. [14] used a pre-trained model to convert text attributes into high-dimensional vectors, thereby semantically connecting records through similarity predicates to obtain more accurate results; Grace et al. [15] proposed the Starmie framework, which uses a pre-training strategy to capture contextual semantic information in a table and uses the cosine similarity between column vectors as the join degree score; Santos [16] added binary relations to tables to improve the accuracy of joinable table retrieval using a knowledge base approach; and Ibraheem [17] outlined various popular methods for finding joinable tables in data lakes and their associated performance.

Regrettably, in the existing literature, similarity and joinability are often regarded as the same concept. Similar columns can be joined, and joined columns carry a certain degree of similarity, but similarity does not necessarily denote business relevance; this is because similarity tends to be analyzed from a data perspective, whereas joinability tends to be analyzed from a business perspective. Therefore, joinability can sometimes be meaningless. For instance, while the key attributes of a sales order table and an attendance information table may meet certain similarity requirements, they will not be joined during data cleaning. In this study, we found that the tables representing a user’s business activities are modular and correlated in a given time period. Therefore, we propose a dynamic data-driven time-slicing Locality-Sensitive Hashing (LSH) method (DTL) for joinable table discovery. This method considers both static and dynamic similarity, combined with a co-occurrence matrix (i.e., the number of times that the tables co-appear in the time segment) in order to determine joinability. When calculating the static similarity, all the data in the database are used, MinHash [18] is applied to reduce the dimensionality, and LSH [19] is used for calculation. Dynamic similarity is achieved through analyzing database logs and constructing time slices with business modularity; then, in these modularized data, the similarity between tables is calculated, allowing for a co-occurrence matrix to be constructed on this basis. This method can exclude tables that have high similarity but are irrelevant to the business. The proposed method was tested on a real enterprise data set, and the experimental results demonstrate that its accuracy in obtaining joined tables surpasses that of methods that rely solely on similarity.

The main contributions of this study include the following:

(1): It investigates and defines a method for representing joinability through combining static and dynamic similarity.
(2): It innovatively uses changes in database logs to study the joinability information between tables and reveals how to use database operation logs for dynamic similarity analysis.
(3): It proposes a data-partitioning method based on time slices. By stacking different time slices, we can obtain the co-occurrence matrix of each table regarding specific operations. Using this co-occurrence matrix as a parameter in combination with the static and dynamic similarity, we can extract collections of joined tables.

2. Methodology

2.1. Preliminaries

In this subsection, the basic methods adopted in this study are introduced, including MinHash, LSH, and the use of database logs. Table 1 summarizes the notation frequently used in this study.

In this study, attributes and columns are considered equivalent, representing the collection of data in a table. Suppose we have a given table

A

. The i^th column of

A

is labeled

A_{i}

, and the value in the i^th row and j^th column of

A

is denoted by

a_{i j}

. Then, table

A

can be denoted as

A = [A_{1}, A_{2} \dots A_{n}] = [\begin{matrix} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{22} & \dots & a_{2 n} \\ \dots & \dots & \dots & \dots \\ a_{m 1} & a_{m 2} & \dots & a_{m n} \end{matrix}]

(1)

The similarity between the attribute domains

A_{i}

and

B_{j}

is denoted as

ρ

, which is calculated as follows:

ρ = d (A_{i}, B_{j}) = \frac{A_{i} \cap B_{j}}{A_{i} \cup B_{j}}

(2)

Given a similarity calculation method and a threshold

τ

, the attributes

A_{i}

and

B_{j}

are considered a match if (and only if)

ρ \geq τ

; otherwise, they are considered a mismatch. The matching degree

ρ

measures the similarity between the two columns of data.

Similarity is generally considered as the basis of joinability. In a simple method for calculating column similarity, for the data in attribute

Q

, the similarity distance with respect to all elements in

R

is calculated in turn. When the database contains

n

columns of attributes, the time complexity of this method is

O (n^{2})

. However, when the number of attributes reaches a certain order of magnitude (e.g., over 10,000), and the number of rows in each column set exceeds 1 million, the time consumption of this brute force calculation method will be very large, making it unfeasible.

In the literature [18], a method for calculating the Jaccard similarity between sets of any size was proposed. In this technique, a set of attributes is converted into a MinHash signature using a set of Minwise hash functions. Suppose

h_{m i n}

is one such hash function; then, the minimum hash values for domains

A_{i}

and

B_{j}

are

h_{m i n} (A_{i})

and

h_{m i n} (B_{j})

. The probability

ρ

of two minimum hash values being equal is considered as the similarity between

A_{i}

and

B_{j}

[18]:

ρ = p (h_{\min} (A_{i}) = h_{\min} (B_{j})) = d (A_{i}, B_{j})

(3)

LSH is a solution for the nearest neighbor search problem in high-dimensional space [19], which can also be applied to approximate R-nearest neighbor searches [20]. In LSH, if two objects are similar, their distances after projection using the distance-sensitive hash function will be very close; meanwhile, if they are dissimilar, their distances will be relatively far. William et al. [21] proposed the Multi-Probe LSH method; its core concept is to check all hash buckets that may contain neighboring points of the query point according to a specific probe sequence; that is, if an object is close to the query object but has not been projected into the same hash bucket, it might be projected into a bucket nearby. This method can effectively reduce memory and time overheads by reducing the use of hash tables and expanding the query range, and it can also accurately obtain the candidate set.

When a program runs normally, the database continuously generates logs, recording all operations in detail. Prior studies [22,23,24] have outlined the development trends and attention in this field of log analysis. For example, system administrators can obtain a better understanding of the health and operating status of the program [25] by regularly checking the log data. When a user performs a business operation on the system page, it may trigger changes in the data of multiple tables. For example, adding a new sales order involves inter-related tables such as the order information table, the detail table, the inventory table, and the shipping table. If the operation is successful, the data changes in these tables will be sequentially recorded in the database log; therefore, once a table changes, joinable tables are likely to appear in its context.

As the log record data are closely related to the business information, combining this information with respect to the similarity of existing data allows us to more effectively determine the joinability of tables.

2.2. Dynamic Data-Driven Time-Slicing LSH Method

This subsection describes a novel method for calculating the joinability of tables based on dynamic data-driven time-slicing LSH (denoted as DTL). This method processes the table data in a database using MinHash and LSH, obtaining the static similarity; then, it analyzes the database logs, constructs a time slice with modularity (with respect to the business information) to calculate the dynamic similarity, and forms a co-occurrence matrix. The co-occurrence matrix is then combined with the static and dynamic similarity data to obtain effective joinability information. The framework of the proposed method is depicted in Figure 1.

Figure 1 indicates the relevant modules of this study. Initially, LSH hash tables are generated by traversing the column data to establish static similarity. Next, the database logs are analyzed to extract time slices, and the dynamic similarity information between fields and the co-occurrence matrix between tables is constructed. Finally, using the LSH hash table and the dynamic similarity information, attribute elements that satisfy the similarity conditions are extracted and added to the candidate set, and the joinable table of the candidate set with the element to be queried is computed.

2.2.1. Constructing the Static Similarity

Defining the Minwise Function

When defining several hash functions

h_{i}

, these functions can randomly change the order of the elements. A hash function can be defined as

h_{i} (x) = (a_{i} \cdot x + b_{i}) \mod p

(4)

where

a_{i}

and

b_{i}

are randomly chosen constants,

p

is a large prime number, and

x

is a data set of attributes. For each column

Q

in the database, the MinHash signature is the minimum value calculated by multiple hash functions. Using

k

hash functions, the signature can be expressed as

M i n H a s h_{Q} = [\min (h_{1} (Q)), \min (h_{2} (Q)), \dots, \min (h_{k} (Q))]

(5)

where

m i n (h_{i} (Q))

represents the minimum value calculated by the column through the hash function

h_{i}

; thus, the high-dimensional attribute set can be reduced to k dimensions.

2.: Defining the LSH Function

In this method,

L

hash functions

h_{1}, h_{2}, \dots, h_{L}

are chosen to form a hash function group

G (x) = (h_{1} (x), h_{2} (x), \dots, h_{L} (x))

. For each hash function group

G_{i} (x)

, a hash table is constructed; this process is then repeated

S

times to obtain

S

hash tables.

3.: Calculating LSH Tables

Due to the large number of tables, fields, and data in the database, loading them all into memory will put pressure on the server and the database. To solve this problem, this method dynamically concatenates SQL to read column data based on the database configuration information (including table name, field name, field type, and field length), reducing the relative load on the server and database. After obtaining the necessary information, we can use the above function to calculate the MinHash signature and create an LSH hash table. The similarity set of the object being queried can be obtained through the LSH hash table, and the resulting similarity value is the static similarity.

2.2.2. Dynamic Data Acquisition

In small- and medium-sized manufacturing industries, database management systems such as MSSQL, Oracle, MySQL, PostgreSQL, and MongoDB are widely used, which all provide features for logging changes in data. To improve the efficiency of capturing changes in database logs, the Change Data Capture (CDC) strategy [26,27,28] can be employed. The core idea of CDC is to track and capture changes in the database, including insertions, updates, and deletions of data or tables, and then record these changes in the order they occurred in a complete manner. For subsequent analysis needs, the captured log data are processed into a specific format and stored in a high-throughput and durable Kafka [29] message queue, making it convenient for applications to consume and analyze data at different time intervals, as shown in Figure 2.

In Kafka, log information is unitized by row. Each row of the record corresponds to a data change in the table, and its contents include the change in table name, the field name, the field value, and the type of operation. It is read in time slices as the calculation unit, where a time slice is

t

seconds. If

R_{t}

represents the log set within time

t

, then

R_{t} = {R_{t 1}, R_{t 2}, \dots, R_{t m}}

, where

R_{t i}

represents the i^th log record in time slice

t

. Each

R_{t i}

can be represented by four-tuple

(o_{t i}, r_{t i}, f_{t i}, v_{t i})

, where

o_{t i}

represents the operation type, including addition, modification, and deletion;

r_{t i}

represents the table name;

f_{t i}

is the field information in the table; and

v_{t i}

is the list of field values. The latter is expressed as

v = {o l d, n e w}

; for the addition and deletion operations, the value of

o l d

is empty; meanwhile, for modification operations,

o l d

is the value before the modification and

n e w

is the value after the update.

Let

F i e l d = \cup {(r_{t i}, f_{t i}) | (o_{t i}, r_{t i}, f_{t i}, v_{t i}) \in R_{t}}

denote the union of combinations of table names and field names within time slice

t

.

D a t a (F i e l d)

represents the data set of all fields, and

D a t a ({F i e l d}_{r f})

represents the data set of field

f

in table

r

. The initial condition is

D a t a (F i e l d) = \emptyset

.

For the addition operation, the data in the corresponding fields are added to the data set. In the deletion operation, before they are deleted, these data are similar to some of the columns. Therefore, for the deleted data, we also perform the following addition operation:

D a t a (F i e l d_{r f}) = D a t a (F i e l d_{r f}) \cup {v_{t i} | (o_{t i}, r_{t i}, f_{t i}, v_{t i}) \in R_{t} \land r = r_{t i} \land f = f_{t i})}

(6)

For the editing operation, we record the data before and after the edit. Regarding joinability, if the data were joinable before, then the data after the edit are also joinable. Therefore, we add both the data before and after the edit to the collection, as follows:

\begin{array}{l} D a t a (F i e l d_{r f}) = D a t a (F i e l d_{r f}) \cup {v_{t i} (o l d) \cup v_{t i} (n e w) | \\ (o_{t i}, r_{t i}, f_{t i}, v_{t i}) \in R_{t} \land r = r_{t i} \land f = f_{t i})} \end{array}

(7)

In this way, we can obtain all dynamic attribute data within time slice

t

.

2.2.3. Time Slice and Dynamic Similarity

Choosing the appropriate time slice is key to ensuring the accuracy of the obtained dynamic similarity value. First, the optimal splitting point must be found, which is the maximum value for the time interval between two log records in the Kafka message queue after the end of the last read. However, this interval needs to be within a certain pre-defined threshold range of the last read time. Therefore, the minimum time length

I_{m i n}

and the maximum time length

I_{m a x}

of the read slice are taken as constraints for timestamp filtering, in order to prevent the time slice from being too large or small. Let

K

be the timestamp of the last log read,

t h r e s h o l d

be the minimum value of the time interval, and

T

be the sequence of log timestamps; then,

T

can be expressed as

T = [t_{1}, t_{2}, \dots, t_{n}]

, where

t_{i}

is the timestamp of the i^th log. First, we need to filter out the data information, denoted as

T^{*}

, with timestamps after

K

and in the range between

I_{m i n}

and

I_{m a x}

, as follows:

T^{*} = [t_{i} | t_{i} > K \land t_{i} - K \geq I_{\min} \land t_{i} - K \leq I_{\max}]

(8)

Then, calculate the time interval between adjacent records in the data set

T^{*}

and obtain the maximum value:

Δ T = [Δ t_{i} = t_{i + 1} - t_{i} | t_{i}, t_{i + 1} \in T^{*}]

(9)

Δ t_{\max} = \max (Δ T)

(10)

Δ t = \{\begin{cases} Δ t_{\max}, Δ t_{\max} \geq t h r e s h o l d \\ I_{\max}, Δ t_{\max} < t h r e s h o l d \end{cases}

(11)

To determine the optimal module, the value of the

t h r e s h o l d

parameter must be set, which has a significant impact on the correlation results. In extreme cases, if the

∆ t_{m a x}

interval is very small and the current user is constantly operating, the log time may not distinguish the business modules as they are operating, so

I_{m a x}

will be directly used as the slice time. Conversely, if the

t h r e s h o l d

value is set too large, it may lead to slices that have a missed modularity, thus affecting the quality of time slicing and the accuracy of the results.

Through this method, we obtain the optimal timing point, time slice, and finally, the dynamic data information

R_{Δ t} = {R_{t 1}, R_{t 2}, \dots, R_{t m}}

within this time slice. According to the above method, we can calculate the dynamic data set

D a t a (F i e l d)

. We then use the MinHash signature and build an LSH hash table, using the Multi-Probe LSH method for similarity retrieval. A four-tuple

({f i e l d}_{i}, {f i e l d}_{j}, s, c o u n t)

for the similarity between two candidate sets is defined, where

{f i e l d}_{i}

represents the column name,

s

represents the similarity value between columns i and j, and

c o u n t

represents the cumulative number of time slices—that is, the number of times the table where the i^th column is located co-occurs, which is initially 0. Every time it occurs, an operation is performed to add 1 to

c o u n t

; as such, for the L^th time slice, it can be expressed as follows:

({f i e l d}_{i}, {f i e l d}_{j}, s_{L}, 1)

. For the existing time slices, the similarities

s

between the same columns can be superimposed, as follows:

s = \frac{s \cdot c o u n t + s_{L}}{c o u n t + 1}

(12)

count = c o u n t + 1

(13)

Using this method, after processing multiple time slices, we can obtain attribute matrix information that represents the dynamic similarity values.

2.2.4. Joinability Construction

When the data in a certain table change, joined tables often synchronize in the context of the log. This study uses a co-occurrence matrix to represent the number of times that tables appear together in a single time slice. If two tables frequently appear together in a time slice, their co-occurrence value and the corresponding probability of joinability are both larger. Assume that there are two data tables,

A

and

B

, which appear at the same time in time slice

T_{i}

; as the number of time slices increases, the number of times that they appear together can be expressed as

C_{A B} (T_{i}) = C_{A B} (T_{i - 1}) + 1

(14)

The value of co-occurrences

C_{A B}

is initially 0 and will be increased by 1 when two tables appear at the same time in the same time slice. The larger

C_{A B}

is, the more likely they are to belong to the same business module, and the stronger the join. If two columns can be joined with each other, they must have a certain similarity, so the attribute

A_{i}

is queried through the LSH hash table and dynamic similarity information to obtain the candidate set that meets the similarity conditions; then, an element

B_{j}

from it is selected, in turn, to obtain the static similarity between the attributes

A_{i}

and

B_{j}

for

d (A_{i}, B_{j})

and dynamic similarity

s (A_{i}, B_{j})

. In the data cleaning, the tables that can be joined have a business module, and there is similarity in the data. The joinable degree of attributes

A_{i}

and

B_{j}

can be expressed as:

l (A_{i}, B_{j}) = σ \cdot d (A_{i}, B_{j}) + (1 - σ) \cdot s (A_{i}, B_{j})

(15)

σ = \frac{\sum_{i} C_{A B}}{i}

(16)

where

σ

represents the probability of tables

A

and

B

appearing together in all time slices. As an example, when a user generates a sales order, it will not be related to the meeting information table; however, this does not rule out the similarity between the data in the order module and the data in the meeting record table. Because they do not appear together, at this time,

σ = 0

, and the dynamic similarity

s (A_{i}, B_{j})

is 0; thus,

l (A_{i}, B_{j})

is also 0. Although they are similar, they are not related. Through this method, we can exclude tables with no business relationships. If tables

A

and

B

have business relevance, when the time slice reaches a certain number,

σ \neq 0

. At this time,

s (A_{i}, B_{j})

tends to be similar to

d (A_{i}, B_{j})

, and their similarity can represent their joinability. The joinability calculation process is shown in Figure 3.

The joinability between tables

A

and

B

is the maximum joinability value between the two table columns:

l (A, B) = \max (l (A_{i}, B_{j}))

(17)

In this way, the elements in the candidate set are computed to obtain the set of correlatability tables.

3. Experiments

3.1. Data Sets

In this study, we used publicly available data sets that are representative of the similarity processing field. However, this research required complete data log information; thus, to more closely approach the actual situation, we chose to use the most commonly used, most complex enterprise resource planning (ERP) database in the telecom cable industry, which has been used in the production environment for many years. It contains 5574 tables and 95,370 columns, with a total size of about 3 GB; the column type details are shown in Figure 4.

Figure 4 shows that most fields use char, float, varchar, and datetime; during the cleaning process, the joinability between data tables is mainly determined according to primary keys or business codes. We directly filtered out non-joinability fields such as datetime, binary, and short char, resulting in 6081 fields that satisfied the joinability requirements. Consequently, 713 tables containing records remained, with an average of 68,000 items per table. Therefore, we only needed to calculate the joinability of the filtered table fields; the common types of joinable fields are shown in Table 2.

The experiment was carried out on a computer equipped with an AMD 7840HS CPU (8 cores) with 64 GB of DDR5 memory. The operating system for this machine is Windows 11 Professional Edition, and the method was implemented on Python 3.8, using libraries such as numpy, pandas, and datasketch for data processing and analysis.

3.2. Evaluation Metrics

This study focused on improving the accuracy of joined table discovery by eliminating tables that only show similarity but are irrelevant to the business during the data cleaning process; thus, precision and recall, which are widely used, were adopted as the measurement standards. The evaluation criteria in this study were determined during manual data cleaning. For a query column

Q

,

T P

denotes the set of tables with businesses joined with

Q

identified through tagging, and

F N

is the set of joinable tables calculated with the present method. Precision represents the proportion of correctly identified joinable tables out of all the predicted joinable tables, as follows:

p = \frac{T P \cap F N}{TP \cup F N}

(18)

Recall represents the proportion of correctly identified joinable tables out of the total set of annotated joinable tables and is defined as follows:

r = \frac{T P \cap F N}{T P}

(19)

3.3. Performance Comparison

To highlight the effectiveness of the proposed approach, we chose to compare it with the previously proposed techniques LSH Ensemble [30], Aurum [31], Josie [11], and LSH. LSH Ensemble (L32, with the number of shards set to 32) uses a transformation that allows MinHash LSH to be employed with a set containment similarity; then, the algorithm reads and computes the exact set intersection size for candidates to obtain the ranked final results. Aurum utilizes schema- and instance-level information to identify different relationships between the attributes of a data lake. Though performing a two-step dissection and indexing the data, we can establish a graph structure for keyword searches and/or determination of joinability. Josie is a powerful database associative table discovery framework that uses an intersection-estimated connection search to evaluate the joinability of tables and return the top k joinable tables as results. The original algorithm only supports a top k search, so we modified the algorithm to set the intersection-estimated connection value as a condition for ending the search, and the intersection-estimated value is the same as

τ

. At the same time, in order to better illustrate the perspective of this article, LSH is also selected as a comparison method.

For the experiment,

I_{m i n}

was set to 20 s,

I_{m a x}

to 5 min, and

t h r e s h o l d

to 10 s. A total of 16 tables were selected, for which we obtained the average precision of the compared methods under different values of the similarity threshold

τ

, as shown in Figure 5.

Figure 5 shows that when applied to data obtained from small- and medium-sized manufacturing industries, DTL was superior to Aurum, L32, and Josie in terms of joinability accuracy, with an average advantage of about 7% compared to the Josie framework. The methods of L32, Aurum, and DTL are based on the LSH index type. When the LSH method is used alone, its accuracy is very low because it can only represent static similarity, indicating that column similarity cannot not intuitively reflect the joinability between tables. The Aurum method achieves a relatively higher result than LSH alone because the relationship between the extracted data sets is used as a knowledge graph and is combined with database schema information. Although L32 divides the data set into different sections based on the distribution of column values, it does not incorporate time and cannot capture the shared contextual information of all tables within a specific timestamp, so its effectiveness is limited. Although Josie, as a powerful framework for discovering related tables, is insensitive to data distribution, its effectiveness still has some gaps compared with DTL. This also proves the feasibility and advantages of using static and dynamic similarity in finding joinable tables in small- and medium-sized manufacturing enterprises; the use of log data to construct business modules to remove similar but non-business joinable table information is the reason why the DTL method works well in small- and medium-sized manufacturing industries. Under the same conditions, the recall rate is shown in Figure 6.

Figure 6 indicates that the recall rate of the DTL was somewhat lower than those of Aurum, L32, and Josie; this lower recall rate is because these comparison methods solely rely on the similarity analysis of data dimensions, resulting in them obtaining higher recall than our method. In particular, the LSH method has a recall rate of over 90% because it completely regards similarity as joinable and covers many tables that have no business relationship, so its precision is very low. Additionally, this study employs a method that combines dynamic and static similarity in a modular manner. As a result, tables that are seldom manipulated do not appear in the log, and when

σ = 0

, their degree of joinability is perceived as zero, implying that there is no business joinability. This table information will then be filtered out.

This experiment showed that, although the recall of the proposed method was slightly lower than the values for the compared methods, as the table set

F N

covers truly business-related tables and filters out some irrelevant table sets,

T P

tends to be similar to

F N

; therefore, the higher precision of the proposed DTL method was more meaningful.

The time complexity of the processing phase includes the creation of the hash table and the analysis of the logs, whereas the time complexity of the query phase remains the same as Multi-Probe LSH.

3.4. Investigation of Parameter Sensitivity

This subsection describes the effects of different parameter values on the experimental results.

3.4.1. Threshold Experiment

The

t h r e s h o l d

value directly affects the size of the time slice. If it is too large, a situation where

Δ t_{m a x} < t h r e s h o l d

will occur, and the time slice will be adjusted to

I_{m a x}

. Experiments were conducted with different

t h r e s h o l d

values under the same parameters, and the results are shown in Figure 7.

The best results were achieved when

t h r e s h o l d

was in the range of 3 to 10 s. Considering the complexity of page operations in the industrial sector, the impact of

t h r e s h o l d

is relatively limited when the value is small; users need to perform page switching or button-clicking operations when an operation is completed and then trigger another operation, which takes a certain amount of time to complete. However, when the time interval exceeded 15 s, the detection effect significantly decreased. This decrease is due to an inability to effectively divide the time slices, resulting in the filtering out of reasonable business modules, thus affecting the frequency of the co-occurrence matrix and leading to decreased accuracy.

3.4.2. $I_{m a x}$ Experiment

I_{m a x}

can directly affect the structure of the business modules. We used minutes as the unit of measurement and analyzed the impact of this parameter on the results, as shown in Figure 8.

Figure 8 indicates that when

I_{m a x}

exceeded 1 min, the impact on the results was not significant. This is partly because, when acquiring time slices, there will be several slices that meet the required time within

I_{m a x}

. Only a few slices are taken as

I_{m a x}

, due to not finding an appropriate interval. When it was set to more than 10 min, a minor impact on the experimental results was observed, as all the data tables in the entire

I_{m a x}

tend to be considered as one module. At this point, as long as the columns are similar, they will be judged as being joinable.

3.4.3. Time Slice Statistics Experiment

Next, the proportion of time slices of different lengths was calculated based on the obtained slice information, as shown in Figure 9.

The area marked by two green lines represents the time slices from 8 to 44 s, which occupy 82.7% of the total time slices. The most common slice time was concentrated around 22 s, representing the time it takes for a user to perform a certain business operation. For operations that are logically complex, involve multiple tables, or have a large amount of data, more time is needed for business logic processing and database operations, which can lead to an increase in the length of the time slices.

3.4.4. Time Slice Quantity Experiment

Based on the characteristics of user behaviors, 22 s was selected as the average length of a time slice. A statistical unit was set to contain 1000 time slices, and by analyzing the user’s usage, this statistical unit can generally cover the number of logs for about one day. Nearly 120 statistical units were used for the analysis, as shown in Figure 10.

Figure 10 shows that the joinability results significantly depended on the number of time slices. Specifically, when the number of time slices was less than 30 units, it was difficult to obtain evaluation results; however, the accuracy only significantly improved when the number of time slices reaches a certain level. When reaching 80 units, the influence of the number of time slices on the results tended to become stable, as this period basically covers the common operations of users.

4. Conclusions

This paper addresses the problem of discovering joinable tables in the context of data cleaning in the manufacturing and related industries. A time-slicing LSH (DTL) method based on database logs was proposed, designed to improve the accuracy of the discovery of joinable tables. The proposed method is particularly effective in small and medium-sized manufacturing enterprises, in which it accurately identifies joinable tables and reduces the time spent finding joinable tables, thereby providing a more efficient strategy for data cleaning.

The improvement in accuracy achieved using this method is primarily due to the integration of dynamic and static similarity measures. Traditional approaches typically rely solely on static similarity, which overlooks the temporal dynamics of data. In contrast, our method captures the business modularity between data tables by extracting effective time slices from database logs. This approach represents the relationships between tables using a co-occurrence matrix, while also calculating dynamic similarity within each slice. By incorporating the co-occurrence matrix as a parameter and combining both dynamic and static similarity to determine joinability, our method facilitates a more effective discovery of joinable tables. The experimental results demonstrate that when dealing with complex ERP tables, our method improves the accuracy of identifying joinable tables by almost 7% compared to traditional frameworks, particularly in small and medium-sized enterprises in which table structures are complex and design practices are inconsistent.

However, this study has certain limitations. The proposed method relies heavily on the availability of a sufficient number of time slices. While all databases generate log files, these logs are often periodically purged due to their large size in practical applications, which may result in an insufficient number of time slices, thereby affecting the accuracy of the results. Future research could consider exploring the use of the time attribute in static data in conjunction with contextual information from other attributes to mine time shards with business modularity, further improving the efficiency and convenience of discoverable joinable tables.

Overall, this research provides a new idea and method for the discovery of joinable tables. The method significantly improves the accuracy of the results and reduces the time spent in finding joinable tables by combining business modularity and similarity as joinability, which has significance for improving the efficiency of data cleaning.

Author Contributions

Conceptualization, W.W. and C.Z.; methodology, W.W.; software, W.W. and H.Y.; validation, W.W., C.Z. and H.Y.; formal analysis, W.W.; investigation, C.Z.; resources, W.W.; data curation, H.Y.; writing—original draft preparation, W.W.; writing—review and editing, C.Z.; visualization, H.Y.; supervision, C.Z.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhejiang Provincial Department of Education Research Project (No. Y202351243).

Data Availability Statement

The source codes will be made publicly accessible via our website: https://github.com/wangwwwork/dtl (accessed on 1 October 2024). The data presented in this study are available upon request from the corresponding author due to containing real production and financial data information.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chepurko, N.; Marcus, R.; Zgraggen, E.; Fernandez, R.C.; Kraska, T.; Karger, D. ARDA: Automatic Relational Data Augmentation for Machine Learning. Proc. VLDB Endow. 2020, 13, 1373–1387. [Google Scholar] [CrossRef]
Dong, Y.; Oyamada, M. Table Enrichment System for Machine Learning. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 6 July 2022; pp. 3267–3271. [Google Scholar]
Zhao, Z.; Jian, Z.; Gaba, G.S.; Alroobaea, R.; Masud, M.; Rubaiee, S. An Improved Association Rule Mining Algorithm for Large Data. J. Intell. Syst. 2021, 30, 750–762. [Google Scholar] [CrossRef]
Jang, H.-J.; Yang, Y.; Park, J.S.; Kim, B. FP-Growth Algorithm for Discovering Region-Based Association Rule in the IoT Environment. Electronics 2021, 10, 3091. [Google Scholar] [CrossRef]
Gomes Dos Reis, D.; Ladeira, M.; Holanda, M.; De Carvalho Victorino, M. Large Database Schema Matching Using Data Mining Techniques. In Proceedings of the 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Singapore, 17–20 November 2018; pp. 523–530. [Google Scholar]
Pan, Z.; Pan, G.; Monti, A. Semantic-Similarity-Based Schema Matching for Management of Building Energy Data. Energies 2022, 15, 8894. [Google Scholar] [CrossRef]
Hättasch, B.; Truong-Ngoc, M.; Schmidt, A.; Binnig, C. It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. In Proceedings of the 2nd International Workshop on Applied AI for Database Systems and Applications, Tokyo, Japan, 31 August 2020. [Google Scholar]
Wang, J.; Lin, C.; Zaniolo, C. MF-Join: Efficient Fuzzy String Similarity Join with Multi-Level Filtering. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 386–397. [Google Scholar]
Choi, D.; Wee, J.; Song, S.; Lee, H.; Lim, J.; Bok, K.; Yoo, J. K-NN Query Optimization for High-Dimensional Index Using Machine Learning. Electronics 2023, 12, 2375. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, Q. MinSearch: An Efficient Algorithm for Similarity Search under Edit Distance. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 23 August 2020; pp. 566–576. [Google Scholar]
Zhu, E.; Deng, D.; Nargesian, F.; Miller, R.J. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 25 June 2019; pp. 847–864. [Google Scholar]
Zhang, M.; Hadjieleftheriou, M.; Ooi, B.C.; Procopiuc, C.M.; Srivastava, D. Automatic Discovery of Attributes in Relational Databases. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, 12 June 2011; pp. 109–120. [Google Scholar]
Esmailoghli, M.; Quiané-Ruiz, J.-A.; Abedjan, Z. MATE: Multi-Attribute Table Extraction. Proc. VLDB Endow. 2022, 15, 1684–1696. [Google Scholar] [CrossRef]
Dong, Y.; Takeoka, K.; Xiao, C.; Oyamada, M. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 456–467. [Google Scholar]
Fan, G.; Wang, J.; Li, Y.; Zhang, D.; Miller, R.J. Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proc. VLDB Endow. 2023, 16, 1726–1739. [Google Scholar] [CrossRef]
Khatiwada, A.; Fan, G.; Shraga, R.; Chen, Z.; Gatterbauer, W.; Miller, R.J.; Riedewald, M. SANTOS: Relationship-Based Semantic Table Union Search. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
Taha, I.; Lissandrini, M.; Simitsis, A.; Ioannidis, Y. A Study on Efficient Indexing for Table Search in Data Lakes. In Proceedings of the 2024 IEEE 18th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 5 February 2024; pp. 245–252. [Google Scholar]
Broder, A.Z. On the Resemblance and Containment of Documents. In Proceedings of the Proceedings. Compression and Complexity of SEQUENCES 1997, Salerno, Italy, 11–13 June 1997; pp. 21–29. [Google Scholar]
Indyk, P.; Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing—STOC ’98, Dallas, TX, USA, 23–26 May 1998; pp. 604–613. [Google Scholar]
Ukey, N.; Yang, Z.; Li, B.; Zhang, G.; Hu, Y.; Zhang, W. Survey on Exact kNN Queries over High-Dimensional Data Space. Sensors 2023, 23, 629. [Google Scholar] [CrossRef] [PubMed]
Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23 September 2007; pp. 950–961. [Google Scholar]
Korzeniowski, L.; Goczyla, K. Landscape of Automated Log Analysis: A Systematic Literature Review and Mapping Study. IEEE Access 2022, 10, 21892–21913. [Google Scholar] [CrossRef]
Ma, J.; Liu, Y.; Wan, H.; Sun, G. Automatic Parsing and Utilization of System Log Features in Log Analysis: A Survey. Appl. Sci. 2023, 13, 4930. [Google Scholar] [CrossRef]
Zhang, T.; Qiu, H.; Castellano, G.; Rifai, M.; Chen, C.S.; Pianese, F. System Log Parsing: A Survey. IEEE Trans. Knowl. Data Eng. 2023, 35, 8596–8614. [Google Scholar] [CrossRef]
Naseer, H.; Desouza, K.; Maynard, S.B.; Ahmad, A. Enabling Cybersecurity Incident Response Agility through Dynamic Capabilities: The Role of Real-Time Analytics. Eur. J. Inf. Syst. 2024, 33, 200–220. [Google Scholar] [CrossRef]
Imani, F.M.; Widyasari, Y.D.L.; Arifin, S.P. Optimizing Extract, Transform, and Load Process Using Change Data Capture. In Proceedings of the 2023 6th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Batam, Indonesia, 11 December 2023; pp. 266–269. [Google Scholar]
Qaiser, A.; Farooq, M.U.; Nabeel Mustafa, S.M.; Abrar, N. Comparative Analysis of ETL Tools in Big Data Analytics. Pak. J. Eng. Technol. 2023, 6, 7–12. [Google Scholar] [CrossRef]
Khan, B.; Jan, S.; Khan, W.; Chughtai, M.I. An Overview of ETL Techniques, Tools, Processes and Evaluations in Data Warehousing. J. Big Data 2024, 6, 1–20. [Google Scholar] [CrossRef]
Chy, M.S.H.; Arju, M.A.R.; Tella, S.M.; Cerny, T. Comparative Evaluation of Java Virtual Machine-Based Message Queue Services: A Study on Kafka, Artemis, Pulsar, and RocketMQ. Electronics 2023, 12, 4792. [Google Scholar] [CrossRef]
Zhu, E.; Nargesian, F.; Pu, K.Q.; Miller, R.J. LSH Ensemble: Internet-Scale Domain Search. Proc. VLDB Endow. 2016, 9, 1185–1196. [Google Scholar] [CrossRef]
Castro Fernandez, R.; Abedjan, Z.; Koko, F.; Yuan, G.; Madden, S.; Stonebraker, M. Aurum: A Data Discovery System. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, France, 16–19 April 2018; pp. 1001–1012. [Google Scholar]

Figure 1. Framework for computing joinability.

Figure 2. Dynamic data generation architecture.

Figure 3. Joinability calculations and lookups.

Figure 4. Numbers of columns of each type in the ERP database.

Figure 5. Average precision of the compared algorithms on the ERP database.

Figure 6. Average recall of compared algorithms on the ERP database.

Figure 7. Impact of time threshold on results obtained with the proposed method.

Figure 8. Impact of

I_{m a x}

on results.

Figure 8. Impact of

I_{m a x}

on results.

Figure 9. Statistics on the proportion of time slice lengths.

Figure 10. Impact of time slice quantity on the results.

Table 1. Notation used in this study.

Symbol	Description
$A, B$	Data collection tables in the database.
$A_{i}, B_{j}$	$A collection of data in column i of table A$ ; collection of data in column j of table B.
$a_{i j}$	$The data value in row i and column j of table A$ .
$d (\dots)$	A distance function used to determine similarity.
$ρ$	The similarity value.
$Q$	A query column.
$R$	A target column in the repository.
$τ$	The similarity threshold.
$l$	The relevance calculation function.
$h_{m i n}$	The minimum hash function.
$h$	The hash function.
$R_{t}$	A collection of logs in time period t.
$F i e l d, {F i e l d}_{r f}$	A collection of field names in the log within time period t; attribute f in table r.
$D A T A (F i e l d)$	A data collection of all fields.
$G (x)$	The hash function group
$s_{i j}$	The dynamic similarity between columns i and j.
$C_{A B}, σ$	The number of co-occurrences within the time slice between tables A and B; probability of co-occurrence.

Table 2. Commonly used joined field information.

Join Key	Type	Length	Description
Primary Key	INT	Default, 11 bits	A self-incrementing approach is typically used, usually starting at 0. This approach results in higher similarity between columns.
Primary Key	BIGINT	20 bits and above	A snowflake algorithm generates a unique value; there will be no duplicates.
Business Code	BIGINT	20 bits and above	Custom codes are unique in businesses, such as prefix + date + serial number.
	VARCHAR	10 bits and above
	CHAR	10 bits and above

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, W.; Zhu, C.; Yan, H. Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery. Electronics 2024, 13, 3920. https://doi.org/10.3390/electronics13193920

AMA Style

Wang W, Zhu C, Yan H. Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery. Electronics. 2024; 13(19):3920. https://doi.org/10.3390/electronics13193920

Chicago/Turabian Style

Wang, Weiwei, Chunxiang Zhu, and Han Yan. 2024. "Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery" Electronics 13, no. 19: 3920. https://doi.org/10.3390/electronics13193920

APA Style

Wang, W., Zhu, C., & Yan, H. (2024). Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery. Electronics, 13(19), 3920. https://doi.org/10.3390/electronics13193920

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery

Abstract

1. Introduction

2. Methodology

2.1. Preliminaries

2.2. Dynamic Data-Driven Time-Slicing LSH Method

2.2.1. Constructing the Static Similarity

2.2.2. Dynamic Data Acquisition

2.2.3. Time Slice and Dynamic Similarity

2.2.4. Joinability Construction

3. Experiments

3.1. Data Sets

3.2. Evaluation Metrics

3.3. Performance Comparison

3.4. Investigation of Parameter Sensitivity

3.4.1. Threshold Experiment

3.4.2. $I_{m a x}$ Experiment

3.4.3. Time Slice Statistics Experiment

3.4.4. Time Slice Quantity Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery

Abstract

1. Introduction

2. Methodology

2.1. Preliminaries

2.2. Dynamic Data-Driven Time-Slicing LSH Method

2.2.1. Constructing the Static Similarity

2.2.2. Dynamic Data Acquisition

2.2.3. Time Slice and Dynamic Similarity

2.2.4. Joinability Construction

3. Experiments

3.1. Data Sets

3.2. Evaluation Metrics

3.3. Performance Comparison

3.4. Investigation of Parameter Sensitivity

3.4.1. Threshold Experiment

3.4.2. I m a x Experiment

3.4.3. Time Slice Statistics Experiment

3.4.4. Time Slice Quantity Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.2. $I_{m a x}$ Experiment