1. Introduction
There is a long stream of research on set similarity query (SSQ). Given a collection of sets and a query, SSQ aims to find all sets in a dataset that are similar to the given query. SSQ plays an important role in many applications. For example, SSQ can help identify multiple profiles of the same customer in data integration. In data cleaning, similar objects can be found using SSQ to eliminate inconsistencies and redundancies [
1]. In bioinformatics, SSQ can be used to find similar genome sequences to help find treatments for diseases.
Comparing sets one by one is inefficient and index is key to efficient SSQ. Most existing SSQ and set similarity join (SSJ) algorithms are inverted index based and adopt a filter -validation framework [
2,
3,
4,
5,
6]. In the filtering stage, a variety of filtering techniques, such as prefix filtering, length filtering, and position filtering, are used to generate candidates. In the validation phase, the candidates are validated to obtain the final results. However, these algorithms usually filter unqualified sets one by one and do not have sufficient support for duplicated sets, thus leading to a low efficiency.
Trie can effectively compress the prefixes between sets into a common path, so it can better support duplicated or a near duplicated set. ETI, a trie based index, is implemented in [
7], which performs better than other inverted index based competitors. It is mainly designed for T-overlap query, which aims to find all sets having at least
T common elements with a given query. Although it can be extended to support Jaccard and many other similarity predicates, the efficiency is not high due to a large index and insufficient filtering in ETI.
SSQ and SSJ are significantly different in the execution process, index construction, threshold support, and many other aspects [
8,
9]. While a lot of work is focused on SSJ [
10,
11,
12], there are few research works on SSQ. This paper is a good supplement to the existing studies.
As the algorithms in ETI, our algorithm is also trie based. However, considering that a query with low threshold is of little significance in real applications, we design T-starTrie, a trie index created on the t-prefixs (the prefixes computed by threshold ) of all sets. As T-starTrie only indexes the prefix of each set, its space overhead is much smaller than ETI and prefix filtering can be effectively utilized. In addition, it can support any queries with query threshold t no less than . We found that SSQ on T-starTrie can be transformed into matching nodes of the first layer (FLMNodes) detecting problem on T-starTrie. Therefore, an efficient FLMNode detection algorithm is designed. Based on T-starTrie, TT-SSQ, an efficient SSQ algorithm, is proposed by developing a variety of filtering techniques. Experimental results show that TT-SSQ significantly outperforms the existing algorithms.
The main contributions of this paper are as follows:
We designed T-starTrie, a trie based index created on t-prefixs, which can effectively support any queries with a threshold no less than .
We transformed SSQ into FMNodes finding problem, and then proposed an efficient FMNode detection algorithm.
We design efficient node depth filtering (NDFiltering) and NIL length filtering (NLFiltering) and then the proposed efficient SSQ algorithm, TT-SSQ. Extensive experimental results show that TT-SSQ outperforms the existing algorithms significantly.
The rest of this paper is organized as follows:
Section 2 gives a problem definition, data preprocessing and related works.
Section 3 presents T-starTrie. The TT-SSQ algorithm is given in
Section 4.
Section 5 discusses the experiments and analyses. Finally, we conclude the full text.
2. Preliminary
2.1. Problem Definition
Given a universe U consisting of a collection of elements, which are comparable to each other and with the same type, a database D contains a collection of sets with each set S consisting of a number of elements in U. is denoted as the length of set S. Each set S has a unique set identifier (SID), and we refer to an SID as a set when there is no ambiguity. We assume S is not a multi-set, that is, there are no duplicated elements in S. Based on the above discussions, the definition of SSQ is given as follows:
Definition 1. SSQ: Given a dataset D, a query Q and a threshold t, a SSQ returns allthat satisfies.
The in Definition 1 represents a set similarity predicate. There are a variety of similarity predicates available, such as Jaccard, Dice, Cosine, T-overlap, etc. This paper uses Jaccard as the default similarity predicate. Given two sets S and Q, the Jaccard similarity between them is defined as Jaccard . The value range of Jaccard is .
Example 1. Table 1 shows an example dataset. Given a query and a threshold , the SSQ returns , while the Jaccard similarity between Q and all the other sets are smaller than t. 2.2. Data Preprocessing
Existing SSQ algorithms usually need some specific data preprocessing, mainly including sorting sets or elements in specific orders. Similar to the most previous algorithms, this paper adopts length ascending ordering as the default set ordering, as it can effectively exploit length filtering to generate candidates. For the element ordering in each set, there are three common representatives: the value ordering, the frequency ordering and the frequency-value ordering [
8]. In this paper, we use the frequency-value ordering as our default element ordering, and it can be further divided into two kinds: the frequency value ascending ordering (FVA) and the frequency value descending ordering (FVD). Take FVA as an example, and the steps to sort
D in FVA are as follows:
sorts the elements of U in frequency ascending order in D;
for each element e in each set S of D, replace e with its subscript in U;
sort the elements in each set S by their value in ascending order.
Example 2. Given the example dataset in Table 1, the frequencies of elements from 1 to 9 are 4,5,6,2,5,3,6,5,4, respectively. After sorting these elements in frequency ascending order, we obtain the following mappings between each element in U and its subscript using FVA: . Following these mappings, the mapped FVA dataset is given in Table 2. 2.3. Related Works
SSQ and SSJ are the two common set similarity processing methods. Most previous work has focused on SSJ. Given two datasets, SSJ identifies all similar set pairs, in which the first set comes from the first dataset and the second set comes from the other dataset [
13]. AllPairs [
3] preprocessed the datasets in length ordering and then combined prefix filtering and length filtering together to generate candidates. PPJoin [
4], viewed as a combination of prefix filtering and positional filtering, further imposed the frequency ascending ordering and position filtering to improve the efficiency of AllPairs. Considering the limited filtering ability of fixed prefixes, an adapted method was proposed in [
5], which improves the filtering ability by increasing the prefix length and well supports SSQ and SSJ. Groupjoin proposed in [
14] divided sets with the same length and prefix into one group, which can filter sets in a group simultaneously. Deng et al. [
6] proposed a partition-based framework to enhance filtering capabilities. For more information, we refer reader the recent surveys [
8,
15] on SSQ and SSJ.
Currently, there are only a few works supporting SSQ, e.g., ETI [
7] and Adapt [
5]. In spite of this, it was also observed that the existing SSJ algorithms can also be considered as SSQ algorithms when the offline indexes are built in advance and RS joins are performed, so they can be compared with our algorithm as well.
In addition to the above works, there are some other studies on SSQ in specific scenarios. MinHash based algorithms were studied in [
16] to obtain the approximate results of SSQ, which is different to our work aiming to obtain exact results. McCauley et al. [
17] studied SSQ under skew distributed data. Zhang et al. [
18] studied the knn SSQ algorithm, which returns the first
k results with the maximum Jaccard similarity to the query. Zhang et al. [
19] proposed a learning based method to divide a dataset into groups to improve the efficiency of SSQ. Vernica et al. [
20] researched parallel SSJ algorithms on Mapreduce.
3. T-starTrie
3.1. Inverted Index
As an index extensively used in SSQ and SSJ, the inverted index is composed of two parts: a directory and inverted lists. The directory consists of all the elements in
U. Each element corresponds to an inverted list containing all SIDs whose set contains that element. The inverted index created for the dataset in
Table 1 is shown in
Figure 1.
As shown in
Figure 1, the SID of each set is inserted into the inverted lists corresponding to all elements in that set, resulting in redundant storage and query time overhead.
For example, and are exactly the same. However, they are inserted into all the inverted lists of elements 1, 3, 7 and 8 and need to be processed dependently in these four inverted lists, deteriorating the performance.
3.2. T-starTrie
In this paper, we use trie to tackle this issue. Trie, also known as prefix tree, is an ordered tree structure. The common prefixes between two sets are mapped to a common path in trie. Each node in trie represents the collection of sets whose prefixes can map to this node. For these sets, we can process them in a group manner. Therefore, trie can support duplicated or nearly duplicated sets better than inverted indexing. The sample trie created for the dataset in
Table 1 is shown in
Figure 2.
Considering that the simple trie in
Figure 2 is difficult to support SSQ, we expand the trie node by adding more properties. The structure of the expanded node is shown in
Figure 3.
In
Figure 3, for a Node
N, the corresponding descriptions of each attribute are as follows:
1. N.Elem: the element corresponding to N;
2. N.Link: the pointer pointing to the next node with element N.Elem;
3. N.Children: the pointer pointing to the first child of N;
4. N.Parent: the pointer pointing to to the parent node of N;
5. N.NIL: the node inverted list containing all the SIDs whose prefix maps to N;
6. N.Depth: the depth of N in trie. The depth of the root is 0.
For a large dataset, creating a complete trie will lead to a large space overhead as shown in
Figure 2. Considering that a very small similarity threshold has little significance in the real world scenarios, to reduce the space overhead, in this paper, we construct a novel trie based index: T-StarTrie.
T-StarTrie is made up of two parts: a directory and a trie. The directory is the same as that of an inverted index. Each entry in the directory contains an element e and a link pointing to the first node N with N.Elem = e, hereafter we called the link for element e as e.Link for ease of description. The trie in T-StarTrie is created for a specific threshold (e.g., 0.6). That is, only the first elements of each set S are indexed in T-StarTrie. For convenience, we denote the prefix of S corresponding to threshold t as S.t-prefix.
As T-starTrie indexes
-prefix of each set, if
Q.t-prefix and
S.-prefix have no common elements,
Q and
S cannot be similar. Thus, T-Startrie can support any query with threshold greater than
. Given the above description, the T-starTrie built on the dataset in
Table 1 is shown in
Figure 4. For ease of description, a node ID (colored in red) is virtually added to the right of each node, and we call the node with ID
i as node
i hereafter.
Example 3. In Figure 4, for and t = 0.6,-prefix is {1,2}, so only elements 1 and 2 in are indexed in T-StarTrie, corresponding to node 1 and node 4 in T-StarTrie, respectively. The NIL of node 1 is {} as these four sets have the same prefix {1}. In addition, node 4 has the same Elem with node 2, so these two nodes are linked through the Link pointer. According to the above description, the corresponding index construction algorithm is shown in Algorithm 1. Updating T-starTrie is relatively straightforward. However, it is not easy to update T-starTrie while keeping FVA, as inserting new sets into T-starTrie may change the underlying frequency distribution. Fortunately, we do not need to keep FVA all the time, and a T-starTrie with approximate FVA has little impact on the query performance. Thus, we can regulate FVA periodically when the server is not busy.
Algorithm 1 CreateT-starTrie |
- Input:
, a database with each set sorted according to a certain element ordering , a threshold to create trie - Output:
, the root of the created T-starTrie - 1:
R←∅ - 2:
create a directory using all distinct elements in the -prefix for all sets in D - 3:
for each set S∈D do - 4:
P ← R - 5:
for each element e in the -prefix of S do - 6:
find the node N whose Elem is e in P.Children - 7:
if N is null then - 8:
N←create a new node with Elem e - 9:
N.NIL ← {S.SID} - 10:
insert N into P.Children - 11:
insert N into e.link - 12:
else - 13:
N.NIL ←N.NIL ∪ {S.SID} - 14:
end if - 15:
end for - 16:
P ← N - 17:
end for
|
4. Similarity Query Algorithm
4.1. Filtering Techniques
Filtering techniques are key to efficient SSQ algorithms. In this paper, we develop two following efficient filtering techniques for TT-SSQ algorithm: node depth filtering (NDFiltering) and NIL length filtering (NLFiltering). Firstly, we have NDFiltering lemma as shown in Lemma 1.
Lemma 1. Given a query Q, a threshold t and a node N, if N.Depth , then all sets in N.NIL cannot be similar to Q.
Proof. According to the length filtering, the maximum length of a set S that meets Jaccard similarity with Q is , and the maximum length of -prefix is . If N.Depth > -prefix, the length of all sets in N.NIL must be greater than . Therefore, all sets in N.NIL cannot be similar to Q. Thus, Lemma 1 holds.
According to NDFiltering, any nodes with depth greater than can be safely pruned. For the nodes passing the NDFiltering, NLFiltering can be further applied. That is, given a query Q, a threshold t and a node N, if a set .NIL is similar with Q, the length of S must be within ,.
NLFiltering is the application of length filtering on NILs. As we keep all sets in length ascending order, the SIDs in a NIL also keep the same order. As a result, we can locate the qualified SID range in an NIL by binary search, thus filtering sets whose length cannot meet the requirements. □
Example 4. In Figure 4, given a query and t = 0.6, the Q.t-prefix is {2, 5}. When querying in T-starTrie, although the ELEM of node 12 (the red node in Figure 4) is 5, this node can be filtered as its depth is . For another example, the depth of node 2 is 1, which satisfies NDFiltering. However, according to NLFiltering, the length bound of for Q is [3, 6], so and in the NIL of this node can be filtered out. Based on NDFiltering and NLFiltering, we present our TT-SSQ algorithm.
4.2. TT-SSQ
Before introducing TT-SQL, the following definitions are given:
Definition 2. Matching Node (MNode): a node N is called a MNode if N.Elem appears in Q.t-prefix.
Definition 3. Matching node of the first layer(FMNode): an MNode with no other MNodes in the path from this node to the root.
Based on the definition of FMNode, we have Lemma 2 as follows:
Lemma 2. The union of the NILs of all FMNodes is the superset of SSQ.
Proof. Firstly, we know from prefix filtering that if two sets are similar, they must have a common element in their prefixes. Thus, the union of the NILs of all MNodes contains all sets similar to Q. Secondly, the NIL of any non-FMNode is a subset of the NIL of FMNodes, so Lemma 2 holds.
According to Lemma 2, our problem can be transformed into obtaining FLMNodes on T-starTrie. We observed that a node N should meet one of the following three conditions to be a FMNode:
N.Elem equals the first element in Q;
N.Depth is 1;
N does not satisfy the 1st and the 2nd condition, but satisfies Definition 3.
□
Identifying an FMNode
N satisfying the 1st and the 2nd condition is rather easy. For the 3rd condition, we can iteratively check each node
N* from the parent of N to the root in a bottom-up manner to see whether
N*.Elem is in
Q.t-prefix. In most cases, we do not need to check all nodes from
N to the root; instead, we can make an early termination when
N*.Elem is in
Q.t-prefix or
N*.Elem overpasses the first element of
Q, here “overpasses” means “is smaller than” for FVA, and “is greater than” for FVD. The algorithm for checking whether
N is a FMNode, IsFMNode, is shown in Algorithm 2.
Algorithm 2 IsFMNode |
- Input:
Q={, , …, }, a query N, a node - Output:
true if N is a FMNode, else otherwise - 1:
ifN.Elem=then - 2:
return true - 3:
end if - 4:
ifN.Depth=1 then - 5:
return true - 6:
end if - 7:
for each node from the parent of N to the root do - 8:
if .Elem in Q.t-prefix then - 9:
return false - 10:
else .Elem overpasses - 11:
return false - 12:
end if - 13:
end for - 14:
return true
|
Based on the above discussion, the main steps of TT-SSQ are described as follows:
For each node N in the Link of each element in Q.t-prefix;
If N does not pass NDFiltering, skip this node;
If N is a FMNode, perform NLFiltering on its NIL to obtain the candidates;
The complete TT-SSQ algorithm is shown in Algorithm 3:
Algorithm 3 TT-SSQ |
- Input:
Q={,, …,}, a query t, a query threshold - Output:
R, all sets having a similarity at least t with Q - 1:
R←∅ - 2:
for each element e in Q.t-prefix do - 3:
for each node N in e.Link do - 4:
if N.Depth ≤ then - 5:
if IsFMNode(Q,N) then - 6:
C ← the SIDs in N.NIL whose set length between [,] - 7:
for each c in C do - 8:
if verify(Q, c, t) then - 9:
R ← R ∪ {c} - 10:
end if - 11:
end for - 12:
end if - 13:
end if - 14:
end for - 15:
end for
|
Example 5. Given a query and t = 0.6, the Q.t-prefix is . Since 2 is the first element in Q.t-prefix, the nodes with green bold circle in Figure 4 (node 2 and node 4) are FMNodes. For nodes with Elem 5, only node 3 is a FMNode that passed NDFiltering. Therefore, nodes 2, 3 and 4 are FMNodes and are further filtered by NLFiltering to obtain the final candidates {,,,,,}. 6. Conclusions
In this paper, T-starTrie, a novel trie-based index is designed, and TT-SSQ is proposed based on T-starTrie. T-starTrie is time and space efficient by only indexing the -prefix of each set. We transform the SSQ problem into the problem of finding FMNodes in T-starTrie and propose two effective filtering methods: NDFiltering and NLFiltering. By exploiting NDFiltering and NLFiltering, TT-SSQ can filter a large number of sets in a group manner, which has a high efficiency. Experimental results show that TT-SSQ significantly outperforms the other algorithms, especially for high query thresholds. Next, we will further optimize our index and algorithm. In addition, applying this work to some real world applications is also an interesting direction.