3.1. TLP
Two strings may lead to collisions when they are inserted into a DA, so it is not practical inserting them into a DA simultaneously. To construct DA in parallel, we need to divide a dataset into partitions and build multiple independent small DAs in parallel. As mentioned above, BP and BP_PLA are not suitable or less efficient in parallel environments due to the difficulties in keeping prefix integrity and load balancing together. To this end, this paper proposes a bottom-up two-level partitioning (TLP) framework to achieve a better balance between these two characteristics.
TLP firstly divides a dataset
D into
m discrete lower-level partitions (
LP) (
) based on the first character of strings in
D. Each
LP is a first-character partition (all strings with the same first character in
D constitute a first-character partition).
is used to represent the number of strings in
. This can ensure the prefix integrity within each
LP. Secondly, TLP uniquely assigns each
LP to an upper-level partition (
UP), eventually merging
m LPs into
n UPs (
). Similarly,
is used to represent the number of strings in
. A brief structure of TLP is shown in
Figure 4.
Merging m LPs into n UPs is non-trivial. Intuitively, there are nm different merging strategies. Different merging strategies lead to different load balancing performances. Thus, choosing the best merging strategy to obtain the best load balancing is a key to TLP. For easy explanation, we give the definition of Partition Range (PR) as follows:
Definition 1. Given , the PR of U is defined as , where and represent the maximum and minimum number of strings of each UP in U, respectively.
PR can be used to measure load balancing of a partition strategy. To obtain the optimal load balancing performance, the optimal LP merging is defined as follows:
Definition 2. Optimal LP Merging (OLM): Given, each LP in L is uniquely assigned to an UP, thus merging UP into n disjoint Upswhile keepingto be minimum.
OLM is essentially a multi-way number partitioning problem [
29,
30], which is proved to be an NPC problem, solving it in limited time is key to TLP. Based on the analysis above, this paper proposes a Min-Heap based Greedy Merging algorithm (MH_GMerge) to quickly find an approximate optimal merging strategy. The basic idea of MH_GMerge is as follows:
- (1)
sort m LPS to be merged in descending order of the number of strings;
- (2)
given the partition parameter n, build a min-heap for the first n LPS in L, and set the corresponding UP partition No. (UPN) of the n LPs to 1, 2, 3,… , n, accordingly;
- (3)
process the subsequent m-n LPs in order until all LPs have been processed. For , assign its UPN to UPN of the heap top Ht and execute , where means the number of strings in all LPs related to Ht. At last, we readjust the heap to min-heap.
The number of elements in the created min-heap is always n and there is no need to actually load the subsequent m-n LPs into the heap, which makes MH_GMerge very fast. The completed MH-GMerge algorithm is shown in Algorithm 1.
Algorithm 1. Min-Heap based Greedy Merging (MH-GMerge) algorithm.
Algorithm 1 MHG-Merge |
//Input: L, a set of m LPs |
n, number of partitions |
//Output: PQT, partition query table |
1. L ← sort L by number of strings in descending order |
2. PQT ← empty map from LP to UP |
3. for j from 1 to n |
4. PQT[j] ← j |
5. H ← minimum heap constructed according to the first n elements of L. |
6. Ht ← top element of H |
7. for j from n + 1 to m |
8. |Ht| ← |Ht| + |LPj| |
9. PQT[j] ← Ht.upNo |
10. adjust Heap to minimum |
Complexity Analysis: The time complexity of sorting m LPs is, the time complexity of constructing and adjusting heap is. As , so the overall time complexity of MH-GMerge algorithm is . The major space overheads of MH-GMerge are the overheads of partition query table (PQT) and th3e heap, which are and respectively, where represents the overhead of an integer. As a result, the total space overheads of MH-GMerge is .
After MH-GMerge finished, we can obtain a UPN for each LP, which means a mapping between LP and UP is established. This mapping is expressed as a partition query table (PQT) in this paper. As all strings in a LP have the same first character c, we can use an array to implement PQT and store UPN of each UP directly into PQT[c]. In the subsequent index construction and string retrieval processes, the UPN can be retrieved directly by the first character of a string, thus avoiding the expensive overheads of string comparison.
Example 1: Given
,
m = 7 and
n = 3, the number of strings corresponding to each
is 100, 80, 65, 60, 55, 20, and 10, respectively. The initial min-heap constructed on the first 3
LPs is shown in
Figure 5, the number on the left part of a heap node is the
partition number, and the right is the total number of strings in the
. For
LP4, we set its UPN to 3 the UPN of the heap top, 3, then we execute
. At last, we adjust the heap to min-heap. Following this way, after all
LPs have been processed, we can obtain the final result status as
Figure 6 shows. The final PR of this example is 10.
It is worth noting that TLP is independent of any specific DA and can be seamlessly integrated with other DAs and further improves the construction efficiency.
From the analysis above, it is easy to see that TLP combined with MH-GMerge has a better load balancing performance while keeping prefix integrity. The final TLP algorithm is shown in Algorithm 2.
Algorithm 2. TLP algorithm.
Algorithm 2 TLP |
//Input: D, a string dataset |
n, number of partitions |
//Output: PQT, partition query table |
1. L ← devide D into partitions by initial characters |
2. PQT ← MHG-Merge(L, n) |
3.2. TLP-Based Partitioned DA
Based on TLP, we can easily implement TLP-based partitioned DA (TPDA) both in serial and in parallel. For ease of description, we only discuss constructing TPDA in parallel. To do so, this paper employs a popular parallel library, OpenMP, to construct the final TPDA in parallel. Using OpenMP,
n threads are started, each thread is responsible for the DAs corresponding to a
UP. Based on the levels the DAs created on, two kinds of TPDAs are designed as
Figure 7 shows.
- (1)
DAs are created on UP(UP-DA), one DA for each UP, n DAs in total;
- (2)
DAs are created on LP(LP-DA), one DA for each LP, m DAs in total. A UP corresponds to multiple independent DAs.
Both UP-DA and LP-DA consist of PQT and DA. PQT is responsible for determining the UPN for a string, while DA is for indexing and retrieving strings. Despite the different number of DAs created for the two structures, for each thread, the loads are still relatively balanced because threads are UP oriented. Generally speaking, LP-DA is expected to have a much better performance for it creates multiple smaller DAs, thus reducing the collisions and the collision-handling costs.
Taking UP-DA as an example, the algorithm for constructing UP-DA is shown in Algorithm 3.
Algorithm 3. Constructing algorithm for UP-DA.
Algorithm 3 CreateUPDA |
//Input: D, a string dataset |
n, number of partitions |
//Output: PQT, partition query table |
1. L ← devide D into partitions by initial characters |
2. PQT ← MHG-Merge(L, n) |