Next Article in Journal
Mapping Landslide Hazard Risk Using Random Forest Algorithm in Guixi, Jiangxi, China
Next Article in Special Issue
A Unified Methodology for the Generalisation of the Geometry of Features
Previous Article in Journal
Generalization of Soundings across Scales: From DTM to Harbour and Approach Nautical Charts
Previous Article in Special Issue
Station-Free Bike Rebalancing Analysis: Scale, Modeling, and Computational Challenges
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Continuous k Nearest Neighbor Queries over Large-Scale Spatial–Textual Data Streams

College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China
*
Author to whom correspondence should be addressed.
ISPRS Int. J. Geo-Inf. 2020, 9(11), 694; https://doi.org/10.3390/ijgi9110694
Submission received: 19 October 2020 / Revised: 11 November 2020 / Accepted: 18 November 2020 / Published: 20 November 2020
(This article belongs to the Special Issue Spatial Optimization and GIS)

Abstract

:
Continuous k nearest neighbor queries over spatial–textual data streams (abbreviated as CkQST) are the core operations of numerous location-based publish/subscribe systems. Such a system is usually subscribed with millions of CkQST and evaluated simultaneously whenever new objects arrive and old objects expire. To efficiently evaluate CkQST, we extend a quadtree with an ordered, inverted index as the spatial–textual index for subscribed queries to match the incoming objects, and exploit it with three key techniques. (1) A memory-based cost model is proposed to find the optimal quadtree nodes covering the spatial search range of CkQST, which minimize the cost for searching and updating the index. (2) An adaptive block-based ordered, inverted index is proposed to organize the keywords of CkQST, which adaptively arranges queries in spatial nodes and allows the objects containing common keywords to be processed in a batch with a shared scan, and hence a significant performance gain. (3) A cost-based k-skyband technique is proposed to judiciously determine an optimal search range for CkQST according to the workload of objects, to reduce the re-evaluation cost due to the expiration of objects. The experiments on real-world and synthetic datasets demonstrate that our proposed techniques can efficiently evaluate CkQST.

1. Introduction

The continuous k nearest neighbor queries over spatial–textual data streams (abbreviated as CkQST) retrieve to and continuously monitor at most k nearest neighbor (abbreviated as kNN) objects at the user-specified location containing all the user-specified keywords, which have been widely used in a variety of location-based applications, such as location-aware targeting of advertisements, analysis of micro-blogs, and mobile navigation-services.
In an e-coupon recommendation system or a Weibo publish/subscribe system, users register his/her interests (e.g., favorite food or clothing brand for the former, and news or persons for the latter) as a query. A stream of spatial–textual objects (e.g., e-coupons or Weibos) generated are fed to the relevant users. Continuous queries over spatial–textual data streams studied by existing work [1,2,3,4,5,6,7,8,9,10,11,12] are primarily in terms of Boolean matching or approximate matching, which return an unpredictable number of objects or approximate results. The number of qualified objects containing all keywords specified by a user can be far larger than k , because the objects (e.g., tweets, news) usually contain much more keywords than queries do. This motivates us to study CkQST, which return at most k nearest neighbor objects containing all the query keywords.
Example 1. Figure 1 depicts a running example used throughout this paper. At timestamp t 0 , there are five subscribed 2-NN (i.e., k = 2 ) queries q 1 , q 2 , , q 5 with a small circle representing their geo-location, and five objects o 1 , o 2 , , o 5 with a small square representing their geo-location in Figure 1a, while corresponding keywords and expiration times are shown in Figure 1e. The spatial region is organized by a three-layer quadtree, where the spatial nodes are numbered successively, and the root node is n 0 . Taking the evaluation of q 1 as an example, { o 4 , o 1 } is returned. For q 1 , the spatial search range, thereafter “search range”, is defined as a minimal circle centered at the geo-location of q 1 and covering { o 4 , o 1 } , i.e., C 1 . At timestamp t 1 , an object o 6 arrives, as shown in Figure 1b, with keywords { w 1 , w 2 , w 3 } and expiration time t 2 . o 6 contains all the keywords of q 1 and q 4 , but only C 1 is hit by o 6 . The result and search range of q 1 are updated to { o 6 , o 4 } and the circle C 1 , respectively, while the result and search range of q 4 are not affected. At timestamp t 2 , o 6 expires. For q 1 , the number of qualified objects in C 1 is less than 2, so the result should be re-evaluated. The result and search range of q 1 are updated to { o 4 , o 1 } and the circle C 1 , respectively. Therefore, for CkQST, the spatial search range covering kNN objects changes dynamically with the arrival and expiration of qualified objects.
Challenges. The solution framework for evaluating generic continuous queries over spatial–textual data streams consists of selecting an appropriate spatial index and a textual index to form a hybrid spatial–textual index, and exploiting it with appropriate spatial and/or textual filtering strategies to process the incoming objects according to the features of queries [1,2,3,4,5,6,7,8,9,10,11,12]. There are three key challenges in constructing such an index for CkQST.
First, regarding the spatial filtering, evaluating CkQST is essentially identifying queries whose search range is hit by the incoming objects. It is very important to efficiently organize the search ranges of CkQST; therefore, how to map the search range of CkQST to the spatial nodes is the focus. The search range of CkQST covering kNN objects changes frequently with the arrival and expiration of qualified objects, which requires the index to have both strong filtering ability and low update cost. For most spatial indexes, having strong filtering ability and low update cost are contradictory. There are two approaches to mapping the search range of queries to spatial nodes to improve the filtering ability and reduce the update cost of the index. (1) Queries are mapped to the leaf nodes in the spatial index, which minimizes the spatial region of the nodes covering the search range of queries to reduce the number of objects to be verified [2,5,7,8,9,10,13,14,15]. (2) Queries are mapped to the spatial nodes according to the spatial distribution [10,11,12], the keyword distribution [6], or the corresponding cost model [1,3,4]. These approaches are appropriate in the scenarios where the search range of the queries rarely changes, but inappropriate to CkQST, where frequent update of the search range of the queries results in high costs.
Second, regarding the textual filtering, evaluating CkQST is essentially identifying queries whose keywords are fully contained in a given object. An inverted index is usually used to organize continuous queries [1,2,6,10]. A large number of queries make the posting lists very long, and the fast-arriving objects are verified against the corresponding posting lists in multiple rounds in a short time, which becomes the bottleneck of textual filtering. There are three ways to improve textual filtering capabilities. (1) Insert queries into the shortest posting list according to the frequency of query keywords to reduce the number of queries in posting lists, such as in the ranked-key inverted index [1,6]. The posting lists may still be long. (2) Increase the depth of textual partition, such as the ordered keyword trie [3,4,6]. It takes much time to construct the index, and nodes must be reconstructed if queries are updated, which is not appropriate for the scenarios like CkQST where the queries are frequently updated. (3) Organize queries in posting lists in the ascending order according to the ranking score [10]. However, there is no corresponding concept in CkQST. None of the above approaches can efficiently support CkQST textual filtering.
Third, the kNN re-evaluation is frequently triggered by object expiration. When an object expires, several CkQST have to be re-evaluated from scratch, which is expensive. Several techniques have been proposed to solve the similar problems in approximate top-k query (e.g., [10,13,16,17]). They all favor maintaining more than k results to reduce the chances for re-evaluation. However, they either maintain all the skyline objects in the entire region [13,16], or maintain a k-skyband containing skyline objects whose scores were larger than a threshold [10,17], which are not designed for the CkQST returning exact results.
In view of the challenges, we extend a quadtree with an ordered, inverted index to organize CkQST. Three key techniques are proposed to exploit the spatial–textual index and address the above three challenges. The contributions of this paper follow.
(1) To support the frequent change of search ranges of CkQST, a memory-based cost model is proposed to map the search ranges of CkQST to the quadtree nodes, which minimizes the verification cost and index update cost.
(2) To reduce the number of queries verified and process objects in batches, an adaptive block-based ordered, inverted index is proposed to organize the query keywords at quadtree nodes, which allow multiple objects containing common texts to be verified concurrently. For this index, an insertion strategy is proposed to adaptively insert queries in views of the skewed distributions of CkQST and objects.
(3) To reduce the re-evaluation cost, a cost-based k-skyband technique is proposed to judiciously determine the search range for CkQST according to the workload of objects, which minimize the verification cost, update cost, and the re-evaluation cost.
The experiments on real-world and synthetic datasets demonstrate that the proposed techniques can efficiently evaluate CkQST. Compared with the state-of-the-art techniques, when the number of CkQST reaches 20 M, the average index updating time caused by incoming objects decreases by 61%, and the average incoming object processing time decreases by 36%. Compared with the re-evaluation from scratch, the average processing time for expired objects decreases by 99.99%. The rest of this paper is organized as follows. Section 2 formally defines CkQST and presents a framework for evaluating CkQST. Section 3 presents three key techniques for evaluating CkQST. Section 4 reports the experimental studies. Finally, Section 5 concludes this paper.

2. The Framework for Evaluating CkQST

In this section, we formally define CkQST in Section 2.1 and present a framework to evaluate CkQST in Section 2.2.

2.1. Problem Definition

A spatial–textual object is defined as o = l o c , ψ , t e , where o . l o c is the geo-location, o . ψ is a set of keywords (terms) from a vocabulary set V , and o . t e is a timestamp indicating the expiration time of o . All the spatial–textual objects over the data streams are denoted as O . A CkQST is defined as q = l o c , ψ , k , t e , where q . l o c , q . ψ , and q . t e follow the similar meaning to o , q . k is the number of returned objects, i.e., at most q . k (abbreviated as k ) results are maintained for q . The result list of q , denoted as q ( O ) k , contains a set of k objects, each of which covers all the keywords in q . ψ . q ( O ) k is organized by a linked list, in which objects are arranged in the ascending order according to the distances to q . Formally, o q ( O ) k ( ( o O \ q ( O ) k ) ( o . ψ q . ψ d i s t ( o , q ) d i s t ( o , q ) ) ) , where d i s t ( o , q ) is the Euclidean distance between o and q . Let q . k d i s t be the distance between q and its k t h nearest neighbor result. The search range for q , denoted as q . S R k , is defined as a circle centered at q . l o c with radius q . k d i s t .
Spatial–textual objects are usually advertisements published by merchants or the latest breaking news, and CkQST are users’ search requests. Hereafter spatial–textual object and CkQST are abbreviated as object and query, respectively, if there is no ambiguity. To simplify the calculation, the terms in the vocabulary set V are mapped to integers between 1 and V according to the alphabetical order, where V is the number of terms in V . We assume that the terms in V , and the terms contained in queries and objects are sorted in increasing order. Specifically, for q Q , we use q . ψ [ i ] to denote the i t h keyword of q , q . ψ [ i : j ] to denote a subset of q . ψ , i.e., i l j q . ψ [ l ] , q . ψ [ : i ] to denote 1 l i q . ψ [ l ] , q . ψ [ i : ] to denote i l q . ψ q . ψ [ l ] , and q . ψ to denote the number of keywords in q . ψ . Objects follow the similar notations. Table 1 summarizes the notations used throughout this paper.
Problem Statement. Given a set of CkQST Q and spatial–textual data streams O , for each CkQST, find the kNN objects containing all the query keywords over O whenever objects arrive or expire.

2.2. The Framework for Evaluating CkQST

The framework for evaluating CkQST shown in Figure 2 consists of two indexes and four key techniques. The object index organizes the objects and can be implemented with any existing spatial–textual index, and we adopt the inverted linear quadtree (IL-quadtree) [18] as an example. The query index organizes queries, which is essentially a quadtree integrated with an ordered, inverted index described in Section 3.
The arrival and expiration of objects. When multiple objects arrive in a batch, they are inserted into the object index, and processed by the object-batch processing algorithm with the help of the query index to find all the affected queries and update the corresponding queries’ results and search ranges. When objects expire, the result list of affected queries is checked. Those queries that cannot be refilled through their result list are re-evaluated from scratch against the object index. To save computational cost, the expired objects are removed lazily from the object index until they are accessed again.
The arrival and deletion of queries. When a new query is submitted, it is initially evaluated using the object index with several strategies. A cost-based k-skyband technique is used to find an optimal search range for the query to reduce the cost for updating the index by sacrificing a little bit of filtering performance. A memory-based cost model is used to get the corresponding mapped spatial nodes. An adaptive insertion strategy is used to get the posting list and the corresponding block to be inserted. These strategies can further improve the filtering performance of the index and reduce the cost for updating the index. When a query is deleted or its search range shrinks, a flag is set in the corresponding nodes, where a query table is maintained, and it is not removed from the query index until accessed again, which is called delayed deletion and is necessary in an update-friendly system. A query insertion request might cancel the marked items, which avoid the deletion of objects changing frequently. If a query is deleted, its result list is also removed.

3. The Query Index

According to the above discussions, the query index is essentially a quadtree extended with an ordered, inverted index. Three techniques are proposed to enhance the filtering ability and reduce the update cost of the index. Section 3.1 introduces the motivations. Section 3.2 describes the ordered, inverted index, followed by a detailed adaptive query inserting algorithm in Section 3.3. Section 3.4 proposes the memory-based cost model to quantitatively analyze how to find optimal associated nodes for CkQST. The algorithm for processing objects in batches is presented to improve the throughput in Section 3.5. The re-evaluation technique is introduced in Section 3.6.

3.1. Motivations

Organizing the search range of CkQST. The first issue of using a quadtree to organize the search range of CkQST is how to map the search range to the quadtree nodes. Given that q Q , q . S R k can be mapped to any set of quadtree nodes N S = { n 1 , n 2 , , n i } , only if the union of the spatial region corresponding to these nodes in N S covers q . S R k , which is also called that q is associated with N S . Associating a query with the quadtree nodes is challenging because it affects two computation costs: (1) Verification cost, i.e., the cost of verifying the query with the objects falling in the associated nodes. (2) Update cost, i.e., the cost of inserting or deleting the query in or from the associated nodes. If the search range is organized by nodes with large regions, the index update cost is small, and the verification cost is large; otherwise, if multiple nodes with small regions are used, the situation is reversed. Therefore, a cost model is required to trade off the verification cost and update cost, and find the optimal associated nodes for CkQST.
Organizing the keywords of CkQST. When new objects arrive, the cost of verifying these objects with queries in spatial nodes is expensive. How to reduce the verification cost is the key to improve the filtering ability of the index. We discuss three aspects of constructing an inverted index. (1) For an inverted index, queries in posting lists are usually unordered. For the five queries in Figure 1, we attached them to the posting list of a single keyword. Figure 3a is the inverted index in which queries are attached to the posting list corresponding to the first query keyword, and Figure 3b is the ranked-key inverted index in which queries are attached to the posting list corresponding to the least frequent keyword. If the incoming objects contain the corresponding term, all queries in posting lists are verified [1,2], which is inefficient. In this work, we use an ordered, inverted index to solve the above problem. Figure 3c is the ordered, inverted index if queries are attached to the posting list corresponding to the first keyword, i.e., queries in posting lists are organized in the ascending order according to the keywords. When o 6 with keywords { w 1 , w 2 , w 3 } arrives, the posting list corresponding to w 1 is verified. When o 6 is verified with q 2 , its keywords are smaller than q 2 , so we can terminate the verification early and speed up processing objects.
(2) Compared with the ordered, inverted index constructed by single keyword, the ordered, inverted index constructed by multiple keywords has more advantages. The length is shorter and the verification probability is smaller. As Figure 3d shows, o 6 is verified with the first two posting lists and contains all the keywords of the queries in these posting lists. However, the number of posting lists might grow sharply. If the number of terms contained in vocabulary set V is 1 M, and the number of query keywords are not more than 5, total 106×5 posting lists are required, which is difficult to implement by hash table due to the need for large continuous memory. Like other works in [1,2,3,4,5,6,7,8,9,10,11,12], this paper uses the Map class in Microsoft Visual Studio [19] to build the ordered, inverted index. Lemma 1 describes the verification efficiency with the number of keywords for constructing the ordered, inverted index. Section 4.2 verifies this lemma through experiments. Based on the discussions, we select two keywords to construct the ordered, inverted index.
(3) Usually, there are many queries in posting lists, but only a small number match the incoming objects. Therefore, quickly locating the queries to be verified in posting lists is another way to improve the efficiency of evaluating CkQST. The queries in posting list are partitioned into multiple blocks such that objects are verified with the queries in a few blocks rather than the whole posting list. The only problem is how to partition these queries in posting lists. It is inefficient to have too many or too few queries in a block. An adaptive insertion strategy is proposed in Section 3.3.

3.2. Ordered, Inverted Index

The formal definition of an ordered, inverted index constructed using two keywords follows. Queries are attached to the posting list of their first two keywords, and arranged in ascending order according to their keywords.
Definition 1 (Ordered Posting List/Ordered, Inverted Index).
Given a set of queries  q 1 , q 2 , , q i  to be inserted into a quadtree node, if  q 1 . ψ [ 1 : 2 ] = q 2 . ψ [ 1 : 2 ] = q i . ψ [ 1 : 2 ] = { w i 1 , w i 2 }  , and  q 1 . ψ [ 3 : ] q 2 . ψ [ 3 : ] q i . ψ [ 3 : ] , the posting list determined by the two terms  w i 1 , w i 2  at the node is denoted as  P L w i 1 w i 2 , in which these queries are successively inserted.  P L w i 1 w i 2 is called an ordered posting list. Specifically, if these queries only contain one keyword, the corresponding posting list is denoted as P L w i 1 w i 1 . All the ordered posting lists constitute the ordered, inverted index.
Hereafter the ordered posting list is abbreviated as posting list, if there is no ambiguity. To quickly locate the queries to be verified, posting lists are divided into multiple blocks.
Definition 2 (Block).
Given any ordered posting list  P L w i 1 w i 2 , w i 1 w i 2 , b r  is the  r t h  block of  P L w i 1 w i 2 . For any query  q in b r , q . ψ [ 3 ] [ b r . m i n w , b r . m a x w ] , where  b r . m i n w = min q i b r q i . ψ [ 3 ] , b r . m a x w = max q i b r q i . ψ [ 3 ] . b r . ψ denotes all the keywords satisfying  b r . ψ = { w | w [ b r . m i n w , b r . m a x w ] } . Specially, if  q  only contains one or two keywords, it is inserted into the block  b 0  of the corresponding posting list.
Lemma 1.
If the number of keywords for constructing an ordered, inverted index is  m m 1 , there are at most V m posting lists at a node. For any object  o containing more than two keywords, the verification cost can be estimated by Equation (1). Where B is the number of blocks in a posting list, b is the number of queries in b , Q contains queries whose keywords are contained in o , and the number of keywords is less than m . The proof is shown in Appendix A.
O ( o . ψ m log V m + o . ψ m + 1 ( log B + b ( B b . ψ ) m 1 ) + Q )
For w j b . ψ , the verification probability, denoted as p V w ( w j ) , is maintained, i.e., the probability of verifying these queries subjected to q i . ψ [ 3 ] = { w j } in b r . For b r , the verifying probability, denoted as p V B ( b r ) , is maintained, i.e., the probability that the block b r is verified, which can be estimated by Equation (2).
min { max w b r . ψ p V w ( w ) + 1 b r . ψ 1 ( w b r . ψ p V w ( w ) max w b r . ψ p V w ( w ) ) , 1 }
The following theorems claim that the incoming object is verified with few queries in posting lists. For any incoming object, the blocks being verified can be located according to block keyword interval [ b r . m i n w , b r . m a x w ] (see Theorem 1). The object verification with a block or a posting list can be terminated if the keywords are smaller than that of some query being verified. (see Theorem 2).
Theorem 1.
Given o O and a posting list P L w i 1 w i 2 being verified with o , for b r in P L w i 1 w i 2 , if q in b r , o contains all keywords of q , only if w j o . ψ ,   w j [ b r . m i n w , b r . m a x w ] .
Proof. 
Suppose that o contains all the keywords of q , but there is no term in o . ψ satisfying w j [ b r . m i n w , b r . m a x w ] . Based on Definition 2,   q . ψ [ 3 ] [ b r . m i n w , b r . m a x w ] , o does not contain the third keyword of q , so o does not contain all the keywords of q . It is contradicted by the hypothesis, so the theorem is proved. □
Theorem 2.
Given o O and a posting list P L w i 1 w i 2 being verified with o , for b r in P L w i 1 w i 2 , we have the following conclusions. (1) If b r . m i n w > max w o . ψ , o is not the result of queries in the blocks starting from b r , and P L w i 1 w i 2 can be terminated verifying. Specifically, for q i in b r , if q i . ψ [ 3 ] > max w o . ψ , P L w i 1 w i 2 can be terminated verifying. (2) If b r . m a x w < min w o . ψ | w > w i 2 , o is not the result of queries in b r .
Proof. 
(1) If b r . m i n w = min q i b r q i . ψ [ 3 ] > max { w o . ψ } , i.e., for any q i in b r , o does not contain its third keyword, o is not the result of q i . So does the block b r + j , since b r + j . m i n w > b r . m i n w > max { w o . ψ } . Similarly, for any q i in b r , q i + j . ψ [ 3 ] > q i . ψ [ 3 ] > max { w o . ψ } , o is not the result of queries after q i .
(2) If b r . m a x w = max q i b r q i . ψ [ 3 ] < min w o . ψ | w > w i 2 , o is not the result of the queries in b r . □

3.3. Adaptive Query Insertion Algorithm

Given any posting list at a node, we consider two extreme situations: (1) the posting list only contains a block which contains all queries; (2) the posting list contains many blocks, each of which only contain one query. The former has poor filtering ability and the latter has high update cost. Neither is what we expect. In the real world, people are concerned with different interests and often pay high attention to the breaking news or topical issues, so the keywords of the queries and objects vary over time. For each query, we adaptively insert it into the posting lists according to the historical queries and objects. We expect that the increase of the verification cost and update cost of the posting list is minimal after the query being inserted.
Given a posting list P L w i 1 w i 2 , the update cost is denoted as C U P L ( P L w i 1 w i 2 ) , and the verification cost is denoted as C V P L ( P L w i 1 w i 2 ) , which can be estimated by Equation (3), where C V B ( b r ) = p V B ( b r ) ( l o g B + b r ) represents the verification cost of the block b r in P L w i 1 w i 2 .
C V P L ( P L w i 1 w i 2 ) = b r C V B b r
Theorem 3.
Let q be the query to be inserted into P L w i 1 w i 2 , q . ψ [ 1 : 3 ] = { w i 1 , w i 2 , w j } , P L w i 1 w i 2 has B ( B 1 ) blocks, Δ C V P L is the increase of verification cost of P L w i 1 w i 2 after q being inserted. We have the following conclusions.
Case 1: If   b r P L w i 1 w i 2 satisfies w j b r . m i n w , b r . m a x w , q is inserted into b r . Δ C V P L = p V B ( b r ) .
Case 2: If  b P L w i 1 w i 2  satisfies  w j [ b . m i n w , b . m a x w ] , but b r P L w i 1 w i 2  satisfies  b r . m a x w < w j , q is inserted into the tail of b r . The updated block is denoted as b r . Δ C V P L = ( p V B ( b r ) p V B ( b r ) ) ( l o g B + b r ) + p V B ( b r ) .
Case 3: Similar to case 2, if  b P L w i 1 w i 2  satisfies  w j [ b . m i n w , b . m a x w ] , but b r P L w i 1 w i 2  satisfies  w j < b r . m i n w , q is inserted into the head of the b r . Δ C V P L = ( p V B ( b r ) p V B ( b r ) ) ( l o g B + b r ) + p V B ( b r ) .
Case 4: If  b P L w i 1 w i 2  satisfies  w j [ b . m i n w , b . m a x w ] , a new block b is constructed in P L w i 1 w i 2 , and q is inserted into b . Δ C V P L = log ( ( B + 1 ) / B ) b r p V B ( b r ) + p V w ( w j ) log ( B + 1 ) .
Proof. 
(1) Case 1: If q is inserted into b r , the verifying probability p B ( b r ) does not change since w j b r . m i n w , b r . m a x w , but the number of queries in b r increases by 1.
Δ C V P L = p V B b r l o g B + b r + 1 p V B b r l o g B + b r = p V B b r
(2) Case 2:
Δ C V P L = C V B b r C V B b r = p V B b r l o g B + b r p V B b r l o g B + b r = p V B b r p V B b r l o g B + b r + p V B b r
(3) Similar to case 2.
(4) Case 4: For any b i in P L w i 1 w i 2 , C V B b i = p V B b i log B + b i , C V B b i = p V B b i log B + 1 + b i
Δ C V B = C V B b r C V B b r = p V B b r log B + 1 + b r log B + b r = p V B b r log B + 1 / B
Δ C V P L = b r ( C V B b r C V B b r ) + C V B b = b r p V B b r log B + 1 / B + C V B b = log B + 1 / B b r p V B b r + p V w w j log B + 1
Theorem 3 shows all the cases where the verification costs increase if a query is inserted into the posting list. The increase of update costs Δ C U P L corresponding to the above four cases are O ( b r + 1 ) , O ( 1 ) , O ( 1 ) , and O ( 1 + l o g B ) respectively. To compare the verification costs and update costs, we introduce a normalization parameter θ U ( 0 < θ U 1 ) to represent the ratio of the update operation to the verification operation, i.e., if a query is inserted into a node, there will be 1 / θ U objects being verified with it. A query is adaptively inserted into the posting list according to the following theorem. □
Theorem 4.
Let q  be the query to be inserted into P L w i 1 w i 2 , q . ψ [ 1 : 3 ] = { w i 1 , w i 2 , w j } . If b r in P L w i 1 w i 2  satisfies  w j [ b r . m i n w , b r . m a x w ] , q is inserted into b r . Otherwise, the minimum Δ C V P L + θ U Δ C U P L of the cases 2–4 is taken.
 Algorithm 1: I n s e r t Q u e r y P L ( q , P L q . ψ [ 1 ] q . ψ [ 2 ] )
1 if P L q . ψ [ 1 ] q . ψ [ 2 ] then
2   construct new block b in P L q . ψ [ 1 ] q . ψ [ 2 ] insert q into b; return;
b r min { b | b . m i n w q . ψ [ 3 ] } ;
4  if q . ψ [ 3 ] = = b r . m i n w then q is inserted into br; return;
5  if r > 1 and q . ψ [ 3 ] b r 1 . maxw  then
6   q is inserted into br−1; return;
7  if b r then compute min { Δ C V P L + θ U Δ C U P L } c a s e 2 , c a s e 4 on b B ;
8  if r == 1 then compute min { Δ C V P L + θ U Δ C U P L } case 3 , c a s e 4 ;
9  if r > 1 then compute min { Δ C V P L + θ U Δ C U P L } case2,case3,case 4 ;
10  if case 2 then q is inserted into the tail of br−1;
11  if case 3 then q is inserted into the head of br;
12  if case 4 inserted into a new block.
Given q Q , if q contains no more than two keywords, it is directly inserted into the b 0 of the corresponding posting list. Algorithm 1 shows how a query q containing more than two keywords is adaptively inserted into a posting list. If P L q . ψ [ 1 ] q . ψ [ 2 ] does not exist, a new block b is constructed, and q is inserted into b (lines 1–2). Otherwise, a block in P L q . ψ [ 1 ] q . ψ [ 2 ] is found for q to minimize Δ C V P L + θ U Δ C U P L (lines 3–12). First, we find the block, denoted as b r , whose b r . m i n w is the smallest—no smaller than q . ψ [ 3 ] (line 3). If q . ψ [ 3 ] = = b r . m i n w , q is inserted into b r (line 4). If q . ψ [ 3 ] [ b r 1 . m i n w , b r 1 . m a x w ] ( r > 1 ), q is inserted into block b r 1 (lines 5–6). Otherwise, we compute Δ C V P L + θ U Δ C U P L according to cases 2–4 in Theorem 3 and select the minimum case (lines 7–12). It is worth noting that when compared with the first block of the list, there are only cases 3–4, and if q . ψ [ 3 ] is larger than b r . m i n w of all the blocks, there are only cases 2 and 4.
Computation complexity. In the worst case, the computation cost of Algorithm 1 is shown as Lemma 1. That is, in posting lists constructed by two keywords, the complexity of inserting a query at a node is O ( o . ψ 2 log V 2 + o . ψ 3 ( log B + b / B b . ψ ) ) . The algorithm can adaptively adjust B and b .

3.4. The Memory-Based Cost Model

A memory-based cost model associates queries with the optimal quadtree nodes. Given the search range of CkQST, the model traversals the quadtree from the root node, compares the sum of the verification cost and index update cost if the query is associated with the current node and its child nodes, and selects the smaller one. The verification cost is the product of the number of verified objects and the expectation of the verification cost, and the update cost is the expectation of the update cost if the query is inserted into the corresponding block of the posting list.
Definition 3 (Minimum Bounding Node).
Given  q Q and search range  q . S R k , if  N , q . S R k N . R , and for any child node  n i  of  N ,   q . S R k n i . R N  is the minimum bounding node of  q  , where N . R , n . R  are the region where  N  and  n  locate respectively. The minimum bounding node of  q is the node, which covers its search range, but any of its child nodes cannot completely cover the search range.
Verification cost. Given q Q and its minimum bounding node N , q . ψ [ 1 : 3 ] = { w i 1 , w i 2 , w i 3 } , if q is associated with N , the verification cost within unit time interval, denoted as C V q q , N , can be estimated by Equation (4). We assume that the query and object contain more than two keywords, and the average verification cost is unit time.
C V q ( q , N ) = N u m o N ( N ) p V q ( q | N ) E V ( q | N )
Specifically, if the query or the object contains one or two keywords, the verification cost is estimated by Equation (5). This case is simple, so we omit the details.
C V q ( q , N ) = N u m o N ( N ) p V B ( b 0 )
where N u m o N ( N ) is the number of objects falling in N within the unit time interval. p V q q | N is the probability that q is verified if it is inserted into b r in N , i.e., the probability that the objects contain the terms w i 1 ,   w i 2 , and w j [ w i 3 , b r . m a x w ] , and can be estimated by Equation (6).
p V q q | N = p V B b r b r . m a x w w i 3 + 1 b r . ψ
where b r . ψ is the number of keywords contained in b r . ψ . E V q | N is the verification cost if q is inserted into b r in N , and can be estimated with the expectation of verification cost of the queries in b r , i.e., Equation (7).
E V q | N = w j p V w w j N u m q w j
where ( N u m q ) w j is the number of queries subjected to q i . ψ [ 3 ] w j in b r . Similarly, if the query q is associated with a set of non-overlapping nodes, denoted as N S , the verification cost is denoted as C V q q , N S and can be estimated by Equation (8).
C V q q , N S = n i N S C V q q , n i
For q Q , we find the optimal associated nodes starting from its minimum bounding node, and check whether the query is associated with the current node or associated with its child nodes. The difference of two verification costs is estimated by Equation (9).
Δ C V q = n i I N S n C V q q , n i n i I N S n . c h i l d C V q q , n i
where I N S keeps the intermediate result, n . c h i l d contains the child nodes of n that intersect with the search range of the query. It is worth noting that if Δ C V q 0 , we terminate the iteration.
Update cost. When inserting or deleting queries in nodes, it will incur an index update cost. We delay deleting queries until these queries are accessed again, so the deletion cost is ignored. If a query q is associated with its minimum bounding node N , and is inserted into a block b r of posting list P L q . ψ [ 1 ] q . ψ [ 2 ] in N , the insertion cost consists of two parts, the time to find the corresponding block and the time to find the insertion position. The update cost, denoted as C U q q , N , can be estimated by Equation (10).
C U q q , N = l o g B + 1 b r i = 1 b r i = l o g B + b r + 1 2
If q is associated with a set of non-overlapping nodes, denoted as N S , the update cost is denoted as C U q q , N S and can be estimated by Equation (11).
C U q q , N S = n i N S C U q q , n i
Similarly to Δ C V q , the difference of two update costs between the query being associated with the node and associated with the child nodes is estimated by Equation (12).
Δ C U q = n j I N S n . c h i l d C U q q , n j n j I N S n C U q q , n j
Given q Q and search range q . S R k , we start from the minimum bounding node, and computes Δ C V q and Δ C U q between the query being associated with the node and associated with the child nodes. If Δ C V q θ U Δ C U q , the child nodes are the optimal. Otherwise the node is optimal. The computation cost consists of two parts, finding the minimum bounding node of q , and finding an optimal association in the descendant nodes of minimum bounding node. The computation cost of the first part is O ( θ h ) , and the second part is O ( ( 4 θ h 1 ) / 3 ) , i.e., in the worst case, the node will be partitioned until the leaf node, where θ h is the height of the quadtree.

3.5. Processing Objects in Batches

For these objects being verified with the same posting list, an object processing algorithm, which is a group matching technique that follows the filtering and verification strategy, is proposed to process objects in batches.
A data structure b i d , w , o s e t is defined to group the objects being verified with the same posting list, where b i d ( b i d > 0 ) is the block id, w ( w [ b b i d . m i n w , b b i d . m a x w ] ) is a term, o s e t is a set of objects being verified with block b b i d and containing w . For the convenience of the description, w s e t b i d is a set of terms satisfying w [ b b i d . m i n w , b b i d . m a x w ] , o s e t b i d , w is a set of objects which are verified with the queries in block b b i d and contain w .
Algorithm 2 describes how to process a set of objects represented by b i d , w , o s e t , which are verified with the queries in the posting list P L w i 1 w i 2 . If b b i d is b 0 , the queries in b 0 is verified with the objects in o s e t (lines 1–5). Otherwise, for any term w j in w s e t b i d , according to Theorem 2, if b b i d . m i n w > w j , we check the next term (line 7); if b b i d . m a x w < w j , we check the next block (line 8); otherwise, for each query q in b b i d , if q . ψ [ 3 ] > w j , we check the next term (line 10); if q . ψ [ q . ψ ] < w j , we check the next query (line 11); otherwise, we verify whether the object is the results of q . If yes, q , o is added to Q O S (lines 12–13). Moreover, the result list and search range of q are updated.
Computation complexity. Algorithm 2 describes how to process objects in batches in an ordered posting list. In the worst case, objects are processed individually. As Lemma 1 shows, in posting lists constructed by two keywords, for an object, the time complexity of finding the qualified queries at a node is O ( o . ψ 2 log V 2 + o . ψ 3 ( log B + b / B b . ψ ) ) .
 Algorithm 2:   ObjectProcessing ( P L w i 1 w i 2 , { b i d , w , o s e t } ) .
Ijgi 09 00694 i001

3.6. Cost-Based k-Skyband Technique

To reduce the re-evaluation cost, a cost-based k-skyband technique is proposed to judiciously determine an optimal search range for CkQST such that the overall cost defined in the cost model can be minimized. Specifically, for q Q , three parameters are defined: an extended search range is denoted as q . S R , where q . S R q . S R k ; a k-skyband, i.e., an extended result list, denoted as q ( O ) , where q ( O ) q ( O ) k ; the number of objects containing all query keywords within q . S R in the initial timestamp is denoted as q . θ k , where q . θ k q . k .
Definition 4 (Loose Matching).
Given  o O  and  q Q , o  loosely matches  q  only if  q . ψ o . ψ  and  o . l o c q . S R . All the objects that loosely match  q  aredenoted as  q ( O ) sup = { o O | q . ψ o . ψ o . l o c q . S R } .
Definition 5 (Dominance).
Given  q Q and two objects  o 1 , o 2 , which loosely match  q , o 1  dominates  o 2  only if  d i s t ( q , o 1 ) < d i s t ( q , o 2 )  and o 1 . t e o 2 . t e  or  d i s t ( q , o 1 ) d i s t ( q , o 2 )  and  o 1 . t e > o 2 . t e .
Definition 6 (k-skyband/Extended Result list).
Given  q Q , for any incoming object  o , we insert it into the k-skyband  q ( O ) only if: (1)  o loosely matches  q ; (2)  o  is dominated by less than  k  other objects. For q Q , ifan object  o  loosely matches  q , and there are  k  objects in  q ( O )  dominating o , o would not be a result at any timestamp. Therefore, we would not insert these objects into the result list.
Theorem 5.
Given q Q and an extended search range q . S R , we always have the following conclusions. (1) q ( O ) k q ( O ) ; (2) q ( O ) q ( O ) sup ; (3) q ( O ) < k iff q ( O ) sup < k .
Proof. 
(1) At any timestamp, for o q ( O ) k , we have q . ψ o . ψ and o . l o c q . S R k q . S R , i.e., o loosely matches q , and less than k objects dominate o . So o q ( O ) . (2) According to Definition 4, for o q ( O ) , o loosely matches q , so o q ( O ) sup , q ( O ) q ( O ) sup . (3) Since q ( O ) q ( O ) sup , if q ( O ) sup < k , we have q ( O ) q ( O ) sup < k . On the other hand, if q ( O ) < k , q ( O ) sup k , i.e., o q ( O ) sup and o q ( O ) , which means o loosely matches q , but o is dominated by more than k other objects, which is contradicted by q ( O ) < k . The theorem is proved. □
According to Theorem 5, the extended result list is the super set of the exact result list, from which we can extract the kNN objects, and the number of objects in extended result list is less than k only if the number of objects in q ( O ) sup is less than k .
Given q Q , an extended search range q . S R , and the corresponding extended result list q ( O ) , three costs are defined in the cost-based k-skyband technique: the verification cost of q within q . S R , the update cost of q ( O ) , and the re-evaluation cost.
Verification cost. The verification cost of q within q . S R , within the unit time interval, denoted as C V R q | q . S R , is estimated by Equation (13), i.e., the verification cost if q is inserted into all the leaf nodes that intersect with q . S R .
C V R q | q . S R = n . R q . S R C V q q , n
Update cost. Similarly to q ( O ) k , the extended result list q ( O ) is organized by a linked list. For o q ( O ) , a dominance counter is defined to count the number of objects dominating o . If an incoming object o is inserted into q ( O ) , the dominance counters of all the objects in q ( O ) with d i s t ( q , o ) < d i s t ( q , o ) and o . t e o . t e , or d i s t ( q , o ) d i s t ( q , o ) and o . t e > o . t e will increase by 1, and the objects with dominance counter equal to k will be evicted, which can be processed in O ( q ( O ) ) time. When an object in q ( O ) expires, it is deleted from q ( O ) until it is accessed again. The cost is negligible. The update cost of q ( O ) within unit time interval, denoted as C U L ( q | q ( O ) ) , is estimated by Equation (14). Where f r e q U o is the number of object updates within unit time interval, p M q ( q . S R ) is the probability that the objects loosely match q within the search range q . S R , and is estimated by p M q ( q . S R ) = q . θ k / N u m o N ( n 0 ) , 1 / 2 is the probability that a qualified object arrival, q ( O ) can be estimated by q ( O ) = max { k ln ( θ k / k ) , θ k } .
C U L ( q | q ( O ) ) = f r e q U o 1 / 2 p M q ( q . S R ) q ( O )
Re-evaluation cost. The re-evaluation cost within the unit time interval is denoted as 1 / θ t C I e q . Where θ t is the re-evaluation period, i.e., the shortest time required between two consecutive independent evaluations, and 1 / θ t is the frequency of re-evaluation. C I e q is the re-evaluation cost, and is approximated to the verification cost in q . S R k , i.e., C I e q = C V R q | q . S R k = n . R q . S R k C V q q , n .
The overall cost in the cost-based k-skyband technique, denoted as C R e q , is shown in Equation (15). When C R e q is minimal, the search range is optimal.
C R e q = C V R q | q . S R + C U L q | q O + 1 / θ t C I e q
In the following, we discuss how to get θ t . θ t is the re-evaluation period, i.e., the shortest time that q ( O ) is reduced to k 1 since the last re-evaluation. For q Q , the update process of number of objects in q ( O ) can be modeled as a simple random walk, which is a stochastic sequence S l , with S 0 being the original status, defined by S l = i = 1 l X i , where X i is the object update, which is an independent and identically distributed random variable. In q ( O ) , if an object is inserted, X i = 1 ; if an object expires or is dominated, X i = 1 ; otherwise X i = 0 . It’s difficult to estimate X i due to the eviction of objects by the dominance relationship in q ( O ) . For example, an object is inserted, but the number of objects decreases due to the eviction of objects with dominance counters reaching k . According to Theorem 5, the number of objects in q ( O ) is less than k only if the number of objects in q ( O ) sup is less than k , and the objects in q ( O ) sup don’t dominate each other. Therefore we estimate the shortest time that q ( O ) sup is reduced to k 1 , denoted as θ t , where θ t = θ t . The object update in q ( O ) sup at any timestamp can be estimated as Equation (16).
p r o b ( X i ) = 1 / 2 p M q ( q . S R ) i f   X i = 1 1 / 2 p M q ( q . S R ) i f   X i = 1 1 p M q ( q . S R ) i f   X i = 0
The number of object updates required to reduce the number of objects from q . θ k to k 1 in q ( O ) , denoted as ( q ) , is estimated by Equation (17). θ t is estimated by θ t = ( q ) / f r e q U o .
( q ) = 2 ( q . θ k k + 1 ) q . θ k p M q ( q . S R ) + ( q . θ k k + 1 ) ( q . θ k k + 2 ) p M q ( q . S R )
For q Q , the variables in Equation (15) are q . S R and q . θ k . To minimize Equation (15), we employ the incremental estimation algorithm to compute the optimal q . θ k and the corresponding q . S R .
To accommodate our extended search range with the objects processing algorithm and index construction and maintenance algorithm, we replace q ( O ) k with q ( O ) and replace q . S R k with q . S R .

4. Experiments

In this section, we conduct a set of comprehensive experiments to evaluate the efficiency and scalability of the key techniques. Section 4.1 introduces the experimental environment. Section 4.2 evaluates the effect of three tuning parameters and the re-evaluation technique. Section 4.3 evaluates the efficiency and scalability of our index techniques.

4.1. Experimental Settings

All experiments are implemented in VC++, and run on a Win10 machine with an Intel I7-8700K 3.7 GHz CPU and 32 GB memory. In accordance with previous works (e.g., [2,3,4,5,6,7,8,9,10,11,12]), we load the query indexes into main memory to support real-time response.
Datasets. Three datasets are collected for experimental evaluations. The statistics are shown in Table 2. TWEETS contains twitters collected from Twitter [8]. TWEETS is the default dataset. GN is obtained from the US Board on Geographic Names, in which each record contains a geo-location and some terms (http://geonames.usgs.gov/). GOWALLA is a synthetic dataset, in which each record contains a geo-location collected from the Gowalla (https://snap.stanford.edu/data/loc-gowalla.html), and less than 50 terms randomly assigned from 20 Newsgroups (http://people.csail.mit.edu/jrennie/20Newsgroups). Based on the datasets, we generate queries and objects.
Query Workload. For each sample dataset, we take the geo-location as the geo-location of the query and randomly select j terms of the sample data as the query keywords, where 1 j 5 . The number of returned kNN results k is set to a default value. At any timestamp, the expired queries are randomly selected.
Object Workload. For each sample dataset, we take all terms as the object keywords, and take the geo-locations deviating from the original geo-location by 0.01% to 1% of the maximum distance in the region. At any timestamp, the expired objects are randomly selected.
Set of Queries and Objects. For each dataset, unless otherwise specified, we select 5 M objects and queries to construct the query index and object index initially, and generate three test sets, each of which contains 2 M objects and 2 M queries. The evaluation criteria take the average performance of three test sets.
Baseline. We compare our index techniques with IQ-tree [1], Ap-tree [3] and FAST [6]. By default, for Ap-tree, the fanout, partition threshold, and KL-Divergence threshold are set to 200, 40, and 0.001. We use the number of verifications to replace the number of I/O in the cost model of IQ-tree. In the following sections, we use AOIQ-tree to represent the index integrated the quadtree with the ordered, inverted index, and three key techniques. We compare the cost-based k-skyband technique with the Kmax [16] when they are integrated in the AOIQ-tree.
Evaluation criteria. We report four criteria: (1) the index construction time (i.e., ICT), i.e., the time of inserting queries into index after finding their search range; (2) the average incoming object processing time (i.e., AOPT), i.e., the time of finding the affected queries and modifying their corresponding parameters when an object arrives; (3) the average index updating time caused by objects (i.e., AIUT), i.e., the time of updating query index after processing objects; (4) index size, i.e., the memory used for constructing the query index. By default, the number of keywords for constructing ordered, inverted index m , the number of kNN results returned for CkQST k , the height of the quadtree θ h , the ratio of the update operation to the verification operation θ U , and the number of object updates within unit time interval f r e q U o are set to 2, 20, 10, 0.001, and 20,000.

4.2. Experimental Tuning

In this section, a series of experiments are conducted to evaluate the effect of parameters in techniques on the AOIQ-tree.
Effect of m . Figure 4 shows the evaluation criteria of the AOIQ-tree when m takes 1, 2, 3, 4, and 5. According to Lemma 1, if m is small, the number of queries in posting lists is large, so it takes a long time to verify queries in posting lists; contrarily, if m is large, it takes a long time to find the posting lists to be verified. Therefore, the optimal m is neither too small nor too large. As shown in Figure 4, when m takes 2, the performance of the index is the best. The larger m is, the larger the index size. When m > 3 , the verification cost and update cost in a single posting list decrease, so the cost model maps queries to nodes with large regions, so ICT, AIUT and index size decrease, while AOPT increases.
Effect of θ U . Figure 5 shows the evaluation criteria of the AOIQ-tree when θ U takes 0.0001, 0.001, 0.01, and 0.1. When θ U takes 0.0001, the verification cost plays a major role in finding the associated nodes, therefore, queries are associated with many small nodes, so ICT is long, AOPT and AIUT are short, and index size is large. When θ U increases, the index update cost is more important, so queries are associated with fewer larger nodes, so ICT and index size decrease, while AOPT and AIUT increase.
Effect of k . Figure 6 shows the evaluation criteria of the AOIQ-tree when k takes 10, 20, 30, 40, and 50. As k is small, the number of returned objects for queries is few, the search range of queries is small, and queries are associated with many small nodes, so the ICT is long, AOPT and AIUT are short, and index size is large. On the contrary, when k is large, the search range of the queries becomes larger, and queries are associated with few larger nodes, the ICT and index size decrease, and the AOPT and AIUT increase.
Effect of re-evaluation techniques. We evaluate the re-evaluation performance in three cases, denoted as AOIQ-tree, AOIQ-tree_Kmax, and AOIQ-tree_Skyband. AOIQ-tree only keeps k objects in the result list, AOIQ-tree_Kmax keeps k max = 2 k objects in the result list, and the number of objects in the result list for AOIQ-tree_Skyband is calculated according to the cost-based k-skyband technique. Figure 7 shows the ICT, AOPT, and EOPT in three cases with varied k , where EOPT is the average processing time for expired objects, i.e., the average time of modifying the parameters of the affected queries or re-evaluating the queries if the number of objects in their result list is less than k when an object expires. The number of objects maintained in AOIQ-tree_Kmax is more than AOIQ-tree_Skyband, which is more than AOIQ-tree. The ICT of AOIQ-tree_Kmax is shortest, and that of AOIQ-tree is longest. The AOPT of AOIQ-tree and AOIQ-tree_Skyband are shorter than that of AOIQ-tree_Kmax. The EOPT of AOIQ-tree_Kmax and AOIQ-tree_Skyband are shorter than that of AOIQ-tree. This phenomenon is related to the numbers of objects maintained in their result list. Compared with AOIQ-tree, the EOPT of the other two techniques are much less, and if k takes 10, 20, the average update time is close to 0.

4.3. Performance Evaluation

Evaluation on different datasets. We evaluate the efficiency of the key techniques against three datasets in Figure 8. The number of queries is 5 M. As shown in Figure 8, besides the index size, AOIQ-tree is always the best one. IQ-tree has a good spatial filtering performance, but its textual filtering ability is weak; AP-tree comprehensively considers the spatial and textual distribution of queries, but the index construction and update cost are expensive; FAST has a good textual filtering performance, but its spatial filtering ability is weak. The memory-based cost model in AOIQ-tree can minimize the verification cost and update cost, which makes the number of queries in spatial nodes neither too many nor too few; the ordered, inverted index in AOIQ-tree is constructed by two keywords, which makes the verification cost close to the ordered keyword trie, and the update cost close to the ranked-key inverted index. AOIQ-tree takes up the most memory, which is determined by the structure of its posting lists. The evaluation criteria of GN are the smallest, and Gowalla are the largest, which is because the data in Gowalla contain far more keywords than the other two datasets. For Gowalla, the AOPT of AP-tree is shorter than that of FAST. This is because a large number of queries in Gowalla only contain frequent keywords, compared with FAST, AP-tree comprehensively considers the spatial and textual distribution of queries, i.e., index filtering is more powerful, so AOPT is shorter.
Effect of number of queries. To evaluate the scalability of the key techniques on the number of queries and objects, we increase the number of queries and objects from 1 M to 20 M to construct the object index and query index. As Figure 9 shows, the ICT, AOPT, and AIUT of all indexes increase as the number of queries increases, and AOIQ-tree is much more scalable. For instance, it only takes 0.23 ms on average to process the incoming objects when the number of queries reaches 20 M, which is 54% faster than Ap-tree and 36% faster than FAST. This shows that our techniques have good scalability. The AIUT of AOIQ-tree is the shortest, and Ap-tree is the longest. That’s because AOIQ-tree associates queries with optimal nodes to adapt to the objects on data streams, and some of the Ap-tree nodes are re-constructed if many queries updates in these nodes. Compared with AP-tree, only some queries update in FAST nodes, so AOIQ-tree’s AIUT is shorter.
Effect of number of query keywords q . ψ . To evaluate the scalability of key techniques on q . ψ , we increase q . ψ from 1 to 5. As shown in Figure 10, the evaluation criteria of IQ-tree are insensitive to the number of keywords since it focuses on the spatial distribution of the queries. Ap-tree, FAST, and our index consider the keyword distribution of the queries, so the evaluation criteria vary with q . ψ . As q . ψ increases, the ICT and index size increase, and the AOPT decreases. That is because Ap-tree continuously calculates how to partition queries into nodes according to query keywords, and increases the number of textual nodes and height of the tree. For FAST, as q . ψ increases, more keywords being attached to posting lists become frequent, and queries are more likely to be inserted into the multiple higher-level nodes. For AOIQ-tree, the time of sorting the queries increases when the ordered, inverted index is constructed.

5. Conclusions and Future Research Perspectives

The challenging for evaluating CkQST is how to strike the balance between the filtering ability and the update cost of the spatial–textual index. To address the challenging, we use quadtree and inverted index to organize millions of CkQST with three techniques. A memory-based cost model maps the search range of CkQST to the quadtree nodes to balance the spatial filtering ability of the indexes and the cost for updating the indexes. The balance can be further tuned by the cost-based k-skyband technique, which judiciously determines the search range for CkQST according to the workload of objects. An adaptive block-based ordered, inverted index enhances the textual filtering ability. The experimental results on the real-world and synthetic datasets show that the proposed techniques are effective and scalable, and can significantly improve the evaluation efficiency of CkQST. The future work for evaluating the continuous query over spatial–textual data streams includes solving the challenges of continuous queries in mobile and other relevant scenarios, and exploring efficient evaluation techniques using hardware technologies such as Graphics Processing Unit and distributed clusters and trade-off strategy for precision and evaluation efficiency.

Author Contributions

Rong Yang proposed the methods, implemented the algorithms for the experiments, and wrote the manuscript; Baoning Niu provided suggestions for the methods and experiments, reviewed and modified the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 62072326), National Key Research and Development Plan of Shanxi Provence (Grant No. 201903D421007).

Acknowledgments

The authors would like to thank three anonymous reviewers for their valuable suggestions and comments.

Conflicts of Interest

The authors declare that they have no conflict of interest.

Appendix A

In the proof of Lemma 1, without loss of generality, we suppose that all the terms in o . ψ are contained in different blocks and divide the object processing at quadtree nodes into three steps: finding the posting lists to be verified, finding the blocks to be verified in all posting lists, and finding the queries to be verified in all blocks.
Proof of Lemma 1.
If o . ψ = 1 , let o . ψ = { w i 1 } . The object is verified with the queries in the posting list determined by { w i 1 } , the verification cost is O ( l o g V m + Q ) , where Q contains these queries whose query keyword are { w i 1 } . If o . ψ = 2 , let o . ψ = { w i 1 , w i 2 } , the object is verified with the queries in the posting list determined by { w i 1 } , { w i 2 } , and { w i 1 , w i 2 } , the verification cost is O ( 3 l o g V m + Q ) , where Q contains these queries whose query keywords are { w i 1 } , { w i 2 } , or { w i 1 , w i 2 } . Specifically, if m = 1 , the object is verified with the posting lists determined by { w i 1 } and { w i 2 } . For the posting list determined by { w i 1 } , the object is verified with two blocks. One block contains queries whose keywords are { w i 1 } , and the other contains queries whose keywords may contain { w i 1 , w i 2 } . The verification cost is O ( 2 l o g V + log B + b + Q ) . If o . ψ 3 , the posting lists that the object is verified with can be divided into three categories. First, the posting lists are determined by less than m terms. The verification cost is i = 1 m 1 i o . ψ l o g V m + Q ; second, the posting lists are determined by m terms and one of the term is o . ψ . The verification cost is m 1 o . ψ 1 log V m + Q . Third, the posting lists are determined by m terms and the terms do not contain o . ψ . The verification cost is O o . ψ m log V m + o . ψ log B + b B m 1 b . ψ m 1 + Q , where Q contains these queries that contain less than or equal to m keywords and these keywords are contained in o . The Lemma is proved. □

References

  1. Chen, L.S.; Cong, G.; Cao, X. An efficient query indexing mechanism for filtering geo-textual data. In Proceedings of the 32nd ACM SIGMOD International Conference on Management of Data (SIGMOD’13), New York, NY, USA, 22–27 June 2013; ACM Press: New York, NY, USA, 2013; pp. 749–760. [Google Scholar]
  2. Li, G.L.; Wang, Y.; Wang, T.; Feng, J.H. Location-aware publish/subscribe. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD’13), Chicago, IL, USA, 11–14 August 2013; pp. 802–810. [Google Scholar]
  3. Wang, X.; Zhang, Y.; Zhang, W.J.; Lin, X.M.; Wang, W. AP-Tree: Efficiently support location-aware publish/subscribe. VLDB J. 2015, 24, 823–848. [Google Scholar] [CrossRef]
  4. Deng, Z.; Wang, M.; Wang, L.Z.; Huang, X.H.; Han, W.; Chu, J.D.; Zomaya, A.Y. An efficient indexing approach for continuous spatial approximate keyword queries over geo-textual streaming data. Int. J. Geo-Inf. 2019, 8, 57. [Google Scholar] [CrossRef] [Green Version]
  5. Guo, L.; Zhang, D.X.; Li, G.L.; Tan, K.-L.; Bao, Z.F. Location-aware pub/sub system: When continuous moving queries meet dynamic event streams. In Proceedings of the 34th ACM SIGMOD International Conference on Management of Data (SIGMOD’15), Melbourne, Australia, 31 May–4 June 2015; ACM Press: New York, NY, USA, 2015; pp. 843–857. [Google Scholar]
  6. Mahmood, A.R.; Aly, A.M.; Aref, W.G. FAST: Frequency-Aware Indexing for Spatio-Textual Data Streams. In Proceedings of the 34th IEEE International Conference on Data Engineering (ICDE’18), Paris, France, 16–19 April 2018; IEEE Press: Piscataway, NJ, USA, 2018; pp. 305–316. [Google Scholar]
  7. Hu, H.; Liu, Y.; Li, G.; Feng, J.; Tan, K.L. A location-aware publish/subscribe framework for parameterized spatio-textual subscriptions. In Proceedings of the 31st IEEE International Conference on Data Engineering (ICDE’15), Seoul, Korea, 13–17 April 2015; IEEE Press: Piscataway, NJ, USA, 2015; pp. 711–722. [Google Scholar]
  8. Chen, L.; Cong, G.; Cao, X.; Tan, K.L. Temporal spatial-keyword top-k publish/subscribe. In Proceedings of the 31st IEEE International Conference on Data Engineering (ICDE’15), Seoul, Korea, 13–17 April 2015; IEEE Press: Piscataway, NJ, USA, 2015; pp. 255–266. [Google Scholar]
  9. Chen, L.S.; Shang, S. Approximate spatio-temporal top-k publish/subscribe. World Wide Web 2019, 22, 2153–2175. [Google Scholar] [CrossRef] [Green Version]
  10. Wang, X.; Zhang, W.J.; Zhang, Y.; Lin, X.M.; Huang, Z.F. Top-k spatial-keyword publish/subscribe over sliding window. VLDB J. 2017, 26, 301–326. [Google Scholar] [CrossRef] [Green Version]
  11. Chen, Z.D.; Cong, G.; Zhang, Z.J.; Fu, T.Z.J.; Chen, L.S. Distributed Publish/Subscribe Query Processing on the Spatio-Textual Data Stream. In Proceedings of the 33rd IEEE International Conference on Data Engineering (ICDE’17), San Diego, CA, USA, 19–22 April 2017; IEEE Press: Piscataway, NJ, USA, 2017; pp. 1095–1106. [Google Scholar]
  12. Mahmood, A.; Daghistani, A.; Aly, A.M.; Tang, M.J. Adaptive processing of spatial-keyword data over a distributed streaming cluster. In Proceedings of the 21st ACM International Conference on Advances in Geographic Information Systems (SIGSPATIAL’18), Seattle, WA, USA, 6–9 November 2018; ACM Press: New York, NY, USA, 2018; pp. 219–228. [Google Scholar]
  13. Böhm, C.; Ooi, B.C.; Plant, C.; Yan, Y. Efficiently processing continuous k-NN queries on data streams. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE’07), Istanbul, Turkey, 15–20 April 2007; IEEE Press: Piscataway, NJ, USA, 2007; pp. 156–165. [Google Scholar]
  14. Xiong, X.P.; Mokbel, M.F.; Aref, W.G. SEA-CNN: Scalable processing of continuous k-nn Queries in spatio-temporal databases. In Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; IEEE Press: Piscataway, NJ, USA, 2005; pp. 643–654. [Google Scholar]
  15. Yu, X.H.; Pu, K.Q.; Koudas, N. Monitoring k-nearest neighbor queries over moving objects. In Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE’05), Tokyo, Japan, 5–8 April 2005; IEEE Press: Piscataway, NJ, USA, 2005; pp. 631–642. [Google Scholar]
  16. Yi, K.; Yu, H.; Yang, J.; Xia, G.; Chen, Y. Efficient maintenance of materialized top-k views. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE’03), Bangalore, India, 5–8 March 2003; IEEE Press: Piscataway, NJ, USA, 2003; pp. 189–200. [Google Scholar]
  17. Mouratidis, K.; Bakiras, S.; Papadias, D. Continuous monitoring of top-k queries over sliding windows. In Proceedings of the 25th ACM SIGMOD International Conference on Management of Data (SIGMOD’06), Portland, OR, USA, 27–29 June 2006; ACM Press: New York, NY, USA, 2006; pp. 635–646. [Google Scholar]
  18. Zhang, C.Y.; Zhang, Y.; Zhang, W.J.; Lin, X.M. Inverted linear Quadtree: Efficient top k spatial keyword search. IEEE Trans. Knowl. Data Eng. 2016, 28, 1706–1721. [Google Scholar] [CrossRef]
  19. Microsoft Ignite. Available online: https://docs.microsoft.com/zh-cn/cpp/standard-library/map-class?view=vs-2019 (accessed on 10 September 2020).
Figure 1. Running example. (a) Spatial description of queries and objects at timestamp t 0 ; (b) spatial description of queries and objects at t 1 ; (c) spatial description of queries and objects at t 2 ; (d) quadtree; (e) textual and time description of queries and objects.
Figure 1. Running example. (a) Spatial description of queries and objects at timestamp t 0 ; (b) spatial description of queries and objects at t 1 ; (c) spatial description of queries and objects at t 2 ; (d) quadtree; (e) textual and time description of queries and objects.
Ijgi 09 00694 g001
Figure 2. A framework for evaluating the continuous k nearest neighbor queries over spatial–textual data streams (CkQST).
Figure 2. A framework for evaluating the continuous k nearest neighbor queries over spatial–textual data streams (CkQST).
Ijgi 09 00694 g002
Figure 3. Inverted index. (a) Inverted index; (b) ranked-key inverted index; (c) ordered, inverted index; (d) ordered, inverted index constructed by three keywords. As q 4 contains less than three keywords, we expand its keywords by duplicating the last keyword to construct the ordered index.
Figure 3. Inverted index. (a) Inverted index; (b) ranked-key inverted index; (c) ordered, inverted index; (d) ordered, inverted index constructed by three keywords. As q 4 contains less than three keywords, we expand its keywords by duplicating the last keyword to construct the ordered index.
Ijgi 09 00694 g003aIjgi 09 00694 g003b
Figure 4. Effect of the number of keywords on constructing the ordered, inverted index: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Figure 4. Effect of the number of keywords on constructing the ordered, inverted index: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Ijgi 09 00694 g004
Figure 5. Effect of the ratio of the update operation to the verification operation: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Figure 5. Effect of the ratio of the update operation to the verification operation: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Ijgi 09 00694 g005
Figure 6. Effect of the number of kNN results returned for CkQST: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Figure 6. Effect of the number of kNN results returned for CkQST: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Ijgi 09 00694 g006
Figure 7. Effect of the number of kNN results returned for CkQST: (a) index construction time; (b) average object processing time; (c) average expired object processing time.
Figure 7. Effect of the number of kNN results returned for CkQST: (a) index construction time; (b) average object processing time; (c) average expired object processing time.
Ijgi 09 00694 g007
Figure 8. Evaluation of different datasets: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Figure 8. Evaluation of different datasets: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Ijgi 09 00694 g008
Figure 9. Effect of number of queries: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Figure 9. Effect of number of queries: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Ijgi 09 00694 g009
Figure 10. Effect of number of query keywords: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Figure 10. Effect of number of query keywords: (a) index construction time; (b) average object processing time; (c) average update time; (d) index size.
Ijgi 09 00694 g010
Table 1. Summary of notations.
Table 1. Summary of notations.
NotationDescription
q q . l o c , q . ψ , q . k , q . t e A CkQST (the geo-location, keywords, expiration time and the number of returned objects of q )
o o . l o c , o . ψ , o . t e An object (the geo-location, keywords, and expiration time of o )
q ( O ) k , q ( O ) The result list and extended result list of q
q . S R k , q . S R The search range and extended search range of q
q . ψ [ i : j ] , q . ψ [ i : ] , q . ψ [ : i ] The subset of q . ψ
q . ψ , o . ψ The number of keywords in q . ψ and o . ψ
V , V A vocabulary set and the number of terms in V
N , n The quadtree node
P L w i 1 w i 2 The posting list of the ordered, inverted index
b , b r The block of a posting list
b r . m i n w , b r . m a x w The minimum and maximum q i . ψ [ 3 ] for any query q i in b r
b r . ψ The terms contained in [ b r . m i n w , b r . m a x w ]
b The number of queries in b
B The number of blocks in a posting list
C V P L ( P L w i 1 w i 2 ) , C U P L ( P L w i 1 w i 2 ) The verification cost and update cost of P L w i 1 w i 2
C V q q , N , C U q q , N The verification cost and update cost within unit time interval if q is associated with N
p V B ( b r ) The probability that the block b r is verified
p V q q | N The probability that q is verified if it is inserted into N
p V w ( w j ) The probability that these queries subjected to q i . ψ [ 3 ] = { w j } are verified in b r
Table 2. Datasets statistics.
Table 2. Datasets statistics.
DatasetsTWEETSGNGOWALLA
Size of dataset20 M2.29 M644.3 K
Vocabulary size1.80 M202.4 K61.2 K
Average number of keywords in objects9426
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yang, R.; Niu, B. Continuous k Nearest Neighbor Queries over Large-Scale Spatial–Textual Data Streams. ISPRS Int. J. Geo-Inf. 2020, 9, 694. https://doi.org/10.3390/ijgi9110694

AMA Style

Yang R, Niu B. Continuous k Nearest Neighbor Queries over Large-Scale Spatial–Textual Data Streams. ISPRS International Journal of Geo-Information. 2020; 9(11):694. https://doi.org/10.3390/ijgi9110694

Chicago/Turabian Style

Yang, Rong, and Baoning Niu. 2020. "Continuous k Nearest Neighbor Queries over Large-Scale Spatial–Textual Data Streams" ISPRS International Journal of Geo-Information 9, no. 11: 694. https://doi.org/10.3390/ijgi9110694

APA Style

Yang, R., & Niu, B. (2020). Continuous k Nearest Neighbor Queries over Large-Scale Spatial–Textual Data Streams. ISPRS International Journal of Geo-Information, 9(11), 694. https://doi.org/10.3390/ijgi9110694

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop