1. Introduction
With the rapid proliferation of emerging technologies such as Mobile Internet and the Internet of Things, numerous location-based big data services are becoming increasingly popular in various fields, including, but not limited to, population distribution statistics [
1], urban planning and management [
2], intelligent traffic scheduling [
3], and disease epidemic control [
4]. Particularly, in the current critical period of COVID-19 pandemic, a series of location-based big data applications (i.e., health codes, communication itinerary cards, and close contact self-examination procedures) provide great convenience for epidemic prevention and control in terms of epidemiological investigations, epidemic distribution and statistical analysis, personnel and material scheduling, etc.
Location information is highly correlated with personal privacy. Improper release or reverse reasoning analysis [
5,
6,
7] of location-based big data statistics can easily lead to the disclosure of personal privacy of a user’s specific location, movement trajectory, living habits, health status, hobbies, economic conditions, etc., and may even endanger a user’s property and life [
8]. Therefore, addressing the above-mentioned privacy protection issues during the publishing and utilization of location-based big data statistics remains an indispensable task and acts as a considerable bottleneck in the development of the big data industry.
The statistical partition and publishing method is widely used in location-based big data services. It can provide users with accurate and timely traffic statistics, facilitate people to plan reasonable travel times and routes, and obtain high-precision Location Based Services (LBS). When combined with the differential privacy model and once the noise disturbance has been incorporated in the statistics of location-based big data, it can further reduce the risk of users’ privacy leakage while maintaining the statistical characteristics of the published data. During the COVID-19 pandemic, spatiotemporal location big data provided case statistics, distribution analysis, and trajectory estimation for pandemic prevention and control and which subsequently assisted in social resumption of work and production.
Figure 1 is a typical example of statistical publishing of location-based big data, which depicts the spatial distribution of COVID-19 infectors in Hubei province on 31 March 2020 [
9].
Figure 2 is the result of RAPPOR, an anonymous crowdsourced statistical data privacy-preserving technique based on differential privacy, wherein the true sample distribution is shown in black, and the light green depicts the estimated distribution based on the decoded Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) reports [
10]. As can be observed from
Figure 2, when the number of report from users reaches one million, the statistical results reported according to RAPPOR closely traced the statistical distribution of the real data, which does not affect the availability of the published big data statistical results.
The big data statistical publishing method based on the differential privacy model protects the privacy of each user under the premise of ensuring data availability by adding a certain random noise to the statistical publishing results. The partition structure of spatial location information and the introduction of differential privacy noise lead to errors between the published big data statistics’ results and the real statistical results. This may reduce the availability of the released data. For example, a user queries the number of taxis available within 1 km of their own location and the noisy statistics’ results returned by the LBS system is 11. However, the real situation is that there are only two taxis within a user’s query range. Therefore, such statistically released results are likely to mislead users to wait for a long time, thereby resulting in an unsatisfactory LBS service experience. Similarly, let us suppose that, in accordance with the location information of confirmed COVID-19 patients in a certain area, the number of close contacts is about 12,000; nevertheless, the released statistical result after incorporating differential privacy noise turns out to be 8700. This can thus disrupt the supply of the medical aid equipment and other pandemic prevention materials as they would not be able to cater to the needs of this region, thereby delaying the timely treatment of patients. Therefore, improving the accuracy of location-based big data statistical results on the premise of ensuring location privacy is an important issue for big data services.
In order to reduce the impact of spatial partition structure and differential privacy noise on the availability of published data, this paper hereby proposes a bottom-up grid clustering algorithm and a corresponding differential privacy preserving data publishing algorithm. The range counting querying errors are reduced, and the availability of released statistical results is improved.
The rest of the paper is organized as follows:
Section 2 reviews the state-of-the-art pertinent to private spatial decomposition methods.
Section 3 defines the differential privacy model used for the publishing of location-based big data statistical information.
Section 4 analyzes the problems of traditional privacy spatial decomposition and introduces the proposed grid clustering algorithm.
Section 5 details the proposed statistical publishing and differential privacy protection algorithms. Finally,
Section 6 reports a set of empirical studies, whereas
Section 7 concludes the paper, laying out the limitations and future works of the research.
2. Related Work
Privacy spatial decomposition is one of the important means to realize the application of location-based big data statistical publishing. The statistical release and differential perturbation of location-based information within the partition and indexing area facilitate reducing the risk of users’ location privacy. The traditional privacy spatial decomposition methods partition the 2D space from top to bottom and index the partition areas either according to location independent grid or tree structure or some location-based hierarchical structures.
The grid-based partition and indexing structure is the most common privacy decomposition methods, wherein the uniform grid (UG) method and the adaptive grid (AG) method are the typical representatives [
11]. The former separates the 2D space into equal sized grids and calculates the statistical result in the units of grid. The latter performs a further grid partition on the dense areas based on the UG method. In [
12], the authors employ contour maps to describe the distribution of location points in spatial crowdsourcing services and partition the spatial area into disjointed units to protect workers’ location privacy. The partition method proposed in [
13] firstly obtains the preliminary density-adaptive grid partition by judging the distribution of location-based information on the longitude and latitude directions, and, subsequently, the adaptive grid partition was performed on the dense grids to save unnecessary partition process and reduce publishing errors. In [
14], the authors propose a three-layer adaptive grid partition method based on Bernoulli random sampling technology which employs the exponential mechanism and high-pass filtering technology to filter out grids smaller than the predefined threshold on the second layer of the partition structure but continues to partition the grid larger than the predefined threshold. The location privacy protection scheme proposed in [
15] uses a three-layer adaptive grid structure and a fully pyramid grid algorithm based on differential privacy. In [
16], the authors use the statistical information of particle distribution to extract the optimal spatial decomposition so that each unit contains particles with uniform spatial distribution. A common problem with the grid-based partition method method is difficult to balance the noise error with the uniform assumption error, which affects the accuracy of the published statistical results.
The tree structure has better hierarchical characteristics and can provide more convenient spatial range querying services. The Quad-opt algorithm [
17] separates the 2D space using a complete quad-tree structure and designs a post-adjustment method which improves the range counting querying accuracy to a certain extent. However, the complete quadtree structure does not consider the distributions of data, thereby resulting in a large uniform assumption error. In [
18], the complete quadtree partition results are adjusted and merged bottom-up in accordance with the uniformity conditions in order to reduce the uniformity assumption error. In [
19], the authors separate the entire area into four sub-units of different sizes and recursively invoke the same partition process until a reasonable spatial structure is obtained. In [
20], the authors propose an Unbalanced quadtree partition method based on regional uniformity and which traverses the subtree according to the depth-first strategy and adaptively carry out partition according to the uniformity. However, tree-based partition methods need more recursive operations, and, therefore, the execution efficiency of the partition and publishing algorithm is dragged down. Some privacy spatial decomposition methods organically combine grid-based and tree-based structures to form hybrid partition structures [
21,
22], thereby improving the accuracy of range counting querying services.
Clustering analysis plays an important role in data mining process for a long time. Cluster data records that are spatially close to each other in the same area not only facilitates improving the accuracy of statistically released data but also protects the location information corresponding to a single record. This kind of grid-based or density-based clustering algorithm [
23,
24,
25,
26,
27] can be used to classify data of any shape. Since it only requires less calculations pertinent to the distance between a small number of grid nodes, the clustering algorithm has a high level of efficiency. In [
28], the authors propose a location-based big data clustering algorithm based on density and grid partition. The proposed cell distance analysis model greatly reduced the distance calculation and the traversal of the location points. In [
29], the authors propose an online data stream clustering method based on density grids. The grid-based method is employed to reduce the number of calls of the distance function, thereby improving the clustering quality. In [
30], the authors propose a strongly connected grid clustering algorithm for the parallel processing of clustering based on MapReduce. Grid clusters with strong connectivity are merged by calculating the connectivity weight matrix so as to realize efficient clustering of location data. With the improvement of machine learning methods, high-performance grid clustering and density clustering algorithms are gradually increasing. Combining the above clustering algorithms with the publishing of location-based statistical information can improve the availability of published data on the basis of making full use of the distribution characteristics of location-based data. Research in this area has a very broad application space.
5. Statistical Publishing and Differential Privacy Protection Algorithm
In order to realize the privacy protection of the published location-based big data, it is necessary to allocate the differential privacy budget according to the spatial indexing structure and incorporate disturbance noise into the statistical results of each spatial region. Algorithm 2 depicts the differential privacy preserving publishing process of location-based big data. Firstly, according to Algorithm 1, the distribution structure of the whole space is obtained. Then, the statistical results within each cluster are calculated, and the Laplacian noise with a privacy budget of
is incorporated into the statistical results of each cluster. Finally, the noisy statistical results will be evenly allocated to each grid region according to the number of grids within the cluster in a bid to balance the noise error and the uniform assumption error and improve the usability of the published statistical results.
Figure 9 portrays the schematic of the grid partition structure and the clustering process, wherein
Figure 9a suggests the type of the underlying grid with red numbers (i.e., flag = 0 correspond to the empty grids, flag = 1 represent the uniform grids, and flag = 2 stand for the non-uniform grids). Only adjacent empty grids or uniform grids possessing the same density level can be integrated to form the grid clusters. Hence, neither the non-uniform grids nor the uniform grids with different density levels can be merged to reduce the uniformity assumption error.
Algorithm 2 Differential privacy preserving publishing algorithm. |
Require: Cluster structure matrix ; grid density matrix ; privacy budget ; Sensitivity S; Ensure: Publish statistical result based on grid ;
- 1:
=0 - 2:
calculate the number of clusters according to - 3:
for each cluster do - 4:
calculate the number of grids within the current cluster - 5:
sum the density of the u grids according to - 6:
- 7:
- 8:
end for - 9:
return
|
Corollary 1. Algorithm 2 can provide ϵ-differential privacy protection for the published location-based big data statistical results.
Proof. For any querying range Q submitted by a user, the range counting query has the following use cases:
(1) The querying range
Q located within the area covered by a certain cluster as depicted in
Figure 10a. According to the grid clustering Algorithm 1 proposed in this paper, the querying range
Q may be included in a single underlying grid or in a region merged by multiple underlying grids. In the former situation, it can be concluded that this single underlying grid is a non-uniform grid and forms a cluster alone.Therefore, Algorithm 2 incorporates Laplace noise in this grid with a differential privacy budget of
. In the latter situation, Algorithm 2 incorporates Laplace noise to the merged region with a differential privacy budget of
. According to the parallel combination characteristics of the differential privacy model described in Theorem 2, each grid that falls in the merged region satisfies the differential privacy protection of
.
(2) The querying range
Q consists of the area covered by
A (
) complete clusters as depicted in
Figure 10b. In this case, each cluster conforms to the first use case described above. Therefore, according to the parallel combination characteristics of the differential privacy model described in Theorem 2, all the
A complete clusters included in the querying range
Q can provide differential privacy protection of
.
(3) The querying range
Q consists of the area covered by
B (
) intersecting clusters as depicted in
Figure 10c. In this case, each of the intersecting area conforms to the case (1) and can also provide differential privacy protection with a strength of
.
(4) The querying range
Q consists of area covered by
A (
) complete clusters and
B (
) intersecting clusters, as depicted in
Figure 10d. For the complete clusters falling in the querying range
Q, each cluster can provide
differential privacy protection similar to use case (2). For the clusters intersected with the querying range
Q, the intersected region conforms to use case (3) and provide
differential privacy protection. Therefore, it can also provide differential privacy protection of
according to the parallel combination characteristics of the differential privacy model.
Based on the above use cases, we can draw the conclusion that Algorithm 2 can provide differential privacy protection of for the published location-based big data statistical results. □
6. Experiments and Analysis
In order to comprehensively evaluate the proposed grid clustering and differential privacy protection publishing algorithm (GCDPP) for location-based big data statistical information, we compare and analyze the proposed algorithm with a number of classical privacy spatial decomposition algorithms in terms of the availability of published location-based big data statistical results, efficiency of the privacy protection publishing algorithm, and the influences of partition granularity. The baseline methods include, but are not limited to, uniform grid partition method (UG) [
11], adaptive grid partition method (AG) [
11], standard deviation circle radius adaptive grid decomposition algorithm (SDCAG) [
14], quadtree partition method (Quad-opt) [
17], density-based quadtree partition method (DBP) [
19], and unbalanced quadtree partition method (unbalanced quadtree) [
20].
Experimental location-based datasets include the facility location information datasets, Storage
http://www.infochimps.com/datasets/storage-facilities-by-neighborhood-2, accessed on 1 January 2022, and Landmark
http://www.infochimps.com/datasets/storage-facilities-by-landmarks, accessed on 1 January 2022, provided by Infochimps; Checkin
http://snap.stanford.edu/data/loc-gowalla.html, accessed on 1 January 2022, which provides location information from the social network site, Gowalla; and Taxi record dataset, Yellow_trip (2009–2016)
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page, accessed on 1 January 2022, provided by the New York City Taxi Management Committee. The experimental platform selects the Alibaba Cloud Server ECS (ecs.r6.2xlarge: 8-core CPU, 64 GB memory, 100 GB ESSD cloud disk, Windows Server 2019 Data Center Edition 64-bit), and all the algorithms are programmed in MATLAB.
6.1. Availability Analysis of the Published Location-Based Big Data Statistical Results
According to the application scenarios of location-based big data, the accuracy of the range counting query has been employed to ascertain the availability of location-based big data statistical publishing results. Range counting queries of different sizes are implemented on the published results and compared with the baseline methods.
For parameter settings of the baseline algorithms, we refer to their corresponding literature. Specifically, for the
dataset, the UG algorithm sets the constant parameter
c = 10, the AG and SDCAG algorithm employ the privacy allocation ratio of
, the Quad-opt algorithm adopts the partition depth of
h = 6, the DBP algorithm set maximum density difference
, and Unbalanced quadtree algorithm use the uniformity judgment threshold of
. The sensitivity of range counting query caused by incorporating differential privacy noise is
. For the other three big datasets, the constant parameter was set to
c = 1000 for the UG algorithm, division depth to
h = 8 for the Quad-opt algorithm, and other parameters remained unchanged. The information pertinent to range counting queries of different sizes is depicted in
Table 1. The differential privacy model incorporates Laplace noise with privacy budgets of
, and
. Each type of query was randomly generated for 1000 times to ascertain the average relative error between the original and published statistical results. The definition of relative error is presented in Formula (9), wherein
Q represents the querying range,
returns the statistical result obtained within the querying range on the original location-based dataset, and
is the noisy statistical result obtained on the published dataset within the same range
Q. In order to prevent the denominator from being zero, we set
according to the UG algorithm [
11], wherein
represents the size of the location-based dataset.
Figure 11,
Figure 12,
Figure 13 and
Figure 14 portray the comparison of querying accuracy, i.e., in terms of relative error, on different datasets and privacy budgets. On the same experimental dataset, the relative error of all the algorithms gradually decreases with an increase in the privacy budget
. The primary reason is that the increase in the privacy budget
reduces the incorporated Laplace noise, thereby shrinking the error between the published result and the real statistical value. For all the datasets, as the querying range increases from small to large, the relative error first increases and then decreases. This is mainly because, when the querying range is small (e.g., of size
and
), the included areas only contain some underlying grids or merged cluster areas, and, therefore, the accumulated noise interference is small, thereby making the querying error relatively low. When the querying range is large (e.g., of size
and
), the included areas contain more regions belonging to different clusters, and, therefore, the uniformity assumption error is small, thereby making the overall querying error relatively low.
When we compare the relative errors of various privacy preserving publishing algorithms on the same dataset, it can be observed that the UG and Quad-opt algorithm have higher relative errors in contrast to the others under various privacy budgets. The reason is that the spatial partition process of these two algorithms has nothing to do with the spatial distribution of location-based information, which not only produces more empty nodes and noise errors but also introduces more uniform assumption error owing to the non-uniform distribution of the nodes. The AG algorithm performs an additional fine-grained partition for the dense grids on the basis of the UG algorithm. However, it still cannot overcome the large number of noise errors caused by the empty nodes in areas with sparse location distribution. Therefore, the relative error of the AG algorithm is low when the querying range is small and the value deteriorates when the querying range is large. In the SDCAG algorithm, filtering and bucketing are used to reduce the noise error. Therefore, it provides better accuracy of range counting query than the AG algorithm. The DBP and Unbalanced quadtree algorithm implements a heuristic quadtree partition according to the uniformity of the location distribution, thereby compensating the uniformity assumption error and the noise error caused by the empty nodes to a certain extent. The proposed GCDPP algorithm adopts the strategy of bottom-up neighborhood clustering. The initial uniform grid partition takes into account the need for small-scale counting query services and retains a better accuracy of location services. The bottom-up neighborhood grid clustering reduces the excessive partition of sparse regions and suppresses excessive noise errors. In addition, it further facilitates merging the uniform regions of the same density level and apportions the differential privacy noise. Therefore, the proposed algorithm achieves better querying accuracy in contrast to the other algorithms under different datasets and different privacy budgets.
6.2. Efficiency Analysis of the Privacy Protection Publishing Algorithm
Table 2 compares the time complexity of the proposed GCDPP algorithm with the baseline methods. For a dataset encompassing
n records, whilst the UG, AG, and SDCAG algorithm have the same time complexity of
, the partition process of the UG algorithm only scans the input data for one time, whereas the AG and SDCAG algorithm scan the input data for at least two times to form a two-layer adaptive grid partition. The Quad-opt algorithm performs a recursive quadtree partition on the input dataset. The overall time complexity is about
. The partition process of the DBP and Unbalenced quadtree algorithm is related with the specific distribution of location-based dataset and partition depth of quadtree. In the worst scenario, for a location-based dataset with
n records and
h partition depth, the time complexity of these two algorithms will be
. The proposed GCDPP algorithm first performs the same partition as the UG algorithm, and then merges the neighbourhood grids of the same type and density grade to form the clusters. Its time complexity is about
.
Figure 15 depicts the comparison in the execution time of all the privacy protection algorithms on real location-based datasets. The parameter settings of these algorithms are the same as those in
Section 6.1. Since the strength of differential privacy budget has no significant effect on the execution time of the statistical publishing algorithms, we take
as an example to conduct the experimental comparisons on the real location-based datasets of different scales. From a macro perspective, the overall execution time of all the algorithms increase with the size of the dataset. Specifically, the UG algorithm has the lowest execution time and is hardly affected by the size of the dataset. The AG and SDCAG algorithm perform an additional round of grid partition on the basis of the UG algorithm, and, therefore, they take a much longer time. In order to reduce the noise error, the SDCAG algorithm filters the grids with a raw count of 0 and allocates the similar grids into the same bucket. Thus, the overall execution time is slightly higher than the AG algorithm. Although the partition process of the Quad-opt algorithm does not consider the specific distribution state of the location information, the tree-based iterative process using depth-first traversal is primarily affected by the size of the dataset. Therefore, the overall execution time of the Quad-opt algorithm is significantly higher than other algorithms. The partition process of the DBP and Unbalanced quadtree algorithm need to be combined with the specific distribution of the location information. In the areas wherein the location distribution is particularly sparse or dense, they can avoid unnecessary partition. Therefore, they received better performance on the sparse dataset
Storage and the dense dataset
Yellow_trip. The proposed GCDPP algorithm firstly implements the uniform grid partition stage and then performs bottom-up neighborhood grid clustering. The grid-based clustering accelerates the merging of the spatial regions and saves much time compared to the process of adaptive grid partition and filtering. Thus, the execution time of the GCDPP algorithm is only secondary to the UG algorithm. It can be expected that, with an increasing in the size of dataset, the advantage of the proposed GCDPP algorithm will be more obvious.
6.3. The Influences of Partition Granularity
As discussed in
Section 4.1, traditional privacy spatial decomposition methods are easily affected by partition granularity, resulting in over-partitioning or under-partitioning problems that not only impair the availability of published data but may also drag down the efficiency of the entire data publishing algorithm. This section analyses the relationships between partition granularity setting, accuracy of published results, and publishing algorithm efficiency. Since the privacy budget strength has no obvious effect on the setting of partition granularity, we take
as an example to analyze the average error of the published statistical results with different partition granularities. For the sake of fairness, we keep the initial partition granularity of all the comparison algorithms basically the same. The average error of the published statistical results can be expressed as follows:
wherein
N is the initial partition granularity of the location-based dataset, and
represents the size of the dataset.
stands for the original density of grid
i, whereas
returns the published density of the corresponding area.
Figure 16 portrays the average error of all the algorithms under different partition granularity. It can be observed that, with an increase in the partition granularity, the average error of the tree-based partition algorithms grows gradually, whereas the average error of the grid-based partition algorithms do not change obviously. The primary reason is that, when the partition granularity is small, the coverage area of a single grid will be large and the published statistical result may contain more non-uniform assumption errors due to the inhomogeneous distribution of data within the region. Nevertheless, the introduced noise error will be small because of the less number of empty grids. With an increase in partition granularity, the coverage area of each grid will be reduced. The non-uniform assumption error for the grid area will decrease, whereas the total noise error will increase because the number of empty grids may increase. The proposed algorithm achieves the lowest average error on almost all experimental datasets demonstrating the advantage of the proposed algorithm in the availability of the published statistical result.
Intuitively, the execution time of partition algorithms will increase with the size of dataset and partition granularity.
Figure 17 depicts the execution time of all the algorithms under different partition granularity. The partition progress of UG algorithm is independent of the distribution of dataset, so it always maintains the lowest execution time and hardly changes with the change of the partition granularity. For the other algorithms, the execution time increases significantly with the partition granularity. The execution time of the AG and SDCAG algorithms is slightly longer than that of the UG algorithm. With an increase in a dataset’s scale and distribution complexity, the efficiency of these two algorithms gradually deteriorates, especially for the largest and most complex dataset
. For the tree-based partition algorithms, the increase of partition gradually causes more recursive operations and uniformity judgments; therefore, the execution time of these algorithms suffers significantly. While the proposed GCDPP algorithm maintains the independent grid partition just as the UG algorithm, its bottom-up grid clustering process overcomes the excessive recursive operations. Therefore, with an increase in partition granularity, the proposed algorithm is still better than the other algorithms. It is foreseeable that the advantages of the proposed algorithm will be more obvious in the case of location-based dataset with larger scale and more complex distribution.
7. Conclusions
The rapid development of big data technology and the widespread popularization of mobile smart terminals has made various location-based services highly relevant to the work and life of users. Accordingly, the issues of location privacy leakage caused by the release of location-based big data statistics have started attracting widespread attention. Achieving the release of statistics of location-based information via the spatial decomposition method and the differential privacy protection model can effectively avoid various risks caused by location privacy leakage on the premise of ensuring data availability. In order to reduce the loss of data availability caused by over-partitioning and under-partitioning in the traditional top-down spatial decomposition method, this paper proposed a bottom-up grid clustering and differential privacy protection publishing method. The accuracy of location-based services is met through uniform grid partition and density statistics. Privacy protection of the published data is achieved by incorporating Laplace noise to location-based big data statistics according to the differential privacy model. The merging of the blank areas and the uniform distribution areas has been realized by the neighborhood grid clustering in order to reduce the noise error and the uniform assumption error. Experiments and analysis on the real location-based datasets prove that the proposed grid clustering and differential privacy publishing method have obvious advantages in improving the range querying accuracy and operating efficiency.
However, the research still has some limitations. Firstly, most of the existing spatial partition methods of location big data are based on grid or tree structures. However, the areas where human actually work and live are primarily closely related to the distribution of infrastructure and cannot be well represented by grids or tree structures. Geographic segmentation methods such as road network structures did not have effective index methods and cannot achieve efficient querying services for location-based big data. Therefore, how to design more detailed and flexible partition method combined with efficient index structure will facilitate improving the availability of published results and is one of our future works. Secondly, the paper only focuses on the statistical partitioning and publishing of static location-based big data. However, a user’s location continuously and randomly changes over time. The dynamic changes of the locations of large number of users make the spatial partition structure of the previous statistical release time not applicable to the next release time. Therefore, how to design a statistical partition and publishing method combined with the dynamic change of location big data, and enhance the timeliness and practicability of published statistical results under the premise of ensuring users’ privacy, will be a further research direction. Finally, the paper discusses the statistical publishing method of location-based big data based on the centralized differential privacy model. The premise is that users’ locations are collected and statistically released via a trusted third-party. However, there is no completely reliable platform in practical applications. Problems such as technical failures, superuser leaks, and hacker attacks make the users’ original locations in danger. Therefore, transferring the privacy protection process of users’ locations to the their terminals and realizing local differential privacy location protection have become the research hotspot in recent years.