Achieving Differential Privacy Publishing of Location-Based Statistical Data Using Grid Clustering

Yan, Yan; Sun, Zichao; Mahmood, Adnan; Xu, Fei; Dong, Zhuoyue; Sheng, Quan Z.

doi:10.3390/ijgi11070404

Open AccessArticle

Achieving Differential Privacy Publishing of Location-Based Statistical Data Using Grid Clustering

by

Yan Yan

^1,*

,

Zichao Sun

¹,

Adnan Mahmood

²

,

Fei Xu

¹,

Zhuoyue Dong

¹ and

Quan Z. Sheng

²

¹

School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China

²

Faculty of Science and Engineering, School of Computing, Macquarie University, Sydney, NSW 2109, Australia

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2022, 11(7), 404; https://doi.org/10.3390/ijgi11070404

Submission received: 28 March 2022 / Revised: 28 June 2022 / Accepted: 12 July 2022 / Published: 15 July 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Statistical partitioning and publishing is commonly used in location-based big data services to address queries such as the number of points of interest, available vehicles, traffic flows, infected patients, etc., within a certain range. Adding noise perturbation to the location-based statistical data according to the differential privacy model can reduce various risks caused by location privacy leakage while keeping the statistical characteristics of the published data. The traditional statistical partitioning and publishing methods realize the decomposition and indexing of 2D space from top to bottom. However, they can easily cause the over-partitioning or under-partitioning phenomenon, and therefore need multiple times of data scan. This paper proposes a grid clustering and differential privacy protection method for location-based statistical big data publishing scenarios. We implement location-based big data statistics in units of equal-sized grids and perform density classification on uniformly distributed grids by discrete wavelet transform. A bottom-up grid clustering algorithm is designed to perform on the blank and the uniform grids of the same density level based on neighborhood similarity. The Laplacian noise is incorporated into the clustering results according to the differential privacy model to form the published statistics. Experimental comparison of the real-world datasets manifests that the grid clustering and differential privacy publishing method proposed in this paper is superior to other existing partition publishing methods in terms of range querying accuracy and algorithm operating efficiency.

Keywords:

statistical release of big data; location privacy; differential privacy; privacy spatial decomposition; grid clustering

1. Introduction

With the rapid proliferation of emerging technologies such as Mobile Internet and the Internet of Things, numerous location-based big data services are becoming increasingly popular in various fields, including, but not limited to, population distribution statistics [1], urban planning and management [2], intelligent traffic scheduling [3], and disease epidemic control [4]. Particularly, in the current critical period of COVID-19 pandemic, a series of location-based big data applications (i.e., health codes, communication itinerary cards, and close contact self-examination procedures) provide great convenience for epidemic prevention and control in terms of epidemiological investigations, epidemic distribution and statistical analysis, personnel and material scheduling, etc.

Location information is highly correlated with personal privacy. Improper release or reverse reasoning analysis [5,6,7] of location-based big data statistics can easily lead to the disclosure of personal privacy of a user’s specific location, movement trajectory, living habits, health status, hobbies, economic conditions, etc., and may even endanger a user’s property and life [8]. Therefore, addressing the above-mentioned privacy protection issues during the publishing and utilization of location-based big data statistics remains an indispensable task and acts as a considerable bottleneck in the development of the big data industry.

The statistical partition and publishing method is widely used in location-based big data services. It can provide users with accurate and timely traffic statistics, facilitate people to plan reasonable travel times and routes, and obtain high-precision Location Based Services (LBS). When combined with the differential privacy model and once the noise disturbance has been incorporated in the statistics of location-based big data, it can further reduce the risk of users’ privacy leakage while maintaining the statistical characteristics of the published data. During the COVID-19 pandemic, spatiotemporal location big data provided case statistics, distribution analysis, and trajectory estimation for pandemic prevention and control and which subsequently assisted in social resumption of work and production. Figure 1 is a typical example of statistical publishing of location-based big data, which depicts the spatial distribution of COVID-19 infectors in Hubei province on 31 March 2020 [9]. Figure 2 is the result of RAPPOR, an anonymous crowdsourced statistical data privacy-preserving technique based on differential privacy, wherein the true sample distribution is shown in black, and the light green depicts the estimated distribution based on the decoded Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) reports [10]. As can be observed from Figure 2, when the number of report from users reaches one million, the statistical results reported according to RAPPOR closely traced the statistical distribution of the real data, which does not affect the availability of the published big data statistical results.

The big data statistical publishing method based on the differential privacy model protects the privacy of each user under the premise of ensuring data availability by adding a certain random noise to the statistical publishing results. The partition structure of spatial location information and the introduction of differential privacy noise lead to errors between the published big data statistics’ results and the real statistical results. This may reduce the availability of the released data. For example, a user queries the number of taxis available within 1 km of their own location and the noisy statistics’ results returned by the LBS system is 11. However, the real situation is that there are only two taxis within a user’s query range. Therefore, such statistically released results are likely to mislead users to wait for a long time, thereby resulting in an unsatisfactory LBS service experience. Similarly, let us suppose that, in accordance with the location information of confirmed COVID-19 patients in a certain area, the number of close contacts is about 12,000; nevertheless, the released statistical result after incorporating differential privacy noise turns out to be 8700. This can thus disrupt the supply of the medical aid equipment and other pandemic prevention materials as they would not be able to cater to the needs of this region, thereby delaying the timely treatment of patients. Therefore, improving the accuracy of location-based big data statistical results on the premise of ensuring location privacy is an important issue for big data services.

In order to reduce the impact of spatial partition structure and differential privacy noise on the availability of published data, this paper hereby proposes a bottom-up grid clustering algorithm and a corresponding differential privacy preserving data publishing algorithm. The range counting querying errors are reduced, and the availability of released statistical results is improved.

The rest of the paper is organized as follows: Section 2 reviews the state-of-the-art pertinent to private spatial decomposition methods. Section 3 defines the differential privacy model used for the publishing of location-based big data statistical information. Section 4 analyzes the problems of traditional privacy spatial decomposition and introduces the proposed grid clustering algorithm. Section 5 details the proposed statistical publishing and differential privacy protection algorithms. Finally, Section 6 reports a set of empirical studies, whereas Section 7 concludes the paper, laying out the limitations and future works of the research.

2. Related Work

Privacy spatial decomposition is one of the important means to realize the application of location-based big data statistical publishing. The statistical release and differential perturbation of location-based information within the partition and indexing area facilitate reducing the risk of users’ location privacy. The traditional privacy spatial decomposition methods partition the 2D space from top to bottom and index the partition areas either according to location independent grid or tree structure or some location-based hierarchical structures.

The grid-based partition and indexing structure is the most common privacy decomposition methods, wherein the uniform grid (UG) method and the adaptive grid (AG) method are the typical representatives [11]. The former separates the 2D space into equal sized grids and calculates the statistical result in the units of grid. The latter performs a further grid partition on the dense areas based on the UG method. In [12], the authors employ contour maps to describe the distribution of location points in spatial crowdsourcing services and partition the spatial area into disjointed units to protect workers’ location privacy. The partition method proposed in [13] firstly obtains the preliminary density-adaptive grid partition by judging the distribution of location-based information on the longitude and latitude directions, and, subsequently, the adaptive grid partition was performed on the dense grids to save unnecessary partition process and reduce publishing errors. In [14], the authors propose a three-layer adaptive grid partition method based on Bernoulli random sampling technology which employs the exponential mechanism and high-pass filtering technology to filter out grids smaller than the predefined threshold on the second layer of the partition structure but continues to partition the grid larger than the predefined threshold. The location privacy protection scheme proposed in [15] uses a three-layer adaptive grid structure and a fully pyramid grid algorithm based on differential privacy. In [16], the authors use the statistical information of particle distribution to extract the optimal spatial decomposition so that each unit contains particles with uniform spatial distribution. A common problem with the grid-based partition method method is difficult to balance the noise error with the uniform assumption error, which affects the accuracy of the published statistical results.

The tree structure has better hierarchical characteristics and can provide more convenient spatial range querying services. The Quad-opt algorithm [17] separates the 2D space using a complete quad-tree structure and designs a post-adjustment method which improves the range counting querying accuracy to a certain extent. However, the complete quadtree structure does not consider the distributions of data, thereby resulting in a large uniform assumption error. In [18], the complete quadtree partition results are adjusted and merged bottom-up in accordance with the uniformity conditions in order to reduce the uniformity assumption error. In [19], the authors separate the entire area into four sub-units of different sizes and recursively invoke the same partition process until a reasonable spatial structure is obtained. In [20], the authors propose an Unbalanced quadtree partition method based on regional uniformity and which traverses the subtree according to the depth-first strategy and adaptively carry out partition according to the uniformity. However, tree-based partition methods need more recursive operations, and, therefore, the execution efficiency of the partition and publishing algorithm is dragged down. Some privacy spatial decomposition methods organically combine grid-based and tree-based structures to form hybrid partition structures [21,22], thereby improving the accuracy of range counting querying services.

Clustering analysis plays an important role in data mining process for a long time. Cluster data records that are spatially close to each other in the same area not only facilitates improving the accuracy of statistically released data but also protects the location information corresponding to a single record. This kind of grid-based or density-based clustering algorithm [23,24,25,26,27] can be used to classify data of any shape. Since it only requires less calculations pertinent to the distance between a small number of grid nodes, the clustering algorithm has a high level of efficiency. In [28], the authors propose a location-based big data clustering algorithm based on density and grid partition. The proposed cell distance analysis model greatly reduced the distance calculation and the traversal of the location points. In [29], the authors propose an online data stream clustering method based on density grids. The grid-based method is employed to reduce the number of calls of the distance function, thereby improving the clustering quality. In [30], the authors propose a strongly connected grid clustering algorithm for the parallel processing of clustering based on MapReduce. Grid clusters with strong connectivity are merged by calculating the connectivity weight matrix so as to realize efficient clustering of location data. With the improvement of machine learning methods, high-performance grid clustering and density clustering algorithms are gradually increasing. Combining the above clustering algorithms with the publishing of location-based statistical information can improve the availability of published data on the basis of making full use of the distribution characteristics of location-based data. Research in this area has a very broad application space.

3. Differential Privacy

Definition 1

(

ϵ

-Differential Privacy [31,32]). For a pair of neighboring datasets

T_{1}

and

T_{2}

with

‖ T_{1} - T_{2} ‖_{1} \leq 1

, a randomized algorithm K with domain N is ϵ-differentially private if, for all

S \subseteq R a n g e (K)

, there is:

P_{r} [K (T_{1}) \in S] \leq e^{ϵ} \times P_{r} [K (T_{2}) \in S]

(1)

The parameter

ϵ

in Formula (1) is referred to as the privacy budget. The smaller the value of

ϵ

, the higher is the degree of privacy protection provided by the algorithm K. Even if an attacker obtains all the data except for a specific target record, the attacker still cannot determine whether records corresponding to a target exist in the original dataset, thereby realizing privacy protection.

Definition 2

(Sensitivity [33]). For a given query mapping function f, the sensitivity

▵ f

is defined as the maximum

L_{1}

-norm distance between the output of the query mapping function on neighboring datasets

T_{1}

and

T_{2}

:

▵ f = max_{T_{1}, T_{2}} {‖ f (T_{1}) - f (T_{2}) ‖}_{1}

(2)

Definition 3

(The Laplace Mechanism [33]). For numeric queries, the Laplace mechanism achieves differential privacy protection by incorporating a small amount of independent noise to the output of the query mapping function f. Let

f (T)

represent the result obtained by the query mapping function f on the original dataset T; then, the query result returned by the Laplace mechanism can be expressed as

K (T) = f (T) + η

, wherein η is a continuous random variable satisfying the Laplace distribution (centered at 0) with scale b, and its probability density function can be expressed as:

P_{r} [η = x] = \frac{1}{2 b} e^{- \frac{| x |}{b}}

(3)

Combined with the definition of sensitivity, the incorporated independent noise satisfies a zero-mean Laplace distribution of magnitude

b = \frac{▵ f}{ϵ}

.

Theorem 1

(Serial Combination Characteristics [34]). For a set of randomized algorithms

{K_{1}, K_{2}, \dots, K_{n}}

on the same dataset T, wherein

K_{i}

is an

ϵ_{i}

-differentially private algorithm

(1 \leq i \leq n)

. Then, their combination

{K_{1}, K_{2}, \dots, K_{n}}

is

\sum_{i = 1}^{n} ϵ_{i}

differentially private.

Theorem 2

(Parallel Combination Features [34]). If the dataset T is composed of a set of independent and disjoint subsets

{T_{1}, T_{2}, \dots, T_{n}}

, a set of randomized algorithms

{K_{1}, K_{2}, \dots, K_{n}}

are

ϵ_{i}

differentially private, respectively; then, their combination

{K_{1}, K_{2}, \dots, K_{n}}

is

m a x {ϵ_{i}}

differentially private.

Definition 4

(Noise Error). For any querying range Q, the error between the released statistical results and the original statistical results after privacy protection processing is referred to as the noise error:

N o i s e_E r r o r (Q) = | C o u n t (Q) - C o u n t^{*} (Q) |

(4)

wherein

C o u n t (Q)

represents the statistical result of the original location-based information within the querying range Q, and

C o u n t^{*} (Q)

manifests the statistical result of the published data within the same area.

Definition 5

(Uniform Assumption Error). Since the 2D space cannot be accurately partitioned into a single point, the partition and publishing algorithms often assume that the location-based information within the smallest partition area is uniformly distributed. The range counting querying error transpired as a result of the uniform assumption estimate is called the uniform assumption error:

U n i_E r r o r (Q) = | \sum_{i = 1}^{m} r_{i} \cdot C o u n t (P_{i}) - C o u n t (Q) |

(5)

wherein

P_{i}

(i = 1, 2, \dots, m)

represents the partition area that intersects with the querying range Q and satisfies

\underset{1 \leq i, j \leq m ⋀ i \neq j}{P_{i} \cap P_{j}} = \emptyset

.

r_{i}

(i = 1, 2, \dots, m)

is the ratio of the intersection of the querying range Q and the partition area, and

C o u n t (Q)

represents the statistical result of the original location-based information within the querying range.

4. The Grid Clustering Algorithm for Location-Based Big Data Statistical Information

4.1. The Problems of Traditional Privacy Spatial Decomposition

Unreasonable partition structure is likely to increase the publishing errors for location-based big data statistical information. The errors mainly include differential privacy noise error and uniform assumption error. If the local distribution of location-based big data has the sparse characteristic as shown in Figure 3a, the accurate statistical value in the querying range

Q_{1}

, i.e., the red dotted box is 1. The returned query result after incorporating differential privacy noise according to Definition 3 is about

(3 + 2.6) \times \frac{4}{9} \approx 2.5

and which is closer to the real statistical result. The partition and differential privacy protection result of the local distribution of location-based big data are portrayed in Figure 3b. The returned query result within the same range

Q_{1}

is about

1 + 6.3 = 7.3

(where 6.3 is the noise error). From this example, we can observe that, if a partition structure possesses a large number of empty nodes, i.e., grids with no location-based information, incorporating differential privacy noise introduces large noise error and reduces the querying accuracy of the published results.

If the local distribution of location-based big data has the uniform characteristic as depicted in Figure 4a, the accurate statistical value in the querying range

Q_{2}

, i.e., the blue dotted box, is 16. The returned query result after incorporating differential privacy noise is about

(148 - 7.6) \times \frac{1}{9} \approx 15.6

and which is also closer to the real statistical results. However, for the partition and differential privacy protection result of the corresponding distribution portrayed in Figure 4b, the returned query result within the same range

Q_{2}

is

16 - 6.9 = 9.1

(where 6.9 is the noise error). Therefore, the over-partitioning regions with relatively uniform spatial distribution also leads to a decrease in the precision of range counting queries.

If the local distribution of location-based big data has non-uniform characteristics as portrayed in Figure 5a, the accurate statistical value in querying range

Q_{3}

, i.e., the green dotted box, is 3, and the returned query result after incorporating differential privacy noise is

(101 + 11.6) \times \frac{1}{9} \approx 12.5

. The publishing error between the released and the real statistical result is relatively high. However, for the partition and differential privacy protection result of the corresponding distribution depicted in Figure 5b, the returned query result within the same range

Q_{3}

is

3 + 1.1 = 4.1

(where 1.1 is the noise error). It can be observed that, for areas with non-uniform spatial distribution, under-partitioning causes larger uniform assumption error and affect the accuracy of range counting queries.

4.2. Proposed Method

In order to overcome the above-mentioned problems during privacy spatial decomposition, we propose a grid clustering algorithm for location-based big data statistical information. Firstly, the 2D space covered by the location information is partitioned into uniform grids in accordance with the minimum size of the range counting querying service of location-based big data. The number of location points in each grid is counted and used as the density of the grid. In many practical applications, including but not limited to, population statistics, traffic statistics, and location based services, statistics are published by means of range counting [35,36]. Therefore, our preliminary grid partition and statistical method is in line with practical application scenarios. Secondly, the uniform distribution judgment condition has been designed for the grid areas in a bid to determine the distribution characteristics of the non-empty grids. The grid density is graded via discrete wavelet transform which, therefore, facilitates in achieving coarse clustering of location-based big data statistical results. Finally, a bottom-up grid clustering algorithm is proposed. The uniform grid and empty grid belonging to the same level of density are clustered and merged according to the neighborhood similarity. This, therefore, reduces the noise error and uniform assumption error in the released statistical results.

Definition 6

(Grid Uniformity). For a grid G with a density of

D e n (G)

, let

D e n (D_{1}),

D e n (D_{2})

,…,

D e n (D_{i})

be the density of the sub-regions of grid G partitioned from horizontal, vertical, diagonal, and other directions. If a row vector

V = {D e n (D_{1}), D e n (D_{2}), \dots,

D e n (D_{i})}

is constructed, then the distribution uniformity of the grid G can be represented by the variance

V a r (V)

of vector V. If the following Formula (6) is satisfied, the grid G is said to be a uniformly distributed grid:

| L F C - {log}_{10} {(\frac{D e n (G)}{i})}^{2} | \leq θ

(6)

wherein

L F C = l o g_{10} (V a r (V))

, i is the number of sub-regions after multi-directional segmentation and θ is the threshold.

Definition 7

(Neighborhood Grid). Let

C_{i}

be the center of grid

G_{i}

and the distance between any two grids with shared adjacent edge be 1. If the distance between grids

G_{i}

and

G_{j}

satisfies the following Formula (7), the grid

G_{j}

is referred to as the neighborhood grid of the grid

G_{i}

(as shown in Figure 6).

d i s (C_{i}, C_{j}) \leq \sqrt{2}

(7)

Regions with the same uniformity but having large differences in terms of their density are not suitable for direct merging. Therefore, we employ the discrete wavelet transform (DWT) so as to classify the matrix formed via the grids’ density. Subsequently, the grids of different levels are merged according to the neighborhood similarity to form the final publishing structure. DWT can decompose a signal into different sub-bands with frequency and time information in a bid to realize localized and detailed analysis of time or space. The decomposition process of DWT is depicted in Figure 7. The input signal

X (n)

is firstly computed by the low-pass (marked as

f_l (n)

) and high-pass (marked as

f_h (n)

) horizontal filters respectively, and is then down-sampled by a factor of 2. Subsequently, the outputs

L_{1}

and

H_{1}

are processed by the low-pass and high-pass vertical filters separately, and are again down-sampled. If necessary, the next-level DWT decomposition can be performed on the low-frequency coefficients (

L L_{1}

) of the previous layer DWT according to the same method. The low-frequency coefficients of the DWT transform contain most of the energy of a signal. Therefore, the low-frequency coefficients of the density matrix after DWT transform were selected to carry out the classification of grid density in this paper.

Definition 8

(Grid Density Grading). Let

L L

refer to the low-frequency coefficients of the original grid density matrix after the 2D DWT transform. Given two thresholds,

T_{1} = \frac{1}{3} \bar{L L}

and

T_{2} = \frac{2}{3} \bar{L L}

, the original grid density matrix can be classified into three levels:

C l a s s (i, j) = {\begin{matrix} 1 & i f & L L (i, j) \leq T_{1} \\ 2 & i f & T_{1} < L L (i, j) \leq T_{2} \\ 3 & i f & T_{2} \leq L L (i, j) \end{matrix}

(8)

Algorithm 1 delineates the detailed implementation mechanism of the grid clustering algorithm proposed in this paper. Lines 1–18 perform uniform grid partition and density statistics on the 2D space covered by the location dataset. The initial space is segregated into empty grids (flag = 0), uniform grids (flag = 1) and non-uniform grids (flag = 2) by uniformity judgement. The uniform grids are further graded via their density. Lines 19–34 complete various types of grid clustering: for uniform grids with flag = 1, the neighborhood grids with the same density classification are integrated to form a cluster; for empty grids with flag = 0, similar clustering and merging are done on adjacent empty grids. The above operation facilitates in reducing the incorporation of Laplace noise and avoiding excessive accumulation of noise errors. For non-uniform grids with flag = 2, as discussed in Section 4.1, a combination of this kind of area causes large uniform assumption errors. Therefore, we will not cluster these type of grids via our proposed algorithm. Lines 35–36 return the final cluster structure matrix. Figure 8 portrays the results of the proposed grid clustering algorithm by taking the location dataset, Storage, as an example (different colors are used to represent different clusters).

Algorithm 1 Grid clustering algorithm.

Require: Location-based dataset D; partition granularity m; uniformity threshold

θ

; density grading threshold

T_{1}

and

T_{2}

;
Ensure: Cluster structure matrix

C S M

; grid density matrix

D e n

;

1:: Partition the 2D space of location-based dataset D into $m \times m$ uniform grids
2:: $D e n (i, j)$ =0, $f l a g (i, j)$ =0, $C l a s s (i, j)$ =0, $u n i_G (i, j)$ =0, $n u l l_G (i, j)$ =0, $(i, j = 1, 2, \dots, m)$
3:: for each grid do
4:: $D e n (i, j) ⟵$ calculate the density of current grid
5:: if $D e n (i, j) = 0$ then
6:: $f l a g (i, j) = 0$
7:: else
8:: if uniformity of the current grid satisfies Formula (6) then
9:: $f l a g (i, j) = 1$
10:: else
11:: $f l a g (i, j) = 2$
12:: end if
13:: end if
14:: end for
15:: $c A ⟵$ complete DWT transform on $D e n$ and get the low-frequency coefficients
16:: for each grid with $f l a g$ =1 do
17:: $C l a s s (i, j) ⟵$ grid density grading according to Formula (8)
18:: end for
19:: for each uniform grid with $f l a g$ =1 do
20:: for each class within Formula (8) do
21:: if the current grid’s neighborhood grid satisfies Definition (7) then
22:: $u n i_G ⟵$ set the connected grid with the label of the current grid
23:: else
24:: $u n i_G ⟵$ mark the current grid as a new sequence number
25:: end if
26:: end for
27:: end for
28:: for each empty grid with $f l a g$ =0 do
29:: if the current grid’s neighborhood grid satisfies Definition (7) then
30:: $n u l l_G ⟵$ set the connected grid with the label of the current grid
31:: else
32:: $n u l l_G ⟵$ mark the current grid as a new sequence number
33:: end if
34:: end for
35:: $C S M ⟵$ merge the matrix $u n i_G$ and $n u l l_G$
36:: return $C S M$ , $D e n$

5. Statistical Publishing and Differential Privacy Protection Algorithm

In order to realize the privacy protection of the published location-based big data, it is necessary to allocate the differential privacy budget according to the spatial indexing structure and incorporate disturbance noise into the statistical results of each spatial region. Algorithm 2 depicts the differential privacy preserving publishing process of location-based big data. Firstly, according to Algorithm 1, the distribution structure of the whole space is obtained. Then, the statistical results within each cluster are calculated, and the Laplacian noise with a privacy budget of

ϵ

is incorporated into the statistical results of each cluster. Finally, the noisy statistical results will be evenly allocated to each grid region according to the number of grids within the cluster in a bid to balance the noise error and the uniform assumption error and improve the usability of the published statistical results. Figure 9 portrays the schematic of the grid partition structure and the clustering process, wherein Figure 9a suggests the type of the underlying grid with red numbers (i.e., flag = 0 correspond to the empty grids, flag = 1 represent the uniform grids, and flag = 2 stand for the non-uniform grids). Only adjacent empty grids or uniform grids possessing the same density level can be integrated to form the grid clusters. Hence, neither the non-uniform grids nor the uniform grids with different density levels can be merged to reduce the uniformity assumption error.

Algorithm 2 Differential privacy preserving publishing algorithm.

Require: Cluster structure matrix

C S M

; grid density matrix

D e n

; privacy budget

ϵ

; Sensitivity S;
Ensure: Publish statistical result based on grid

D e n^{*}

;

1:: $D e n^{*} (i, j)$ =0
2:: $n ⟵$ calculate the number of clusters according to $C S M$
3:: for each cluster do
4:: $u ⟵$ calculate the number of grids within the current cluster
5:: $s u m_D e n ⟵$ sum the density of the u grids according to $D e n$
6:: $n o i s y_s u m = s u m_D e n + L a p l a c e (\frac{S}{ϵ})$
7:: $D e n^{*} (i, j) = n o i s y_s u m / u$
8:: end for
9:: return $D e n^{*}$

Corollary 1.

Algorithm 2 can provide ϵ-differential privacy protection for the published location-based big data statistical results.

Proof.

For any querying range Q submitted by a user, the range counting query has the following use cases:

(1) The querying range Q located within the area covered by a certain cluster as depicted in Figure 10a. According to the grid clustering Algorithm 1 proposed in this paper, the querying range Q may be included in a single underlying grid or in a region merged by multiple underlying grids. In the former situation, it can be concluded that this single underlying grid is a non-uniform grid and forms a cluster alone.Therefore, Algorithm 2 incorporates Laplace noise in this grid with a differential privacy budget of

ϵ

. In the latter situation, Algorithm 2 incorporates Laplace noise to the merged region with a differential privacy budget of

ϵ

. According to the parallel combination characteristics of the differential privacy model described in Theorem 2, each grid that falls in the merged region satisfies the differential privacy protection of

m a x {ϵ_{i}} = ϵ

.

(2) The querying range Q consists of the area covered by A (

A \geq 1

) complete clusters as depicted in Figure 10b. In this case, each cluster conforms to the first use case described above. Therefore, according to the parallel combination characteristics of the differential privacy model described in Theorem 2, all the A complete clusters included in the querying range Q can provide differential privacy protection of

m a x {ϵ_{i}} = ϵ

.

(3) The querying range Q consists of the area covered by B (

B \geq 1

) intersecting clusters as depicted in Figure 10c. In this case, each of the intersecting area conforms to the case (1) and can also provide differential privacy protection with a strength of

ϵ

.

(4) The querying range Q consists of area covered by A (

A \geq 1

) complete clusters and B (

B \geq 1

) intersecting clusters, as depicted in Figure 10d. For the complete clusters falling in the querying range Q, each cluster can provide

ϵ

differential privacy protection similar to use case (2). For the clusters intersected with the querying range Q, the intersected region conforms to use case (3) and provide

ϵ

differential privacy protection. Therefore, it can also provide differential privacy protection of

ϵ

according to the parallel combination characteristics of the differential privacy model.

Based on the above use cases, we can draw the conclusion that Algorithm 2 can provide differential privacy protection of

ϵ

for the published location-based big data statistical results. □

6. Experiments and Analysis

In order to comprehensively evaluate the proposed grid clustering and differential privacy protection publishing algorithm (GCDPP) for location-based big data statistical information, we compare and analyze the proposed algorithm with a number of classical privacy spatial decomposition algorithms in terms of the availability of published location-based big data statistical results, efficiency of the privacy protection publishing algorithm, and the influences of partition granularity. The baseline methods include, but are not limited to, uniform grid partition method (UG) [11], adaptive grid partition method (AG) [11], standard deviation circle radius adaptive grid decomposition algorithm (SDCAG) [14], quadtree partition method (Quad-opt) [17], density-based quadtree partition method (DBP) [19], and unbalanced quadtree partition method (unbalanced quadtree) [20].

Experimental location-based datasets include the facility location information datasets, Storage http://www.infochimps.com/datasets/storage-facilities-by-neighborhood-2, accessed on 1 January 2022, and Landmark http://www.infochimps.com/datasets/storage-facilities-by-landmarks, accessed on 1 January 2022, provided by Infochimps; Checkin http://snap.stanford.edu/data/loc-gowalla.html, accessed on 1 January 2022, which provides location information from the social network site, Gowalla; and Taxi record dataset, Yellow_trip (2009–2016) https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page, accessed on 1 January 2022, provided by the New York City Taxi Management Committee. The experimental platform selects the Alibaba Cloud Server ECS (ecs.r6.2xlarge: 8-core CPU, 64 GB memory, 100 GB ESSD cloud disk, Windows Server 2019 Data Center Edition 64-bit), and all the algorithms are programmed in MATLAB.

6.1. Availability Analysis of the Published Location-Based Big Data Statistical Results

According to the application scenarios of location-based big data, the accuracy of the range counting query has been employed to ascertain the availability of location-based big data statistical publishing results. Range counting queries of different sizes are implemented on the published results and compared with the baseline methods.

For parameter settings of the baseline algorithms, we refer to their corresponding literature. Specifically, for the

S t o r a g e

dataset, the UG algorithm sets the constant parameter c = 10, the AG and SDCAG algorithm employ the privacy allocation ratio of

α = 0.5

, the Quad-opt algorithm adopts the partition depth of h = 6, the DBP algorithm set maximum density difference

β = 5

, and Unbalanced quadtree algorithm use the uniformity judgment threshold of

θ = 0.01

. The sensitivity of range counting query caused by incorporating differential privacy noise is

S = 1

. For the other three big datasets, the constant parameter was set to c = 1000 for the UG algorithm, division depth to h = 8 for the Quad-opt algorithm, and other parameters remained unchanged. The information pertinent to range counting queries of different sizes is depicted in Table 1. The differential privacy model incorporates Laplace noise with privacy budgets of

ϵ = 0.01, ϵ = 0.1

, and

ϵ = 1

. Each type of query was randomly generated for 1000 times to ascertain the average relative error between the original and published statistical results. The definition of relative error is presented in Formula (9), wherein Q represents the querying range,

C o u n t (Q)

returns the statistical result obtained within the querying range on the original location-based dataset, and

C o u n t^{*} (Q)

is the noisy statistical result obtained on the published dataset within the same range Q. In order to prevent the denominator from being zero, we set

ρ = 0.001 \times | T |

according to the UG algorithm [11], wherein

| T |

represents the size of the location-based dataset.

R E (Q) = \frac{| C o u n t^{*} (Q) - C o u n t (Q) |}{m a x {C o u n t (Q), ρ}}

(9)

Figure 11, Figure 12, Figure 13 and Figure 14 portray the comparison of querying accuracy, i.e., in terms of relative error, on different datasets and privacy budgets. On the same experimental dataset, the relative error of all the algorithms gradually decreases with an increase in the privacy budget

ϵ

. The primary reason is that the increase in the privacy budget

ϵ

reduces the incorporated Laplace noise, thereby shrinking the error between the published result and the real statistical value. For all the datasets, as the querying range increases from small to large, the relative error first increases and then decreases. This is mainly because, when the querying range is small (e.g., of size

q_{1}

and

q_{2}

), the included areas only contain some underlying grids or merged cluster areas, and, therefore, the accumulated noise interference is small, thereby making the querying error relatively low. When the querying range is large (e.g., of size

q_{5}

and

q_{6}

), the included areas contain more regions belonging to different clusters, and, therefore, the uniformity assumption error is small, thereby making the overall querying error relatively low.

When we compare the relative errors of various privacy preserving publishing algorithms on the same dataset, it can be observed that the UG and Quad-opt algorithm have higher relative errors in contrast to the others under various privacy budgets. The reason is that the spatial partition process of these two algorithms has nothing to do with the spatial distribution of location-based information, which not only produces more empty nodes and noise errors but also introduces more uniform assumption error owing to the non-uniform distribution of the nodes. The AG algorithm performs an additional fine-grained partition for the dense grids on the basis of the UG algorithm. However, it still cannot overcome the large number of noise errors caused by the empty nodes in areas with sparse location distribution. Therefore, the relative error of the AG algorithm is low when the querying range is small and the value deteriorates when the querying range is large. In the SDCAG algorithm, filtering and bucketing are used to reduce the noise error. Therefore, it provides better accuracy of range counting query than the AG algorithm. The DBP and Unbalanced quadtree algorithm implements a heuristic quadtree partition according to the uniformity of the location distribution, thereby compensating the uniformity assumption error and the noise error caused by the empty nodes to a certain extent. The proposed GCDPP algorithm adopts the strategy of bottom-up neighborhood clustering. The initial uniform grid partition takes into account the need for small-scale counting query services and retains a better accuracy of location services. The bottom-up neighborhood grid clustering reduces the excessive partition of sparse regions and suppresses excessive noise errors. In addition, it further facilitates merging the uniform regions of the same density level and apportions the differential privacy noise. Therefore, the proposed algorithm achieves better querying accuracy in contrast to the other algorithms under different datasets and different privacy budgets.

6.2. Efficiency Analysis of the Privacy Protection Publishing Algorithm

Table 2 compares the time complexity of the proposed GCDPP algorithm with the baseline methods. For a dataset encompassing n records, whilst the UG, AG, and SDCAG algorithm have the same time complexity of

O (n)

, the partition process of the UG algorithm only scans the input data for one time, whereas the AG and SDCAG algorithm scan the input data for at least two times to form a two-layer adaptive grid partition. The Quad-opt algorithm performs a recursive quadtree partition on the input dataset. The overall time complexity is about

O (n log (n))

. The partition process of the DBP and Unbalenced quadtree algorithm is related with the specific distribution of location-based dataset and partition depth of quadtree. In the worst scenario, for a location-based dataset with n records and h partition depth, the time complexity of these two algorithms will be

O (h n)

. The proposed GCDPP algorithm first performs the same partition as the UG algorithm, and then merges the neighbourhood grids of the same type and density grade to form the clusters. Its time complexity is about

O (n + n) \approx O (n)

.

Figure 15 depicts the comparison in the execution time of all the privacy protection algorithms on real location-based datasets. The parameter settings of these algorithms are the same as those in Section 6.1. Since the strength of differential privacy budget has no significant effect on the execution time of the statistical publishing algorithms, we take

ϵ = 1

as an example to conduct the experimental comparisons on the real location-based datasets of different scales. From a macro perspective, the overall execution time of all the algorithms increase with the size of the dataset. Specifically, the UG algorithm has the lowest execution time and is hardly affected by the size of the dataset. The AG and SDCAG algorithm perform an additional round of grid partition on the basis of the UG algorithm, and, therefore, they take a much longer time. In order to reduce the noise error, the SDCAG algorithm filters the grids with a raw count of 0 and allocates the similar grids into the same bucket. Thus, the overall execution time is slightly higher than the AG algorithm. Although the partition process of the Quad-opt algorithm does not consider the specific distribution state of the location information, the tree-based iterative process using depth-first traversal is primarily affected by the size of the dataset. Therefore, the overall execution time of the Quad-opt algorithm is significantly higher than other algorithms. The partition process of the DBP and Unbalanced quadtree algorithm need to be combined with the specific distribution of the location information. In the areas wherein the location distribution is particularly sparse or dense, they can avoid unnecessary partition. Therefore, they received better performance on the sparse dataset Storage and the dense dataset Yellow_trip. The proposed GCDPP algorithm firstly implements the uniform grid partition stage and then performs bottom-up neighborhood grid clustering. The grid-based clustering accelerates the merging of the spatial regions and saves much time compared to the process of adaptive grid partition and filtering. Thus, the execution time of the GCDPP algorithm is only secondary to the UG algorithm. It can be expected that, with an increasing in the size of dataset, the advantage of the proposed GCDPP algorithm will be more obvious.

6.3. The Influences of Partition Granularity

As discussed in Section 4.1, traditional privacy spatial decomposition methods are easily affected by partition granularity, resulting in over-partitioning or under-partitioning problems that not only impair the availability of published data but may also drag down the efficiency of the entire data publishing algorithm. This section analyses the relationships between partition granularity setting, accuracy of published results, and publishing algorithm efficiency. Since the privacy budget strength has no obvious effect on the setting of partition granularity, we take

ϵ = 1

as an example to analyze the average error of the published statistical results with different partition granularities. For the sake of fairness, we keep the initial partition granularity of all the comparison algorithms basically the same. The average error of the published statistical results can be expressed as follows:

A E = \frac{\sum_{i = 1}^{N} | D e n_{i}^{*} - D e n_{i} |}{| T |}

(10)

wherein N is the initial partition granularity of the location-based dataset, and

| T |

represents the size of the dataset.

D e n_{i}

stands for the original density of grid i, whereas

D e n_{i}^{*}

returns the published density of the corresponding area.

Figure 16 portrays the average error of all the algorithms under different partition granularity. It can be observed that, with an increase in the partition granularity, the average error of the tree-based partition algorithms grows gradually, whereas the average error of the grid-based partition algorithms do not change obviously. The primary reason is that, when the partition granularity is small, the coverage area of a single grid will be large and the published statistical result may contain more non-uniform assumption errors due to the inhomogeneous distribution of data within the region. Nevertheless, the introduced noise error will be small because of the less number of empty grids. With an increase in partition granularity, the coverage area of each grid will be reduced. The non-uniform assumption error for the grid area will decrease, whereas the total noise error will increase because the number of empty grids may increase. The proposed algorithm achieves the lowest average error on almost all experimental datasets demonstrating the advantage of the proposed algorithm in the availability of the published statistical result.

Intuitively, the execution time of partition algorithms will increase with the size of dataset and partition granularity. Figure 17 depicts the execution time of all the algorithms under different partition granularity. The partition progress of UG algorithm is independent of the distribution of dataset, so it always maintains the lowest execution time and hardly changes with the change of the partition granularity. For the other algorithms, the execution time increases significantly with the partition granularity. The execution time of the AG and SDCAG algorithms is slightly longer than that of the UG algorithm. With an increase in a dataset’s scale and distribution complexity, the efficiency of these two algorithms gradually deteriorates, especially for the largest and most complex dataset

Y e l l o w_t r i p

. For the tree-based partition algorithms, the increase of partition gradually causes more recursive operations and uniformity judgments; therefore, the execution time of these algorithms suffers significantly. While the proposed GCDPP algorithm maintains the independent grid partition just as the UG algorithm, its bottom-up grid clustering process overcomes the excessive recursive operations. Therefore, with an increase in partition granularity, the proposed algorithm is still better than the other algorithms. It is foreseeable that the advantages of the proposed algorithm will be more obvious in the case of location-based dataset with larger scale and more complex distribution.

7. Conclusions

The rapid development of big data technology and the widespread popularization of mobile smart terminals has made various location-based services highly relevant to the work and life of users. Accordingly, the issues of location privacy leakage caused by the release of location-based big data statistics have started attracting widespread attention. Achieving the release of statistics of location-based information via the spatial decomposition method and the differential privacy protection model can effectively avoid various risks caused by location privacy leakage on the premise of ensuring data availability. In order to reduce the loss of data availability caused by over-partitioning and under-partitioning in the traditional top-down spatial decomposition method, this paper proposed a bottom-up grid clustering and differential privacy protection publishing method. The accuracy of location-based services is met through uniform grid partition and density statistics. Privacy protection of the published data is achieved by incorporating Laplace noise to location-based big data statistics according to the differential privacy model. The merging of the blank areas and the uniform distribution areas has been realized by the neighborhood grid clustering in order to reduce the noise error and the uniform assumption error. Experiments and analysis on the real location-based datasets prove that the proposed grid clustering and differential privacy publishing method have obvious advantages in improving the range querying accuracy and operating efficiency.

However, the research still has some limitations. Firstly, most of the existing spatial partition methods of location big data are based on grid or tree structures. However, the areas where human actually work and live are primarily closely related to the distribution of infrastructure and cannot be well represented by grids or tree structures. Geographic segmentation methods such as road network structures did not have effective index methods and cannot achieve efficient querying services for location-based big data. Therefore, how to design more detailed and flexible partition method combined with efficient index structure will facilitate improving the availability of published results and is one of our future works. Secondly, the paper only focuses on the statistical partitioning and publishing of static location-based big data. However, a user’s location continuously and randomly changes over time. The dynamic changes of the locations of large number of users make the spatial partition structure of the previous statistical release time not applicable to the next release time. Therefore, how to design a statistical partition and publishing method combined with the dynamic change of location big data, and enhance the timeliness and practicability of published statistical results under the premise of ensuring users’ privacy, will be a further research direction. Finally, the paper discusses the statistical publishing method of location-based big data based on the centralized differential privacy model. The premise is that users’ locations are collected and statistically released via a trusted third-party. However, there is no completely reliable platform in practical applications. Problems such as technical failures, superuser leaks, and hacker attacks make the users’ original locations in danger. Therefore, transferring the privacy protection process of users’ locations to the their terminals and realizing local differential privacy location protection have become the research hotspot in recent years.

Author Contributions

Methodology and validation, Yan Yan; investigation, Zichao Sun, Fei Xu and Zhuoyue Dong; formal analysis, Yan Yan and Zichao Sun; resources, Quan Z. Sheng and Adnan Mahmood; data curation, Zichao Sun, Fei Xu and Zhuoyue Dong; writing—original draft preparation, Adnan Mahmood and Yan Yan; writing—review and editing, Quan Z. Sheng, Adnan Mahmood and Yan Yan; visualization, Adnan Mahmood and Zichao Sun; supervision and project administration, Quan Z. Sheng All authors read and agreed to the published version of the manuscript.

Funding

The research at hand was supported by the National Nature Science Foundation of China, No. 61762059.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: [http://www.infochimps.com/datasets/storage-facilities-by-neighborhood-2, http://www.infochimps.com/datasets/storage-facilities-by-landmarks, http://snap.stanford.edu/data/loc-gowalla.html, and https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page].

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, Z.; Qian, J.L.; Du, Y.Y.; Wang, N.; Yi, J.W.; Sun, Y.R.; Ma, T.; Pei, T.; Zhou, C.H. Multi-level spatial distribution estimation model of the inter-regional migrant population using multi-source spatio-temporal big data: A case study of migrants from Wuhan during the spread of COVID-19. J.-Geo-Inf. Sci. 2020, 22, 147–160. [Google Scholar]
Wu, H.; Gui, Z.; Yang, Z. Geospatial big data for urban planning and urban management. Geo-Spat. Inf. Sci. 2020, 23, 273–274. [Google Scholar] [CrossRef]
Mohammed, S.; Arabnia, H.R.; Qu, X.; Zhang, D.; Kim, T.H.; Zhao, J. IEEE access special section editorial: Big data technology and applications in intelligent transportation. IEEE Access 2020, 8, 201331–201344. [Google Scholar] [CrossRef]
Zhou, C.; Su, F.; Pei, T.; Zhang, A.; Du, Y.; Luo, B.; Cao, Z.D.; Wang, J.L.; Yuan, W.; Zhu, Y.Q.; et al. COVID-19: Challenges to GIS with big data. Geogr. Sustain. 2020, 1, 77–87. [Google Scholar] [CrossRef]
Gruschka, N.; Mavroeidis, V.; Vishi, K.; Jensen, M. Privacy issues and data protection in big data: A case study analysis under GDPR. In Proceedings of the 2018 IEEE International Conference on Big Data, Seattle, WA, USA, 10–13 December 2018; pp. 5027–5033. [Google Scholar]
Takbiri, N.; Houmansadr, A.; Goeckel, D.L.; Pishro-Nik, H. Privacy against statistical matching: Inter-user correlation. In Proceedings of the 2018 IEEE International Symposium on Information Theory, Vail, CO, USA, 17–22 June 2018; pp. 1036–1040. [Google Scholar]
Primault, V.; Boutet, A.; Mokhtar, S.B.; Brunie, L. The long road to computational location privacy: A survey. IEEE Commun. Surv. Tutor. 2018, 21, 2772–2793. [Google Scholar] [CrossRef] [Green Version]
Yan, Y.; Eyeleko, A.H.; Mahmood, A.; Li, J.; Dong, Z.; Xu, F. Privacy preserving dynamic data release against synonymous linkage based on microaggregation. Sci. Rep. 2022, 12, 1–22. [Google Scholar] [CrossRef]
Li, D.; Shao, Z.; YU, W.; Zhu, X.; Zhou, S. Public epidemic prevention and control services based on big data of spatiotemporal location make cities more smart. Geomat. Inf. Sci. Wuhan Univ. 2020, 45, 475–487. [Google Scholar]
Erlingsson, Ú.; Pihur, V.; Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1054–1067. [Google Scholar]
Qardaji, W.; Yang, W.; Li, N. Differentially private grids for geospatial data. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering, Brisbane, QLD, Australia, 8–12 April 2013; pp. 757–768. [Google Scholar]
Xiong, P.; Zhang, L.; Zhu, T. Reward-based spatial crowdsourcing with differential privacy preservation. Enterp. Inf. Syst. 2017, 11, 1500–1517. [Google Scholar] [CrossRef]
Yan, Y.; Hao, X.H. Differential privacy partitioning algorithm based on adaptive density grids. J. Shandong Univ. (Nat. Sci.) 2018, 53, 12–22. [Google Scholar]
Zhou, G.; Tang, X.; Qin, S. Adaptive Grid Decomposition Algorithm based on Standard Deviation Circle Radius. Int. J. Perform. Eng. 2019, 15, 2145–2152. [Google Scholar]
Wei, J.; Lin, Y.; Yao, X.; Zhang, J. Differential privacy-based location protection in spatial crowdsourcing. IEEE Trans. Serv. Comput. 2022, 15, 45–58. [Google Scholar] [CrossRef]
Rodríguez, K.M.; Bossy, M.; Maftei, R.; Shekarforush, S.; Henry, C. New spatial decomposition method for accurate, mesh-independent agglomeration predictions in particle-laden flows. Appl. Math. Model. 2021, 90, 582–614. [Google Scholar] [CrossRef]
Cormode, G.; Procopiuc, C.; Srivastava, D.; Shen, E.; Yu, T. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Arlington, VA, USA, 1–5 April 2012; pp. 20–31. [Google Scholar]
Wu, Y.; Lu, Q.; Cai, J.; Wang, X. Differential privacy two-dimensional data partitioning publication algorithm based on quad-tree. J. Huazhong Univ. Sci. Technol. (Nat. Sci. Ed.) 2016, 44, 99–104. [Google Scholar]
Yang, M.; Zhu, T.; Xiang, Y.; Zhou, W. Density-based location preservation for mobile crowdsensing with differential privacy. IEEE Access 2018, 6, 14779–14789. [Google Scholar] [CrossRef]
Yan, Y.; Gao, X.; Mahmood, A.; Feng, T.; Xie, P. Differential private spatial decomposition and location publishing based on unbalanced quadtree partition algorithm. IEEE Access 2020, 8, 104775–104787. [Google Scholar] [CrossRef]
Huang, S.; Chen, T.; Lu, Q.; Wu, Y.; Ye, S. Differentially privacy two-dimensional dataset partitioning publication algorithm based on kd-tree. J. Shandong Univ. (Eng. Sci.) 2015, 45, 24–29. [Google Scholar]
Yan, Y.; Hao, X.; Zhang, L. Hierarchical differential privacy hybrid decomposition algorithm for location big data. Clust. Comput. 2019, 22, 9269–9280. [Google Scholar] [CrossRef]
Ohadi, N.; Kamandi, A.; Shabankhah, M.; Fatemi, S.M.; Hosseini, S.M.; Mahmoudi, A. Sw-dbscan: A grid-based dbscan algorithm for large datasets. In Proceedings of the 2020 6th International Conference on Web Research, Tehran, Iran, 22–23 April 2020; pp. 139–145. [Google Scholar]
Suo, M.L.; Zhou, D.; An, R.M.; Li, S.L. Neighborhood density grid clustering and its applications. J. Tsinghua Univ. (Sci. Technol.) 2018, 58, 732–739. [Google Scholar]
Wu, B.; Wilamowski, B.M. A fast density and grid based clustering method for data with arbitrary shapes and noise. IEEE Trans. Ind. Inform. 2017, 13, 1620–1628. [Google Scholar] [CrossRef]
Xu, H.; Yao, S.; Li, Q.; Ye, Z. An improved k-means clustering algorithm. In Proceedings of the 2020 IEEE 5th International Symposium on Smart and Wireless Systems within the Conferences on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS-SWS), Dortmund, Germany, 17–18 September 2020; pp. 1–5. [Google Scholar]
Zhu, Q.; Tang, X.; Liu, Z. Revised dbscan clustering algorithm based on dual grid. In Proceedings of the 2020 Chinese Control In addition, Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 3461–3466. [Google Scholar]
Yu, Y.; Jia, Z.; Cao, L.; Zhao, J.D.; Liu, Z.W.; Liu, J.L. Fast density-based clustering algorithm for location big data. J. Softw. 2018, 29, 2470–2484. [Google Scholar]
Tareq, M.; Sundararajan, E.A.; Mohd, M.; Sani, N.S. Online clustering of evolving data streams using a density grid-based method. IEEE Access 2020, 8, 166472–166490. [Google Scholar] [CrossRef]
Hu, Y.S.; Lu, Y.H. Cell Clustering Algorithm Based on MapReduce and Strongly Connected Fusion. Comput. Sci. 2019, 46, 204–207+215. [Google Scholar]
Dwork, C. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
Dwork, C. Differential privacy: A survey of results. In Proceedings of the International Conference on Theory and Applications of Models of Computation, Xi’an, China, 25–29 April 2008; pp. 1–19. [Google Scholar]
Dwork, C. Calibrating noise to sensitivity in private data analysis. Lect. Notes Comput. Sci. 2012, 3876, 265–284. [Google Scholar]
Dwork, C.; Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
Losada-Rojas, L.L.; Gkritza, K. Individual and location-based characteristics associated with Autonomous Vehicle adoption in the Chicago metropolitan area: Implications for public health. J. Transp. Health 2021, 22, 101232. [Google Scholar] [CrossRef]
Hara, Y.; Yamaguchi, H. Japanese travel behavior trends and change under COVID-19 state-of-emergency declaration: Nationwide observation by mobile phone location data. Transp. Res. Interdiscip. Perspect. 2021, 9, 100288. [Google Scholar] [CrossRef]

Figure 1. Spatial distribution of COVID-19.

Figure 2. Normal distribution using RAPPOR.

Figure 3. The relationship of sparse data distribution and partition structure. (a) example of sparse distribution; (b) corresponding partition and range query.

Figure 4. The relationship of uniform data distribution and partition structure. (a) example of uniform distribution; (b) corresponding partition and range query.

Figure 5. The relationship of non-uniform data distribution and partition structure. (a) example of non-uniform distribution; (b) corresponding partition and range query.

Figure 6. Neighborhood grids.

Figure 7. Decomposition process of DWT.

Figure 8. Grid partition and clustering process of Storage dataset. (a) grid partition; (b) density matrix; (c) coefficient matrix after DWT; (d) grid clustering result.

Figure 9. Illustration of the grid clustering result and the spatial indexing structure. (a) supposed grid clustering result; (b) correspondent spatial indexing structure.

Figure 10. Three use cases of the querying range Q. (a) Use case (1); (b) Use case (2); (c) Use case (3); (d) Use case (4).

Figure 11. Comparison of querying accuracy on Storage dataset. (a) Storage dataset,

ϵ = 0.01

; (b) Storage dataset,

ϵ = 0.1

; (c) Storage dataset,

ϵ = 1

.

Figure 11. Comparison of querying accuracy on Storage dataset. (a) Storage dataset,

ϵ = 0.01

; (b) Storage dataset,

ϵ = 0.1

; (c) Storage dataset,

ϵ = 1

.

Figure 12. Comparison of querying accuracy on Landmark dataset. (a) Landmark dataset,

ϵ = 0.01

; (b) Landmark dataset,

ϵ = 0.1

; (c) Landmark dataset,

ϵ = 1

.

Figure 12. Comparison of querying accuracy on Landmark dataset. (a) Landmark dataset,

ϵ = 0.01

; (b) Landmark dataset,

ϵ = 0.1

; (c) Landmark dataset,

ϵ = 1

.

Figure 13. Comparison of querying accuracy on Checkin dataset. (a) Checkin dataset,

ϵ = 0.01

; (b) Checkin dataset,

ϵ = 0.1

; (c) Checkin dataset,

ϵ = 1

.

Figure 13. Comparison of querying accuracy on Checkin dataset. (a) Checkin dataset,

ϵ = 0.01

; (b) Checkin dataset,

ϵ = 0.1

; (c) Checkin dataset,

ϵ = 1

.

Figure 14. Comparison of querying accuracy on Yellow_trip dataset. (a) Yellow_trip dataset,

ϵ = 0.01

; (b) Yellow_trip dataset,

ϵ = 0.1

; (c) Yellow_trip dataset,

ϵ = 1

.

Figure 14. Comparison of querying accuracy on Yellow_trip dataset. (a) Yellow_trip dataset,

ϵ = 0.01

; (b) Yellow_trip dataset,

ϵ = 0.1

; (c) Yellow_trip dataset,

ϵ = 1

.

Figure 15. Comparison of operating efficiency of various algorithms under different datasets. (a) Storage dataset,

ϵ = 1

; (b) Landmark dataset,

ϵ = 1

; (c) Checkin dataset,

ϵ = 1

; (d) Yellow_trip dataset,

ϵ = 1

.

Figure 15. Comparison of operating efficiency of various algorithms under different datasets. (a) Storage dataset,

ϵ = 1

; (b) Landmark dataset,

ϵ = 1

; (c) Checkin dataset,

ϵ = 1

; (d) Yellow_trip dataset,

ϵ = 1

.

Figure 16. Partition granularity vs. average error. (a) Storage dataset,

ϵ = 1

; (b) Landmark dataset,

ϵ = 1

; (c) Checkin dataset,

ϵ = 1

; (d) Yellow_trip dataset,

ϵ = 1

.

Figure 16. Partition granularity vs. average error. (a) Storage dataset,

ϵ = 1

; (b) Landmark dataset,

ϵ = 1

; (c) Checkin dataset,

ϵ = 1

; (d) Yellow_trip dataset,

ϵ = 1

.

Figure 17. Partition granularity vs. execution time. (a) Storage dataset,

ϵ = 1

; (b) Landmark dataset,

ϵ = 1

; (c) Checkin dataset,

ϵ = 1

; (d) Yellow_trip dataset,

ϵ = 1

.

Figure 17. Partition granularity vs. execution time. (a) Storage dataset,

ϵ = 1

; (b) Landmark dataset,

ϵ = 1

; (c) Checkin dataset,

ϵ = 1

; (d) Yellow_trip dataset,

ϵ = 1

.

Table 1. Experimental datasets and querying range information.

Parameter		Storage	Landmark	Checkin	Yellow_trip
Data amount (points)		8938	870,052	1,000,000	10,996,214
Dataset coverage (longitude×latitude)		53 $^{\circ}$ × 23 $^{\circ}$	57 $^{\circ}$ × 25 $^{\circ}$	353 $^{\circ}$ × 143 $^{\circ}$	1 $^{\circ}$ × 1 $^{\circ}$
Querying range (longitude×latitude)	q $_{1}$	1.25 $^{\circ}$ × 0.625 $^{\circ}$	1.25 $^{\circ}$ × 0.625 $^{\circ}$	6 $^{\circ}$ × 3 $^{\circ}$	0.02 $^{\circ}$ × 0.02 $^{\circ}$
	q $_{2}$	2.5 $^{\circ}$ × 1.25 $^{\circ}$	2.5 $^{\circ}$ × 1.25 $^{\circ}$	12 $^{\circ}$ × 6 $^{\circ}$	0.04 $^{\circ}$ × 0.04 $^{\circ}$
	q $_{3}$	5 $^{\circ}$ × 2.5 $^{\circ}$	5 $^{\circ}$ × 2.5 $^{\circ}$	24 $^{\circ}$ × 12 $^{\circ}$	0.08 $^{\circ}$ × 0.08 $^{\circ}$
	q $_{4}$	10 $^{\circ}$ × 5 $^{\circ}$	10 $^{\circ}$ × 5 $^{\circ}$	48 $^{\circ}$ × 24 $^{\circ}$	0.16 $^{\circ}$ × 0.16 $^{\circ}$
	q $_{5}$	20 $^{\circ}$ × 10 $^{\circ}$	20 $^{\circ}$ × 10 $^{\circ}$	96 $^{\circ}$ × 48 $^{\circ}$	0.32 $^{\circ}$ × 0.32 $^{\circ}$
	q $_{6}$	40 $^{\circ}$ × 20 $^{\circ}$	40 $^{\circ}$ × 20 $^{\circ}$	192 $^{\circ}$ × 96 $^{\circ}$	0.64 $^{\circ}$ × 0.64 $^{\circ}$

Table 2. Comparison of time complexity.

Algorithm	Time Complexity
UG	$O (n)$
AG	$O (n)$
SDCAG	$O (n)$
Quad-opt	O(n $log (n)$ )
DBP	O(hn)
Unbalanced quadtree	O(hn)
GCDPP	O(n)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, Y.; Sun, Z.; Mahmood, A.; Xu, F.; Dong, Z.; Sheng, Q.Z. Achieving Differential Privacy Publishing of Location-Based Statistical Data Using Grid Clustering. ISPRS Int. J. Geo-Inf. 2022, 11, 404. https://doi.org/10.3390/ijgi11070404

AMA Style

Yan Y, Sun Z, Mahmood A, Xu F, Dong Z, Sheng QZ. Achieving Differential Privacy Publishing of Location-Based Statistical Data Using Grid Clustering. ISPRS International Journal of Geo-Information. 2022; 11(7):404. https://doi.org/10.3390/ijgi11070404

Chicago/Turabian Style

Yan, Yan, Zichao Sun, Adnan Mahmood, Fei Xu, Zhuoyue Dong, and Quan Z. Sheng. 2022. "Achieving Differential Privacy Publishing of Location-Based Statistical Data Using Grid Clustering" ISPRS International Journal of Geo-Information 11, no. 7: 404. https://doi.org/10.3390/ijgi11070404

APA Style

Yan, Y., Sun, Z., Mahmood, A., Xu, F., Dong, Z., & Sheng, Q. Z. (2022). Achieving Differential Privacy Publishing of Location-Based Statistical Data Using Grid Clustering. ISPRS International Journal of Geo-Information, 11(7), 404. https://doi.org/10.3390/ijgi11070404

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Achieving Differential Privacy Publishing of Location-Based Statistical Data Using Grid Clustering

Abstract

1. Introduction

2. Related Work

3. Differential Privacy

4. The Grid Clustering Algorithm for Location-Based Big Data Statistical Information

4.1. The Problems of Traditional Privacy Spatial Decomposition

4.2. Proposed Method

5. Statistical Publishing and Differential Privacy Protection Algorithm

6. Experiments and Analysis

6.1. Availability Analysis of the Published Location-Based Big Data Statistical Results

6.2. Efficiency Analysis of the Privacy Protection Publishing Algorithm

6.3. The Influences of Partition Granularity

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI