1. Introduction
With the rapid development of mobile intelligent devices and location technology, various types of location-based services (LBS) applications have brought convenience to people’s lives. While enjoying these convenient services, mobile users need to provide specific location information such as the nearest subway station, hospital, or bank. However, user location information is closely linked to personal living habits, health, economic conditions, and other private information [
1], which can be used to mine, analyze, or infer users’ private information. In recent years, several high-profile cases of location privacy leaks have occurred that resulted in serious consequences, including (1) Stalking and physical harm: The application (app) Girls Around Me was found to collect user information from Facebook and Instagram to create a map of the locations of women nearby, without their knowledge or consent. This could potentially lead to stalking or even murder. (2) Identity theft: Location sharing and geotagging in social media apps such as Snapchat and Instagram can reveal personal information about an individual’s location, which can lead to identity theft, credit card fraud, and phishing attacks. (3) Theft: Location-sharing through social media platforms can also be used by thieves to determine when someone is away from home, making them a potential target for theft. So, users can experience serious consequences if they provide precise location data through LBS [
2]. To avoid these problems, there is a pressing need to protect users’ location privacy.
Existing location privacy protection technologies include k-anonymity, l-diversity, and differential privacy (DP). k-anonymity and l-diversity generalize a user’s real location into an area to achieve location protection. However, this only protects users’ privacy to a certain extent and cannot prevent homogeneous attacks [
3] and background knowledge attacks [
4]. Qian et al. proposed a privacy protection model that can prevent background knowledge attacks and provide a quantitative evaluation method, namely differential privacy [
5]. In recent years, LBS protection algorithms based on DP have become a focus of research, but they can fall short of effectively preventing continuous location tracking and identification [
6].
To address these challenges, we are committed to developing a method that makes it difficult for attackers to infer a user’s exact location (protect location privacy) and sensitive attributes (protect query privacy) from query sequences, regardless of how much prior knowledge they possess. At the same time, the method ensures the accuracy of each LBS query, without any additional overhead, that is, the final query results obtained by a user remain the same even after privacy protection is added. Based on these concerns, we propose a privacy protection method based on differential privacy and L-clustering that is suitable for continuous location from the perspective of the above-mentioned goals. This method not only guarantees strong privacy but also maximizes data utility. The main contributions of this study are as follows:
(1) According to the distance and density between locations, an L-clustering algorithm is proposed to find the centroid of each cluster and replace all the locations within the cluster. Moreover, the continuous locations are divided into different regions of interest (ROIs) based on the user’s access frequency in different locations. This method can reduce the computation burden of differential privacy.
(2) A differential privacy-based location privacy protection algorithm (DPLPA) is proposed. The resident point is extracted based on whether the user’s access time, access frequency, and location contain sensitive information. In addition, a privacy budget is allocated to the resident point and cluster centroid. At the same time, Laplace noise is added to the resident point and cluster centroid to protect location privacy.
(3) Considering the user’s privacy preferences, different privacy budgets are allocated to different resident points, and the range of false location generation acceptable to users is determined to generate ROIs with higher utility. Theoretical analysis and experimental results show that DPLPA can effectively protect location privacy in LBS.
The rest of the paper is organized as follows.
Section 2 introduces the related works on privacy protection in LBS and the related major challenges. In
Section 3, we provide definitions of differential privacy, system structures, and the threat model of the algorithm.
Section 4 describes the proposed L-clustering algorithm and DPLPA and theoretically analyzes the algorithms in terms of security, time complexity, the degree of privacy protection, and data utility. In
Section 5, we carry out simulation experiments to evaluate the clustering accuracy, degree of privacy protection, data utility, and running time of each algorithm. Finally, we conclude our paper and provide some future perspectives in
Section 6.
2. Related Works
Many studies have proposed methods for LBS privacy protection involving k-anonymity, l-diversity, and differential privacy [
7,
8,
9]. Zhang et al. [
10] proposed a novel method of location privacy protection based on geographic semantics and ensuring k-anonymity. In this method, a candidate set is constructed using the maximum and minimum distance multi-center clustering algorithm, and the virtual location results are generated based on semantic similarity. Xing et al. [
11] proposed a modified privacy protection scheme based on double k-anonymity that hides users’ locations and request information. Tian et al. [
12] constructed a semantic and trade-off-aware location privacy protection mechanism (STA-LPPM) in which the multi-objective particle swarm optimization algorithm is used to generate an optimal anonymous set, achieving a balance between privacy protection and quality of service. A blockchain-enabled framework for peer-to-peer (P2P) energy trading was designed in [
13], and an anonymous proof-of-location algorithm was proposed that allows clients to choose their trading partners without revealing their real locations. Zheng et al. [
14] employed a dynamically adjustable k-anonymity (DAK) algorithm and a dynamical location privacy protection (DLPP) algorithm based on virtual locations in which sequences are disturbed by adding and deleting moving points. However, the effectiveness of combining l-diversity and k-anonymity is limited by data distribution and background knowledge attacks. As a result, the level of privacy protection cannot be guaranteed.
In addition to the above methods, there are models of LBS privacy protection that consist of a location tree, Markov model, and clustering. The main idea behind a location tree is to construct a tree structure based on certain rules. The prefix tree and DP [
15] are used to protect the privacy of the trajectory data and the nodes of the tree are used to store the trajectory segments. Li et al. [
16] established a hierarchical tree structure based on location attributes and proposed an attribute-aware privacy-preserving scheme for LBS. In addition, a Markov model is used to simulate the temporal correlation between a user’s real location and the prediction of the next possible location based on the transition probability of each location. Yuan et al. [
17] proposed a new location privacy protection method for a Cloud-of-Things system in which a Markov model is used to analyze users’ mobile behavior. The proposed location-hiding algorithm meets users’ privacy requirements by expanding the sizes of areas. Partovi et al. [
18] modeled a Markov decision process and introduced a new location privacy measurement method to ensure that a user’s specified privacy level could be achieved over an infinite time range. Yang et al. [
19] used k-anonymity to enhance privacy protection and clustering technology to group users by learning their trajectory data. A graph-based trajectory data representation model [
20] was proposed in which the similarity between trajectories is calculated using a measurement method based on edges and nodes and similar trajectories are clustered and identified based on their paths. Clustering can capture users’ activity patterns over a certain period and can remove locations with low access frequencies, so it is very flexible.
Differential privacy is a useful method due to its good privacy protection performance. In addition, it can efficiently prevent inference attacks by adding random noise to the original query results (adding or deleting some of the data in the datasets does not affect the query results). Therefore, it is difficult for attackers to infer real data through the use of multiple queries, thus achieving privacy protection. Stephanie et al. [
21] used DP technology to protect location data. In this method, random noise is added to confuse a user’s location, and the centroids of the clusters are gathered on a cloud server to generate the final cluster. This method provides an efficient privacy-preservation solution for location-based data-stream processing. Hu et al. [
22] considered the personalized security requirements of different users to achieve location protection based on users’ historical global positioning system (GPS) trajectory data and the natural attributes of locations. However, it has a massive computational load, and the accuracy of the user sensitivity evaluation is poor. Wang et al. [
23] proposed a privacy-protected social tie mining (P-STM) method, which can identify social connections from users’ daily trajectories, and offered an indicative dense region to calibrate personal daily trajectories. In addition, a clustering analysis method for spatiotemporal sequence data was proposed in [
24]. This method provides a basis for privacy protection by constructing continuous time regions and includes a data publishing mechanism that can prevent inferential attacks. However, this mechanism mainly distributes the offline group location data and cannot update other relevant information. A new framework (PrivSem) was presented in [
25], which combines k-anonymity, l-semantic diversity, and DP. It guarantees location privacy, but setting a non-sensitive location as a sensitive location can increase the cost of privacy protection.
The literature review is summarized in
Table 1.
5. Experimental Results Analysis
5.1. Experimental Setting
Our experiments were implemented in Python 3.7 and run on Windows 10 OS, with an Intel Core i7, 3.6 GHz CPU, and 16 GB RAM. The real datasets Geolife [
28] and Gowalla [
29] were used in our experiments. The Geolife dataset contains 17,621 GPS trajectories of 182 users over three years. Each sample point contains information such as the latitude, longitude, altitude, and time. The dataset contains the user trajectories of a wide range of activities, including traveling home, as well as some recreational and sports activities. The Gowalla dataset is a location-based social network database consisting of 196,591 users and includes 6,442,890 records of users’ behavioral information, including user id, check-in time, latitude, longitude, and location id. Here, only the user id and location id are used.
We compared the DPLPA with the LPPA-PSRDU [
22], P-STM [
23], LPPM [
30], and TLDP [
31]. The performance of the proposed algorithm was measured in terms of clustering accuracy, level of privacy protection, data utility, and running time.
5.2. Clustering Accuracy
The clustering accuracy of the L-clustering algorithm was evaluated by comparing the recall, precision, and F-measure of the K-means [
32] algorithm and DBSCAN algorithm with those of the L-clustering algorithm, as shown in
Figure 6. The precision (
P), recall (
R), and F-measure (
F) were calculated using the following formulas:
where
TP represents true positives,
FP represents false positives, and
FN represents false negatives. The F-measure jointly considers recall and precision, where
is a weight value that adjusts the weight between
P and
R.
As shown in the above figure, the L-clustering algorithm exhibited superior performance compared to the K-means and DBSCAN algorithms. The reason for this is that K-means divides the data into k clusters to minimize the sum of the squares of the distance between the data points and their respective clustering centers. However, the algorithm may not perform well for clusters with an arbitrary shape or size. DBSCAN, which groups dense data points and identifies outliers, is able to find clusters with an arbitrary shape and size compared to K-means and is less sensitive to the initial parameter values. However, DBSCAN may not work well with datasets that have varying densities, and it may produce sub-optimal clusters when the data have widely varying densities. The L-clustering algorithm is a density-based clustering algorithm, which identifies high-density core data points and then merges smaller adjacent data points into a larger cluster. L-clustering can process datasets with varying clustering densities and can detect clusters with different shapes and sizes, making it more suitable for the application scenario described in this paper.
5.3. Privacy Protection Degree
We analyzed the effect of the privacy budget
, cluster density
, and number of locations
N on the level of privacy protection. The effect of
on the level of privacy protection was analyzed, and the results are illustrated in a bar chart, as shown in
Figure 7. The effect of the clustering density
(number of locations per square meter,
N/m
) on the level of privacy protection was analyzed, as shown in
Figure 8.
As seen in
Figure 7, the X-axis represents the
and the Y-axis represents the value of the corresponding level of privacy protection. The dotted yellow line indicates that the level of privacy protection decreased with the increase in the
, which is inferred from the Laplace probability density function. When the value of
was the same, the DPLPA obtained the highest value for the level of privacy protection, followed by the TLDP and the P-STM with the lowest. Therefore the DPLPA achieved the highest level of privacy protection, followed by the TLDP and P-STM.
It can be seen in
Figure 8 that the level of privacy protection increased with the increase in the
(changed from 0 to 10). There was one centroid generated and all the locations were replaced with a unique centroid, enhancing the level of privacy protection. The DPLPA achieved a higher level of privacy protection than the baselines.
Figure 9 shows the levels of privacy protection corresponding to the different numbers of locations
N. The values of
N used in the experiments were 100, 200, 300, 400, 500, and 600, respectively. As expected, the level of privacy protection increased with the decrease in
N, that is, the higher the value of
N, the lower the level of privacy protection. Because of the higher number of locations, a higher privacy budget was required so more noise was added, thereby reducing the level of privacy protection. Similarly, when the value of
N was the same, the DPLPA demonstrated the highest level of privacy protection.
5.4. Data Utility
By evaluating both data utility and privacy, we can assess how the different methods handled the trade-off between these two aspects. By comparing the DPLPA with the baselines, the advantages of the DPLPA in terms of data utility
U were evident. The effect of the
on
U was analyzed, as shown in
Figure 10.
U increased with the decrease in the
because with the increase in S, the level of privacy protection decreased and less noise needed to be added, resulting in higher data utility. The data utility of the LPPM was the worst because it considered many factors that affected the location information, resulting in a loss of data integrity. The data utility of the DPLPA was superior compared to the baselines, with minimal error in the distributed position.
The effect of
on
U was analyzed, as shown in
Figure 11, and the effect of
N on
U was also analyzed, as shown in
Figure 12. As can be seen, for both datasets,
U increased with the increase in the
and with the decrease in
N, although the growth rate gradually slowed. The data utility of our method, the DPLPA, was superior to that of the four baselines, regardless of the
or
N values. This is because the DPLPA first eliminated positions with low access frequencies before location clustering, reducing the interference of invalid location information and improving the data utilization rate.
Generally speaking, the proposed method can ensure high data utility while maintaining a high level of privacy. Since data utility can also reflect service quality to some extent, by considering the experimental results in
Section 5.3, it can be said that our DPLPA method also had good service performance. Furthermore, we can conclude that the DPLPA provides a favorable trade-off between privacy and data utility for location-based services.
5.5. Time Complexity Analysis
In this group of experiments, each experiment was executed five times, and the average value was used as the final value. The effect of the privacy budget
on the running time was analyzed, as shown in
Figure 13. The running time of the algorithm increased with the increase in the
. The larger the
, the longer it took to allocate the privacy budget and the longer the algorithm’s running time. At the same time, because the DPLPA algorithm only extracted the resident points and added noise, the running time of the DPLPA was the shortest and that of the LPPA-PSRDU algorithm was the longest.
The effect of the clustering density
on the running time was analyzed, as shown in
Figure 14. The experiments showed that for both datasets, the running time increased with the increase in the
. The higher the
, the longer the running time. The running time of the DPLPA was the shortest and that of the TLDP was the longest, although the TLDP achieved similar performance to the DPLPA in terms of data utility and privacy protection.
Similarly, the effect of the number of locations
N on the running time was analyzed, as shown in
Figure 15. The experiments showed that for both datasets, the running times of the five methods increased with the increase in
N while remaining within seconds. In this situation, the DPLPA method still had the shortest running time. Despite the fact that the running time of our method showed a gradual growth trend, the trend was relatively gradual. This indicates that the proposed method still has obvious advantages in limited simulation settings.
5.6. Location Privacy Protection in Practical Scenarios Based on DPLPA Methods
Taking Google Maps as an example, the blue line in the figure below represents the moving trajectory of user Ming. User Ming is represented by a red dot and the other users are represented by black dots. Assuming that there are six users using LBS to query nearby bus stations, banks, hospitals, etc., their shared information, including their current location coordinates (longitude, latitude), query locations, and query times, is shown in
Figure 16. It is known that for the first five users, the number of query results for hospital is 1. When the sixth user Ming is added, the number of query results for hospital becomes 2. Therefore, an attacker can infer that the query location for Ming is also hospital.
The DPLPA method proposed in this paper processes sensitive information based on a differential privacy mechanism so that when a user shares their location information, an attacker cannot infer their exact location. Specifically, when a user shares their location information to access certain services, the first step is to extract the user’s resident points, including their long-duration resident points, highly frequented access points, and location points containing sensitive information. Next, multiple groups of continuous locations of the user are simplified and clustered to obtain ROIs. Then, the ROIs are replaced with centroids, and Laplace noise that is suitable for differential privacy is added. As a result, the probability of obtaining specific results through multiple queries is consistent, and the knowledge of an attacker does not change due to the appearance of Ming.
It can be concluded from the above real application scenario that the differential privacy mechanism reduces the risk of an attacker obtaining sensitive information and breaks the connection between identity and location, effectively protecting the privacy of users.
5.7. Comprehensive Analysis
Here, we compare the existing works with the proposed DPLPA in terms of privacy protection, data utility, computational overhead, location continuity, and real application scenarios. The results of the comparison are shown in
Table 4. From the results, it can be seen that aside from our method, none of these works focused on location continuity. In addition, the proposed DPLPA exhibited good performance. Of course, our method is not perfect and has some limitations that need to be addressed. For example, compared with existing state-of-the-art deep learning methods, the query accuracy of the method proposed in this paper is slightly lower.
6. Conclusions
In this paper, we study the privacy protection of continuous location data based on differential privacy and realize differential privacy location protection by constructing ROIs. An L-clustering algorithm is proposed for clustering, which divides the continuous locations into different clusters according to the distance and density, and a cluster centroid is calculated. Then, a location data privacy protection algorithm (DPLPA) is proposed, which allocates a privacy budget to different resident points and centroids and adds Laplace noise to achieve location privacy protection. The experimental results show that the DPLPA can achieve competitive performance in terms of the level of privacy protection, data utilization, and time consumption.
The main contribution of this study is the proposal of an effective method for protecting users’ location privacy for LBS. Compared with other works, the proposed method can effectively ensure the location privacy of users without affecting the efficiency, accuracy, and availability of each LBS query. Therefore, our method is valuable for the protection of user privacy in LBS and can be easily integrated into existing LBS applications, indicating that it can potentially have a positive impact on building privacy-protected LBS applications. However, our work still needs some improvement. For example, due to the diversity of LBS applications, we need to further study how to achieve a connection between our method and each application interface. Furthermore, our approach only considers the privacy protection of users’ continuous historical locations but not their real-time locations. In future work, we will carry out further research on the above problems.