1. Introduction
Knowledge retrieval and data processing are catalysts for innovation. The continuous advances in information and communication technologies (ICT) and the efficient processing of data allow the extraction of new knowledge by discovering non-obvious patterns and correlations in the data. Nevertheless, such knowledge extraction procedures may threaten individuals’ privacy if the proper measures are not implemented to protect it [
1,
2,
3]. For instance, an attacker may use publicly available datasets to obtain insights about individuals and extract knowledge by exploiting correlations that were not obvious from examining a single dataset [
4]. Therefore, before disclosing any data, privacy protection procedures (e.g., anonymization, pseudonymization, aggregation, generalization) must be applied. A wide variety of privacy models and protection mechanisms have been proposed in the literature so as to guarantee anonymity (at different levels depending on the utilized model) when disclosing data [
5]. Since most privacy protection methods are based on modifying/perturbing/deleting original data, their main drawback is that they negatively affect the utility of the data. Hence, there is a need for finding a proper trade-off between data utility and privacy.
One of the most well-known disciplines studying methods to protect individuals’ private information is Statistical Disclosure Control (SDC [
6]), which seeks to anonymize microdata sets (i.e., datasets consisting of multiple records corresponding to individual respondents) in a way that it is not possible to re-identify the respondent corresponding to any particular record in the published microdata set. Microaggregation [
7], which perturbs microdata sets by aggregating the attributes’ values of groups of
k records so as to reduce re-identification risk by achieving
k—anonymity, stands out among the most widely used families of SDC methods. It is usually applied by statistical agencies to limit the disclosure of sensitive microdata, and it has been used to protect data in a variety of fields, namely healthcare [
8], smart cities [
9], or collaborative filtering applications [
10], to name a few.
Although the univariate microaggregation problem can be optimally solved in polynomial time, optimal multivariate microaggregation is an NP-hard problem [
11]. Thus, finding a solution for the multivariate problem requires heuristic approaches that aim to minimize the amount of data distortion (often measured in terms of information loss), whilst guaranteeing a desired privacy level (typically determined by a parameter
k that defines the cardinality of the aggregated groups).
1.1. Contribution and Research Questions
In this article, we propose a novel solution for the multivariate microaggregation problem, inspired by the heuristic solutions of the Traveling Salesman Problem (TSP) and the use of the optimal univariate microaggregation algorithm of Hansen and Mukherjee (HM) [
12]. Given an ordered numerical vector, the HM algorithm creates the optimal
k-partition (i.e., the optimal univariate microaggregation solution). Hence, our intuition is that, if we feed the HM algorithm with a good ordering of the records in a multivariate dataset, it would output a good
k-partition of the multivariate dataset (although not necessarily optimal).
Ordering the records of a univariate dataset is trivial. However, ordering those records in a multivariate dataset, in which every record has p attributes, is not obvious since it is not apparent how to determine the precedence of an element over another. Thus, the primary question is:
Q1:How to create this ordering, when the records are in .
We suggest that a possible order for the records in is determined by the Hamiltonian path resulting from solving the Traveling Salesman Problem, in which the goal is to find the path that travels through all elements of a set only once, whilst minimizing the total length of the path. Optimally solving the TSP is known to be NP-Hard, but very good heuristic solutions are available. Hence, our intuition is that good heuristic solutions of the TSP (i.e., those with shorter path lengths) would provide a Hamiltonian path, that could be used as an ordered vector for the HM optimal univariate microaggregation algorithm, resulting in a good multivariate microaggregation solution.
The quality of a TSP solution is measured in terms of “path length”, the shorter the length the better the solution. However, the quality of the microaggregation is measured in terms of information loss. Given a cardinality parameter
k (which sets the minimum size of the aggregation clusters), the lower the information loss, the better the microaggregation. Hence, the next questions that we aim to answer are:
Q2:Are the length of the Hamiltonian path and the information loss of the microaggregation related?, or Do shorter Hamiltonian paths lead to microaggregation solutions with lower information loss?
and
Q3:Is the length of the Hamiltonian path the only factor affecting information loss or does the particular construction of the path (regardless of the length) affect the information loss?
Overall, the key question is:
Q4:Does this approach provide better solutions (in terms of information loss) than the best performing microaggregation methods in the literature?
In order to answer these questions, we have tested seven TSP solvers, combined with the HM algorithm (virtually applied in a multivariate manner, or Multivariate HM (MHM)). Particularly, we have tested the “Concorde” heuristic, which, to the best of our knowledge, is the first time it is used for microaggregation. In addition, we have tested well-known classic microaggregation methods (i.e., MDAV and V-MDAV), and an advanced refinement of the former (i.e., MDAV-LK-MHM).
With the aim to test all the aforementioned approaches on a variety of datasets, we have used three classical benchmarks (i.e., Census, Tarragona, and EIA) and three novel datasets containing trajectory data retrieved from public sources, which lead to our last research question:
Q5:Do TSP-based microaggregation methods perform better than current solutions on trajectories datasets?
1.2. Plan of the Article
The rest of the article aims to answer the research questions above, and it is organized as follows:
Section 2 provides the reader with some fundamental knowledge on Statistical Disclosure Control and microaggregation. In addition, it introduces the basics of the Traveling Salesman Problem and an overview of the existing heuristics to solve it. Next,
Section 3 analyzes related work and highlights the novelty of our proposal compared with the state of the art.
Section 4 describes our proposal, which is later thoroughly tested and compared with well-known classical and state-of-the-art microaggregation methods in
Section 5.
Section 6 discusses the research questions and the main benefits of our proposal. The article concludes in
Section 7 with some final remarks and comments on future research lines.
3. Related Work on Microaggregation
There is a wide variety of heuristics to solve the multivariate microaggregation problem in the literature. One of the most well-known methods is the Maximum Distance to Average Vector (MDAV), proposed by Domingo-Ferrer et al. [
17]. This method iteratively creates clusters of
k members considering the furthest records from the dataset centroid. A variant of MDAV was proposed by Laszlo et al., namely the Centroid-Based Fixed Size method (CBFS) [
18], which also has optimized versions based on kd-tree neighborhood search, such as KD-CBFS and KD-CBFSapp [
19]. The Two Fixed Reference Points (TFRP) method was proposed by Chang et al. [
20]. It uses the two most extreme points of the dataset at each iteration as references to create clusters. Differential Privacy-based microaggregation was explored by Yang et al. [
21], which created a variant of the MDAV algorithm that uses the correlations between attributes to select the minimum required noise to achieve the desired privacy level. In addition, V-MDAV, a variable group-size heuristic based on the MDAV method was introduced by Solanas et al. in Reference [
13] with the aim to relax the cardinality constraints of fixed-size microaggregation and allow clusters to better adapt to the data and reduce the SSE.
Laszlo and Mukherjee [
18] approached the microaggregation problem through minimum spanning trees, aimed at creating graph structures that can be pruned according to each node’s associated weights to create the groups. Lin et al. proposed a Density-Based Algorithm (DBA) [
22], which first forms groups of records in density descending order, and then fine-tunes these groups in reverse order. The successive Group Selection based on sequential Minimization of SSE (GSMS) method [
23], proposed by Panagiotakis et al., optimizes the information loss by discarding the candidate cluster that minimizes the current SSE of the remaining records. Some methods are built upon the HM algorithm. For example, Mortazavi et al. proposed the IMHM method [
24]. Domingo-Ferrer et al. [
17] proposed a grouping heuristic that combines several methods, such as Nearest Point Next (NPN-MHM), MDAV-MHM, and CBFS-MHM.
Other approaches have focused on the efficiency of the microaggregation procedure, for example, the Fast Data-oriented Microaggregation (FDM) method proposed by Mortazavi et al. [
25] efficiently anonymizes large multivariate numerical datasets for multiple successive values of
k. The interested readers can find more detailed information about microaggregation in References [
5,
26].
The most similar work related to ours is the one presented in Reference [
27] by Heaton and Mukherjee. The authors use TSP tour optimization heuristics (e.g., 2-opt, 3-opt) to refine a path created with the information of a multivariate microaggregation method (e.g., MDAV, MD, CBFS). Notice that, in our proposed method (described in the next section), we use tour construction TSP heuristics instead of optimization heuristics; thus, we eliminate the need for using a multivariate microaggregation method as a pre-processing step, and we decrease the computational time without hindering data utility.
4. Our Method
Our proposal is built upon two main building blocks: a TSP tour construction heuristic (
H), and the optimal univariate microaggregation algorithm of Hansen and Mukherjee (HM). As we have already explained in
Section 2, the HM algorithm is applied to univariate numerical data, because it requires the input elements to be in order. However, we virtually use it with multivariate data; thus, when we do that, we refer to it as Multivariate Hansen-Mukherjee (MHM), although, in practice, the algorithm is univariate. Since our proposal is based on a Heuristic (H) to obtain a Hamiltonian Path and the MHM algorithm, we have come to call it HMHM-microaggregation or
-Micro, for short.
Given a multivariate microdata set (D) with p columns and r rows, we model it as a complete graph , where we assume that each row is represented by a node (or a city, if we think in terms of the TSP), and each edge represents the Euclidean distance between and (or the distance between cities in TSP terms). Hence, we have a set of nodes each representing rows of the microdata set in a multivariate space .
In a nutshell, we use
H over
G to create a Hamiltonian path (
) that travels across all nodes.
is a permutation (
) of the nodes in
N, and
de facto it determines an order for the nodes (i.e., it provides a sense of precedence between nodes). Hence, although
D is multivariate, its rows represented as nodes in
N can be sorted in a univariant permutation
that we use as input to the MHM algorithm. As a result, the MHM algorithm returns the optimal univariate
k-partition of
, this is, the set of disjoint subsets
defining the clusters of
N. Hence, since each node
represents a row in
D, which is indeed multivariate, we have obtained a multivariate microaggregation of the rows in
D and provided a solution for the multivariate microaggregation. Notice that, although MHM returns the optimal
k-partition of
, it does not imply that the resulting microaggregation of
D is optimal A schematic of our solution is depicted in
Figure 2.
Although the foundation of our proposal described above is pretty straightforward, it has the beauty of putting together complex mathematical building blocks from the multivariate and univariate worlds in a simple yet practical manner. In addition, our solution is very flexible, since it allows the use of any heuristic H to create the Hamiltonian path , and it allows for comprehensive studies, such as the one we report in the next section.
Note that most TSP heuristics output a Hamiltonian cycle. However, since we need a Hamiltonian path, we use the well-known solution of adding a dummy node in the graph (i.e., a theoretical node in which its distance to all other nodes is zero), and we cut the cycle by eliminating this node, so as to obtain a Hamiltonian path.
For the sake of completeness, we summarize our proposal step-by-step in Algorithm 1, and we next comment on it. Our solution can be seen as a meta-heuristic to solve the multivariate microaggregation problem, since it can accommodate any Heuristic (H) able to create a Hamiltonian cycle from a complete graph (G), and it could deal with any privacy parameter (k). Thus, our algorithm receives as input a numerical multivariate microdata set D with p columns (attributes) and r rows, that have to be microaggregated, a Heuristic H, and a privacy parameter k (see Algorithm 1: line 1). In order to avoid bias towards higher magnitude variables, the original dataset D (understood as a matrix) is standardized by subtracting to each element the average of its column and dividing it by the standard deviation of the column. The result is a standardized dataset in which each column has zero mean and unitary standard deviation (see Algorithm 1: line 2). Next, the distance matrix is computed. Each element contains the Euclidean distance between row i and row j in ; hence, is a square matrix () (see Algorithm 1: line 3). In order to be able to cut the Hamiltonian Cycle and obtain a Hamiltonian path, we add a dummy node to the dataset by adding a zero column and a zero row to and generate , which is a square matrix () (see Algorithm 1: line 4). is, in fact, a weighted adjacency matrix that defines a graph with nodes and edges . With this matrix as an input, we could compute a Hamiltonian Cycle on G by applying a TSP heuristic H (see Algorithm 1: line 5). Notice that this Heuristic H could be anyone that gets as input a weighted graph and returns a Hamiltonian cycle. Some examples are: Concorde, Nearest Neighbor, Repetitive Nearest Neighbor, and Insertion Algorithms. After obtaining , we cut it by removing the dummy node (see Algorithm 1: line 6), and we obtain a Hamiltonian path that defines a permutation () of the nodes in N, as well as determines an order for the nodes that can be inputted to the MHM algorithm to obtain its optimal k-partition (S) (see Algorithm 1: line 7). S is a set of disjoint subsets defining the clusters of nodes in N. Hence, with S and D, we could create a microaggregated dataset by replacing each row in D by the average vector of the k-partition subset to which it belongs (see Algorithm 1: line 8).
After applying the algorithm, we have transformed the original dataset
D into a dataset
that has been microaggregated so as to guarantee the privacy criteria established by
k.
Algorithm 1-Micro |
- 1:
function -MICRO ( Microdata set D, TSP-Heuristic H, Privacy Parameter k) - 2:
= StandardizeDataset(D) - 3:
= ComputeDistanceMatrix() - 4:
= InsertDummyNode() - 5:
= CreateHamiltonianCycle(, H) - 6:
= CutDummyNode() - 7:
S = MHM(, , k) - 8:
= BuildMicroaggregatedDataSet(D, S); - 9:
return - 10:
end function
|
5. Experiments
With the aim to practically validate the usefulness of our multivariate microaggregation proposal, we have thoroughly tested it on six datasets (described in
Section 5.1) that serve as benchmarks. In addition, we are interested in knowing (if and) to what extend our method outperforms the best performing microaggregation methods in the literature. Hence, we have compared our proposal with these methods (described in
Section 5.2), and the results of all these tests are summarized in
Section 5.3. Overall, considering four different values for the privacy parameter
, ten microaggregation algorithms, 50 repetitions per case, and six datasets, we have run over 12.000 microaggregation tests, which allow us to provide a statistically solid set of results.
5.1. Datasets
We used six datasets as benchmarks for our experiments. We can classify those datasets into two main groups: The first group comprises three well-known SDC microdata sets that have been used for years as benchmarks in the literature, namely “Census”, “EIA”, and “Tarragona”. The second group comprises three mobility datasets containing real GPS traces from three Spanish cities, namely “Barcelona”, “Madrid”, and “Tarraco”. Notice that we use the term “Tarraco”, the old Roman name for the city of Tarragona, in order to avoid confusion with the classic benchmark dataset “Tarragona”. The features of each dataset are next summarized:
The Census dataset was obtained using the public
Data Extraction System of the U.S. Census Bureau. It contains 1080 records with 13 numerical attributes. The Tarragona dataset was obtained from the Tarragona Chamber of Commerce. It contains information on 834 companies in the Tarragona area with 13 variables per record. The EIA dataset was obtained from the U.S. Energy Information Authority, and it consists of 4092 records with 15 attributes. More details on the aforementioned datasets can be obtained in Reference [
28].
The Barcelona, Madrid, and Tarraco datasets consist of OpenStreetMap [
29] GPS traces collected from those cities: Barcelona contains the GPS traces of the city of Barcelona within the area determined by the parallelogram formed by latitude (41.3726866, 41.4078446) and longitude (2.1268845, 2.1903992). The dataset has 969 records with 30 GPS locations each. Madrid contains the GPS traces of the city of Madrid within the area determined by the parallelogram formed by latitude (40.387613, 40.483515) and longitude (−3.7398145, −3.653985). The dataset has 959 records with 30 GPS locations each. Tarraco contains the GPS traces of the city of Tarragona within the area determined by the parallelogram formed by latitude (41.0967083, 41.141174) and longitude (1.226008, 1.2946691). The dataset has 932 records with 30 GPS locations each.
In all trajectories datasets, each record consists of 30 locations represented as (latitude and longitude). Hence, each record has 60 numerical values. These locations were extracted from each corresponding parallelogram according to the amount of recorded tracks and their length.
5.2. Compared Methods
We have selected a representative set of well-known and state-of-the-art methods to assess the value of our approach. We have selected two classic microaggregation methods (i.e., MDAV and V-MDAV), as baselines. In the case of V-MDAV, the method was run for several values of , and the best result is reported. Although some other newer methods might have achieved better results, they are still landmarks that deserve to be included in any microaggregation comparison.
For newer and more sophisticated methods, we have considered the work of Heaton and Mukherjee [
27], in which they study a variety of microaggregation heuristics, including methods, such as CBFS and MD. Thus, instead of comparing our proposal with all those methods, we have taken the method that Heaton and Mukherjee reported as the best performer, namely the
MDAV-LK-MHM method. This method, which is based on MDAV, first creates a microaggregation using MDAV, next improves the result of MDAV by applying the LK heuristic, and it finally applies MHM to obtain the resulting microaggregation (cf. Reference [
27] for further details on the algorithm).
Regarding our proposal (i.e.,
-Micro), as we already discussed, it can be understood as a meta-heuristic able to embody any heuristic
H that returns a Hamiltonian Cycle. Hence, with the aim to determine the best heuristic, we have analyzed seven alternatives, namely
Nearest Neighbor,
Repetitive Nearest Neighbor,
Nearest Insertion,
Farther Insertion,
Cheapest Insertion,
Arbitrary Insertion, and (our suggestion)
Concorde.
Table 1 summarizes some features of all selected methods, including the reference to the original article where the method was described. For our method, each reference points to the article describing the TSP heuristic.
The implementation of all these methods have used the
R package
sdcMicro [
28], the TSP heuristics implemented in Reference [
32], and the LK heuristics implemented in Reference [
33]. LK has been configured so that the algorithm runs once at each iteration parameter
RUN=1 until a local optimum is reached. This same criteria was followed for the other TSP heuristics. In this regard, the heuristics we used consider a random starting node at each run. Hence, each experiment has been repeated 50 times to guarantee statistically sound outcomes regardless of this random starting point.
5.3. Results Overview
By using the datasets and methods described above, we have analyzed the Information Loss (expressed in percentage), as a measure of data utility (cf.
Section 2 for details). It is assumed that, given a privacy parameter
k that guarantees that the microaggregated dataset is
k-anonymous, the lower the Information Loss the better the result and performance of the microaggregation method. The results are reported in
Table 2,
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7 with the best (lowest) information loss highlighted in green.
Overall, it can be observed that our method, -Micro, with the Concorde heuristic is the best performer in 79% of the experiments, and it is the second best in the remaining 21% (for which the MDAV-LK-MHM outperforms it by a narrow margin of less than 2%). Interestingly enough, although -Micro, with both Nearest Insertion and Farthest-Insertion, is not the best performer in any experiment, it outperforms MDAV-LK-HMH 50% of the times. The rest of the methods obtain less consistent results and highly depend on the dataset.
When we analyze the results more closely for each particular dataset, we observe that, in the case of the “Census” dataset (cf.
Table 2), our method with Concorde outperforms all methods for all values of
k. In addition, despite the random nature of TSP-heuristics, the values of
are very stable, denoting the robustness of all methods, yet slightly higher on average in the case of the methods with higher Information Loss. It is worth emphasizing though, that, in all runs, our method with Concorde and the MDAV-LK-MHM method obtained better results than MDAV and V-MDAV (i.e., the
max values obtained in all runs are lower than the outcomes obtained by MDAV and V-MDAV).
For the “EIA” dataset (cf.
Table 3), MDAV-LK-MHM is the best performer for all values of
k except
, for which our proposal with Concorde performs better. In this case, the results obtained by these two methods are very close. Similarly to the results in “Census”, the
max values obtained by these two methods outperform MDAV and V-MDAV. In the case of “Tarragona” (cf.
Table 4), our method with Concorde outperforms all other methods. Surprisingly, both MDAV and V-MDAV obtain better results than MDAV-LK-MHM, which performs poorly in this dataset.
So, it can be concluded that the overall winner for the classical benchmarks (i.e., Census, EIA, and Tarragona) is our method, -Micro, with the Concorde heuristic, that is only marginally outperformed by MDAV-LK-MHM in the EIA dataset.
Regarding the other three datasets containing GPS traces (i.e., Barcelona, Madrid and Tarraco), our method,
-Micro, with the Concorde heuristic, is the best performer in 83% of the cases and comes second best in the remaining 17%. For the Barcelona dataset (cf.
Table 5), MDAV-LK-MHM and
-Micro, with the Concorde heuristic, perform very well and similarly. The methods with the worst Information Loss are MDAV and V-MDAV. Our method,
-Micro, with the Insertion heuristics, have a remarkable performance, obtaining values similar to those of MDAV-LK-MHM and Concorde. Nevertheless, it is worth noting that the max (worst) values obtained by MDAV-LK-MHM and Concorde are still better than the averages obtained by the other methods. In the case of the Madrid dataset (cf.
Table 6), our method,
-Micro, with the Concorde heuristic, achieves the minimum (best) value of Information Loss for all values of
k. We can also observe that our method with Insertion heuristics offers higher performance than MDAV-LK-MHM. Finally, the results for the Tarraco dataset (cf.
Table 7) show that the minimum (best) Information Loss value is obtained by our method with the Concorde heuristic in all cases. In this case, MDAV-LK-MHM performs poorly, and, for
and
, MDAV and V-MDAV are better.
We have already discussed that all studied methods (with the exception of MDAV and V-MDAV) have a non-deterministic component emerging from the random selection of the initial node. This random selection affects the performance of the final microaggregation obtained. With the aim to analyze the effect of this non-deterministic behavior, we have studied the standard deviation of all methods for all values of k and for all datasets. In addition, we have visually inspected the variability of the results by using box plot diagrams.
Since the results are quite similar and consistent across all datasets, for the sake of clarity, we only reproduce here the box plots for the “Census” dataset (see
Figure 3), and we leave the others in
Appendix A for the interested reader.
In
Figure 3, we can observe that the Information Loss values increase with
k, but all methods have the same behavior regardless of the value of
k. In addition, it is clear that the most stable methods are
-Micro, with Concorde, and MDAV-LK-MHM.
Overall, we observe some expected differences depending on the datasets. However, the behavior of the best performing methods is stable. Particularly, the datasets with GPS traces (i.e., Barcelona, Madrid, and Tarraco) show more stable results. In summary, the best method was our -Micro with Concorde, exhibiting the most stable results across all datasets.
6. Discussion
Over the previous sections, we have presented our microaggregation method, -Micro, its rationale, and its performance against other classic and state-of-the-art methods on a variety of datasets. In the previous section, we have reported the main results, and we will discuss them next by progressively answering the research questions that we posed in the Introduction of the article.
Q1:How to create a suitable ordering for a univariate microaggregation algorithm, when the records are in .
A main takeaway of this article is that, by using a combination of TSP tour construction heuristics (e.g., Concorde) and an optimal univariate microaggregation algorithm, we are properly ordering multivariate datasets in a univariate fashion that leads to excellent multivariate microaggregation solutions. Other approaches to order
points might consider projecting them over the principal component. However, the information loss associated with this approach makes it unsuitable. In addition, other more promising approaches, like the one used in MDAV-LK-MHM, first create a
k-partition and set an order based on maximum distance criteria. Although this approach might work well in some cases, we have clearly seen that Hamiltonian paths created by TSP-heuristics, like Concorde, outperform this approach. Hence, based on the experiments of
Section 5, we can conclude that TSP-heuristics, like Concorde, provide an order for elements in
that is suitable for an optimal univariate microaggregation algorithm to output a consistent multivariate microaggregation solution with low Information Loss (i.e., high data utility). Moreover, from all analyzed heuristics, it is clear that the best performer is Concorde, followed by insertion heuristics.
Q2:Are the length of the Hamiltonian path and the information loss of the microaggregation related?, or Do shorter Hamiltonian paths lead to microaggregation solutions with lower information loss?
When we started this research, our intuition was that good heuristic solutions of the TSP (i.e., those with shorter path lengths) would provide a Hamiltonian path, that could be used as an ordered vector for the HM optimal univariate microaggregation algorithm, resulting in a good multivariate microaggregation solution. From this intuition, we assumed that shorter Hamiltonian paths would lead to lower Information Loss in microaggregated datasets.
In order to validate (or disproof) this intuition, we have analyzed the Pearson correlation between the Hamiltonian path length obtained by all studied heuristics (i.e., Nearest Neighbor, Repetitive Nearest Neighbor, Nearest Insertion, Farther Insertion, Cheapest Insertion, Arbitrary Insertion, and Concorde) and the SSE of the resulting microaggregation. We have done so for all studied datasets and
k values. The results are summarized in
Table 8, and all plots along with a trend line are available in
Appendix B.
From the correlation analysis, it can be concluded that there is a positive correlation between the Hamiltonian path length and the SSE. This is, the shorter the path length the lower the SSE. This statement holds for all k and for all datasets (although Census exhibits a lower correlation). Hence, although this result is not a causality proof, it can be safely said that good solutions of the TSP problem lead to good solutions of the multivariate microaggregation problem. In fact, the best heuristic (i.e., Concorde) always results in the lowest (best) SSE.
Interested readers can find all plots in
Appendix B. However, for the sake of clarity, let us illustrate this result by discussing the case of the Madrid dataset with
, depicted in
Figure 4. In the figure, the positive correlation is apparent. In addition, it is clear that heuristics tend to form clusters. In a nutshell, the best heuristic is Concorde, followed by the insertion family of methods (i.e., Nearest Insertion, Furthest Insertion, Cheapest Insertion, and Arbitrary Insertion), followed by Repetitive Nearest Neighbor and Nearest Neighbor.
Although
Figure 4 clearly illustrates the positive correlation between the path length and the SSE, it also shows that heuristics tend to cluster and might indicate that not only the path but the heuristic (per se) plays a role in the reduction of the SSE. This indication leads us to our next research question.
Q3:Is the length of the Hamiltonian path the only factor affecting information loss or does the particular construction of the path (regardless of the length) affect the information loss?
In the previous question, we have found clear positive correlation between the path length and the SSE. However, we have also observed apparent clusters suggesting that the very heuristics could be responsible for the minimization of the SSE. In other words, although the path length and SSE are positively correlated when all methods are analyzed together, would this correlation hold when heuristics are analyzed one at a time? In order to answer this question, we have analyzed the results of each heuristic individually, and we have observed that there is still positive correlation between path length and SSE, but it is very weak or almost non-existent (i.e., very close to 0), as
Figure 5 illustrates.
The results shown in
Figure 5 are only illustrative, and a deeper analysis that is out of the scope of this paper would be necessary. However, our initial results indicate that, although there is positive correlation between path length and SSE globally, this correlation weakens significantly when analyzed on each heuristic individually. This result suggests that it is not only the length of the path but the way in which this path is constructed what affects the SSE. This would explain why similar methods (e.g., those based on insertion) behave similarly in terms of SSE although their paths’ length varies.
Q4:Does -Micro provide better solutions (in terms of information loss) than the best performing microaggregation methods in the literature?
This question has been already answered in
Section 5.3. However, for the sake of completeness, we summarize it here: The results obtained after executing more than 12,000 tests suggest that our solution
-Micro obtains better results than classic microaggregation methods, such as MDAV and V-MDAV. Moreover, when
-Micro uses the Concorde heuristic to determine the Hamiltonian path, it outperforms the best state-of-the-art methods consistently. In our experiments,
-Micro with Concorde was the best performer 79% of the times and was the second best in the remaining 21%.
Q5:Do TSP-based microaggregation methods perform better than current solutions on trajectories datasets?
-Micro with Concorde is the best overall performer. Moreover, if we focus on those datasets with trajectory data (i.e., Barcelona, Madrid, and Tarraco), the results are even better. It is the best performer in 83% of the tests and the second best in the remaining 17%. This good behavior of the method could result from the very foundations of the TSP; however, there is still plenty of research to do in this line to reach more solid conclusions. Location privacy is a very complex topic that encompasses many nuances beyond k-anonymity models (such as the one followed in this article). However, this result is an invigorating first step towards the analysis of novel microaggregation methods applied to trajectory analysis and protection.
7. Conclusions
Microaggregation has been studied for decades now, and, although finding the optimal microaggregation is NP-Hard and a polynomial-time microaggregation algorithm has not been found, steady improvements over microaggregation heuristics have been made. Hence, after such a long research and polishing process, finding new solutions that improve the best methods is increasingly difficult. In this article, we have presented -Micro, a meta-heuristic that leverages the advances in TSP solvers and combines them with the optimal univariate microaggregation to create a flexible and robust multivariate microaggregation solution.
We have studied our method and thoroughly compared it to classic and state-of-the-art microaggregation algorithms over a variety of classic benchmarks and trajectories datasets. Overall, we have executed more than 12,000 tests, and we have shown that our solution embodying the Concorde heuristic outperforms the others. Hence, we have shown that our TSP-inspired method could be used to guarantee k-anonymity of trajectories datasets whilst reducing the Information Loss, thus increasing data utility. Furthermore, our proposal is very stable, i.e., it does not change significantly its performance regardless of the random behavior associated with initial nodes selection.
In addition to proposing -Micro, we have found clear correlations between the length of Hamiltonian Paths and the SSE introduced by microaggregation processes, and we have shown the importance of the Hamiltonian Cycle construction algorithms over the overall performance of microaggregation.
Despite these relevant results, there is still much to do in the study of microaggregation and data protection. Future work will focus on scaling up -Micro to high-dimensional and very-large datasets. Considering the growing importance of Big Data and Cloud Computing, adapting our solution to distributed computation environments is paramount. Moreover, adjusting TSP heuristics to leverage lightweight microaggregation-based approaches is an interesting research path to follow. In addition, although the values of the privacy parameter k are typically low (i.e., 3, 4, 5, 6), we plan to study the effect of larger values of k on our solution. Last but not least, since microaggregation is essentially a data-oriented procedure, we will study how our solution adapts to data structures from specific domains, such as healthcare, transportation, energy, and the like.
All in all, with -Micro, we have set the ground for the study of multivariate microaggregation meta-heuristics from a new perspective, that might continue in the years to come.