1. Introduction
There is no doubt that we live in the era of Big Data. Over the last decade, thanks to technological advances, information systems have favored automatic and effective data gathering, thus resulting in a considerable increase in the amount of available data. Daily, a wide range of data is produced: scientific, financial, health data, as well as from social media, are just some examples of sources. However, this data is useless without the extraction of the underlying knowledge, a major challenge for the researchers as classical machine learning methods cannot deal with the volume, value, veracity and variety that big data brings [
1]. Therefore, existing machine learning techniques, which deal with 4 Vs [
2], have been or need to be redefined for efficiently processing and managing such data, as well as to obtain valuable information that can benefit not only scientists, but also businesses and organizations. Actually, the recent advances in distributed technologies can be utilized in order to enable scientists to rapidly find out hidden or unknown patterns from the 4 Vs [
3]. Nonetheless, most of the existing methods fail to directly tackle the increased number of attributes and records of databases, due to its computational complexity. Hence, the data mining techniques should be able to encounter data scalability, dimensionality, uncertain data and/or data preprocessing. Data preprocessing constitutes an important step before the data mining process and, as a result, data mining algorithms are designed to accept specific data formats that are best suited for them.
With the advances in wireless and mobile technologies, moving objects are equipped with location positioning sensors and wireless communicating capabilities, thus, large-scale spatio-temporal data is being generated and an urgent need for efficient query processing approaches, to deal with both the spatial and temporal attributes, has been arisen [
4]. In particular, with the advent of mobile and ubiquitous computing, spatio-temporal queries processing on moving objects databases has become a necessity for many applications, such as traffic control systems, geographical information systems, and location-aware advertisement. Hence, time-dependent versions of
k nearest neighbor (
k-NN) queries need to be studied. According to [
5], the
k-NN queries can be distinguished into four categories: (i) both query and data objects are static, (ii) moving query but static data objects, (iii) static query but moving data objects and (iv) both query and data objects are moving. When a mobile object’s location data changes, a Snapshot
k-NN (
) query is issued at each location. However, in highly dynamic spatio-temporal applications, where the moving objects data varies frequently over time, a fundamental query is the so-called Continuous
k-NN (
) [
6]. A
query can be subjected to the fourth category, as it instantly retrieves the
k nearest neighbor objects of a moving query object at each time within a given time interval. Practically, the
spatio-temporal query is evaluated on a high number of consecutive timestamps and can be considered as a series of frequent
queries. To the best of our knowledge, none of the existing approaches, such as [
6,
7,
8] to name a few, address the problem of
queries processing on non-linear trajectories under the support of Hough transformation. It is worth to mention that in previous works the process of
query has been considered in road networks with linear trajectory moving objects.
In this paper, our efforts are devoted to processing a
query assuming that each object keeps track of non-linear trajectory. Such a trajectory is approximated as piece-wise linear among timestamps where objects (small) velocity changes; a storage efficient strategy adopted in our previous work [
9]. Some previous works investigate the problem of
for moving objects with fixed velocity while others with uncertain velocity in road networks [
10,
11]. Nonetheless, once an object’s velocity changes, the
query has to be re-executed. Such a case is quite frequent in highly dynamic environments, thus, the performance of these techniques would be significantly degraded, resulting in increased query cost (because of re-evaluation). The focus of this work is on
k nearest neighbor queries, which are considered under both a spatio-temporal and a Hough transformed database. The Hough method can mitigate such issues since it is piece-wise constant [
9], thus, the
query is not evaluated as frequent as in the Euclidean space with non-linear trajectory. Actually, the application of
k-NN queries in Hough space facilitates the maintenance and consistency of continuous queries result [
12], even if the moving objects location data updates. The aforementioned issue relates to the fact that the query result needs to be only updated upon specific changes, namely, whenever moving angle direction and/or velocity vary dramatically. Nevertheless, while an object (including the query object) is moving on a linear part of its trajectory (maintenance phase) the query result will not change, meaning that the query result of all the objects will not need be revised until the objects transit to the next linear part of its trajectory, which indicates that the maintenance phase expires.
Moreover, with the extensive adoption of the Cloud, research on privacy preserving has appealed to lots of researchers [
13,
14]. Let us recall that the ultimate purpose of the processing is to formulate the
k-anonymity set to protect users’ privacy in the cloud computing environment. Especially, in the context of this work, we are challenged with the necessity to apply temporal continuous spatial
k-NN queries in a distributed architecture with the ultimate goal of forming the anonymity set for a group of moving objects in a fast and efficient manner.
In summary, the main contributions of this paper are as follows:
A continuous query processing algorithm is considered that efficiently answers the spatial k-NN query with the aid of an indexing method or otherwise. From the query result, in each timestamp and for all moving objects, the desired anonymity set is formulated.
The adopted method is designed with the aid of GeoSpark spatial data operations to compute the
k-anonymity set which is used to measure the possibility of each object identity being unveiled as it moves from one location to another. To be more specific, vulnerability evaluation is conducted in two-dimensional space (both Euclidean and Hough space), considering different pairs of features, as it will be demonstrated in
Section 4.
A comprehensive set of experiments is conducted to evaluate the time performance of the proposed method in the GeoSpark environment under different data sizes.
The rest of this paper is organized as follows. In
Section 2, previous related works are presented in relation to our approach. In
Section 3, the following are described: (a) the problem definition, (b) the system model and (c) the proposed GeoSpark framework for the
k-anonymity set computation.
Section 4 presents the experiments conducted for the evaluation of the studying problem, while
Section 5 evaluates relevant discussion of the studying problem results in relation to previous works. Finally, in
Section 6, conclusions and future directions of this work are recorded.
3. Materials and Methods
This section provides the necessary background knowledge for the remainder of the paper. Initially, the k-NN algorithm, a core component of the adopted privacy preserving methodology along with its weaknesses to tackle big data problems, is presented. In following, useful definitions and notations will be recorded under the problem definition, with the most characteristic being the spatial indexing and partitioning methods. Moreover, the GeoSpark components for the queries processing with the aim to formulate the k-anonymity set, will be in detail described.
3.1. Operations on Spatial Data
Querying on spatial data is considered an operation that is usually coupled with indexing methods. Several indexing methods have been considered in literature as they are crucial for the performance of spatial data query processing algorithms, since they are used in order to reduce the query run time. The most representative are the ones based on R-tree and Quad-tree.
Further, to support efficient query processing on moving objects, grid-based space partitioning methods can be adopted [
27]. Moving objects data are indexed in the grid cells they belong to for facilitating the queries and avoiding the check of all the objects. The aim of a spatial partitioning technique is to improve the query time as well as to keep all the partitions balanced in terms of memory and computations, which is known as the term “load balancing” [
28].
The equal grids partitioning uniformly divides the whole region, providing thus a good data locality but not load balancing. Also, Quad-tree is another data structure based on the principle of “divide and rule” that recursively divides two-dimensional space into four quadrants and needs a merging operation to construct the specified number of partitions. On the other hand, the
R-tree provides an efficient data partitioning strategy to efficiently index spatial data. It is considered a balanced search tree that improves both search speed and storage utilization. Another space partitioning strategy is based on
-tree, a balanced binary tree, which has been used in load balancing of spatial databases for fast querying. It is worth mentioning that the locality of data and load balancing are important to speedup queries performance [
29].
3.2. The k-NN Classifier from Big Spatial Data Perspective
The k-NN algorithm is a popular non-parametric method that can be used for both classification and regression tasks. In following, we will discuss the k-NN classification problem from the big spatial data viewpoint. The main components of k-NN classifier are:
is a training mobile objects dataset of size N,
is a testing mobile objects dataset of size M,
is a mobile object represented as a tuple in the form of , where is the value of the p-th feature of the n-th object and is the class it belongs to, denoted as and
is only known to dataset.
In the classification process, for each test object , the k-NN algorithm searches the k closest objects in the set, computing the distances (specifically the Euclidean distance) between the test mobile object and all the mobile objects in . The distances from all training objects are ranked in ascending order and then, the k nearest objects are kept to find the dominant class . Despite its remarkable performance in real-world applications, it lacks the scalability to manage large-scale datasets. Initially, the time complexity to find the k nearest neighbors for a single test mobile object is , where an extra complexity for distances sorting process is involved; in this additional complexity, N is the size of training dataset and p is the number of object features. The classification process needs to be restated for all the test mobile objects. Additionally, the k-NN model requires the training data to be stored in memory in order to achieve a fast computation of the distances. However, large-scale and sets may not fit in the RAM memory.
3.3. Problem Definition
A spatio-temporal database, whose records are moving objects with geolocation data attributes in two-dimensional space , is assumed. In real world examples objects move arbitrarily which oppose to fixed velocity assumption that some works suppose. Hence, it is considered that the objects velocity is low with values between a minimal and a maximal, i.e., . In addition to, we consider low data sampling rate, since applying high data sampling rate it would be difficult to apply linear interpolation between sampled points, thus, the adopted dual methods should be redesigned adopting appropriate curve fitting methods. As a result, the velocity calculated from two consecutive way points could not represent the real velocity well and Euclidean distance would not work. In the following, an overview of the studying problem definitions along with notations is presented.
Definition 1. The trajectory of a moving object is assumed to be a continuous piece-wise linear function, which maps the temporal dimension to the two-dimensional Euclidean space, connecting a sequence of points for .
Such a representation entails that the n-th object position at time is for , and that during each time interval , the object moves along a straight line from to with a constant velocity (amplitude and direction).
To ensure an efficient storage and handling of the queries, the expected position of the object at any time
, where
, is obtained by a linear interpolation between
and
. In this way, a number of additional features can be extracted, such as velocity
u, angle direction
, Hough-
X and/or Hough-
Y transformation of
[
9].
Definition 2. A moving object trajectory (spatio-temporal) snapshot is defined as the location data of that object in a specific timestamp; thus, a single trajectory is stored as a collection of location snapshots denoted as .
Definition 3. Snapshot Distance: Given two objects and with location snapshots and respectively, the distance (Euclidean) (Without loss of generality, both other distance measures can be considered, such as the Manhattan distance (), as well as the maximum distance ()) between and in at timestamp is computed as The distance between two objects in our model is represented in Euclidean distance.
The spatial k-NN queries are considered as one of the most common search problems that will be employed in our study. Generally, given a spatial region, a k-NN query on spatial data identifies all the spatial points that lie inside this region. For spatio-temporal k-NN queries, a time interval is also given, where the timestamps of the resulting trajectories in that region need to also fall in that time interval. A spatial k-NN query takes as input a query center point along with a set of spatial objects location data in order to find the k nearest neighbors around the center point.
Definition 4. k-NN: Given a moving object , a dataset O and an integer k, the k nearest neighbors of from O, denoted as , is a set of k objects from O that .
Definition 5. Snapshot Trajectory Point k-NN () query: Given a query center point q at timestamp , an integer k and a dataset of trajectory (spatio-temporal) points denoted as a k-NN query, denoted as , asks for the k spatial points of whose distance from the query point q is less than that of the rest points of .
Definition 6. Continuous Trajectory Point k-NN () query: Given a query center point q at timestamp , an integer k and a dataset of trajectory (spatio-temporal) points denoted as at a time-interval for (practically large enough), a continuous k-NN query, denoted as , asks for the k spatial points of , whose distance from the query point q is less than that of the rest points of , for all consecutive .
Note that,
. Considering the previous definitions, the problem formulation for the application of robust
queries (see
Figure 1) will be considered in the following subsection. In the context of the proposed approach, all points
are selected as query objects in order to acquire the
k nearest neighbors of all mobile objects in each timestamp
. If the process is repeated for a high number of consecutive timestamps in a selected time period, then, the collection of
queries constitutes a
query, thus similar to our case.
3.4. Problem Formulation
The problem of Robust Spatio-temporal Databases is addressed in an environment suitable for Big Spatial Data Management. The
k-anonymization approach adopted in [
30] for preventing mobile objects identity reveal, is related to
k-NN algorithm as described in
Section 3.2. Specifically, the
k-anonymity set is formulated by the unique object identifier, denoted as
, of the
k nearest neighbors, exploiting as a result the spatio-temporal data of a set of mobile objects.
Through the SMaRT system, a set of mobile users’ trajectory data is recorded per timestamp, that is, the mobile user trajectory
and the values of longitude and latitude are recorded. From these location features, the four attributes
(as presented in
Table 1) along with the values of Hough-
X and Hough-
Y of
[
9],
(as presented in
Table 2), are computed.
So, by employing the
k-NN method for different pairs of features, this fact enables us to form the
k-anonymity set of each mobile object per timestamp, as depicted in
Table 3.
For each mobile user
i and per timestamp
l, the
k nearest neighbors
it is computed and in the following is kept in a vector form
for
as presented in
Table 3. For each user, the number
out of the
k nearest neighbors, which remains the same from one timestamp to another, is computed in order to estimate the vulnerability ratio
. Hence, higher
is associated with lower probability (i.e., lower vulnerability) a moving object’s identity being unveiled.
Definition 7. k-anonymous database: A database is k-anonymous, i.e., its records are k-anonymous with respect to the selected features, if discrete records in the same specific timestamp τ, have at least the same nearest neighbors so that no record of k is distinguished from its neighboring records.
3.5. System Model
A spatio-temporal database is considered with
N records, that is,
N moving objects in the
plane. Each record
represents the spatial coordinates of a mobile user
j in timestamp
, or point
i of its trajectory
j [
31]. From the location coordinates
, we can extract the corresponding velocity and angle direction features
and the dual points
,
,
,
by employing the dual methods described in [
9]. Let us assume a trajectories database
of equal length
L in which each trajectory is represented via a sequence of
L triples.
For each point i in trajectory j, we define a two-dimensional feature vector , which captures the selected features data. Hence, we can define and store the trajectory j as .
In the context of this work, the time performance of the
k-anonymity set formulation either with the aid of an indexing method or for spatial
k-NN queries on trajectory data employing Snapshot
k-NN query on Trajectory Points, as depicted in
Figure 2, is evaluated. Given a set of trajectories represented as a sequence of spatio-temporal points along with a query point, the
algorithm finds its
k nearest spatial points from a set of trajectory points in corresponding timestamps. The
query is issued over a set of moving objects, executing the classical
k-NN in a time period, and updates the
k-anonymity set from timestamp to timestamp for all objects.
However, some objects may not move with the same velocity, leading to changes in the query result, i.e., in the k-anonymity set. As a result, in Privacy Preserving, it is important for the k-anonymity set to remain the same or vary slowly, as in this case, the vulnerability measure of moving objects will be affected. Note that, for a given STPkNN query q, the k-anonymity set, denoted as , should always satisfy the following conditions:
The first condition ensures that the anonymity set contains the s of k objects.
The second condition ensures that these k objects’ are the k nearest ones to q.
Assuming a mobile object when considering a k-NN query relevant to the selected features in a specific timestamp, then, the spatial k-anonymity ensures that an attacker, as the query issuer, cannot identify a mobile object with probability larger than , where k is a user-defined anonymity parameter. Moreover, a spatio-temporal database is expected to handle a high number of moving objects location data, as well as, a large number of consecutive k-NN queries. Hence, an efficient consecutive processing algorithm is very important to be addressed. To this end, the query is investigated in the GeoSpark framework and experimental evaluations are conducted using a realistic dataset to demonstrate the time performance of the k-anonymity set computation under the query. Ultimately, the vulnerability behavior for different pairs of features is investigated.
3.6. GeoSpark System Overview
In this subsection, the necessary structures of GeoSpark-based approach is thoroughly presented. All parts were implemented using Apache Scala due to its compatibility with Apache Spark framework. The core components are the following implemented on GeoSpark Resilient Distributed Dataset (RDD) and the interaction with GeoSpark is utilized through Apache Zeppelin interface.
3.6.1. GeoSpark Architecture
GeoSpark [
32] is an extension of the Apache Spark core and provides additional tools to manage geospatial data, e.g., geospatial datatypes, indexes and operations. The architecture of GeoSpark, as introduced in
Figure 3, consists of the following three layers:
The Apache Spark Layer consists of all the components present in Spark and performs data loading and querying.
The Geospatial RDD Layer extends Spark and supports three types of RDD, i.e., Point, Rectangle and Polygon RDD. In addition, it contains a geometrical operations library for every RDD.
The Geospatial Query Processing Layer is used to perform different types of geospatial queries.
GeoSpark provides Spatial RDD that allows efficient loading, transformation, partitioning, in-memory storing as well as indexing of complex spatial data for different input data sources such as CSV, WKT, GeoJSON, Shapefile, etc., which, through the GeoSpark Spatial SQL interface, are compatible with Spark. It also provides eight types of spatial objects, namely Point, Multi-Point, Polygon, Multi-Polygon, Line String, Multi-Line String, GeometryCollection, and Circle. Hence, spatial objects in a Spatial RDD can consist of different geometry types. Furthermore, it supports three types of spatial objects, which are Point, Rectangle and Polygon, for which the corresponding spatial RDDs can be defined. These structures can be used as input to several spatial queries, such as spatial k-NN, spatial range and spatial join.
In following, taking advantage of the GeoSpark architecture, the main parts of our approach will be presented.
3.6.2. Spatial RDD Structures Preparation
Initially, the trajectory data of all mobile users is stored in a csv file. Specifically, this data consists of trajectory (spatio-temporal) points for a number of mobile objects, i.e., bike riders. For each bike trajectory (spatio-temporal) point, we have in our disposal the bike rider
, the spatial coordinates
, the polar coordinates (direction, velocity)
, the Hough-
attributes as well as the timestamp, with the aim of computing the
k nearest neighbors and selecting two features among them, at each time. Practically, all bike riders follow non-linear trajectories. However, in our case, the users’ trajectory is approximated by linear sub-trajectories, as, in our previous work [
9], thus, the values of the attributes are not randomized such that groups of bikes have similar behavior. The user cannot control attribute values, but, they may ask to simulate a specific timestamp. In GeoSpark, firstly the corresponding
is recovered and in following the temporal partitioning is implemented. Information about road network that describe the situation of the region are not taken into consideration.
In Apache Zeppelin, the input data are loaded from the csv file as a Dataframe and then SQL queries can be executed on the created Dataframe in order to recover spatial data of moving objects in specific timestamps. Then, these data are stored and transformed to PointRDD. In this way, several PointRDDs are created which, from now, will be called as BikesRDD, as they concern bike riders trajectories (spatio-temporal points). GeoSpark’s operator permits us to transform a set of raw data into a PointRDD, and selecting the columns that correspond to the desired features at a specific timestamp, as depicted in
Table 4.
3.6.3. Spatio-Temporal Partitioning
The proposed approach considers a naive temporal partitioning from which a collection of timestamps is considered. These timestamps are related to the sampling period where the system recorded the spatio-temporal data with its usage. As a result, the size of the loaded data in BikesRDD may be different for different timestamps since some mobile objects may not have a spatio-temporal footprint for all recorded timestamps. Here, in our simulations, the sampling period is supposed to be similar for all bike riders trajectory data. From the trajectory (i.e., spatio-temporal) points of a number of bike riders objects at different timestamps, a collection of BikesRDD is constructed.
Following the previous aspects, GeoSpark can easily utilize spatial partitioning over the temporal partition of the data of these bikes. In each temporal partition, a bike rider may have not necessarily moved because random delays may be employed. Hence, in a specific time partition, the spatial data of all bikes may not necessarily participate.
BikesRDD constitutes a specialized GeoSpark PointRDD, which consists of individual bike riders’ records. The spatial partitioning on the BikesRDD concerns bike riders selected features, such as the location data. In several temporal partitions, the BikesRDD are simulated one by one.
More to the point, GeoSpark offers low overhead spatial partitioning approaches that take into consideration the spatial distribution of the data as well as repartitions a loaded Spatial RDD in an automatic way. In each Spatial RDD, several spatially proximal objects are grouped into the same partition and, as a result, the partitioning layer in GeoSpark partitions the workload in a spatial way and periodically repartitions this workload in order to keep balanced partitions. Also, it supports a variety of grid-type partitioning methods, such as equal-grids,
R-tree, Quad-tree and
-tree, to name a few [
33].
3.6.4. k-Anonymity Set in GeoSpark
Algorithm
1 presents the steps of a spatial
k-NN in GeoSpark. It takes as input a non-indexed or indexed
, a query center point as well as a parameter
k, which indicates the number of nearest neighbors. The algorithm is separated in two phases, namely selection and sorting phase [
33], and will be utilized in the simulations.
Algorithm 1 Spatial k-NN Query |
- 1:
input The number of k nearest neighbors - 2:
input A query center object A - 3:
input A Spatial RDD B - 4:
output A list of k spatial objects - 5:
Step 1: Selection phase - 6:
for all partition ∈ SRDD B do - 7:
if an index exists then - 8:
Return k-neighbors of A by querying the index of this partition - 9:
else - 10:
for all object ∈ this partition do - 11:
Check all the distance between the object A and Spatial RDD B - 12:
end for - 13:
end if - 14:
end for - 15:
Maintain a priority queue that stores the top k nearest neighbors - 16:
Step 2: Sorting phase - 17:
Sort the spatial objects in the intermediate Spatial RDD based on their distances to A - 18:
Return top k objects in C
|
Our aim is to exploit the above mentioned structures and Algorithm
1 so as to formulate the
k-anonymity set for a number of moving objects based on their trajectory (spatio-temporal) data. From the trajectory points of a number of mobile objects at different timestamps, a collection of spatial BikesRDD is constructed. The goal of this paper is to provide an efficient and scalable framework for robust continuous
k-NN querying of spatial objects in GeoSpark. A Scala RDD API for creating the desired PointRDD from a csv file is introduced in following Algorithm
2.
Algorithm 2 Spatial PointRDD Creation |
- 1:
Define csv file location in pointRDDInputLocation - 2:
Define the attributes start Column equal to 0 in pointRDDOffset - 3:
Define pointRDDSplitter as FileDataSplitter.csv - 4:
Define rest attributes in carryOtherAttributes = true - 5:
Create PointRDD: PointRDD = new PointRDD(sc, pointRDDInputLocation, pointRDDOffset, pointRDDSplitter, carryOtherAttributes)
|
An iterative application of a typical spatial
k-NN query, which was presented in Algorithm
1, is introduced with use of Scala RDD API of Geospark in following Algorithm
3.
Algorithm 3 Iterative Spatial k-NN Query |
- 1:
inputk, usingIndex, number of timestamps L - 2:
for timestamp to L do - 3:
Create the Dataframe with the selected features in t - 4:
Save as csv file f - 5:
Create from csv a PointRDD - 6:
if usingIndex = true then - 7:
Build R-Tree index on PointRDD - 8:
end if - 9:
for all lines ∈ csv file f do - 10:
Read the selected features - 11:
Create querypoint - 12:
val fact = new GeometryFactory() - 13:
val querypoint = fact.createPoint(new Coordinate(,)) - 14:
Run Spatial k-NN Query to return k geometry points - 15:
val queryResultList = KNNQuery.SpatialKnnQuery(PointRDD, querypoint, k, usingIndex) - 16:
end for - 17:
end for
|
It is worth mentioning the fact that the output format of the spatial
k-NN query (indexed or not) is a list of geometries as in
Table 5, where the
of the point geometry is related to the mobile object
. In addition, the list holds the top-
k geometry objects. In the following, the above methodology and algorithms will be considered for the performance evaluation of the proposed
k-anonymity method based on spatial
k-NN.
6. Conclusions and Future Work
In conclusion, this work aimed at testing the hardware and software performance in terms of the number of operations performed and the time needed to obtain satisfactory anonymity. The real world example concerning moving bicycle riders proved satisfactory results using different additional ancillary methods of space partitioning and indexing. Also, a GeoSpark-based approach for evaluating the vulnerability of spatial k-anonymity in large-scale spatio-temporal data is provided. Experiments show that GeoSpark constitutes an appropriate Spark-based system for the evaluation of robust k-NN queries. The GeoSpark spatial RDDs are exploited to store trajectory data of the mobile objects as PointRDD in order to acquire the anonymity sets of all query mobile objects issuing iterative spatial k-NN queries for all mobile objects in consecutive timestamps. It should be noted that the experiments were conducted in one Virtual Machine (VM) in local mode, with specific capabilities as described in the experimental evaluation.
As a future work, our aim is to apply the proposed anonymity method in a fully distributed environment with many VMs so as to verify the expected results from previous findings about the execution time; in this case, the input data will be partitioned and in following distributed to different VMs. In this way, several parallel tasks will be executed and thus, the execution time for the anonymity set computation is expected to be much lower than in the current single VM case when considering large-scale databases. The locality of the data and load balancing are important when applying efficient k-NN queries. GeoSpark also provides a grid-type partitioning based on -tree that can ensure workload balancing.
In conclusion, it should be pointed out that the current experimentation gave us an insight about the design issues and possible constraints of the proposed approach so as to optimize its performance in terms of execution time cost and memory utilization for real-time scenarios, where the mobile objects identity security, when issuing a k-NN query, is of high importance.