1. Introduction
Nowadays, there is a growing number of devices connected to the internet, and an exponential rise is expected, along with the development of 5G communication standards; see [
1]. The availability of data and the increase in computation capacity have boosted the proliferation of new applications. Geolocation services have attracted great interest due to the easy collection of geopositional data through all kind of devices. This type of data can proceed from vehicle services
http://vis.cs.kent.edu/demo/list_of_trajectory_datasets.pdf (accessed date 17 April 2021), electrical micro-mobility (bikes and scooters)
https://www.kaggle.com/pronto/cycle-share-dataset/code, 17 April 2021, people-collected data
https://archive.ics.uci.edu/ml/datasets/GPS+Trajectories (accessed date 17 April 2021), and public transport
https://data-crtm.opendata.arcgis.com/ (accessed date 17 April 2021). These new services have realized the relevance of analyzing moving patterns of their customers as a key characteristic to improve their value proposition and competitivity. Obtaining knowledge from moving agents’ data has applications such as bus routes optimization [
2], prediction of collective behavior [
3,
4], customized trips, or fleet management in the mobility field [
5,
6]. Additionally, the maturity of the mentioned services is impacting mobility in cities. The study of citizens’ movements patterns may push forward more customized transportation [
7]. This is a part of the more global concept of smart transportation and smart cities [
8].
The surge in data from moving objects faces several challenges. One of them is how to extract useful insights of the data. Clustering of trajectories refers to grouping them based on a similarity measure [
9]. This is a task that serves as a exploratory step in the discovery of patterns in moving objects and enables one to synthesize the information by establishing a representative trajectory for each cluster. By doing so, it serves as a base from which to start a deeper analysis of the characteristics of the trajectories.
A second challenge is related to quality assurance and trust. Among the diversity and randomness typical of the data of moving objects, it is vital to establish the data that correspond to real behaviors and those that are due to errors in the sensors or systems. This task is referred to as outlier/anomaly detection. It is domain-specific and is also used for detection of fraudulent activities, since these activities usually fall away from expected common behavior.
Another challenge in the application of trajectory data analysis is the development of user-friendly tools that enable the development of services to extract knowledge from integrated data sources. An obstacle is the dispersion of libraries due to most of them being domain-specific. Moradi et al. [
10] mentioned more than 20 different packages in R, mostly focused on animal trajectories. In their work, they propose a set of data structures, classes and methods to deal with generic trajectory data. Another option is the MovingObjects library [
11] implemented in python, which is based in the Geopandas data structures. The mentioned libraries do not offer no-code options, and thus, they present a learning curve for non-tech users. The open-source TrajAnalytics software [
12] offers a web-based visualization interface that simplifies the analysis process for these users. In this work, we present the CROSS-CPP project
https://www.cross-cpp.eu/, 17 April 2021; its main aim is the development of a system for the integration and analytics of data streams coming from high volume (mass) products with cyber physical features. Its purpose is to offer new cross-sectorial services and focus on the commercial confidentiality, privacy and ethical issues using a context-sensitive approach. In the framework of this project, a toolbox has been developed to provide the analysis services. It has been designed as a easy-to-use tool thanks to its user interface that enables to analyze moving objects’ anonymized data. The toolbox includes algorithms for different machine learning tasks, on both stream and batch mode. In particular, we focus in this paper on the clustering of moving objects.
Quickbundles [
13] is known to be a clustering algorithm for tractography segmentation that has shown good performance in the health domain. Tractografies, like moving object trajectories, may be defined as streamlines. Thus, we propose in this paper to adapt Quickbundles for trajectory clustering and to integrate it into CROSS-CPP analytic toolbox. The adaptation is not straightforward, and some challenges have to be faced: (i) Quickbundles assumes streamlines being defined by a settled number of points, and as a consequence, in this paper, we solve the problem to define different trajectories under the same number of points; (ii) for the distance among streamlines, the measure of the distances is adjusted for moving object trajectories.
The paper presents the results of the experiments carried out with two datasets: (i) Porto Kaggle dataset
https://www.kaggle.com/crailtap/taxi-trajectory. (accessed date 17 April 2021) and (ii) CROSS-CPP dataset (Due to privacy constrains the dataset is not publicly available). We also show how the integration looks like in CROSS-CPP and how it can be accessed. Consequently, the main contributions of the paper are as follows: (i) the adaption of Quickbundles algorithm for mobile data, (ii) the analysis of its performance in real trajectories datasets and (iii) the integration into the CROSS-CPP environment.
The rest of the paper is organized as follows:
Section 2, Related Work, reviews the approaches for clustering trajectories, the used distance measures and models of the Earth surface.
Section 3, Preliminaries , include the theory related with the topic and in particular trajectory definition, distance measures and comparison of trajectories.
Section 4 contains the datasets’ description, the preprocessing of the trajectories data, the validation metrics and results. In
Section 5, it is explained how a non-tech person can perform the end-to-end process of harvesting the data and perform a clustering analysis in the CROSS-CPP platform. Thanks to a graphic user-interface, the learning-curve is lower for non-tech professionals. The last
Section 6 discusses the results, analyzes the pros and cons of the proposed approach, highlights future work and extracts conclusions.
3. Preliminaries
Prior to presenting the approach to cluster trajectories, some preliminary definitions are introduced.
A trajectory is the path followed by a moving object. Formally, a trajectory
s is defined as a chronological sequence of multi-dimensional locations, which is denoted by
, with
n being the number of points. Each point
is represented as
-tuple at time
[
9].
The concept of distance among trajectories refers to the criteria to measure the similarity of trajectories. In this paper, the similarity focuses on the geometric patterns of trajectories. Two trajectories are identical if they have the same length and are defined by the same positions independently of the time. This kind of similarity is well-suited for analysis of routes in cities and countries, such as commuting routes from homes to work places, hot-spot areas and holiday road trips.
The geodesic distance is the shortest distance between two points located on the surface of the Earth, and there are several models to compute this distance:
Harvesine, or half of the versine, which is the central angle
between two points in an sphere:
where
d is the distance between two points along a great circle of the sphere and
r is the radius of the sphere.
The Haversine (
is computed directly from the latitude and longitude of the two points:
where
are the latitude of point 1 and 2.
is the longitude of points 1 and 2.
The haversine function computes half a versine of the angle
:
Then the distance between the two points:
Great circle distance [
15]: In geometry, the great circle is defined as the intersection between a plane and a sphere that passes through the center of the sphere. This formula was manually implemented, and it is defined as follows:
where the
d is the distance,
R is the Earth radius, a constant with
km, and the same notation as before for the latitude (
) and longitude (
) of each point.
Euclidean: by neglecting the curvature of the Earth, the euclidean distance between the two points is defined:
Geodesic WS84 [
15]: in this case, the model of the Earth is an ellipsoid; in particular, the WS84 is the most globally accurate model. The used implementation is that included in the GeoPy library.
All the metrics defined above are suitable to calculate distances between two points. To compute distance among trajectories, different approaches have been proposed [
9,
19,
20]. These approaches distinguish between trajectories defined by the same number of points or dissimilar. In the case of trajectories defined by the same number of points and lineal time complexity, there are two distances that have been used in moving objects data: the euclidean and the Hausdorff [
21]. In this paper, we explored the use of Minimum Direct-Flip (MDF) [
13], see Equation (
7).
where
denotes the Euclidean distances between two points,
s is a trajectory and
t is the mean of the Euclidean distance.
Figure 1 provides an insight of the working of the metric.
Quickbundles
Quickbundles (QB) [
13] is a clustering algorithm with linear time performance with respect to the number of trajectories.
QB computing time depends lineally to the number of samples because (i) it uses MDF with linear complexity and (ii) there is no reassignment of the trajectories clustering. This means that trajectories are evaluated only once when performing the clustering and opposed to K-means, where there is amalgamating phase. Clusters are defined by a triplet
, where
I is a list of indexes of the trajectories in the cluster,
n is the number of trajectories in the cluster, and
h is the trajectories sum.
h is a
matrix that can be updated for new trajectories (see Equation (
8)).
This way, the centroid of the cluster is defined by
v:
Among the strengths of this clustering algorithm, we have its above-mentioned lineal complexity, the easiness to interpret the clusters and the versatility provided by changing the number of points that define each trajectory and the threshold value to consider trajectories being similar. All of these become very useful in the exploration of a dataset of trajectories.
However, QB was developed to overcome the complexity of segmentation of the brain fibers. In order to apply it to cluster trajectories, the following are required:
4. Experiments
4.1. DataSet Description
The experiments presented in this section were performed using the Kaggle dataset of taxi trajectorie
https://www.kaggle.com/crailtap/taxi-trajectory, (accessed date 17 April 2021). both It contains trajectory information from 1 July 2013 to 30 June 2014 for 442 taxis running in the city of Porto, in Portugal. In total, 1,704,757 trajectories are available.
Each register contains 9 features, described as follows:
TRIP_ID: (string) It contains an unique identifier for each trip.
CALL_TYPE: (char) It identifies the way used to demand the taxi service. It may contain one of three possible values:
A: if the trip was dispatched from the central.
B: If the trip was demanded on a specific stand.
C: Otherwise; i.e., a trip was demanded on a random street.
ORIGINCALL (integer)
TAXI_ID (integer):
TIMESTAMP (integer) Unix Timestamp (in seconds). It identifies the trip’s start;
DAYTYPE (char)
MISSING_DATA (Boolean)
POLYLINE (String): It contains a list of GPS coordinates (i.e., WGS84 format) mapped as a string. The beginning and the end of the string are identified with brackets (i.e., [ and ], respectively). Each pair of coordinates is also identified by the same brackets as [LONGITUDE, LATITUDE]. This list contains one pair of coordinates for each 15 s of trip. The last list item corresponds to the trip’s destination, while the first one represents its start. Some trips have missing data points, in which the MISSING_DATA column is TRUE.
For this paper, the TIMESTAMP and POLYLINE features were used.
CROSS-CPP Dataset Description
The original unprocessed data included 6094 different trajectories from 480 vehicles collected from May 2019 to the October of 2021. The only attribute of the data is the GPS streamline.
During the preprocessing step, it was found out that 742 trajectories were defined with less than 100 coordinates, and they were removed. In conclusion, 5852 trajectories were taken into account for the final experimentation.
4.2. Parameter Setting
Experiments were carried out to compare the behaviour of two clustering algorithms: Quickbundles and spectral clustering.
In order to validate results, ten different experiments with batches of 2000 trajectories were conducted for each algorithm and distance metric. A total of 80 experiments were run (10 per pair algorithm-distance metric). The mean Silhouette Coefficients of each set of the 10 experiments carried out was analyzed.
To build each experimentation batch, a different approach was set depending on the data source:
Kaggle data: since there were 1,704,757 trajectories, each experiment run for an algorithm-distance metric pair comprised 2000 different trajectories each time, guaranteeing no repetition of trajectories in each experiment.
Cross-CPP data: data from only 5852 remained for experimentation after filtering, and each experiment was run over a subset of 2000 randomly selected trajectories, unavoidably repeating trajectories between experiments.
In each experiment, the hyperparameters of QuickBundles were adjusted via a 10-fold cross-validation approach. The adjusted hyperparameters were:
Resampling factor: number of coordinates to which each trajectory will be resampled (as the algorithm requires all trajectories to have the same number of coordinates).
Neighboring threshold: distance at which two trajectory points are considered neighbors.
The spectral clustering implementation used the radial basis function (RBF) kernel for its construction. It is to note that, for the Spectral Clustering algorithm, only the number of clusters hyperparameter was checked. Results are displayed for its maximum silhouette coefficient value.
4.3. Validation Metrics
The silhouette was the validation metric used, based on the definition of Rousseeuw [
22].
i is a data point in the cluster
.
a is the mean distance between
i and all other points in the cluster:
The mean dissimilarity of point
i to some cluster
is the mean of the distance from i to all points in
.
b is the smallest distance of
i to a cluster different from
:
Then, the silhouette value for one point
i is given by:
The term silhouette coefficient is defined as the maximum value of the mean
over all data of the entire dataset:
where
represents the mean
over all data for the number of clusters
k.
4.4. Results
The following tables show the average of the silhouette coefficient for the clustering algorithms and the distance measures in each of the datasets (
Table 1 and
Table 2).
One can observe that Quickbundles clearly algorithm ourperforms Spectral clustering. Observe that the geeodesic distance clearly yields the best results even when points to calculate distances in the dataset are not much apart.
6. Discussion and Conclusions
In this paper, we have presented results of the integration Quickbundles algorithm for the moving objects’ trajectory clustering in the CROSS-CPP platform. Two main aspects must be discussed: (i) integration issues and (ii) clustering algorithm performance.
In relation to integration, the CROSS-CPP marketplace makes it possible for data providers to store data so that service providers can use these data to generate applications based on the knowledge extracted from these data. In order to extract the knowledge from the data, the analytic toolbox has been developed and integrated. One of the main advantages of the analytic toolbox is its flexibility to include new functionalities; in particular, in this paper, we have shown how clustering for moving objects has been integrated. More precisely, the Quickbundles algorithm has been adapted from the health domain to calculate the clustering of trajectories. In the paper we have shown the end-to-end process (see
Section 5) from data harvesting to data analysis. In relation to performance, the Quickbundles algorithm has been proposed in this paper for trajectory clustering. We chose this instead of center-based clustering algorithms such as K-means due to the fact that spherical clusters are not appropriate for trajectory clustering. As a consequence, we conducted experiments to compare the behavior of spectral clustering and Quickbundles using different distance measures. The obtained results show that Quickbundles outperforms spectral clustering, with this behavior being coherent in both datasets: the Porto Kaggle and the CROSS-CPP data.
The silhouette metric was used to compare results. The silhouette coefficient metrics are known to be a ratio that depends on both the cohesion, that is to say the similarity inside each cluster, and on the separation, that is, the distance of points belonging to different clusters. From analyzing the obtained results, one could conclude that QB algorithm obtain clusters with a better ratio between cohesion and separation. Spectral clustering may yield clusters with better cohesion or separation metrics, but not both at the same time. However, it would be a future work to analyze these metrics in isolation.
In relation to experimentation, not all of the possible parameters for the compared algorithms have been explored. In particular, for spectral only, the number clusters has been modified in the different executions. For the case of QB, we ran experiments adjusting both the resampling factor and the neighboring threshold. In particular, for QB the experiments raised the fact that the resampling factor does not offer great changes in performance but the neighbouring threshold was shown to be determinant. This is in accordance with what has been already mentioned in the literature.
For experiments run with spectral clustering, we chose the number of clusters yielding the best results. Future work should include a deeper study by using different methods to construct the affinity matrix and fine-tuning the C and parameters, among others.
In relation with the distance measure, despite the fact that not high differences in silhouette coefficient were observed, results yielded slightly better results when using the geodesic WS84 in both datasets. Note that this distance improved performance even in the case of Porto, where the difference of distances is not very big.
Regarding time consumption, although experiments were not conducted to measure this issue, we could observe that the spectral algorithm took up to three times less the time to execute in comparison to Quickbundles. Although this should be further explored, results obtained are in accordance to literature: in [
13], the authors establish that the time of execution of QB increases linearly with the number of trajectories .
Finally, the clustering was performed in batch mode. However, it is important to note that the CROSS-CPP analytic toolbox allows for data-streaming mode. In future works, it would be interesting to analyze the behavior of the algorithm for that setting.
In conclusion, in this paper, we have presented results of the integration of Quickbundles algorithm for the moving objects trajectories clustering in the CROSS-CPP platform. It has been successfully integrated and is available in the CROSS-CPP project. The reason to choose this algorithm was based on the results shown, in which so far the QB algorithms is shown to outperform the spectral clustering algorithm.
As was already mentioned, future work shall include experimenting with streaming clustering algorithms for real-time analysis. Furthermore, deeper studies to analyze the effect of the hyperparameters on the behavior of the algorithms as well as to measure the time consumption should be required to decide the best option depending on the amount and nature of the dataset.