1. Introduction
The rapid progress of cell phones and automobiles has led to the flourishing of research based on GPS data. A key task in GPS data analysis is to identify user stopping points and uncover hidden semantic information to enhance user analysis and improve their experience. By analyzing user movement trajectories and stopping points, their interests and behavior habits can be determined, which can aid in business recommendations, pick-up point delineation, and even crime prevention.
One potential method for identifying stopping points along trajectories is to analyze their spatial–temporal characteristics, including velocity, duration, acceleration, orientation, and other relevant factors. Agamennoni et al. [
1] and Yan et al. [
2] employed a comparison of actual velocity with a predetermined threshold velocity at specific locations to identify stopping points, as well as low-speed or high-speed areas. Alvares et al. [
3] mapped trajectories to possible activities by comparing them with a given threshold of stopping duration. While threshold-based methods can yield effective results in identifying stopping points in straightforward trajectories, the use of a single indicator or fixed threshold values is insufficient to achieve satisfactory clustering outcomes for GPS trajectories that record complex traffic situations with diverse travel modes.
Stopping points typically exhibit high-density characteristics, making density-based clustering algorithms the primary approach for their extraction [
4,
5,
6]. However, trajectory points possess temporal characteristics, posing a significant challenge in incorporating temporal information into traditional Euclidean distance-based density clustering algorithms for stopping point extraction. To qualify the spatial–temporal density of data, Yang et al. [
7] integrated the characteristics of neighborhood move ability, stay time, and evaluation factor noise tolerance. The method enabled the realistic extraction of various complex trajectories with high effectiveness. Furthermore, to evaluate the movement performance of different types of points, Yang et al. [
8] developed a new moving index that combines point density and movement characteristics. A moving index Gaussian model was constructed based on this index to extract stopping points, which overcomes limitations of clustering based on stopping point features and reduces the pseudo-merging of stopping point clusters. However, this method struggles to differentiate transient stopover points from stopping points, and the analysis of moving point features at multiple scales makes the algorithm complex and inefficient for clustering.
Known for its ability to find arbitrarily shaped groups and isolate noise from other data, the DBSCAN (density-based spatial clustering of applications with noise) algorithm has been widely used in numerous fields, including automatic identification systems, anomaly detection, remote sensing, chemistry, and social sciences [
9,
10,
11,
12]. In recent times, researchers have employed DBSCAN and its various adaptations to extract stopping points from GPS trajectories in numerous studies [
13,
14,
15,
16,
17]. Palma et al. [
13] and Tran et al. [
14] characterized the neighborhood as temporal linear and selected the core point in a cluster based on the minimum time criterion, rather than the minimum number of points, and selected along the trajectory distance to apply the improved DBSCAN algorithm. To prevent misidentifying stopping points due to increased local density caused by round trips, Gong et al. [
15] proposed a two-step clustering algorithm C-DBSCAN. The first step involves extracting all stopping points while enforcing temporal and anomaly constraints. Specifically, the algorithm requires that all points in a cluster be temporally sequential and have an even distribution of direction change coefficient. In the second step, a support vector machine is utilized to distinguish between transient and active stopping points. Three features, namely stop duration, mean distance to the cluster centroid, and the shorter of the distances to home and to the workplace, are selected for supervised training. To enhance the performance of C-DBSCAN, Gong et al. [
16] proposed an improved algorithm, DBSCAN-TE, which replaces the outlier detection process in the first clustering step with entropy constraints. The entropy index (EI), which measures the average amount of information and the degree of disorder within a system [
17], is used to gauge the level of orderliness in the data. Specifically, the entropy constraint excludes clusters with an EI below a designated threshold, effectively filtering out slow-moving points with a regular pattern. However, the lack of an automated method for selecting optimal hyperparameters results in an uncertain level of accuracy for GPS data involving intricate traffic patterns and varying travel conditions. At the same time, the influence of input features of support vector machines on the correct classification rate has not been studied. The above drawbacks and the limitations of the dataset limit the scalability of the algorithm.
As another density-based clustering method, density-peak clustering (DPC) exhibits low complexity and high execution efficiency, while being capable of identifying arbitrarily shaped clusters without necessitating the use of multiple parameters [
18,
19,
20]. Fu et al. [
21] proposed a two-step algorithm to cluster stopping points from a single GPS trajectory. They customized three types of stopping points and clustered them using an improved density peak fast search and recognition algorithm. This method yielded better performance and accuracy than the DBSCAN method. Trajectory data often contain low-density areas that deviate from commonly visited locations, such as suburban tourism hotspots. In these situations, the DPC algorithm is better suited for analysis due to its introduction of a relative distance measure, which outperforms the DBSCAN method.
The methods discussed above, which are based on spatial–temporal characteristics and density clustering, have made significant progress in extracting stopping points from GPS trajectories. However, there are several drawbacks that need to be addressed. (1) Direct thresholds based on spatial–temporal characteristics are simple but cannot handle complex trajectories with various motion modes; (2) methods such as DBSCAN and its variations are computationally intensive and not feasible for addressing large volumes of data in the stopping point extraction process; (3) the challenges of mixed points resulting from slow-moving clusters and non-stopping points resulting from the trailing phenomenon have not received thorough investigation and effective solutions. To address these challenges, this paper proposes a novel two-step clustering approach based on DPC with coherence distance, as well as temporal and entropy constraints. This paper’s core contributions can be summarized as follows:
An improved DPC method, DPCC, is proposed by introducing the coherence distance metric that considers both temporal information and the Euclidean distance during the DPCC clustering stage.
Temporal and entropy constraints are introduced to address the mixing problem after clustering caused by temporal noncontinuous and slow-moving points.
A two-step DPCC-TE clustering approach is proposed to resolve the interactions between subclusters that arises during single-step clustering, validating its efficiency compared with different methods.
The rest of this article is arranged as follows.
Section 2 formulates the stopping point extraction problem and defines four types of points on trajectories, which will facilitate algorithm design.
Section 3 introduces the improved DPC algorithm, which combines the temporal–spatial information using a coherence distance metric.
Section 4 presents the procedures of the proposed two-step clustering method for extracting stopping points.
Section 5 details the experiments conducted using Geolife GPS data, and
Section 6 discusses the results. Finally, the conclusions are illustrated in
Section 7.
2. Problem Description
In this paper, we refer to locations where individuals spend an extended duration of time engaged in activities such as work or shopping as “stopping points”. We assume the existence of a trajectory, denoted by
, where each trajectory point
includes latitude (
), longitude (
), recording time (
), and point type (
). The objective of extracting stopping points is to identify the subset
Q that consists of all stopping points in
T, as defined by Equation (
1).
This paper primarily focuses on the extraction of stopping points from GPS trajectories. To analyze the causes of misidentification in detail and facilitate algorithm design, non-stopping points are categorized into three distinct types: moving points, slow-moving points, and temporary stopping points. Using this categorization, all types of points in the trajectory can be denoted as
.
Figure 1 shows a typical distribution of the various types of points in a trajectory.
The following instructions outline how the figure should be interpreted:
Moving points: the trajectory points generated by a user during the journey from one location to another, usually expressed as a straight line;
Slow moving points: similar to the moving points, but with a lower velocity, resulting in an increase in local density;
Temporary stopping points: the trajectory points generated when the user stops at a location for a short time, such as waiting for a cab, buying a bottle of water, etc.;
Stopping points: the points where a person stays and engages in activities, such as shopping, gathering, working, etc. Unlike temporary stopping point, stopping points are purposeful and typically last for longer periods.
Based on above classification of trajectory points, reasonable speculations can be made about the characteristics of stopping points. Stopping points exhibit a relatively higher local density compared to other types of points. Additionally, the internal trajectories of clusters composed of stopping points exhibit a haphazard pattern. Given these inferences, it is possible to explore the potential interference of different point types in accurately identifying stopping points.
For moving points, passing through the same location at different times results in an increase in local point density in the region, potentially reaching the level of the stopping category. This indicates that relying solely on density as a criterion for classification is not practical.
For slow moving points, the local density increase is mainly due to a slower moving speed relative to the moving point. However, despite its higher density, the trajectory of slow moving points displays a greater degree of orderliness.
For temporary stopping points, the main difference with stopping points is that the user engages in simple activities such as waiting for a cab or purchasing water. This results in a smaller activity area and shorter dwell time compared to the stopping point.
Based on the above analysis, it can be seen that the extraction of stopping points is affected by various factors. In addition, the three types of non-stopping points in a cluster may also interact with the stopping points, leading to the interaction between sub-clusters, as shown in the example in
Section 4.3. To address these challenges, we propose a two-step clustering method using improved density peak clustering for secondary clustering to extract stopping points.
3. Improved Density Peak Clustering Algorithm Utilizing Coherence Distance Metric
GPS tracking points typically include both latitude and longitude as well as time information. Traditional density peak clustering relies on the planar distance as the sole measure in the distance matrix, which is straightforward but does not take advantage of the time information available in GPS track points. As stopping points within the same area tend to have slower speeds and be closer to each other, this paper proposes an improved density-peak clustering algorithm that incorporates a coherence distance calculated using latitude, longitude, and time information rather than the traditional Euclidean distance. By doing so, the proposed approach enables a better utilization of temporal information for GPS tracking data clustering.
3.1. Raw Density Peak Clustering Algorithm
Proposed by Alex Rodriguez and Alessandro Laio in 2014 [
18], DPC is a type of density-based clustering algorithm. In contrast to the traditional DBSCAN algorithm, the DPC algorithm relies on two primary assumptions:
Given the aforementioned assumptions, the density peak clustering algorithm requires the calculation of two metrics separately for each data point i: the local density and the relative distance .
For the dataset
, the distance between data points
and
is denoted by
. The local density
of data point
is defined in two ways, truncated kernel and Gaussian kernel, or under the definition of truncated kernel, the local density
is shown in Equation (
2):
where
if
and
otherwise, and
is a cutoff distance.
Under the definition of the Gaussian kernel, the local density
is shown in Equation (
3).
The minimum distance
from data point
to other points with higher density is defined as Equation (
4).
In the DPC algorithm, the appropriate local density is obtained by selecting the cutoff distance . Once the local density and the relative distance of all data points are calculated, a two-dimensional decision map of data points is constructed based on and . The clustering centers are then selected based on their distribution in the decision map. After identifying all the cluster centers, the remaining points are assigned to the cluster with the nearest point that has a higher density, completing the clustering process. Compared to the DBSCAN algorithm, the DPC algorithm can achieve better clustering results in remote low-density regions due to its consideration of the relative distance metric .
3.2. DPC Using Coherence Distance Based on Spatio-Temporal Consistency
A GPS trajectory T that contains n trajectory points can be expressed as the sequence . Each trajectory point records the longitude, latitude, altitude, time, and other information. For the purposes of this paper, only the latitude, longitude, and time information will be taken into account.
Density-based clustering methods, such as DBSCAN and DPC, rely on the distance between data points and clusters. However, when addressing GPS trajectories that include time stamps, direct density-based approaches run the risk of losing temporal information. To extract the user’s stopping points, which refers to a location where the user stays for a period of time and performs some activities, it is essential to combine the coordinates and timestamps. To enhance the utilization of temporal information in a trajectory and accurately extract a user’s stopping points, this paper introduces the concept of coherence distance, which is used to construct a new distance matrix referred to as the consistency matrix. The coherence distance is calculated as shown in Equation (
5).
where
is the distance between trajectory points
and
,
is the deflation index, and
is the time difference between
and
.
When the distance between trajectory points and , denoted as , is shorter but the time difference is longer, the coherence value is smaller. This indicates that the user has stayed in this area for a longer period of time, making it more likely that the two points belong to the same area. Thus, in calculating the local density , we substitute the Euclidean distance in the distance matrix with the coherence distance. However, in computing the relative distance , the Euclidean distance is used according to the second assumption of DPC. As a result, proposed DPCC (DPC with coherence distance metric) can incorporate both spatial and temporal information, instead of relying solely on physical distance.
4. Two-Step Clustering to Extract Stopping Points
Taking the characteristics of stopping points into account, this section employs density peak clustering with coherence distance based on temporal and entropy constraints and proposes a two-step clustering stopping point extraction method called DPCC-TE. First, trajectories are clustered using DPCC, each cluster is temporally segmented, and each trajectory sequence undergoes an entropy test, known as the first step DPCC-TE. Subsequently, each sequence is clustered, segmented, and entropy-tested again to eliminate some extended trajectories that are indistinguishable by the first step DPCC-TE. As a result, this whole algorithm is referred to as the two-step DPCC-TE. The specific workflow of this method is illustrated in
Figure 2.
4.1. Temporal Sequence Constraint
After applying the improved density peak clustering algorithm introduced earlier, the clusters of trajectory are extracted, which are more aggregated both in time and space. However, even the temporal and spatial regions with high local density may still not necessarily be stopping points, as density-based clustering algorithms heavily rely on the accuracy of local density. As previously discussed in the problem description section, non-stopping points may exhibit an increase in local density due to round trips to the same location in different trajectories or within a single trajectory. To address this issue, this paper takes the following two approaches.
Multiple trajectories may pass through a particular location multiple times during human activities, such as an intersection used for commuting to work, resulting in higher local densities in that area. To mitigate local density increase, the trajectories in the dataset are processed individually rather than as a whole collection of trajectories. This approach effectively avoids the accumulation of density at the location of interest. However, a similar situation may occur in a single trajectory, where non-stopping points cannot be differentiated using the aforementioned method.
As shown in
Figure 3, the cluster contains stopping points 4–11 and moving points 21–22. Some of the moving points pass through this area, resulting in their clustering into stopping points.
However, it is observed that there exists a discontinuity in time between the moving and stopping points in the cluster. If the trajectory points are marked in chronological order, the trajectory points within the cluster exhibit a jumping index, as seen in
Figure 3 at 11 and 21. Consequently, the points in the cluster can be divided into two parts, points 4–11 and points 21–22. Based on this, the temporal sequence constraint can be defined as follows.
Temporal sequence constraint: in a cluster, the trajectory sequence must demonstrate continuous growth over time without any discontinuities, with each coordinate point being indexed chronologically. Following the clustering process, the resultant cluster will be partitioned into sub-clusters that conform to a strict order of sequence increase, one by one.
4.2. Entropy Constraint
After applying clustering and temporal constraints, different trajectory sequences are obtained. Although these trajectory sequences are not connected to each other, the points within each sequence are continuous in time. However, due to the presence of slow moving points, a relatively low moving speed also leads to the increase in local density, making them be misidentified as stopping points. Nevertheless, it can be observed that although the slow moving points are dense, they move mainly in one direction, as indicated by the similarity of the direction angle between the front and rear trajectories. This reflects the orderliness of the trajectory sequence. On the contrary, the trajectories of the stopping points exhibit a disorder of directional angles due to interactions within a certain region. To quantify the disorder in the trajectory sequence, we use the entropy index, specifically, the Theil index, an economic indicator first proposed by Henri Theil to measure economic inequality [
22], to measure the orderliness of the trajectory sequences.
The entropy constraint: Trajectory points exhibit directional regularity when a person just passes by, but they exhibit directional chaos when a person stays in one place. To quantify the regularity of the trajectories, this paper uses the entropy index to describe the chaos degree of trajectories. Specifically, for a trajectory sequence, the directions of the trajectory refer to the directions formed by the orientation angles between each pair of consecutive points. When the entire space of Cartesian coordinates is divided into
D groups equally, the entropy index of trajectory sequence
S is calculated by Equation (
6),
where
M is the number of intervals for the direction and
is the count of directions falling into interval
d in the trajectory sequence
S.
4.3. Two-Step Clustering Based on DPCC-TE
By employing methods with temporal and entropy constraints, we can remove those points in the trajectory that are erroneously classified as stopping points due to the local density increases caused by multiple round trips or slow GPS carrier movements. However, in reality, the clusters obtained through clustering may contain multiple types of blended points, leading to complications. For example, consider the scenario illustrated in
Figure 4:
As shown in
Figure 4, the points inside the red circles are extracted as a cluster in this trajectory. After applying the temporal sequence constraint, this cluster is divided into sub-cluster 1 (points 5–13) and sub-cluster 2 (points 18–24). It should be noted that in sub-cluster 1, the moving point trajectory (points 12–13) extends from the stopping points cluster (points 5–11), causing a trailing phenomenon. However, due to the excess of stopping points over moving points in sub-cluster 1, its entropy index is lower than that of a sequence composed entirely of stopping points, but still exceeds the threshold, resulting in sub-cluster 1 being identified as a stopping cluster. The occurrence of this phenomenon is due to the mutual influence of different sub-clusters, resulting in the presence of different types of points in the sub-cluster. Moreover, this phenomenon can occur in other combinations, making the extraction of stopping points more challenging.
The issue is addressed by proposing a two-step DPCC-TE clustering algorithm, which is shown as Algorithm 1. As previously mentioned, trajectories are first clustered, and then the resulting clusters are subjected to temporal and entropy constraints to derive the stopping points, which is called the first step. For the extracted clusters, we then repeat this process again to obtain the final stopping points. This algorithm is referred to as the two-step clustering algorithm in this paper, and its detailed procedure is shown in the next subsection. The proposed two-step clustering algorithm effectively addresses the issue of subcluster interaction, which is a common problem in density-based algorithms when addressing different points located in the same spatial location but recorded at different times. Following the first round of DPCC-TE, the algorithm divides clusters into temporally continuous subclusters. Subsequently, the local density of these subclusters is recalculated before the second-step clustering process. For subclusters that have tailing caused by moving points, the local density is recalculated using a new threshold to remove these points. However, other types of points require an additional round of DPCC-TE to eliminate them.
Algorithm 1 Two-Step Clustering to Extract Stopping Points |
- Require:
Trajectory , where . - Ensure:
, and is stopping point. - 1:
Read trajectories T. - 2:
for to n do - 3:
for to i do - 4:
- 5:
- 6:
end for - 7:
end for - 8:
for to n do - 9:
Calculate coherence distance based local density by Equation ( 3), where - 10:
Calculate minimum distance by Equation ( 4) with Euclidean distance - 11:
end for - 12:
Obtain cluster centers by decision diagram - 13:
Get one-step clusters with DPCC. - 14:
for in do - 15:
Get by temporal constraint. - 16:
end for - 17:
for in do - 18:
if then - 19:
- 20:
end if - 21:
end for - 22:
for in do - 23:
Obtain using the method described in lines 2 to 24. - 24:
end for - 25:
Get the set of stopping points set
|
4.4. Algorithm Time Complexity Analysis
Let the number of trajectories in the dataset be N and the number of coordinate points of the ith trajectory be ; then, the computation of the ith trajectory is
Calculating the distance matrix, time complexity ;
Calculating the local density, time complexity ;
Selecting clustering centers, time complexity );
Time order constraint and entropy constraint, time complexity .
Therefore, for the ith trajectory, its time complexity is . For the whole dataset, the time complexity depends on the trajectory with the highest number of trajectory points.
In contrast to the original DPC, this paper does not increase the time complexity when addressing a single trajectory. Moreover, for the whole dataset with N points totally, the algorithm’s complexity decreases from to , and its worst case degenerates to due to the use of split trajectory operation. Compared to other clustering methods that use trajectory segmentation and have a time complexity of , this algorithm clusters twice, resulting in a constant increase in time consumption, but no increase in the computation complexity.
6. Discussion
The experimental results demonstrate the effectiveness of the proposed two-step clustering algorithm based on DPCC in identifying stopping points from trajectory data with timestamps. The second clustering step eliminates the mutual interference between different sub-clusters and improves the accuracy of stopping point extraction.
Distinct distribution characteristics are observed among various types of trajectories, with trajectories in the stopping point region typically displaying greater disorder in direction.
Figure 6 illustrates that the entropy index of stopping points, temporary stopping points, slow moving points, and moving points decreases when
. This result is consistent with both their physical meaning and the previously analyzed assumption of entropy index. For short trajectory sequences with
, there is a possibility of mixing with moving points. On the other hand, when
, the number of temporary stopping points decreases significantly. These findings confirm the effectiveness of the entropy constraint in the proposed two-step clustering algorithm.
The proposed two-step DPCC-TE algorithm is density-based. However, the traditional calculation of density relies solely on spatial information, which causes the points in the same space area but at different times to have an impact on clustering. This is also the root cause of the situations illustrated in
Figure 4 and
Figure 8. To address this challenge, this paper introduces the temporal aspect using two methods: (1) incorporating the consistency distance and (2) utilizing time-series segmentation. These techniques aim to account for the impact of time on density calculation, thereby enabling more accurate and robust stopping point identification.
As depicted in
Figure 8a, even after the initial DPCC-TE clustering step, there are still instances of misidentified stopping points due to increased local density of non-stopping points and mixed multiple-type clusters in the trajectory. To address this issue, a second DPCC-TE clustering step is performed to remove the extended trajectory pieces of the moving points in the sub-clusters, as shown in
Figure 8b. This approach effectively eliminates the extended trajectory, leading to improved accuracy in stopping point recognition.
The performance comparison results in
Table 2 indicate that the proposed two-step DPCC-TE algorithm outperforms the other two methods in terms of accuracy, precision, and F1 score, demonstrating its effectiveness in extracting stopping points from GPS trajectories. Although the processing time of the two-step clustering method is longer than the other two methods due to two rounds of clustering, its performance is significantly better, indicating that the increased processing time is worthwhile. Meanwhile, it should be noted that the algorithm has multiple parameters, including two rounds of clustering, cutting, and entropy constraint operations, making optimization challenging. To address this issue, future research can leverage prior knowledge from the first round of clustering to achieve adaptive parameter adjustment during the second round of clustering. Additionally, the output of the proposed algorithm can be utilized as input to other classifiers to further enhance stopping point recognition performance.
7. Conclusions
Stopping points extraction is a crucial task in the process of trajectories analysis. In this paper, we present a novel clustering algorithm called two-step DPCC-TE for extracting stopping points from GPS trajectories. The main focus is to develop an effective approach that combines both temporal and spatial distance information to extract stopping points accurately. First, a new consistency distance index is introduced to the design of density peak cluster, which effectively combines the temporal and spatial distance information. Next, the temporal sequence constraint is utilized to guarantee the temporal consistency of points within a cluster, while an entropy constraint ensures stopping points extraction by the chaos degree of trajectories. Finally, a second-step clustering is applied to each subcluster to solve the problem of subcluster mixing that cannot be eliminated by a single clustering step. Simulation and experimental results demonstrate the effectiveness of the method.
This experiment demonstrates that the entropy index of different types of trajectory points follows a certain order of magnitude, which does not vary with the length of the trajectory, and can be used to distinguish stopping points from non-stopping points. Meanwhile, compared with DBSCAN-TE and single-step clustering, the two-step clustering achieved best accuracy in the process of stopping point extractions from GPS trajectories. In contrast to the DPC method which operates on all data, this algorithm adopts a trajectory-based processing approach, reducing the algorithm complexity to depending on the maximum trajectory length. Compared with traditional one-step methods addressing one trajectory at a time, the algorithm may take longer to run due to the use of two clustering steps, but there is no increase in time complexity. In addition, the algorithm has demonstrated improved accuracy in stopping point extraction compared to traditional one-step methods, making it a viable option for trajectory analysis. Therefore, with each step having a clear physical meaning, the proposed two-step clustering algorithm can effectively extract stopping points based on latitude, longitude, and time information from GPS data. In the future, we will investigate the optimization of parameters and explore the combination of other classifiers to enhance the accuracy of stopping point extraction.