1. Introduction
Clustering algorithms divide objects into clusters based on the similarity between them, ensuring high similarity among objects within the same cluster and lower similarity between objects in different clusters [
1,
2]. Researchers have applied these algorithms for applications such as image clustering, modulation recognition, and vehicle detection [
3,
4,
5,
6,
7].
As an important branch of clustering algorithms, density clustering algorithms are methods that group data based on the density distribution of data points [
8,
9,
10,
11,
12,
13]. DBSCAN [
8] and DPC [
13] are two typical density-based clustering algorithms. DBSCAN does not require the pre-specification of the number of clusters and identifies core objects by filtering the number of neighbors contained within a given radius; then, it expands the cluster by continuously extending the reachable range of the core objects. DBSCAN can be used for clustering data of any shape; however, when there are significant differences in density within the dataset, it struggles to distinguish different clusters using the same radius. OPTICS [
9] is an improved algorithm for DBSCAN that reduces the dependence on the parameter radius. By sorting the objects in the dataset, an ordered object list is obtained, which can assist in cluster detection using different parameters. However, the OPTICS algorithm cannot explicitly generate clustering results, and using the above object list to extract clusters is dependent on the user’s judgment.
DPC is a clustering algorithm based on density peaks. It constructs a decision graph by analyzing the density of each point and the distance to nearby points. Based on this decision graph, it identifies cluster centers and guides the clustering process using these center points. However, when the data distribution is uneven, this clustering center selection mechanism is prone to errors (as discussed in
Section 2). Some variant algorithms of DPC streamline this process by directly using the number of clusters as an input parameter to determine the cluster centers, rather than relying on manual selection [
14,
15,
16,
17,
18,
19,
20].
Many researchers have proposed improved algorithms based on DBSCAN and DPC. Zhu et al. proposed the K-DBSCAN algorithm to solve the problem related to the difficulty of setting appropriate parameters in DBSCAN [
15]. Guan et al. constructed sub-clusters using
k-nearest neighbors (
knn) and then merged these sub-clusters based on the target number of clusters [
14]. Long et al. used
to determine the similarity between family trees and tree branches, then obtained the final clustering results using graph segmentation ideas [
16]. Abdulrahman et al. re-defined density using a fuzzy neighborhood and further constructed backbone and boundary points [
17]. Some algorithms have drawn on the strengths of both DBSCAN and DPC. DCHDP is a density-connected hierarchical clustering algorithm based on these two algorithms, which improves the accuracy of identifying clusters with arbitrary shapes and different densities [
18]. LGD utilizes local differential density to identify core points and boundary points and adopts the allocation mechanism from DPC to assign the remaining points to the initial clusters [
19]. ConDPC draws on the idea of DBSCAN to construct connected groups, then modifies the allocation method to distinguish clusters in manifold structures [
21]. All of the above algorithms, however, require the number of clusters to be pre-defined. DBSCAN-DPC extracts initial clusters based on the DBSCAN algorithm and performs cluster expansion based on DPC [
22]. This algorithm is better at identifying cluster centers, compared to DPC; however, it performs poorly in clustering under manifold structures.
This study draws on the extension of DBSCAN and the label allocation mechanism of DPC, introduces local relative density, and proposes a clustering algorithm that automatically identifies cluster centers and the number of clusters in order to solve the problem of poor clustering performance caused by the manual selection of cluster centers or preset cluster numbers, as well as the large differences between clusters in DPC and some of its variants.
The contributions of this study are as follows:
A method for automatically identifying cluster centers and the number of clusters is proposed. The algorithm selects the point with the highest original density from all points as the center of the first cluster. After completing the cluster expansion, the point with the highest original density from the remaining points is selected as the center of the second cluster. This process continues, and the algorithm does not require pre-specification of the number of clusters, allowing it to automatically identify all cluster centers.
A local relative density calculation method is proposed and applied for cluster expansion. Compared to the original density calculation methods in DBSCAN and DPC, the local relative density can more accurately reflect the density level of a point within its cluster. Based on this density, the algorithm identifies core points and performs cluster expansion around the core points and their k-nearest neighbors (knn). The use of local relative density is more effective for clustering in datasets with uneven density.
A secondary allocation mechanism for micro-clusters is proposed. The algorithm identifies small clusters and reallocates the points within these micro-clusters to the appropriate clusters, thereby improving the accuracy of clustering.
We review the relevant algorithms in
Section 2.
Section 3 presents the implementation logic of RDBSCAN and demonstrates how the algorithm works. In
Section 4, comparative experiments are conducted. Finally, a Discussion and Conclusion are provided.
2. Related Works
This Section introduces the implementation logic of DBSCAN and DPC.
2.1. DBSCAN
In DBSCAN, for each point
in a dataset, its neighbors
are defined as follows:
where
is a given radius. If
, then point
is considered a core object.
In DBSCAN, the reachability between points is divided into two categories: direct density reachability and density reachability. For core points, all neighboring points are directly density-reachable from the core point. For any two points and , if there exists a chain , such that each is directly density-reachable from , then point is considered density-reachable from point .
Based on the concept of density reachability, the algorithm selects any core point and its neighbor to create a new cluster C. Then, the unassigned density-reachable points of any point within C are successively added to C until no further expansion is possible. The process is repeated until all core points have been processed.
From the above information, it can be concluded that the algorithm uses fixed combinations to perform global clustering. If there is a large difference in density between clusters, there may be a problem of not being able to select a suitable combination.
2.2. DPC
The DPC algorithm assumes that cluster centers have higher densities compared to their surrounding points and that the cluster centers are far apart from each other. In this algorithm, any point i has two important attributes: its density and the distance to the nearest point j with a higher density than itself. For simplicity in this paper, point j is referred to as the “parent neighbor” of point i, and the distance between i and j is called the “neighbor distance” of point i.
In DPC, the density and neighbor distance of point
i are defined as shown in Equations (2) and (3) [
13]:
Based on the density and neighbor distance of each point, DPC can construct a decision graph where the x-axis represents density values, and the y-axis represents neighbor distance values. This decision graph helps identify points with both high density and large neighbor distance, as these points are most likely to become cluster centers.
Using dataset 2sp2glob as an example, the DPC algorithm can plot a decision graph based on the density and neighbor distance of each point, as shown in
Figure 1a.
From
Figure 1a, it is evident that four points have significantly higher density and neighbor distance compared to others. Using this decision graph, the algorithm selects these four points as cluster centers. Subsequently, the remaining points inherit the labels of their parent neighbors in descending order of density, completing the clustering process.
However, when there is a significant density difference between clusters, this cluster center selection mechanism is prone to errors. Taking the Jain dataset as an example, the optimal result of DPC on this dataset is shown in
Figure 2.
Figure 2a displays the corresponding decision graph, where two points in the upper-right corner clearly have an advantage in terms of density and neighbor distance. However, as shown in
Figure 2b, both of these points belong to the high-density bottom cluster, while the sparse upper cluster fails to select any density peak as its center.
2.3. Relative Density
Relative density is used to measure the density level of a data point within its local region, which helps in identifying cluster centers and expanding cluster boundaries. It is especially useful for handling data with uneven density.
The density ratio is a method for calculating relative density. ReconDBSCAN is a density-ratio-based improved version of the DBSCAN algorithm [
23]. This algorithm uses the ratio of density estimates in two differently sized neighborhoods of the same point as the new density for that point. It then uses this density and a ratio threshold to identify the core points and expand clusters. Dratio-DPC is a density-ratio-based improvement of DPC [
24]. This algorithm does not directly use the density ratio as a density estimate but instead uses the density ratio as a parameter to adjust the original density. These two algorithms outperform DBSCAN and DPC in clustering datasets with uneven density. However, both algorithms introduce an additional parameter for the density ratio—namely, ReconDBSCAN, which adds a neighborhood parameter, while Dratio-DPC adds a nearest neighbor count parameter—which increases the difficulty of parameter selection.
LGD is a clustering algorithm based on local differential density [
19]. The algorithm uses the ratio of the density of point
i to the maximum density of its neighbors as the new density. This algorithm significantly improves the clustering accuracy, compared to the DPC algorithm. However, it also requires the number of clusters to be pre-defined as a parameter.
3. Proposed Algorithm
Inspired by DBSCAN and DPC, RDBSCAN proposes the concept of local relative density and a new method for determining core points, using knn for expansion. During the secondary label assignment process, RDBSCAN draws on the label assignment approach of DPC, making adjustments and improvements.
To better identify the core points in data with uneven density, this study introduces the concept of relative density and uses the ratio of local density to local center as the local relative density.
As described in
Section 2.2, DPC does not account for uneven density between clusters when comparing density values during label assignment. This assignment mechanism can result in a point being labeled with the cluster identifier of a closer point from another cluster. To address this issue, this study considers the density differences within the
k-nearest neighbor neighborhood when searching for parent neighbors.
3.1. Definitions
Definition 1. Original local density.
This study defines a local-density-based kernel density estimate (KDE). For point
i, the closer its
k-nearest neighbors are, the higher its local density becomes. The definition is as follows:
where
represents the set of
k-nearest neighbors of point
i, and
is a KDE bandwidth.
Definition 2. Processing Status Indicator.
Let
be a point in the dataset. We define the processing-status indicator.
Definition 3. Local center.
Let j denote the index of the current cluster. If no clusters have been created, then j = 0.
RDBSCAN selects the point with the highest density from the unprocessed points as the local center of the next cluster. Then, the local center of the next cluster can be defined as follows:
Definition 4. Local relative density.
This article uses the ratio of the original local density of a point
to that of the corresponding local center as the local relative density, defined as follows:
reflects the density of a point relative to the cluster center. If it is close to a local center, the relative density is larger; if it is far from the cluster center and at the boundary, the relative density is smaller.
Definition 5. Core point.
A point with a local relative density larger than the given threshold is called a core point, which is defined as follows:
Definition 6. Micro-cluster.
We call a cluster a micro-cluster if the number of samples in the cluster is less than the minimum cluster size (denoted as
here), as described in Formula (9):
is set to 3 in this paper. For micro-clusters, a secondary assignment will be performed.
Definition 7. Neighborhood density difference.
Since the number of points within the neighborhood is fixed (all are
k), this study uses the ratio of the average distances within the neighborhood to reflect the density differences. Therefore, for
and
, their corresponding neighborhood density differences can be defined as follows:
Definition 8. Balanced neighbor distance.
By using the neighborhood density difference, the neighbor distances can be adjusted, as shown in the following formula:
From Formula (10), it can be seen that reflects the density difference between the sets of points and . The smaller the density difference, the closer approaches 1, and the adjusted distance tends to the original . The larger the density difference, the larger becomes, making the adjustment more significant, with the adjusted distance greater than the original .
3.2. RDBSCAN
The algorithm performs clustering processing from high to low based on local density. The algorithm selects the point with the highest local density as the local center and expands it according to until there are no more points to expand, forming a cluster. Then, select the point with the highest local density from the remaining points for expansion, forming a cluster. Repeat this process until all points are visited. For points in micro-clusters, the algorithm performs secondary processing and allocation.
The main structure of the algorithm is shown in Algorithm 1. In this algorithm, Algorithm 2 is called, which is primarily used for cluster expansion.
Algorithm 1: Clustering based on local relative density (RDBSCAN). |
Input: dataset X; number of nearest neighbors ; threshold of local relative density Output: clustering result
normalize the dataset ; construct the k-nearest neighbor distance matrix using knnsearch function; calculate the original density ; sort the points in descending order of to set ; mark all points as unprocessed; for point in if is not processed create a new cluster C; mark as processed and the local center of C; Expand(, ’s k-nearest neighbors, C, ); end if end for sort the points in micro-clusters in descending order of to set S for point in S find ’s parent neighbor according to Formula (11); assign ’s cluster label to . end for return result
|
Algorithm 2: Expand. |
Input: ; ’s k-nearest neighbors denoted as pneighbors; C; Output: clustering result of C
mark ’s cluster label as C; for point in pneighbors if is unprocessed mark as processed and ’s cluster label as C; ; if > add the ’s k-nearest neighbors to pneighbors; end if end if end for return clustering result of C
|
We have provided the flowchart corresponding to the above algorithm, as shown in
Figure 3.
3.3. Computational Complexity Analysis
In this Section, we assume that the number of samples is n, the dimensionality of the samples is d, and the number of nearest neighbors is k. We will discuss the time complexity of the RDBSCAN algorithm (Algorithm 1).
The time complexity of normalizing the dataset in line 1 is .
The time complexity of computing the k-nearest neighbor distance matrix in line 2 is .
The time complexity of calculating the original density for each point using knn in line 3 is .
The time complexity of sorting the original density in line 4 is .
The time complexity of initializing the processing status in line 5 is .
Lines 6–12: For each point, the algorithm checks whether it is a core point. If it is, the k-nearest neighbors of that point need to be queried. The time complexity associated with this process is .
The time complexity of sorting the points in the small clusters in line 13 is , where is the number of points in the small cluster ().
Lines 14–17: In the for-loop, for each point in the small cluster, we find the nearest point whose adjusted distance is the smallest to serve as its parent. The worst-case time complexity for this step is .
In summary, the time complexity of this algorithm is
. We will further discuss the time complexity of the algorithm in
Appendix A. The runtime of all algorithms on the test datasets is compared in
Section 4.5.
3.4. Execution Demonstration
This Section uses the Jain dataset to explain how the proposed algorithm RDBSCAN works.
First, we find the point with the highest original density among all points, which is used as the first local center (red star in
Figure 4), denoted by
here. As
has not been processed, create a new cluster
, add
to
, and mark it as processed. Add the unprocessed points from the six nearest neighbors of
(red dots in
Figure 4) to the set
and to
, and mark them as processed. If the local relative density of point
in
is larger than
(0.7), then
is the core point, and the unprocessed points in
’s six nearest neighbors are also added to
and
. Repeat this process until cluster
can no longer be expanded (thus forming the red cluster at the bottom in
Figure 4).
Then, all the points in the lower arc-shaped cluster have been marked as processed, and the point with the highest original local density is selected among the remaining points as the second local center (the blue star in
Figure 4), denoted by
here. As
has not been processed, create a new cluster
, add
to
, and mark it as processed. Similar to the first step, expansion is performed based on
until cluster
can no longer be extended (forming the upper blue cluster in
Figure 4).
4. Results
4.1. Experimental Settings
4.1.1. Comparison Algorithms and Datasets
The proposed algorithm was inspired by DBSCAN [
8] and DPC [
13] and introduces improvements based on these algorithms. Therefore, this study takes DBSCAN and DPC, as well as the improved DCHDP [
18], LGD [
19], ConDPC [
21], and DBSCAN-DPC [
22] algorithms based on these two algorithms, for comparison. The density-based clustering algorithm OPTICS [
9] was also chosen for comparison. Meanwhile, a clustering algorithm based on graph-based diffusion geometry, LUND [
25], was also compared with the aforementioned clustering algorithms. This algorithm addresses the shortcomings of the DPC algorithm, demonstrating superior effectiveness over DPC.
Among the nine algorithms, all except OPTICS were executed in MATLAB 2017B. The OPTICS algorithm does not directly provide cluster extraction results. The OPTICS implementation from the scikit-learn library was used, which utilizes the xi parameter to identify cluster boundaries and thus perform cluster extraction, and the algorithm was run in Python 3.8. We conducted the experiments on a computer equipped with an Intel Core Ultra5 Processor 125H 1.20 GHz CPU and 32.0 GB of RAM.
The range of values for each algorithm’s parameters is shown in
Table 1. Among them,
in the RatioDBSCAN, LGD, DBSCANDPC, and LUND algorithms represents the number of nearest neighbors;
in both DPC and ConDPC algorithms represents the percentage of all samples;
in RatioDBSCAN represents the local relative density threshold; and
in LGD represents the local difference density threshold. For LUND,
σ is the diffusion scale parameter, and
is the KDE bandwidth. In addition to the three parameters listed in
Table 1, LUND has an additional two parameters:
β, which represents the sampling rate (with a default value of 2), and
τ, which represents the diffusion stationarity threshold (with a default value of 10
−5). The
T in
Table 1 is calculated based on these two parameters, and the specific formula can be found in the original paper [
26].
The best results of each algorithm on the corresponding parameter combinations were selected for comparison.
The experiments were conducted on sixteen datasets, detailed information of which is provided in
Table 2. The first eight datasets were synthetic datasets, while the last eight were real-world datasets. The datasets were normalized before clustering. To improve the running efficiency, we preprocessed the Olivetti faces dataset using the settings from the paper [
7], which first scaled the original image size to 15 × 15, then applied PCA to select features with a cumulative contribution rate greater than 90%, and finally applied the processed data to clustering algorithms.
4.1.2. Clustering Metrics
Clustering metrics are tools used to evaluate the performance of clustering algorithms, which help to measure the quality of clustering results. The results were evaluated using commonly used metrics, including ARI [
34], F-measure [
35], and VI [
36].
The Adjusted Rand Index (ARI) is a commonly used clustering evaluation metric that measures the consistency or similarity between two clustering results. It is an improved version of the Rand Index (RI).
The formula for ARI is shown in Equation (12):
The F-measure is the harmonic mean of precision and recall. It combines these two metrics to evaluate the performance of a classification or clustering model.
Assuming there are
m clusters, the overall F-measure is calculated as the average of the F-measures across all clusters [
18]:
VI (Variation of Information) combines the concepts of mutual information (MI) and Conditional Entropy to assess the similarity between clustering results and true labels. The corresponding formula is defined as shown in Formula (14):
where
H(
C) is the entropy of the result set
C,
H(
T) is the entropy of the result set
T, and
I(
C,
T) is the mutual information between the clustering result
C and the true labels
T.
4.2. Results on Synthetic Datasets
4.2.1. Visualization Results
This Section presents the clustering results of three representative synthetic datasets for visualization. The clustering results of all algorithms on the three datasets are shown in
Figure 5,
Figure 6 and
Figure 7. The RDBSCAN, DPC, and ConDPC algorithms explicitly use cluster centers and perform cluster expansion based on these centers. In their visualization results, the cluster centers are represented by star shapes. In the visualization results of each algorithm, different colors correspond to different clusters.
Bottleneck is a linear, multimodal dataset. Its left and right elongated bar-shaped clusters have similar density distributions, with each cluster containing two high-density regions connected by a central low-density bottleneck region. The DPC algorithm selected one cluster center in each of the two high-density regions of the right-side cluster, leading to poor clustering performance. The DBSCAN-DPC algorithm, on the other hand, recognizes the same cluster as three distinct clusters. The other algorithms performed well on the Bottleneck dataset.
The Cloud dataset is a mixture of two Gaussian distributions, in which the two clusters have overlapping points. The DBSCAN and OPTICS algorithms identified some points in the upper cluster as noise points, while other algorithms were able to better identify the main parts of the two clusters. For the overlapping region of the two clusters, the assignments made by RDBSCAN and LGD were more reasonable.
The CircleSquare dataset consists of a circular cluster with two square clusters inside. The density of the two square clusters is significantly higher than that of the circular cluster. Due to the significant difference in density between the square and circular clusters, DBSCAN was unable to find a suitable combination of to distinguish all clusters. DPC identified three density peaks within the two denser square clusters, but no points were selected as density peaks in the circular cluster due to its lower density, resulting in poor clustering performance. Among all of the algorithms, only the proposed RDBSCAN algorithm could perfectly identify three clusters. This algorithm found the point with the highest original density (marked as a red star) and completed the upper red square cluster by means of -based expansion. Then, the clustering of the green square cluster and the outer blue ring was completed sequentially.
4.2.2. Results of Evaluation Metrics
The clustering performance metrics for all eight synthetic datasets are shown in
Table 3. In the experimental results on the eight datasets, RDBSCAN achieved the best performance on all of them. It can be seen from
Figure 5,
Figure 6 and
Figure 7 and
Table 3 that, compared to DPC and DBSCAN, RDBSCAN has higher clustering accuracy and yields better visual results. Compared to the LGD, ConDPC, and DBSCANDPC algorithms, RDBSCAN performed better overall, especially on the Halfkernel, 2sp2glob, and CircleSquare datasets.
4.3. Results on Real-World Datasets
Eight real-world datasets were selected for comparative experiments. Information on the attributes, cluster sizes, and other details of these datasets is provided in
Table 2, and the experimental results are shown in
Table 4.
In this paper, statistical tests were conducted on the performance metrics of the nine algorithms. Taking the F-measure as an example, the test results of each algorithm on eight real-world datasets were considered as a set of result data. A Shapiro–Wilk test was applied to each of the nine sets of result data, and all were found to follow a normal distribution. Subsequently, a one-way ANOVA was performed on the nine sets of results, yielding a p-value of 0.3693, which is greater than 0.05. Based on this, it can be concluded that there is no significant difference among the nine sets of result data. Thus, the performance of the nine algorithms does not exhibit significant differences.
To better observe the experimental results for the real-world datasets,
Figure 8 presents bar charts corresponding to the metric values.
From
Table 4 and
Figure 8, it can be seen that RDBSCAN presented a significant improvement in accuracy over DBSCAN and DPC. It is important to note that a lower VI score indicates better clustering performance. From the results of the three metrics, it can be observed that LGD achieves the best performance, followed by RDBSCAN.
4.4. Results on Different Datasets with Different k-τ Values
Figure 9 shows how the ARI results on the different datasets changed as the
values were varied. For the purpose of testing, the KDE bandwidth was fixed at a value of 1.
From this Figure, it can be seen that on the Ecoli and Seeds datasets, setting the number of neighbors to below 20 results in relatively poor average performance. On the other six datasets, the ARI values change more smoothly with variations in the value.
In practical experiments, an initial range for k between 10 and 20, and for τ between 0.7 and 0.9, can be set to identify optimal solutions. In datasets such as Libras and Wine, local optima were observed when the number of nearest neighbors (k) was set to 40. For real-world datasets, expanding the search range for k-nearest neighbors may lead to better solutions.
4.5. Runtime Comparison
The time complexity of RDBSCAN is
, as analyzed in
Section 3.3. DBSCAN, DPC, and the other six comparison algorithms in
Table 1 all have a time complexity of
. The time taken for the nine algorithms to achieve the optimal clustering results on 16 datasets is shown in
Table 5. To ensure the tests are run in the same environment, the OPTICS algorithm was called and executed in MATLAB 2017b.
From the average results, the RDBSCAN algorithm has the least execution time among the nine algorithms. When the dataset has a lower dimensionality, such as in the 2sp2glob and bottleneck datasets, RDBSCAN demonstrates significant performance advantages. However, when the dataset has a higher dimensionality, such as in the Libras and Olivetti faces datasets, RDBSCAN’s execution time is comparable to that of algorithms like DPC and ConDPC. On the Olivetti faces dataset, RDBSCAN takes more time than DBSCAN-DPC.
5. Discussion
DPC calculates
and
to select all density peaks. As analyzed in
Section 2.2, this approach can lead to multiple density peaks being selected in high-density clusters, while potentially overlooking the density peaks in low-density clusters. The selection of density peaks in the proposed algorithm is dynamic. It processes points in descending order of local density, selecting the highest-density unprocessed point as the density peak. Once high-density areas are handled, the corresponding density peaks in low-density areas are also selected. The clustering results on the Jain and CircleSquare datasets (shown in
Figure 4 and
Figure 7) demonstrate the effectiveness of RDBSCAN in selecting density peaks in datasets with uneven density distributions.
If the number of samples in clusters formed in low-density regions is too small, we do not consider them to be true clusters. Points in these micro-clusters are reallocated through a secondary processing step. The minimum cluster size can also be set as an input parameter to adjust the clustering accuracy of the algorithm. For example, when the minimum cluster size was set to 6, the ARI and F-measure values for the wine dataset improved to 0.780 and 0.910, respectively.
6. Conclusions
This study proposed a clustering algorithm, RDBSCAN, based on local relative density. The algorithm utilizes a dynamic method to obtain density peaks, which does not require prior knowledge of the number of clusters. Based on the density peaks, the algorithm calculates the local relative density of neighboring points and identifies core objects according to a given relative density threshold, followed by cluster expansion. For points in micro-clusters, their parent points are identified based on the adjusted , and they inherit the cluster labels from these parent points.
Experimental results showed that the proposed algorithm demonstrates a significant improvement in accuracy, when compared to the original DPC and DBSCAN. In comparison to the original algorithms and other improved algorithms, the proposed algorithm achieved the best performance in terms of metrics on the synthetic datasets and second-best on the real-world datasets. At the same time, the algorithm does not require the pre-specification of the number of clusters, which provides an advantage over improved algorithms such as LGD, ConDPC, and DCHDP.
The experiment compared the execution times of nine algorithms across all datasets. The results show that RDBSCAN has the lowest average execution time and the best performance on these test datasets. However, when the dimensionality of the datasets is higher, RDBSCAN does not demonstrate a significant performance advantage over the other algorithms.
AIS (Automatic Identification System) ship trajectory clustering analyzes ship trajectories using clustering techniques, which can be applied to port management, hazard prevention, and other areas. Due to varying conditions, such as waterway class and depth, the density of ship trajectories in different waterways varies. The RDBSCAN algorithm, which is good at handling data with uneven densities and does not require pre-setting of the number of clusters, can provide technical support for such clustering analyses. However, in practical applications, AIS data volumes are typically large, and the efficiency of the algorithm’s k-nearest neighbor search, among other factors, needs to be optimized.
Future work will explore the application of CoverTree for nearest neighbor search and storage to optimize the performance of the algorithm, as well as its further application to AIS data analysis.
Author Contributions
Conceptualization, Y.Z. and Z.W.; methodology, Y.Z. and Z.W.; software, Y.Z.; validation, Y.Z., X.W. and T.L.; formal analysis, Y.Z. and X.W.; resources, Y.Z. and Z.W.; data curation, X.W.; writing—original draft preparation, Y.Z. and X.W.; writing—review and editing, Y.Z., X.W. and T.L.; visualization, T.L.; supervision, Z.W.; funding acquisition, T.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 23KJA580002), the Excellent Teaching Team of the 2022 QingLan Project of the Jiangsu Higher Education Institutions of China (Big Data Technology Teaching Team with Shipping Characteristic), and the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 22KJB580002).
Data Availability Statement
The data will be shared on request.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A
To analyze the time complexity of the proposed algorithm, we conducted comparative experiments using bottleneck datasets of different sizes. The datasets were generated using the code from paper [
26], with a similar structure. The sizes of the datasets are 500, 1000, 1500, 2000, 3000, 4000, 5000, and 6000. Experiments on these datasets were conducted using the same parameter combination (
k = 10,
= 0.5,
= 1), and all experiments achieved perfect clustering results (ARI = 1, F-measure = 1, VI = 0).
Based on the results in
Table A1, the total runtime was fitted to
, as shown in
Figure A1.
Table A1.
RDBSCAN Running Times (in Seconds) on Bottleneck Datasets of Different Sizes.
Table A1.
RDBSCAN Running Times (in Seconds) on Bottleneck Datasets of Different Sizes.
Dataset | Bottleneck 500 | Bottleneck 1000 | Bottleneck 1500 | Bottleneck 2000 | Bottleneck 3000 | Bottleneck 4000 | Bottleneck 5000 | Bottleneck 6000 |
---|
Total time | 0.00394 | 0.00571 | 0.00753 | 0.00914 | 0.01142 | 0.01449 | 0.01918 | 0.02224 |
Figure A1.
Time complexity analysis: (a) execution time vs. dataset size (nlogn) fitting; (b) execution time vs. nlogn (standardized) with linear fit.
Figure A1.
Time complexity analysis: (a) execution time vs. dataset size (nlogn) fitting; (b) execution time vs. nlogn (standardized) with linear fit.
As illustrated in
Figure A1, the total runtime aligns well with the
complexity.
References
- Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
- Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Waltham, MA, USA, 2011; pp. 443–450. [Google Scholar]
- Xie, Y.; Lin, B.; Qu, Y.; Li, C.; Zhang, W.; Ma, L.; Wen, Y.; Tao, D. Joint deep multi-view learning for image clustering. IEEE Trans. Knowl. Data Eng. 2020, 33, 3594–3606. [Google Scholar] [CrossRef]
- Li, G.; Qin, X.; Liu, H.; Jiang, K.; Wang, A. Modulation Recognition of Digital Signal Using Graph Feature and Improved K-Means. Electronics 2022, 11, 3298. [Google Scholar] [CrossRef]
- Ping, J.; Ying, Z.; Hao, N.; Miao, P.; Ye, C.; Liu, C.; Li, W. Rapid and non-destructive identification of Panax ginseng origins using hyperspectral imaging, visible light imaging, and X-ray imaging combined with multi-source data fusion strategies. Food Res. Int. 2024, 192, 114758. [Google Scholar] [CrossRef]
- Cao, L.; Liu, Y.; Wang, D.; Wang, T.; Fu, C. A Novel Density Peak Fuzzy Clustering Algorithm for Moving Vehicles Using Traffic Radar. Electronics 2020, 9, 46. [Google Scholar] [CrossRef]
- Liu, R.; Wang, H.; Yu, X. Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 2018, 450, 200–226. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD 96, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
- Ankerst, M.; Breunig, M.M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; pp. 49–60. [Google Scholar]
- Hinneburg, A.; Gabriel, H.H. Denclue 2.0: Fast clustering based on kernel density estimation. In Proceedings of the 7th International Symposium on Intelligent Data Analysis, Ljubljana, Slovenia, 6–8 September 2007; pp. 70–80. [Google Scholar]
- Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Disc. 1998, 2, 169–194. [Google Scholar] [CrossRef]
- Wang, Y.; Qian, J.; Hassan, M.; Zhang, X.; Zhang, T.; Yang, C.; Zhou, X.; Jia, F. Density peak clustering algorithms: A review on the decade 2014–2023. Expert Syst. Appl. 2023, 238, 121860. [Google Scholar] [CrossRef]
- Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef]
- Guan, J.; Li, S.; He, X.; Zhu, J.; Chen, J. Fast hierarchical clustering of local density peaks via an association degree transfer method. Neurocomputing 2021, 455, 401–418. [Google Scholar] [CrossRef]
- Zhu, Q.; Tang, X.; Elahi, A. Application of the novel harmony search optimization algorithm for dbscan clustering. Expert. Syst. Appl. 2021, 178, 115054. [Google Scholar] [CrossRef]
- Long, Z.; Gao, Y.; Meng, H.; Yao, Y.; Li, T. Clustering based on local density peaks and graph cut. Inf. Sci. 2022, 600, 263–286. [Google Scholar] [CrossRef]
- Lotfi, A.; Moradi, P.; Beigy, H. Density peaks clustering based on density backbone and fuzzy neighborhood. Pattern Recognit. 2020, 107, 107449. [Google Scholar] [CrossRef]
- Zhu, Y.; Ting, K.M.; Jin, Y.; Angelova, M. Hierarchical clustering that takes advantage of both density-peak and density-connectivity. Inform. Syst. 2022, 103, 101871. [Google Scholar] [CrossRef]
- Li, R.; Yang, X.; Qin, X.; Zhu, W. Local gap density for clustering high-dimensional data with varying densities. Knowl. Based Syst. 2019, 184, 104905. [Google Scholar] [CrossRef]
- Guan, J.; Li, S.; He, X.; Chen, J. Clustering by fast detection of main density peaks within a peak digraph. Inf. Sci. 2023, 628, 504–521. [Google Scholar] [CrossRef]
- Zou, Y.; Wang, Z. ConDPC: Data connectivity-based density peak clustering. Appl. Sci. 2022, 12, 12812. [Google Scholar] [CrossRef]
- Hou, j.; Lin, H.; Yuan, H.; Pelillo, M. Flexible Density Peak Clustering for Real-World Data. Pattern Recognit. 2024, 156, 110772. [Google Scholar] [CrossRef]
- Zhu, Y.; Ting, K.M.; Carman, M.J. Density-ratio based clustering for discovering clusters with varying densities. Pattern Recognit. 2016, 60, 983–997. [Google Scholar] [CrossRef]
- Zou, Y.; Wang, Z.; Xu, P.; Lv, T. An Improved Density Peaks Clustering Algorithm Based on Density Ratio. Comput. J. 2024, 67, 2515–2528. [Google Scholar] [CrossRef]
- Maggioni, M.; Murphy, J.M. Learning by Unsupervised Nonlinear Diffusion. J. Mach. Learn. Res. 2019, 20, 1–56. [Google Scholar]
- Murphy, J.M.; Polk, S.L. A Multiscale Environment for Learning by Diffusion. Appl. Comput. Harmon. Anal. 2022, 57, 58–100. [Google Scholar] [CrossRef]
- Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. In Proceedings of the 21st International Conference on Data Engineering, Tokyo, Japan, 5–8 April 2005; pp. 341–352. [Google Scholar]
- Jain, A.K.; Law, M.H.C. Data clustering: A user’s dilemma. In Proceedings of the Pattern Recognition and Machine Intelligence, Kolkata, India, 20–22 December 2005; pp. 1–10. [Google Scholar]
- Piantoni, J.; Faceli, K.; Sakata, T.C.; Pereira, J.C.; de Souto, M.C.P. Impact of Base Partitions on Multi-objective and Traditional Ensemble Clustering Algorithms. In Proceedings of the ICONIP2015, Istanbul, Turkey, 3–8 November 2015. [Google Scholar]
- Du, M.; Ding, S.; Xue, Y.; Shi, Z. A novel density peaks clustering with sensitivity of local density and density-adaptive metric. Knowl. Inf. Syst. 2019, 59, 285–309. [Google Scholar] [CrossRef]
- Zelnik-Manor, L.; Perona, P. Self-tuning spectral clustering. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 1 December 2004; pp. 1601–1608. [Google Scholar]
- UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ (accessed on 20 October 2024).
- Samaria, F.S.; Harter, A.C. Parameterisation of a stochastic model for human face identification. In Proceedings of the 1994 IEEE Workshop on Applications of Computer Vision, Sarasota, FL, USA, 5–7 December 1994; pp. 138–142. [Google Scholar]
- Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
- Rijsbergen, V. Foundation of Evaluation. J. Doc. 1974, 30, 365–373. [Google Scholar] [CrossRef]
- Meilà, M. Comparing clusterings—An information based distance. J. Multivariate Anal. 2007, 98, 873–895. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).