A Clustering Algorithm Based on Local Relative Density

Zou, Yujuan; Wang, Zhijian; Wang, Xiangchen; Lv, Taizhi

doi:10.3390/electronics14030481

Open AccessArticle

A Clustering Algorithm Based on Local Relative Density

¹

College of Information Engineering, Jiangsu Maritime Institute, Nanjing 211199, China

²

College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

³

Department of Computer Science, Southern University of Science and Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 481; https://doi.org/10.3390/electronics14030481

Submission received: 1 November 2024 / Revised: 9 January 2025 / Accepted: 20 January 2025 / Published: 24 January 2025

(This article belongs to the Topic New Applications of Big Data Technology: Integration of Data Mining and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

DBSCAN and DPC are typical density-based clustering algorithms. These two algorithms have their drawbacks, such as difficulty in clustering when there are significant differences in density between clusters. This study proposes a clustering algorithm, RDBSCAN, which is based on local relative density, drawing on the extension strategy of DBSCAN and the allocation mechanism of DPC. The algorithm first uses k-nearest neighbors to calculate the original local density, then sorts the points in descending order of this density. It then selects the point with the highest original local density from the unprocessed points as the local center of the next cluster. Based on this local center, RDBSCAN calculates the local relative density, determines the core objects, and performs cluster expansion. Drawing on the allocation mechanism of DPC, the algorithm performs a secondary allocation for points in clusters that are too small to complete the final clustering. Comparative experiments using RDBSCAN and eight other clustering algorithms were conducted, and the test results show that RDBSCAN ranks first in clustering performance metrics among all algorithms on synthetic datasets and second on real-world datasets.

Keywords:

clustering; local relative density; density peak; core point

1. Introduction

Clustering algorithms divide objects into clusters based on the similarity between them, ensuring high similarity among objects within the same cluster and lower similarity between objects in different clusters [1,2]. Researchers have applied these algorithms for applications such as image clustering, modulation recognition, and vehicle detection [3,4,5,6,7].

As an important branch of clustering algorithms, density clustering algorithms are methods that group data based on the density distribution of data points [8,9,10,11,12,13]. DBSCAN [8] and DPC [13] are two typical density-based clustering algorithms. DBSCAN does not require the pre-specification of the number of clusters and identifies core objects by filtering the number of neighbors contained within a given radius; then, it expands the cluster by continuously extending the reachable range of the core objects. DBSCAN can be used for clustering data of any shape; however, when there are significant differences in density within the dataset, it struggles to distinguish different clusters using the same radius. OPTICS [9] is an improved algorithm for DBSCAN that reduces the dependence on the parameter radius. By sorting the objects in the dataset, an ordered object list is obtained, which can assist in cluster detection using different parameters. However, the OPTICS algorithm cannot explicitly generate clustering results, and using the above object list to extract clusters is dependent on the user’s judgment.

DPC is a clustering algorithm based on density peaks. It constructs a decision graph by analyzing the density of each point and the distance to nearby points. Based on this decision graph, it identifies cluster centers and guides the clustering process using these center points. However, when the data distribution is uneven, this clustering center selection mechanism is prone to errors (as discussed in Section 2). Some variant algorithms of DPC streamline this process by directly using the number of clusters as an input parameter to determine the cluster centers, rather than relying on manual selection [14,15,16,17,18,19,20].

Many researchers have proposed improved algorithms based on DBSCAN and DPC. Zhu et al. proposed the K-DBSCAN algorithm to solve the problem related to the difficulty of setting appropriate parameters in DBSCAN [15]. Guan et al. constructed sub-clusters using k-nearest neighbors (knn) and then merged these sub-clusters based on the target number of clusters [14]. Long et al. used

knn

to determine the similarity between family trees and tree branches, then obtained the final clustering results using graph segmentation ideas [16]. Abdulrahman et al. re-defined density using a fuzzy neighborhood and further constructed backbone and boundary points [17]. Some algorithms have drawn on the strengths of both DBSCAN and DPC. DCHDP is a density-connected hierarchical clustering algorithm based on these two algorithms, which improves the accuracy of identifying clusters with arbitrary shapes and different densities [18]. LGD utilizes local differential density to identify core points and boundary points and adopts the allocation mechanism from DPC to assign the remaining points to the initial clusters [19]. ConDPC draws on the idea of DBSCAN to construct connected groups, then modifies the allocation method to distinguish clusters in manifold structures [21]. All of the above algorithms, however, require the number of clusters to be pre-defined. DBSCAN-DPC extracts initial clusters based on the DBSCAN algorithm and performs cluster expansion based on DPC [22]. This algorithm is better at identifying cluster centers, compared to DPC; however, it performs poorly in clustering under manifold structures.

This study draws on the extension of DBSCAN and the label allocation mechanism of DPC, introduces local relative density, and proposes a clustering algorithm that automatically identifies cluster centers and the number of clusters in order to solve the problem of poor clustering performance caused by the manual selection of cluster centers or preset cluster numbers, as well as the large differences between clusters in DPC and some of its variants.

The contributions of this study are as follows:

A method for automatically identifying cluster centers and the number of clusters is proposed. The algorithm selects the point with the highest original density from all points as the center of the first cluster. After completing the cluster expansion, the point with the highest original density from the remaining points is selected as the center of the second cluster. This process continues, and the algorithm does not require pre-specification of the number of clusters, allowing it to automatically identify all cluster centers.
A local relative density calculation method is proposed and applied for cluster expansion. Compared to the original density calculation methods in DBSCAN and DPC, the local relative density can more accurately reflect the density level of a point within its cluster. Based on this density, the algorithm identifies core points and performs cluster expansion around the core points and their k-nearest neighbors (knn). The use of local relative density is more effective for clustering in datasets with uneven density.
A secondary allocation mechanism for micro-clusters is proposed. The algorithm identifies small clusters and reallocates the points within these micro-clusters to the appropriate clusters, thereby improving the accuracy of clustering.

We review the relevant algorithms in Section 2. Section 3 presents the implementation logic of RDBSCAN and demonstrates how the algorithm works. In Section 4, comparative experiments are conducted. Finally, a Discussion and Conclusion are provided.

2. Related Works

This Section introduces the implementation logic of DBSCAN and DPC.

2.1. DBSCAN

In DBSCAN, for each point

i

in a dataset, its neighbors

N_{i}

are defined as follows:

N_{i} = {x_{j} \in X | d i s (i, j) \leq e p s}

(1)

where

e p s

is a given radius. If

|N_{i}| \geq minpts

, then point

i

is considered a core object.

In DBSCAN, the reachability between points is divided into two categories: direct density reachability and density reachability. For core points, all neighboring points are directly density-reachable from the core point. For any two points

i

and

j

, if there exists a chain

p_{1} (i), p_{2}, \dots, p_{n} (j)

, such that each

p_{k + 1}

is directly density-reachable from

p_{k}

, then point

j

is considered density-reachable from point

i

.

Based on the concept of density reachability, the algorithm selects any core point

i

and its neighbor

N_{i}

to create a new cluster C. Then, the unassigned density-reachable points of any point

i

within C are successively added to C until no further expansion is possible. The process is repeated until all core points have been processed.

From the above information, it can be concluded that the algorithm uses fixed

e p s - m i n P t s

combinations to perform global clustering. If there is a large difference in density between clusters, there may be a problem of not being able to select a suitable

e p s - minpts

combination.

2.2. DPC

The DPC algorithm assumes that cluster centers have higher densities compared to their surrounding points and that the cluster centers are far apart from each other. In this algorithm, any point i has two important attributes: its density and the distance to the nearest point j with a higher density than itself. For simplicity in this paper, point j is referred to as the “parent neighbor” of point i, and the distance between i and j is called the “neighbor distance” of point i.

In DPC, the density and neighbor distance of point i are defined as shown in Equations (2) and (3) [13]:

ρ_{i} = \sum_{i \neq j} \exp (- {(\frac{d i s (i, j)}{d_{c}})}^{2})

(2)

δ_{i} = \underset{ρ_{i} < ρ_{j}}{\min (d i s (i, j))}

(3)

Based on the density and neighbor distance of each point, DPC can construct a decision graph where the x-axis represents density values, and the y-axis represents neighbor distance values. This decision graph helps identify points with both high density and large neighbor distance, as these points are most likely to become cluster centers.

Using dataset 2sp2glob as an example, the DPC algorithm can plot a decision graph based on the density and neighbor distance of each point, as shown in Figure 1a.

From Figure 1a, it is evident that four points have significantly higher density and neighbor distance compared to others. Using this decision graph, the algorithm selects these four points as cluster centers. Subsequently, the remaining points inherit the labels of their parent neighbors in descending order of density, completing the clustering process.

However, when there is a significant density difference between clusters, this cluster center selection mechanism is prone to errors. Taking the Jain dataset as an example, the optimal result of DPC on this dataset is shown in Figure 2. Figure 2a displays the corresponding decision graph, where two points in the upper-right corner clearly have an advantage in terms of density and neighbor distance. However, as shown in Figure 2b, both of these points belong to the high-density bottom cluster, while the sparse upper cluster fails to select any density peak as its center.

2.3. Relative Density

Relative density is used to measure the density level of a data point within its local region, which helps in identifying cluster centers and expanding cluster boundaries. It is especially useful for handling data with uneven density.

The density ratio is a method for calculating relative density. ReconDBSCAN is a density-ratio-based improved version of the DBSCAN algorithm [23]. This algorithm uses the ratio of density estimates in two differently sized neighborhoods of the same point as the new density for that point. It then uses this density and a ratio threshold to identify the core points and expand clusters. Dratio-DPC is a density-ratio-based improvement of DPC [24]. This algorithm does not directly use the density ratio as a density estimate but instead uses the density ratio as a parameter to adjust the original density. These two algorithms outperform DBSCAN and DPC in clustering datasets with uneven density. However, both algorithms introduce an additional parameter for the density ratio—namely, ReconDBSCAN, which adds a neighborhood parameter, while Dratio-DPC adds a nearest neighbor count parameter—which increases the difficulty of parameter selection.

LGD is a clustering algorithm based on local differential density [19]. The algorithm uses the ratio of the density of point i to the maximum density of its neighbors as the new density. This algorithm significantly improves the clustering accuracy, compared to the DPC algorithm. However, it also requires the number of clusters to be pre-defined as a parameter.

3. Proposed Algorithm

Inspired by DBSCAN and DPC, RDBSCAN proposes the concept of local relative density and a new method for determining core points, using knn for expansion. During the secondary label assignment process, RDBSCAN draws on the label assignment approach of DPC, making adjustments and improvements.

To better identify the core points in data with uneven density, this study introduces the concept of relative density and uses the ratio of local density to local center as the local relative density.

As described in Section 2.2, DPC does not account for uneven density between clusters when comparing density values during label assignment. This assignment mechanism can result in a point being labeled with the cluster identifier of a closer point from another cluster. To address this issue, this study considers the density differences within the k-nearest neighbor neighborhood when searching for parent neighbors.

The related concepts are defined in Section 3.1.

3.1. Definitions

Definition 1.

Original local density.

This study defines a local-density-based kernel density estimate (KDE). For point i, the closer its k-nearest neighbors are, the higher its local density becomes. The definition is as follows:

ρ_{i} = \exp (- \sum_{x_{j} \in K N N_{i}} \frac{d i s {(i, j)}^{2}}{σ_{0}^{2}})

(4)

where

K N N_{i}

represents the set of k-nearest neighbors of point i, and

σ_{0}

is a KDE bandwidth.

Definition 2.

Processing Status Indicator.

Let

x_{i}

be a point in the dataset. We define the processing-status indicator.

I (i) = \{\begin{cases} 1, i f point x_{i} has been processed \\ 0, i f point x_{i} has not been processed \end{cases}

(5)

Definition 3.

Local center.

Let j denote the index of the current cluster. If no clusters have been created, then j = 0.

RDBSCAN selects the point with the highest density from the unprocessed points as the local center of the next cluster. Then, the local center of the next cluster can be defined as follows:

l o c a l C e n t e r_{(j + 1)} = a r g m a x (ρ_{i} ∣ x_{i} \in X a n d I (i) = 0)

(6)

Definition 4.

Local relative density.

This article uses the ratio of the original local density of a point

p_{i}

to that of the corresponding local center as the local relative density, defined as follows:

r d e n s i t y_{i} = ρ_{i} / ρ_{(l o c a l C e n t e r)}

(7)

r d e n s i t y_{i}

reflects the density of a point relative to the cluster center. If it is close to a local center, the relative density is larger; if it is far from the cluster center and at the boundary, the relative density is smaller.

Definition 5.

Core point.

A point with a local relative density larger than the given threshold is called a core point, which is defined as follows:

c o r e p = {x_{i} \in X | r d e n s i t y_{i} > threshold}

(8)

Definition 6.

Micro-cluster.

We call a cluster a micro-cluster if the number of samples in the cluster is less than the minimum cluster size (denoted as

minsize

here), as described in Formula (9):

microC = {c_{i} \in C | | c_{i} | < minsize}

(9)

minsize

is set to 3 in this paper. For micro-clusters, a secondary assignment will be performed.

Definition 7.

Neighborhood density difference.

Since the number of points within the neighborhood is fixed (all are k), this study uses the ratio of the average distances within the neighborhood to reflect the density differences. Therefore, for

x_{i}

and

x_{j}

, their corresponding neighborhood density differences can be defined as follows:

d e n D i f f (i, j) = \max (m e a n \underset{x_{p} \in K N N_{i}}{(d i s (i, p))} / m e a n \underset{x_{p} \in K N N_{j}}{(d i s (j, p))}, m e a n \underset{x_{p} \in K N N_{j}}{(d i s (j, p))} / m e a n \underset{x_{p} \in K N N_{i}}{(d i s (i, p))})

(10)

Definition 8.

Balanced neighbor distance.

By using the neighborhood density difference, the neighbor distances can be adjusted, as shown in the following formula:

δ_{i} = \underset{ρ_{i} < ρ_{j}}{\min (d e n D i f f (i, j) * d i s (i, j))}

(11)

From Formula (10), it can be seen that

d e n D i f f (i, j)

reflects the density difference between the

k n n

sets of points

i

and

j

. The smaller the density difference, the closer

d e n D i f f (i, j)

approaches 1, and the adjusted distance tends to the original

δ

. The larger the density difference, the larger

d e n D i f f (i, j)

becomes, making the adjustment more significant, with the adjusted distance greater than the original

δ

.

3.2. RDBSCAN

The algorithm performs clustering processing from high to low based on local density. The algorithm selects the point with the highest local density as the local center and expands it according to

k n n

until there are no more points to expand, forming a cluster. Then, select the point with the highest local density from the remaining points for expansion, forming a cluster. Repeat this process until all points are visited. For points in micro-clusters, the algorithm performs secondary processing and allocation.

The main structure of the algorithm is shown in Algorithm 1. In this algorithm, Algorithm 2 is called, which is primarily used for cluster expansion.

Algorithm 1: Clustering based on local relative density (RDBSCAN).

Input: dataset X; number of nearest neighbors

k

; threshold of local relative density

τ

Output: clustering result

normalize the dataset $X$ ;
construct the k-nearest neighbor distance matrix using knnsearch function;
calculate the original density $ρ$ ;
sort the points in descending order of $ρ$ to set $X^{'}$ ;
mark all points as unprocessed;
for point $p$ in $X^{'}$
if $p$ is not processed
create a new cluster C;
mark $p$ as processed and the local center of C;
Expand( $p$ , $p$ ’s k-nearest neighbors, C, $τ$ );
end if
end for
sort the points in micro-clusters in descending order of $ρ$ to set S
for point $p$ in S
find $p$ ’s parent neighbor $j$ according to Formula (11);
assign $j$ ’s cluster label to $p$ .
end for
return result

Algorithm 2: Expand.

Input:

p

;

p

’s k-nearest neighbors denoted as pneighbors; C;

τ

Output: clustering result of C

mark $p$ ’s cluster label as C;
for point $i$ in pneighbors
if $i$ is unprocessed
mark $i$ as processed and $i$ ’s cluster label as C;
$r d e n s i t y_{i} = ρ_{i} / ρ_{p}$ ;
if $r d e n s i t y_{i}$ > $τ$
add the $i$ ’s k-nearest neighbors to pneighbors;
end if
end if
end for
return clustering result of C

We have provided the flowchart corresponding to the above algorithm, as shown in Figure 3.

3.3. Computational Complexity Analysis

In this Section, we assume that the number of samples is n, the dimensionality of the samples is d, and the number of nearest neighbors is k. We will discuss the time complexity of the RDBSCAN algorithm (Algorithm 1).

The time complexity of normalizing the dataset in line 1 is

O (d n)

.

The time complexity of computing the k-nearest neighbor distance matrix in line 2 is

O (nlogn)

.

The time complexity of calculating the original density for each point using knn in line 3 is

O (k n)

.

The time complexity of sorting the original density in line 4 is

O (nlogn)

.

The time complexity of initializing the processing status in line 5 is

O (n)

.

Lines 6–12: For each point, the algorithm checks whether it is a core point. If it is, the k-nearest neighbors of that point need to be queried. The time complexity associated with this process is

O (k n)

.

The time complexity of sorting the points in the small clusters in line 13 is

{O (n}_{1} {logn}_{1})

, where

n_{1}

is the number of points in the small cluster (

n_{1} ≪ n

).

Lines 14–17: In the for-loop, for each point in the small cluster, we find the nearest point whose adjusted distance is the smallest to serve as its parent. The worst-case time complexity for this step is

{O (n}_{1} n)

.

In summary, the time complexity of this algorithm is

O (nlogn)

. We will further discuss the time complexity of the algorithm in Appendix A. The runtime of all algorithms on the test datasets is compared in Section 4.5.

3.4. Execution Demonstration

This Section uses the Jain dataset to explain how the proposed algorithm RDBSCAN works.

First, we find the point with the highest original density among all points, which is used as the first local center (red star in Figure 4), denoted by

p 1

here. As

p 1

has not been processed, create a new cluster

C 1

, add

p 1

to

C 1

, and mark it as processed. Add the unprocessed points from the six nearest neighbors of

p 1

(red dots in Figure 4) to the set

p 1 n e i g h b o r s

and to

C 1

, and mark them as processed. If the local relative density of point

j

in

p 1 n e i g h b o r s

is larger than

τ

(0.7), then

j

is the core point, and the unprocessed points in

j

’s six nearest neighbors are also added to

p 1 n e i g h b o r s

and

C 1

. Repeat this process until cluster

C 1

can no longer be expanded (thus forming the red cluster at the bottom in Figure 4).

Then, all the points in the lower arc-shaped cluster have been marked as processed, and the point with the highest original local density is selected among the remaining points as the second local center (the blue star in Figure 4), denoted by

p 2

here. As

p 2

has not been processed, create a new cluster

C 2

, add

p 2

to

C 2

, and mark it as processed. Similar to the first step, expansion is performed based on

knn

until cluster

C 2

can no longer be extended (forming the upper blue cluster in Figure 4).

4. Results

4.1. Experimental Settings

4.1.1. Comparison Algorithms and Datasets

The proposed algorithm was inspired by DBSCAN [8] and DPC [13] and introduces improvements based on these algorithms. Therefore, this study takes DBSCAN and DPC, as well as the improved DCHDP [18], LGD [19], ConDPC [21], and DBSCAN-DPC [22] algorithms based on these two algorithms, for comparison. The density-based clustering algorithm OPTICS [9] was also chosen for comparison. Meanwhile, a clustering algorithm based on graph-based diffusion geometry, LUND [25], was also compared with the aforementioned clustering algorithms. This algorithm addresses the shortcomings of the DPC algorithm, demonstrating superior effectiveness over DPC.

Among the nine algorithms, all except OPTICS were executed in MATLAB 2017B. The OPTICS algorithm does not directly provide cluster extraction results. The OPTICS implementation from the scikit-learn library was used, which utilizes the xi parameter to identify cluster boundaries and thus perform cluster extraction, and the algorithm was run in Python 3.8. We conducted the experiments on a computer equipped with an Intel Core Ultra5 Processor 125H 1.20 GHz CPU and 32.0 GB of RAM.

The range of values for each algorithm’s parameters is shown in Table 1. Among them,

k

in the RatioDBSCAN, LGD, DBSCANDPC, and LUND algorithms represents the number of nearest neighbors;

p

in both DPC and ConDPC algorithms represents the percentage of all samples;

τ

in RatioDBSCAN represents the local relative density threshold; and

τ

in LGD represents the local difference density threshold. For LUND, σ is the diffusion scale parameter, and

σ_{0}

is the KDE bandwidth. In addition to the three parameters listed in Table 1, LUND has an additional two parameters: β, which represents the sampling rate (with a default value of 2), and τ, which represents the diffusion stationarity threshold (with a default value of 10⁻⁵). The T in Table 1 is calculated based on these two parameters, and the specific formula can be found in the original paper [26].

The best results of each algorithm on the corresponding parameter combinations were selected for comparison.

The experiments were conducted on sixteen datasets, detailed information of which is provided in Table 2. The first eight datasets were synthetic datasets, while the last eight were real-world datasets. The datasets were normalized before clustering. To improve the running efficiency, we preprocessed the Olivetti faces dataset using the settings from the paper [7], which first scaled the original image size to 15 × 15, then applied PCA to select features with a cumulative contribution rate greater than 90%, and finally applied the processed data to clustering algorithms.

4.1.2. Clustering Metrics

Clustering metrics are tools used to evaluate the performance of clustering algorithms, which help to measure the quality of clustering results. The results were evaluated using commonly used metrics, including ARI [34], F-measure [35], and VI [36].

The Adjusted Rand Index (ARI) is a commonly used clustering evaluation metric that measures the consistency or similarity between two clustering results. It is an improved version of the Rand Index (RI).

The formula for ARI is shown in Equation (12):

A R I = \frac{R I - E [R I]}{\max (R I) - E [R I]}

(12)

The F-measure is the harmonic mean of precision and recall. It combines these two metrics to evaluate the performance of a classification or clustering model.

Assuming there are m clusters, the overall F-measure is calculated as the average of the F-measures across all clusters [18]:

F - m e a s u r e = \frac{1}{m} \sum_{i = 1}^{m} \frac{2 P_{i} R_{i}}{P_{i} + R_{i}}

(13)

VI (Variation of Information) combines the concepts of mutual information (MI) and Conditional Entropy to assess the similarity between clustering results and true labels. The corresponding formula is defined as shown in Formula (14):

VI (C, T) = H (C) + H (T) - 2 I (C, T)

(14)

where H(C) is the entropy of the result set C, H(T) is the entropy of the result set T, and I(C, T) is the mutual information between the clustering result C and the true labels T.

4.2. Results on Synthetic Datasets

4.2.1. Visualization Results

This Section presents the clustering results of three representative synthetic datasets for visualization. The clustering results of all algorithms on the three datasets are shown in Figure 5, Figure 6 and Figure 7. The RDBSCAN, DPC, and ConDPC algorithms explicitly use cluster centers and perform cluster expansion based on these centers. In their visualization results, the cluster centers are represented by star shapes. In the visualization results of each algorithm, different colors correspond to different clusters.

Bottleneck is a linear, multimodal dataset. Its left and right elongated bar-shaped clusters have similar density distributions, with each cluster containing two high-density regions connected by a central low-density bottleneck region. The DPC algorithm selected one cluster center in each of the two high-density regions of the right-side cluster, leading to poor clustering performance. The DBSCAN-DPC algorithm, on the other hand, recognizes the same cluster as three distinct clusters. The other algorithms performed well on the Bottleneck dataset.

The Cloud dataset is a mixture of two Gaussian distributions, in which the two clusters have overlapping points. The DBSCAN and OPTICS algorithms identified some points in the upper cluster as noise points, while other algorithms were able to better identify the main parts of the two clusters. For the overlapping region of the two clusters, the assignments made by RDBSCAN and LGD were more reasonable.

The CircleSquare dataset consists of a circular cluster with two square clusters inside. The density of the two square clusters is significantly higher than that of the circular cluster. Due to the significant difference in density between the square and circular clusters, DBSCAN was unable to find a suitable combination of

e p s - minpts

to distinguish all clusters. DPC identified three density peaks within the two denser square clusters, but no points were selected as density peaks in the circular cluster due to its lower density, resulting in poor clustering performance. Among all of the algorithms, only the proposed RDBSCAN algorithm could perfectly identify three clusters. This algorithm found the point with the highest original density (marked as a red star) and completed the upper red square cluster by means of

knn

-based expansion. Then, the clustering of the green square cluster and the outer blue ring was completed sequentially.

4.2.2. Results of Evaluation Metrics

The clustering performance metrics for all eight synthetic datasets are shown in Table 3. In the experimental results on the eight datasets, RDBSCAN achieved the best performance on all of them. It can be seen from Figure 5, Figure 6 and Figure 7 and Table 3 that, compared to DPC and DBSCAN, RDBSCAN has higher clustering accuracy and yields better visual results. Compared to the LGD, ConDPC, and DBSCANDPC algorithms, RDBSCAN performed better overall, especially on the Halfkernel, 2sp2glob, and CircleSquare datasets.

4.3. Results on Real-World Datasets

Eight real-world datasets were selected for comparative experiments. Information on the attributes, cluster sizes, and other details of these datasets is provided in Table 2, and the experimental results are shown in Table 4.

In this paper, statistical tests were conducted on the performance metrics of the nine algorithms. Taking the F-measure as an example, the test results of each algorithm on eight real-world datasets were considered as a set of result data. A Shapiro–Wilk test was applied to each of the nine sets of result data, and all were found to follow a normal distribution. Subsequently, a one-way ANOVA was performed on the nine sets of results, yielding a p-value of 0.3693, which is greater than 0.05. Based on this, it can be concluded that there is no significant difference among the nine sets of result data. Thus, the performance of the nine algorithms does not exhibit significant differences.

To better observe the experimental results for the real-world datasets, Figure 8 presents bar charts corresponding to the metric values.

From Table 4 and Figure 8, it can be seen that RDBSCAN presented a significant improvement in accuracy over DBSCAN and DPC. It is important to note that a lower VI score indicates better clustering performance. From the results of the three metrics, it can be observed that LGD achieves the best performance, followed by RDBSCAN.

4.4. Results on Different Datasets with Different k-τ Values

Figure 9 shows how the ARI results on the different datasets changed as the

k - τ

values were varied. For the purpose of testing, the KDE bandwidth was fixed at a value of 1.

From this Figure, it can be seen that on the Ecoli and Seeds datasets, setting the number of neighbors to below 20 results in relatively poor average performance. On the other six datasets, the ARI values change more smoothly with variations in the value.

In practical experiments, an initial range for k between 10 and 20, and for τ between 0.7 and 0.9, can be set to identify optimal solutions. In datasets such as Libras and Wine, local optima were observed when the number of nearest neighbors (k) was set to 40. For real-world datasets, expanding the search range for k-nearest neighbors may lead to better solutions.

4.5. Runtime Comparison

The time complexity of RDBSCAN is

O (nlogn)

, as analyzed in Section 3.3. DBSCAN, DPC, and the other six comparison algorithms in Table 1 all have a time complexity of

O (n^{2})

. The time taken for the nine algorithms to achieve the optimal clustering results on 16 datasets is shown in Table 5. To ensure the tests are run in the same environment, the OPTICS algorithm was called and executed in MATLAB 2017b.

From the average results, the RDBSCAN algorithm has the least execution time among the nine algorithms. When the dataset has a lower dimensionality, such as in the 2sp2glob and bottleneck datasets, RDBSCAN demonstrates significant performance advantages. However, when the dataset has a higher dimensionality, such as in the Libras and Olivetti faces datasets, RDBSCAN’s execution time is comparable to that of algorithms like DPC and ConDPC. On the Olivetti faces dataset, RDBSCAN takes more time than DBSCAN-DPC.

5. Discussion

DPC calculates

ρ

and

δ

to select all density peaks. As analyzed in Section 2.2, this approach can lead to multiple density peaks being selected in high-density clusters, while potentially overlooking the density peaks in low-density clusters. The selection of density peaks in the proposed algorithm is dynamic. It processes points in descending order of local density, selecting the highest-density unprocessed point as the density peak. Once high-density areas are handled, the corresponding density peaks in low-density areas are also selected. The clustering results on the Jain and CircleSquare datasets (shown in Figure 4 and Figure 7) demonstrate the effectiveness of RDBSCAN in selecting density peaks in datasets with uneven density distributions.

If the number of samples in clusters formed in low-density regions is too small, we do not consider them to be true clusters. Points in these micro-clusters are reallocated through a secondary processing step. The minimum cluster size can also be set as an input parameter to adjust the clustering accuracy of the algorithm. For example, when the minimum cluster size was set to 6, the ARI and F-measure values for the wine dataset improved to 0.780 and 0.910, respectively.

6. Conclusions

This study proposed a clustering algorithm, RDBSCAN, based on local relative density. The algorithm utilizes a dynamic method to obtain density peaks, which does not require prior knowledge of the number of clusters. Based on the density peaks, the algorithm calculates the local relative density of neighboring points and identifies core objects according to a given relative density threshold, followed by cluster expansion. For points in micro-clusters, their parent points are identified based on the adjusted

δ

, and they inherit the cluster labels from these parent points.

Experimental results showed that the proposed algorithm demonstrates a significant improvement in accuracy, when compared to the original DPC and DBSCAN. In comparison to the original algorithms and other improved algorithms, the proposed algorithm achieved the best performance in terms of metrics on the synthetic datasets and second-best on the real-world datasets. At the same time, the algorithm does not require the pre-specification of the number of clusters, which provides an advantage over improved algorithms such as LGD, ConDPC, and DCHDP.

The experiment compared the execution times of nine algorithms across all datasets. The results show that RDBSCAN has the lowest average execution time and the best performance on these test datasets. However, when the dimensionality of the datasets is higher, RDBSCAN does not demonstrate a significant performance advantage over the other algorithms.

AIS (Automatic Identification System) ship trajectory clustering analyzes ship trajectories using clustering techniques, which can be applied to port management, hazard prevention, and other areas. Due to varying conditions, such as waterway class and depth, the density of ship trajectories in different waterways varies. The RDBSCAN algorithm, which is good at handling data with uneven densities and does not require pre-setting of the number of clusters, can provide technical support for such clustering analyses. However, in practical applications, AIS data volumes are typically large, and the efficiency of the algorithm’s k-nearest neighbor search, among other factors, needs to be optimized.

Future work will explore the application of CoverTree for nearest neighbor search and storage to optimize the performance of the algorithm, as well as its further application to AIS data analysis.

Author Contributions

Conceptualization, Y.Z. and Z.W.; methodology, Y.Z. and Z.W.; software, Y.Z.; validation, Y.Z., X.W. and T.L.; formal analysis, Y.Z. and X.W.; resources, Y.Z. and Z.W.; data curation, X.W.; writing—original draft preparation, Y.Z. and X.W.; writing—review and editing, Y.Z., X.W. and T.L.; visualization, T.L.; supervision, Z.W.; funding acquisition, T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 23KJA580002), the Excellent Teaching Team of the 2022 QingLan Project of the Jiangsu Higher Education Institutions of China (Big Data Technology Teaching Team with Shipping Characteristic), and the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (No. 22KJB580002).

Data Availability Statement

The data will be shared on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To analyze the time complexity of the proposed algorithm, we conducted comparative experiments using bottleneck datasets of different sizes. The datasets were generated using the code from paper [26], with a similar structure. The sizes of the datasets are 500, 1000, 1500, 2000, 3000, 4000, 5000, and 6000. Experiments on these datasets were conducted using the same parameter combination (k = 10,

τ

= 0.5,

σ_{0}

= 1), and all experiments achieved perfect clustering results (ARI = 1, F-measure = 1, VI = 0).

Based on the results in Table A1, the total runtime was fitted to

O (nlogn)

, as shown in Figure A1.

Table A1. RDBSCAN Running Times (in Seconds) on Bottleneck Datasets of Different Sizes.

Dataset	Bottleneck 500	Bottleneck 1000	Bottleneck 1500	Bottleneck 2000	Bottleneck 3000	Bottleneck 4000	Bottleneck 5000	Bottleneck 6000
Total time	0.00394	0.00571	0.00753	0.00914	0.01142	0.01449	0.01918	0.02224

The number in the dataset name represents the size of the dataset.

Figure A1. Time complexity analysis: (a) execution time vs. dataset size (nlogn) fitting; (b) execution time vs. nlogn (standardized) with linear fit.

As illustrated in Figure A1, the total runtime aligns well with the

O (nlogn)

complexity.

References

Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Waltham, MA, USA, 2011; pp. 443–450. [Google Scholar]
Xie, Y.; Lin, B.; Qu, Y.; Li, C.; Zhang, W.; Ma, L.; Wen, Y.; Tao, D. Joint deep multi-view learning for image clustering. IEEE Trans. Knowl. Data Eng. 2020, 33, 3594–3606. [Google Scholar] [CrossRef]
Li, G.; Qin, X.; Liu, H.; Jiang, K.; Wang, A. Modulation Recognition of Digital Signal Using Graph Feature and Improved K-Means. Electronics 2022, 11, 3298. [Google Scholar] [CrossRef]
Ping, J.; Ying, Z.; Hao, N.; Miao, P.; Ye, C.; Liu, C.; Li, W. Rapid and non-destructive identification of Panax ginseng origins using hyperspectral imaging, visible light imaging, and X-ray imaging combined with multi-source data fusion strategies. Food Res. Int. 2024, 192, 114758. [Google Scholar] [CrossRef]
Cao, L.; Liu, Y.; Wang, D.; Wang, T.; Fu, C. A Novel Density Peak Fuzzy Clustering Algorithm for Moving Vehicles Using Traffic Radar. Electronics 2020, 9, 46. [Google Scholar] [CrossRef]
Liu, R.; Wang, H.; Yu, X. Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 2018, 450, 200–226. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD 96, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Ankerst, M.; Breunig, M.M.; Kriegel, H.P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA, 1–3 June 1999; pp. 49–60. [Google Scholar]
Hinneburg, A.; Gabriel, H.H. Denclue 2.0: Fast clustering based on kernel density estimation. In Proceedings of the 7th International Symposium on Intelligent Data Analysis, Ljubljana, Slovenia, 6–8 September 2007; pp. 70–80. [Google Scholar]
Sander, J.; Ester, M.; Kriegel, H.P.; Xu, X. Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Disc. 1998, 2, 169–194. [Google Scholar] [CrossRef]
Wang, Y.; Qian, J.; Hassan, M.; Zhang, X.; Zhang, T.; Yang, C.; Zhou, X.; Jia, F. Density peak clustering algorithms: A review on the decade 2014–2023. Expert Syst. Appl. 2023, 238, 121860. [Google Scholar] [CrossRef]
Rodriguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef]
Guan, J.; Li, S.; He, X.; Zhu, J.; Chen, J. Fast hierarchical clustering of local density peaks via an association degree transfer method. Neurocomputing 2021, 455, 401–418. [Google Scholar] [CrossRef]
Zhu, Q.; Tang, X.; Elahi, A. Application of the novel harmony search optimization algorithm for dbscan clustering. Expert. Syst. Appl. 2021, 178, 115054. [Google Scholar] [CrossRef]
Long, Z.; Gao, Y.; Meng, H.; Yao, Y.; Li, T. Clustering based on local density peaks and graph cut. Inf. Sci. 2022, 600, 263–286. [Google Scholar] [CrossRef]
Lotfi, A.; Moradi, P.; Beigy, H. Density peaks clustering based on density backbone and fuzzy neighborhood. Pattern Recognit. 2020, 107, 107449. [Google Scholar] [CrossRef]
Zhu, Y.; Ting, K.M.; Jin, Y.; Angelova, M. Hierarchical clustering that takes advantage of both density-peak and density-connectivity. Inform. Syst. 2022, 103, 101871. [Google Scholar] [CrossRef]
Li, R.; Yang, X.; Qin, X.; Zhu, W. Local gap density for clustering high-dimensional data with varying densities. Knowl. Based Syst. 2019, 184, 104905. [Google Scholar] [CrossRef]
Guan, J.; Li, S.; He, X.; Chen, J. Clustering by fast detection of main density peaks within a peak digraph. Inf. Sci. 2023, 628, 504–521. [Google Scholar] [CrossRef]
Zou, Y.; Wang, Z. ConDPC: Data connectivity-based density peak clustering. Appl. Sci. 2022, 12, 12812. [Google Scholar] [CrossRef]
Hou, j.; Lin, H.; Yuan, H.; Pelillo, M. Flexible Density Peak Clustering for Real-World Data. Pattern Recognit. 2024, 156, 110772. [Google Scholar] [CrossRef]
Zhu, Y.; Ting, K.M.; Carman, M.J. Density-ratio based clustering for discovering clusters with varying densities. Pattern Recognit. 2016, 60, 983–997. [Google Scholar] [CrossRef]
Zou, Y.; Wang, Z.; Xu, P.; Lv, T. An Improved Density Peaks Clustering Algorithm Based on Density Ratio. Comput. J. 2024, 67, 2515–2528. [Google Scholar] [CrossRef]
Maggioni, M.; Murphy, J.M. Learning by Unsupervised Nonlinear Diffusion. J. Mach. Learn. Res. 2019, 20, 1–56. [Google Scholar]
Murphy, J.M.; Polk, S.L. A Multiscale Environment for Learning by Diffusion. Appl. Comput. Harmon. Anal. 2022, 57, 58–100. [Google Scholar] [CrossRef]
Gionis, A.; Mannila, H.; Tsaparas, P. Clustering aggregation. In Proceedings of the 21st International Conference on Data Engineering, Tokyo, Japan, 5–8 April 2005; pp. 341–352. [Google Scholar]
Jain, A.K.; Law, M.H.C. Data clustering: A user’s dilemma. In Proceedings of the Pattern Recognition and Machine Intelligence, Kolkata, India, 20–22 December 2005; pp. 1–10. [Google Scholar]
Piantoni, J.; Faceli, K.; Sakata, T.C.; Pereira, J.C.; de Souto, M.C.P. Impact of Base Partitions on Multi-objective and Traditional Ensemble Clustering Algorithms. In Proceedings of the ICONIP2015, Istanbul, Turkey, 3–8 November 2015. [Google Scholar]
Du, M.; Ding, S.; Xue, Y.; Shi, Z. A novel density peaks clustering with sensitivity of local density and density-adaptive metric. Knowl. Inf. Syst. 2019, 59, 285–309. [Google Scholar] [CrossRef]
Zelnik-Manor, L.; Perona, P. Self-tuning spectral clustering. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 1 December 2004; pp. 1601–1608. [Google Scholar]
UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ (accessed on 20 October 2024).
Samaria, F.S.; Harter, A.C. Parameterisation of a stochastic model for human face identification. In Proceedings of the 1994 IEEE Workshop on Applications of Computer Vision, Sarasota, FL, USA, 5–7 December 1994; pp. 138–142. [Google Scholar]
Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
Rijsbergen, V. Foundation of Evaluation. J. Doc. 1974, 30, 365–373. [Google Scholar] [CrossRef]
Meilà, M. Comparing clusterings—An information based distance. J. Multivariate Anal. 2007, 98, 873–895. [Google Scholar] [CrossRef]

Figure 1. Decision graph and clustering result of dataset 2sp2glob (parameter p = 0.3): (a) decision graph; (b) clustering result where different colors correspond to different clusters (the star markers represent the density peaks corresponding to those in (a)).

Figure 2. Decision graph and clustering result of dataset Jain (parameter p = 0.3): (a) decision graph; (b) clustering result where different colors correspond to different clusters (the star markers represent the density peaks corresponding to those in (a)).

Figure 3. RDBSCAN flowchart.

Figure 4. Clustering process (

k = 6, τ = 0.7

): (a) ground truth; (b) find the first local center (shaped as the red star); (c) add

k n n

to the bottom cluster; (d) find the second local center (shaped as the blue star); (e) add

k n n

to the top cluster; (f) final result.

Figure 4. Clustering process (

k = 6, τ = 0.7

): (a) ground truth; (b) find the first local center (shaped as the red star); (c) add

k n n

to the bottom cluster; (d) find the second local center (shaped as the blue star); (e) add

k n n

to the top cluster; (f) final result.

Figure 5. Results on bottleneck dataset: (a) ground truth; (b) RDBSCAN; (c) DBSCAN; (d) DPC; (e) OPTICS; (f) LGD; (g) ConDPC; (h) DBSCAN-DPC; (i) DCHDP; (j) LUND.

Figure 6. Results on cloud dataset: (a) ground truth; (b) RDBSCAN; (c) DBSCAN; (d) DPC; (e) OPTICS; (f) LGD; (g) ConDPC; (h) DBSCAN-DPC; (i) DCHDP; (j) LUND.

Figure 7. Results on CircleSquare dataset: (a) ground truth; (b) RDBSCAN; (c) DBSCAN; (d) DPC; (e) OPTICS; (f) LGD; (g) ConDPC; (h) DBSCAN-DPC; (i) DCHDP; (j) LUND.

Figure 8. Values of evaluation metrics on real-world datasets. (a) ARI; (b) VI; (c) F-measure; (d) mean value of the three metrics.

Figure 9. ARI result on different

k - τ

values: (a) Ecoli; (b) Glass; (c) Libras; (d) Thyroid; (e) Wine; (f) Zoo; (g) Seeds; (h) Olivetti faces.

Figure 9. ARI result on different

k - τ

values: (a) Ecoli; (b) Glass; (c) Libras; (d) Thyroid; (e) Wine; (f) Zoo; (g) Seeds; (h) Olivetti faces.

Table 1. Parameters for the algorithms used.

Algorithm	Value Range
RDBSCAN	$k \in {2, 3, \dots, 40}$ ; $τ \in {0.5, 0.61, \dots, 0.99}$ ; $σ_{0} \in {0.01, 0.02, \dots, 1}$
DBSCAN	$e p s \in {0.01, 0.02, \dots, 0.9}$ ; $minpts \in {2, 3, \dots, 50}$
DPC	$p \in {0.1, 0.2, \dots, 10}$ ; n: count of clusters
OPTICS	$minpts \in {2, 3, \dots, 50}$ ; $xi \in {0.01, 0.02, \dots, 0.99}$
LGD	$k \in {2, 3, \dots, 20}$ ; $τ \in {0.5, 0.51, \dots, 0.62}$ ; n: count of clusters
ConDPC	$p \in {0.1, 0.2, \dots, 10}$ ; n: count of clusters
DBSCAN-DPC	$e p s \in {0.01, 0.011, \dots, 0.04}$ ; $k \in {5, 6, \dots, 20}$
DCHDP	$e p s \in {0.01, 0.02, \dots, 0.99}$ ; $minpts \in {2, 3, \dots, 50}$ ; n: count of clusters
LUND	$σ \in {0.01, 0.02, \dots, 1}$ ; $σ_{0} \in {0.01, 0.02, \dots, 1}$ ; $k \in {5, 10, \dots, 50}$ ; $i \in {1, 2, \dots, T + 2}$

Table 2. Dataset information.

Dataset	Sample Size	Attributes	Clusters
Aggregation [27]	788	2	7
Jain [28]	373	2	2
2sp2glob [29]	2000	2	4
Halfkernal [30]	1000	2	2
ThreeCircle [31]	299	2	3
CircleSquare [31]	238	2	3
Bottleneck [26]	796	2	3
Cloud	900	2	2
Ecoli [32]	336	7	8
Glass [32]	214	9	6
Thyroid [32]	215	5	3
Libras [32]	360	90	15
Wine [32]	178	13	3
Zoo [32]	101	16	7
Seeds [32]	210	7	3
Olivetti faces [33]	400	92 × 112	40

Table 3. Results of evaluation metrics on synthetic datasets.

Algorithm		Aggregation	Jain	2sp2glob	Halfkernal	ThreeCircle	CircleSquare	Bottleneck	Cloud	Mean
RDBSCAN	ARI	1	1	1	1	1	1	1	0.982	0.998
	F-measure	1	1	1	1	1	1	1	0.996	1.000
	VI	0	0	0	0	0	0	0	0.057	0.007
	pra	22/0.97/1	6/0.7/1	6/0.7/1	8/0.7/1	7/0.5/1	8/0.7/1	10/0.5/1	9/0.5/1
DBSCAN	ARI	0.992	0.976	0.995	1	1	0.840	1	0.910	0.964
	F-measure	0.996	0.976	0.998	1	1	0.851	1	0.978	0.975
	VI	0.033	0.089	0.023	0	0	0.497	0	0.200	0.105
	pra	0.08/21	0.08/2	0.04/2	0.08/5	0.09/5	0.08/2	0.05/2	0.07/6
DPC	ARI	0.996	0.715	1	0.422	0.214	0.533	0.676	0.960	0.690
	F-measure	0.998	0.910	1	0.819	0.586	0.634	0.745	0.990	0.835
	VI	0.026	0.421	0	0.710	1.394	0.905	0.534	0.098	0.511
	pra	3.1/7	0.3/2	1/4	0.1/2	0.2/3	7.1/3	0.6/3	2.6/2
OPTICS	ARI	0.985	1	0.999	1	1	1	1	0.790	0.972
	F-measure	0.991	0.500	0.9995	1	1	0.667	1	0.945	0.872
	VI	0.069	0	0.007	0	0	0	0	0.394	0.059
	pra	10/0.08	6/0.21	15/0.1	6/0.37	5/0.44	3/0.71	3/0.7	50/0.02
LGD	ARI	1	1	1	0.562	1	0.773	1	0.982	0.915
	F-measure	1	1	1	0.873	1	0.847	1	0.996	0.965
	VI	0	0	0	0.594	0	0.419	0	0.051	0.133
	pra	3/0.58/7	14/0.5/2	2/0.51/4	13/0.5/2	7/0.5/3	3/0.5/3	13/0.5/3	6/0.5/2
ConDPC	ARI	0.996	1	1	1	1	0.614	1	0.960	0.946
	F-measure	0.998	1	1	1	1	0.677	1	0.990	0.958
	VI	0.026	0	0	0	0	0.683	0	0.098	0.101
	pra	3.1/7	3/2	0.3/4	1.5/2	7.2/3	2/3	4.2/3	3.3/2
DBSCAN-DPC	ARI	0.996	1	0.714	0.836	1	0.797	0.431	0.960	0.842
	F-measure	0.998	1	0.667	0.943	1	0.893	0.707	0.990	0.900
	VI	0.026	0	0.347	0.254	0	0.485	0.917	0.098	0.266
	pra	0.033/5	0.015/14	0.04/7	0.012/18	0.01/8	0.01/8	0.026/18	0.037/9
DCHDP	ARI	0.998	1	0.995	1	1	0.971,	1	0.953	0.992
	F-measure	0.999	1	0.998	1	1	0.984	1	0.988	0.996
	VI	0.015	0	0.023	0	0	0.112	0	0.118	0.034
	pra	21/0.09/7	2/0.99/2	2/0.04/4	6/0.37/2	8/0.02/3	2/0.07/3	3/0.7/3	3/0.12/2
LUND	ARI	0.995	1	1	1	1	0.736	1	0.965	0.962
	F-measure	0.996	1	1	1	1	0.841	1	0.991	0.979
	VI	0.029	0	0	0	0	0.589	0	0.089	0.088
	pra	35/0.1/0.1/7	10/0.1/0.11/12	5/0.2/0.11/17	10/0.2/0.11/15	10/0.7/0.1/13	10/0.3/0.2/9	10/0.4/0.1/21	5/0.2/0.1/12

“pra” represents the corresponding parameter values.

Table 4. Results of evaluation metrics on real-world datasets.

Algorithm		Ecoli	Glass	Libras	Thyroid	Wine	Zoo	Seeds	Olivetti Faces	Mean Value
RDBSCAN	ARI	0.717	0.273	0.377	0.86	0.676	0.421	0.7	0.74	0.600
	F-measure	0.559	0.498	0.539	0.711	0.878	0.714	0.878	0.826	0.700
	VI	0.975	1.711	1.909	0.537	0.85	1.179	0.82	0.717	1.087
	pra	16/0.5/0.52	20/0.5/0.86	14/0.5/0.92	36/0.5/0.6	40/0.5/0.92	13/0.5/0.18	31/0.59/0.95	6/0.5/0.61
DBSCAN	ARI	0.669	0.246	0.197	0.742	0.538	0.059	0.524	0.617	0.449
	F-measure	0.298	0.316	0.391	0.464	0.526	0.403	0.543	0.736	0.460
	VI	1.110	1.487	2.367	0.797	0.816	2.108	0.912	0.868	1.308
	pra	0.23/30	0.27/2	0.90/2	0.09/2	0.51/23	0.5/2	0.3/35	0.73/2
DPC	ARI	0.349	0.200	0.319	0.165	0.672	0.514	0.767	0.690	0.460
	F-measure	0.259	0.408	0.424	0.511	0.885	0.452	0.913	0.733	0.573
	VI	1.508	1.868	2.078	1.284	0.628	0.909	0.582	0.701	1.195
	pra	0.4/8	1/6	0.3/15	1.3/3	2.0/3	9/7	0.7/3	5.9/40
OPTICS	ARI	0.643	0.252	0.076	0.683	0.588	0.508	0.565	0.293	0.451
	F-measure	0.296	0.229	0.366	0.482	0.762	0.582	0.818	0.722	0.532
	VI	1.135	1.468	2.363	0.748	0.901	0.94	0.929	1.288	1.222
	pra	28/0.01	19 /0.08	5/0.03	5/0.09	15/0.01	6/0.01	22/0.01	5/0.01
LGD	ARI	0.718	0.218	0.384	0.710	0.727	0.802	0.773	0.667	0.625
	F-measure	0.576	0.450	0.478	0.876	0.907	0.746	0.919	0.746	0.712
	VI	1.065	2.056	1.835	0.577	0.558	0.514	0.605	0.764	0.997
	pra	6/0.57/8	4/0.58/6	8/0.62/15	4/0.55/3	14/0.51/3	17/0.62/7	8/0.5/3	3/0.54/40
ConDPC	ARI	0.692	0.211	0.225	0.448	0.660	0.678	0.767	0.684	0.546
	F-measure	0.444	0.380	0.431	0.545	0.879	0.717	0.913	0.706	0.627
	VI	1.156	1.503	1.909	0.725	0.644	0.609	0.582	0.696	0.978
	pra	0.6/8	7.8/6	1.5/15	3.8/3	0.8/3	7.9/7	0.7/3	7.3/40
DBSCAN-DPC	ARI	0.763	0.230	0.396	0.502	0.518	0.208	0.735	0.470	0.478
	F-measure	0.562	0.383	0.476	0.627	0.810	0.189	0.903	0.471	0.553
	VI	0.870	1.976	1.686	0.880	0.988	1.268	0.724	1.189	1.198
	pra	0.035/9	0.01/8	0.035/6	0.01/19	0.01/8	0.01/5	0.028/7	0.029/5
DCHDP	ARI	0.629	0.216	0.085	0.71	0.762	0.046	0.739	0.652	0.480
	F-measure	0.464	0.476	0.389	0.778	0.91	0.344	0.904	0.747	0.627
	VI	1.03	1.816	1.935	0.682	0.513	2.185	0.686	0.76	1.201
	pra	2/0.23/8	6/0.28/6	6/0.95/15	6/0.27/3	43/0.66/3	2/0.5/7	10/0.38/3	3/0.84/40
LUND	ARI	0.703	0.288	0.306	0.524	0.949	0.678	0.767	0.473	0.586
	F-measure	0.494	0.333	0.419	0.630	0.983	0.698	0.913	0.46	0.614
	VI	0.868	1.507	1.935	0.844	0.156	0.976	1.680	1.565	1.191
	pra	10/0.2/0.5/6	10/0.3/0.3/25	10/1/0.9/6	10/0.1/0.4/1	15/0.4/0.1/5	15/0.8/0.8/6	30/0.4/0.3/2	30/0.4/0.2/8

“pra” represents the corresponding parameter values.

Table 5. Comparison of Running Times (in Seconds) for Different Algorithms.

	RDBSCAN	DBSCAN	DPC	OPTICS	LGD	ConDPC	DBSCAN-DPC	DCHDP	LUND
Aggregation	0.004	0.011	0.028	0.225	0.043	0.044	0.037	0.031	0.22
ThreeCircle	0.003	0.003	0.005	0.081	0.008	0.008	0.006	0.003	0.026
Jain	0.003	0.004	0.007	0.106	0.016	0.011	0.013	0.016	0.181
2sp2glob	0.013	0.064	0.18	0.624	0.411	0.301	0.342	0.14	1.231
Halfkernal	0.006	0.017	0.041	0.291	0.311	0.068	0.074	0.059	0.342
CircleSquare	0.003	0.003	0.004	0.067	0.011	0.005	0.016	0.007	0.171
Bottleneck	0.004	0.011	0.028	0.233	0.049	0.055	0.058	0.82	0.25
Cloud	0.005	0.015	0.036	0.262	0.059	0.059	0.057	0.045	0.31
Ecoli	0.005	0.161	0.011	0.097	0.013	0.01	0.072	0.012	0.171
Glass	0.004	0.061	0.004	0.061	0.007	0.006	0.028	0.007	0.162
Libras	0.022	0.232	0.024	7.82	0.026	0.027	0.058	0.231	0.191
Thyroid	0.003	0.061	0.004	0.063	0.007	0.005	0.026	0.071	0.171
Wine	0.003	0.044	0.004	0.052	0.006	0.005	0.011	0.054	0.162
seeds	0.003	0.003	0.004	0.057	0.007	0.005	0.008	0.011	0.014
zoo	0.002	0.016	0.003	0.051	0.004	0.003	0.004	0.019	0.152
Olivetti faces	0.505	0.512	0.513	8.84	0.511	0.525	0.038	0.521	0.645
mean value	0.037	0.086	0.056	1.330	0.103	0.071	0.053	0.144	0.275

Each row corresponds to the runtime of each algorithm on different datasets.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zou, Y.; Wang, Z.; Wang, X.; Lv, T. A Clustering Algorithm Based on Local Relative Density. Electronics 2025, 14, 481. https://doi.org/10.3390/electronics14030481

AMA Style

Zou Y, Wang Z, Wang X, Lv T. A Clustering Algorithm Based on Local Relative Density. Electronics. 2025; 14(3):481. https://doi.org/10.3390/electronics14030481

Chicago/Turabian Style

Zou, Yujuan, Zhijian Wang, Xiangchen Wang, and Taizhi Lv. 2025. "A Clustering Algorithm Based on Local Relative Density" Electronics 14, no. 3: 481. https://doi.org/10.3390/electronics14030481

APA Style

Zou, Y., Wang, Z., Wang, X., & Lv, T. (2025). A Clustering Algorithm Based on Local Relative Density. Electronics, 14(3), 481. https://doi.org/10.3390/electronics14030481

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Clustering Algorithm Based on Local Relative Density

Abstract

1. Introduction

2. Related Works

2.1. DBSCAN

2.2. DPC

2.3. Relative Density

3. Proposed Algorithm

3.1. Definitions

3.2. RDBSCAN

3.3. Computational Complexity Analysis

3.4. Execution Demonstration

4. Results

4.1. Experimental Settings

4.1.1. Comparison Algorithms and Datasets

4.1.2. Clustering Metrics

4.2. Results on Synthetic Datasets

4.2.1. Visualization Results

4.2.2. Results of Evaluation Metrics

4.3. Results on Real-World Datasets

4.4. Results on Different Datasets with Different k-τ Values

4.5. Runtime Comparison

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI