1. Introduction
In the field of machine learning and data mining, research on clustering has always attracted extensive attention [
1,
2,
3,
4]. Clustering methods are primarily categorized as partition-based, density-based, and hierarchical clustering methods [
5,
6,
7]. Partition-based clustering methods classify different samples on the basis of features. In general, density-based clustering methods classify samples on the basis of the number of samples at each location. The samples are considered related (contained and included) and classified by the hierarchy of different samples in hierarchical clustering methods. Hierarchical clustering is suitable for small datasets but not large datasets. Partition-based clustering is one of the most used clustering methods for large dataset. Density-based clustering divides data points into high-density and low-density regions and is suitable for large datasets. In recent years, three-way soft clustering has been a new direction in clustering research, which attributes samples in the positive region as belonging to the cluster, samples in the boundary region as partially belonging to the cluster, and samples in the negative region as not belonging to the cluster [
8,
9,
10]. Clustering divides the objects in the set into different clusters based on a certain standard (distance) to increase the intra-cluster similarity and reduce the inter-cluster similarity. Clustering algorithms are divided into soft and hard clustering on the basis of clustering research. Hard clustering specifies that a sample can be divided into only one cluster, while soft clustering allows a sample to be divided into different clusters. For two primary clustering methods, K-means and FCM clustering methods are the most widely used [
11,
12]. These two algorithms have driven the research on clustering, and much research has been performed to improve them. However, noise and initial cluster centers often influence the clustering results. Moreover, the clustering result is often significantly reduced when the clustering algorithm faces high-dimensional and complex data. A good clustering algorithm is supposed to be high-performance, robust, and scalable.
To resolve these difficulties, several improved clustering algorithms have been proposed [
13,
14,
15,
16]. The K-means method divides each sample into a specific cluster. However, these clusters often have overlapping and fuzzy divisions in practical applications. To solve the uncertain data objects, soft clustering algorithms are introduced. Bezdek [
17] introduced the fuzzy set theory into K-means and proposed the FCM algorithm, which uses the membership function to verify the membership relationship between objects and clusters. Furthermore, it has demonstrated efficient performance in different applications [
18,
19]. However, the degree of the membership function does not always correspond to the cluster to which it belongs. The FCM algorithm is partition-based. The advantage of the partition-based clustering method is that the convergence is fast, and the disadvantage is that it requires that the number of clusters can be reasonably estimated and that the choice of initial centers and noise can have a significant impact on the clustering results. In these two traditional methods, all features are given the same weight and are easily influenced by noise [
20,
21]. These methods are also susceptible to random initial cluster centers, and poor initialization will likely result in local optimal solution generation [
22,
23]. In the face of high-dimensional and complex data, high-dimensional data are usually very sparse in space, and the sample size always seems very small compared with the dimensionality of the space. The features of clusters are not obvious. Traditional clustering algorithms cannot guarantee robustness; hence, it is extremely important to fully use the features’ properties.
To address these issues, it is critical to remember that different weights should be assigned to different features. In previous studies, the weights corresponding to the features were assigned in one of two ways. The first method is to assign weights to the features on a global scale. Throughout the process, a specific feature is given only one weight. The second method is to assign features local weights, which means that features in a dataset have different weights in different clusters. Numerous studies have demonstrated that the second method outperforms global weighting [
24]. Therefore, the SCAD algorithm considers different feature weights in the different clusters and simultaneously clusters the data with feature discrimination [
25]. However, the conventional FCM algorithms, including improved algorithms, constrain related variables through exponential regularization, which may lead to consistent results and low precision when dealing with sparse and noisy data (the denominator is 0). Entropy is proposed to go for better-constrained features in the clustering process. Entropy-Weighting K-Means is particularly prominent method among local weight-based and entropy-weight algorithms [
26]. EWKM pioneered the form of entropy weight and applied it to membership to better constrain the objective function. The algorithm introduced the form of entropy weight in response to this defect to incentivize more features that contribute to identifying clusters. The newly introduced approach is more efficient in dealing with various data types, particularly high-dimensional sparse data. Entropy weights can be used to assign relevant feature weights to a cluster, whereas fuzzy partitions can help identify the best partitions. This method, however, ignores the limitations of Euclidean distance and may result in inconsistent classification [
24,
27]. Moreover, the weights of this algorithm are very dependent on the initial clustering centers and are sensitive to changes related to the centers. If the initial cluster center varies, the algorithm will not be robust and will not converge. Therefore, improving the clustering algorithm should focus on these aspects. To address the effect of noise and features with different weights in different clusters, certainly improved algorithms for classification criteria based on non-Euclidean distances have been proposed [
28,
29,
30]. Some algorithms using entropy weights have also been proposed in unsupervised clustering studies. Both Entropy K-Means [
31] and U-K-Means [
32] algorithms use feature weights and entropy weights in unsupervised learning. However, the distance formula used in their objective functions is Euclidean distance, which does not perform well in the face of high-dimensional and noisy data. This can eventually lead to noise and irrelevant features influencing the overall clustering results. In recent years, there has been extensive interest in research on feature weights and entropy weights for fuzzy clustering. However, these algorithms frequently only consider the Euclidean distance and do not constrain features or use fuzzy partitioning [
33,
34,
35].
This paper proposes an advanced FCM algorithm to handle high-dimensional sparse and noisy data to solve the flaws mentioned above. The proposed algorithm’s new findings use entropy control features and partitions, and adding non-Euclidean distance division eliminates noise interference. Moreover, the entropy weight is introduced to the membership and weight variables to enhance its efficiency in processing different datasets. In improving clustering in the face of high-dimensional and complex data, the intra-cluster dispersion is minimized, and the negative weight entropy is maximized to motivate features to help identify clusters. Furthermore, the proposed method updates the membership degrees of samples in different clusters and the feature weights in different clusters during each iteration so that the objective function converges rapidly. This algorithm avoids this issue by efficiently handling high-dimensional data and noise with feature weights and non-Euclidean distance formulas. Furthermore, entropy weights are used to constrain variables, which can be more advantageous than exponential constraints in some cases. Extensive experimental results on real datasets suggest that the proposed algorithm performs better in clustering. The proposed algorithms on high-dimensional and complex datasets also exhibit high performance. Furthermore, the algorithm exhibits robustness and stability in the experiment.
The remainder of the paper comprises the following sections:
Section 2 introduces several of the most classic clustering methods. In
Section 3, the proposed algorithm, its convergence proof, and its complexity analysis are provided. The performance of the proposed algorithm and other clustering algorithms is compared and evaluated using different clustering metrics in
Section 4. Lastly,
Section 5 presents a summary of the paper.
3. The Proposed Algorithm
In this section, a novel Fuzzy-C-Means-based entropy weighting algorithm is proposed. Motivated by the shortcomings of traditional Fuzzy-C-Means clustering algorithm, a new algorithm is presented that includes local feature weighting, the use of entropy weights acting on features, and the degree of membership to improve clustering’s sensitivity and accuracy to random class centers. Furthermore, because the Euclidean distance is susceptible to noise and outliers [
29], a non-Euclidean distance is introduced. The new distance formula makes the algorithm more robust and fully uses the dataset’s features to get the clustering result. The objective function can be defined as
In Equation (15),
is an
by
matrix, in which
denotes the degree of the
sample’s membership to the center of the
cluster;
is a
by
matrix, where
represents the center of the
cluster and is defined by
. Moreover,
is a
by
matrix, where
denotes the weight of the
feature in the
cluster.
is the membership matrix of each sample to cluster, containing
samples and
clusters.
is the feature center matrix of each cluster, containing
clusters and
features.
is the feature weight matrix of each sample to cluster, containing
clusters and
features. The term
denotes a non-Euclidean distance metric between the
sample and the
cluster in the
feature and is defined as follows:
where
denotes the inverse of the variance of the
feature of the data,
Minimizing in Equation (15) with the constraints forms a class of constrained nonlinear optimization problems. The usual approach toward optimization of is to introduce partial optimization for , , and . First, and are fixed, and the reduced is minimized with respect to . Next, and are fixed, and the reduced is minimized with respect to . Followed by this, and are fixed, and the reduced is minimized to solve . After the results are calculated iteratively, the solution can be drawn.
The Lagrange multiplier technique is used to solve the following unconstrained minimization problem:
where
and
are Lagrange multipliers. By setting the gradient of
with respect to
,
,
,
, and
to zero,
From Equations (21) and (22),
where
From Equations (19) and (23),
where it follows that
which can be substituted into Equation (23),
From Equations (20) and (24),
where it follows that
which can be substituted into Equation (24),
For the clustering centers,
where it follows that
which gives
It is evident that Equation (35) is independent of the parameters and . The interdependence of both terms promotes the detection of a better partition during the clustering process. The proposed algorithm minimizes Equation (15), using Equations (29), (32), and (35).
3.1. Parameter Selection
The values of and are essential for the proposed algorithm since they affect the significance of the second and third terms in Equation (15) relative to the first term. Initially, plays two roles in the clustering process; when is large, this results in a smaller value of in Equation (29) and, thus, the second term has a more significant influence to minimize Equation (15). Therefore, it tries to assign more than one sample cluster to make the second term more negative while clustering. The membership entropy value becomes larger when the membership value of a sample to all clusters is equal. After the position of the samples is fixed, all the clustering centers move to the same position for an enormous entropy value. Second, when is large, this results in an immense value of in Equation (29). Therefore, the first term plays a key role in minimizing Equation (15). The local feature weights are controlled by . Since is positive, the value of is inversely proportional to . A smaller value of this term results in a larger . If is large, the third parameter controls the partitions, and all feature weights are assigned in different clusters.
Assuming that
denotes the value of Equation (15) after the run,
denotes the value after the next completion. The proposed algorithm is summarized below (Algorithm 1).
Algorithm 1. Proposed Clustering Algorithm. |
Input: Dataset , the number of clusters K, the values of parameters λ and . Randomly set K cluster centers, generate a set of initial weights, set t = 0, the maximum iteration is , and local minimum . |
Output: . |
Repeat |
1: Compute the non-Euclidean distance matrix . |
2: Update the partition matrix using Equation (29). |
3: Update the weight matrix using Equation (32). |
4: Update the cluster center matrix using Equation (35). |
5: Until the objective function is less than or equal to the local minimum or reaches the maximum iterations. |
3.2. Convergence Analysis
It is important to note that the proposed algorithm will converge in the iterations. It can be observed that different partition
occurs only once during the algorithm process. Therefore, it is assumed that
where
. It must be noted that given
. The minimizer
should be calculated. For
and
, the minimizers are
and
, respectively. It is evident that
since
. Using
,
,
, and
, the minimizers
and
can be calculated, respectively. It is obvious that
. Therefore, the following equation can be obtained:
However, the function
monotonically decreases. Therefore, different partition
occurs only once during the algorithm process.
in Equation (29) can be calculated after taking the derivative of Equation (18) and setting it equal to zero. Therefore,
can be minimal or maximal. If the second partial derivative of Equation (18) is positive, it can prove that
defined by Equation (29) is a local minimum of Equation (18). The second partial derivative of Equation (18) with respect to
is
Since , , it can prove that Equation (37) is positive. Therefore, in Equation (29) is a local minimum of Equation (18). The proposed algorithm converges in a finite number of iterations.
3.3. Computational Complexity
As shown in
Table 1, the computational complexity of the proposed algorithm is high when compared to other clustering algorithms. However, by adding new terms, the clustering performance is improved. The computational complexity of the algorithm is based on four update processes: updating the distance matrix
, cluster center matrix
, membership matrix
, and weight matrix
. The computational complexity of each process is equal to
. Each process executes independently; hence, the total computational complexity is
. Furthermore,
denotes the number of samples,
denotes the number of clusters, and
denotes the number of features. Each iteration updates
,
,
, and
using Equations (29), (32), and (35) and finally classifies the different samples according to the matrix
.
4. Experiments
In the experimental section, to test the performance of the proposed algorithm, the performance of other clustering algorithms is evaluated and compared on real-world datasets. These algorithms are the standard K-Means [
11], the standard FCM [
12], WK-Means [
30], RLWHCM [
28], SCAD [
25], EWK-Means [
31], and UK-Means [
32]. These algorithms have specific standard parameters. If the parameter values are uniform, the influence of the parameters can be removed to observe the clustering results. Therefore, various parameters were equalized to avoid inconsistency in algorithm performance. In the experiments, the maximum number of restarts was set to 100. For
and
, the clustering centroids were randomly selected from the original datasets. In practical applications, choosing the appropriate threshold
is a crucial issue. If the threshold is too small, the algorithm may converge very slowly or not even converge; if the threshold is too large, the algorithm may stop prematurely, resulting in less accurate clustering results. Therefore, experiments and adjustments are needed to determine an appropriate threshold size. Considering that the threshold
is not the same for different datasets, to get the best performance, we reduced from 0.1 to 0.00001 by 10 times in each case to get the best clustering result. After testing,
was set to get accurate clustering results for different datasets. Each algorithm was iterated 100 times, and the best result was recorded.
Seven real-world datasets from UCI [
36] were used to assess the performance of the proposed approach and compare its results to other approaches. Furthermore, the text, face image, and biological datasets [
37] highlight the proposed algorithm’s performance in high-dimensional and noisy datasets. These datasets are mentioned in
Table 2. The best clustered values for each dataset in
Table 3 and
Table 4 are bolded.
4.1. Evaluation Indicators
The clustering accuracy is defined as
In Equation (38),
denotes the number of samples correctly classified into the
cluster, and
represents the number of points in the dataset. A considerable value of
[
38] suggests a better clustering performance.
Since the sample labels of the real-world dataset are known,
[
38] was used to evaluate the similarity between the clustering partitions and real partitions.
In Equation (39),
indicates the number of similar sample points belonging to a common partition,
indicates the number of non-similar samples belonging to a common partition,
indicates the number of non-similar samples in two separate partitions, and
indicates the number of common samples belonging to two different partitions. Larger values suggest better classification results.
is often used in clustering to compute the similarity between the clustering results and the real label of the dataset. The measurement method is
In Equation (40),
and
are the two partitions in the dataset comprising clusters
and
, respectively.
indicates the probability that a randomly selected sample is allocated to a cluster
,
represents the probability that a sample belongs to both clusters
and
, and
is the entropy associated with all the probabilities
in partition
[
39]. A larger value of
leads to more consistent clustering results.
4.2. Clustering Results on Real-World Datasets
As demonstrated in
Table 3 and
Table 4, the clustering performance of the proposed algorithm in terms of ACC, RI, and NMI was much better on different real-world datasets than other clustering methods. The clustering results suggest that the proposed algorithm significantly improved the clustering performance and provided the best results in most real-world and txt datasets. It should also be noted that the proposed algorithm’s performance was not exceptional in the Wine dataset. After careful examination of this dataset, it was discovered that there were only three clusters in this dataset, and there was little variation in the values of the features. The value of a feature in different clusters may differ by as little as 0.1, which negatively impacts any clustering algorithm and leads to poor performance of entropy-weight terms and non-Euclidean distances. Most classical clustering algorithms had ACC and NMI values of nearly 0.5 in the high-dimensional datasets because the number of clusters was two, indicating that these algorithms failed due to high dimensionality and noise. The reason is that there are often many outliers in high-dimensional data, and the data distribution could be sparser. Traditional clustering algorithms have difficulty clustering these sparse data. As the dimensionality increases, the calculation based on distance makes the clustering centers progressively more complex and is affected by outliers leading to shifts. At the same time, because the different feature weights of different clusters are not considered, the feature of sparse high-dimensional data cannot be exploited, making the clustering results much less accurate. However, using the entropy constraint feature in the proposed algorithm improves the algorithm’s performance in three performance metrics. On the other hand, the clustering algorithm with exponential constraints equalizes sample membership degrees, resulting in an inaccurate ACC value of 0.5. The results also show that the proposed algorithm has the advantage of assisting in the detection of noise and the main classification features in large datasets. The clustering results of the proposed algorithm performed better in the Face Image Dataset and Biological Dataset. This shows that, with random initialization of clustering centers, our algorithm was better and more stable in handling high and noisy datasets. From the clustering results, we can find that the K-Means algorithm performed better in the face dataset because it did not introduce the feature weights, which could lead to unsatisfactory results for high-dimensional data and when the number of clusters is large. Furthermore, this demonstrates that the algorithm can be used for classification in both supervised and unsupervised learning.
Numerous points justify the performance of the proposed algorithm. First, this algorithm has good performance even with terrible initial centers, while the other algorithms are severely sensitive to initialization. Second, the introduction of entropy weighting allows different features to be well added to the clustering process. Third, the non-Euclidean distance makes it possible to encounter noisy and sparse data in the calculation and does not affect the clustering results.
The value range of the parameters was discussed in detail in the previous section. To further examine the sensitivity of
and
, the algorithm’s sensitivity was analyzed on the Iris and Zoo datasets.
= 0.3 was fixed, increasing
from 0 by 0.1 to 2 each time. The sensitivity of
can be inferred from
Figure 1. After that,
= 0.8 was fixed, increasing
from 0 by 0.1 to 2 each time. The sensitivity of
can be observed in
Figure 2. It can be observed from the figures that ACC, RI, and NMI did not fluctuate much when the two parameters were varied, which highlights the excellent performance and robustness of our algorithm. Higher values for the three metrics also directly reflect the proposed algorithm’s efficiency and robustness. The ACC changes of the proposed algorithm were analyzed on the basis of the non-Euclidean distance and the Euclidean distance tested on the Iris and Zoo datasets, to better demonstrate this advancement of the proposed non-Euclidean distance in computing sample point distances.
Figure 3 shows that clustering results based on non-Euclidean distances were more accurate than those based on Euclidean distances. In the non-Euclidean distance, the average ACC values were 0.95 and 0.93, respectively, while in the Euclidean distance, the average ACC values were 0.79 and 0.80, respectively. The new distance formula increased the ACC of the algorithm by 18%, which also indicates the advantage of the newly proposed non-Euclidean distance in dealing with high-dimensional sparse and noisy data. Furthermore, the ACC’s variance of non-Euclidean distance was 0.025, while the ACC’s variance of Euclidean distance was 0.074, indicating that the non-Euclidean distance makes the algorithm more robust. To better distinguish the algorithms in the experiment,
Table 5 reflects the usage conditions associated with the compared algorithms. The experimental results show that the proposed algorithm had high clustering accuracy in most real-world datasets from various domains. The sensitivity analysis of the two parameters could be determined, proving that the algorithm is robust and performs well. Furthermore, when comparing the Euclidean distance and the non-Euclidean distance on the objective function, we can find that the non-Euclidean distance was more accurate and stable in clustering results.
4.3. Discussion of Noise
To demonstrate the performance of the proposed algorithm under the influence of noise, a new experiment was designed for the Iris dataset. In the Iris dataset, uniformly distributed data from 0 to 1 were randomly assigned as new noise features. Furthermore, to compare the effect with and without noise, we compared experiments for the original dataset and the dataset with noise. As shown in
Figure 4, the noise only slightly affected the clustering results. This shows that the performance of the proposed algorithm remained good even though the noise affected the dataset. Moreover, as shown in
Figure 5, the noise did not influence the assignment of feature weights. This confirms the high accuracy and stability of the algorithm.
5. Conclusions
This paper proposed a new algorithm for classifying high-dimensional and noisy data on the basis of non-Euclidean distance, combining feature weights and entropy weights. In this approach, two different entropy terms are added to the objective function, which helps better identify the clustering features of complex data. The performance was compared with state-of-the-art methods in terms of different clustering measures, revealing that the proposed approach is a new clustering algorithm that can partition data with improved performance. Considering the nature of the proposed algorithm and the results of extensive experiments on various datasets, it can be applied to medical research and textual information, facilitating the extraction of critical features, and obtaining clustering results in high-dimensional and complex data conditions. The proposed algorithm significantly improves on the following aspects:
- (1)
The clustering result is consistent and stable, as it is not susceptible to the original cluster centers and assigns different feature weights to each cluster in the clustering process.
- (2)
The entropy weights improve the algorithm’s handling of partitioning during the clustering process and highlight the importance of distinguishing different features.
- (3)
The introduction of non-Euclidean distance makes the algorithm more robust and efficient in handling high-dimensional sparse and noisy data in the real world.
- (4)
The insensitivity to parameter changes ensures the flexibility of the algorithm.
In the future, EM and Gaussian mixture models will be used to improve the clustering algorithm, making it more useful in image processing.