1. Introduction
Clustering algorithms have received widespread attention and applications in various fields, such as pattern recognition, biology, engineering systems, image processing, and so forth [
1,
2,
3,
4,
5,
6,
7]. For clustering algorithms, the data points in a given data set are divided into several clusters, and the similarity between the data points in the same cluster is greater than that in other clusters; meanwhile, there are strong differences among the clusters, which contribute to the presentation of asymmetric data structure. In the early stage, hard (Boolean) clustering was mainly studied, where every object strictly belonged to a single cluster. For example, The DPC algorithm [
8] is an excellent representative of this category.
Among numerous fuzzy clustering algorithms, the fuzzy c-means (FCM) algorithm is one of the most commonly used methods [
9,
10,
11,
12,
13,
14]. In general, FCM attached equal importance to all features of data, establishing a symmetrical structure, which might often be inconsistent with that of the original data. Therefore, the use of weighted processing has become an important development direction. The weighted FCM (WFCM) algorithm [
15] grouped data according to the weighted categories of the separated features, and the algorithm incorporated feature weights into commonly used Euclidean distances for clustering. The feature-weighted fuzzy k-means (FWFKM) algorithm [
16] was based on the fuzzy k-prototypes algorithm and a supervised algorithm. However, it still required two objective functions to optimize the data partition and feature weights. The attribute weight algorithm (AWA) [
17] was a fuzzy weighted subspace clustering algorithm, which could effectively find the important features of each cluster, that is, discover the asymmetry of data. However, the disadvantage of the AWA algorithm was that it did not work when the standard deviation of certain attributes was zero, as zero might be used as the denominator in the learning rules. Improved versions have been proposed to overcome this weakness, including the fuzzy weighted k-means (FWKM) algorithm [
18] and the fuzzy subspace clustering (FSC) algorithm [
19,
20]. In the objective function of FWKM, a small constant was added when calculating the distance, which effectively avoided the problem caused by the zero standard deviation of some attributes in AWA. Gan et al., proposed the FSC algorithm, which used a strategy similar to the FWKM algorithm discussed earlier. However, they had a significant difference in the method of parameter setting. The constant parameter introduced in FSC should be set manually, while that in FWKM was set through a predefined formula.
However, when clustering is completed in a high-dimensional space, the traditional clustering algorithms will expose obvious drawbacks [
21,
22,
23]. For example, for any given pair of data points in the cluster of a high-dimensional space, these points may be far apart. Due to the lack of multi-dimensional space thinking, traditional algorithms may have some deviations in calculating the distance, resulting in unsatisfactory clustering performance. For most traditional clustering algorithms, a key challenge is that in many real-world problems, data points in different clusters are often related to different feature subsets; that is, clusters can exist in different subspaces [
24,
25,
26]. Frigui and Nasraoui [
27] proposed a new approach called simultaneous clustering and attribute discrimination (SCAD). It used continuous feature weighting, providing a richer feature correlation representation than feature selection. Moreover, it also independently learned the associative representation of features of each cluster in an unsupervised way. Later, Deng [
28] studied the use of intra-class and inter-class information, and proposed a new clustering method, called enhanced soft subspace clustering (ESSC). However, many irrelevant data would affect the clustering performance of fuzzy clustering algorithms. In other words, different characteristic features should have different importance in clustering. Yang and Nataliani [
29] proposed a feature-reduction FCM (FRFCM) algorithm, which automatically calculated the weight of a single feature while reducing the influence of these unrelated features. Tang et al. [
30] proposed a new kernel fuzzy clustering method called viewpoint-based kernel fuzzy clustering with weight information granules (VWKFC). Because its new initialization algorithm is more efficient in selecting cluster centers, namely the kernel-based hypersphere density initialization algorithm, combined with weight information particles and viewpoint induction mechanisms, VWKFC is significantly superior to the other eight existing related algorithms in processing high-dimensional data.
In the early stage, the clustering process was entirely data-driven. In fact, domain knowledge could be used to assist the development of clustering. W. Pedrycz et al. [
31] introduced knowledge into a data-driven process, where knowledge was embodied through viewpoints, thus giving a viewpoint-based fuzzy clustering algorithm V-FCM. Tang et al. [
32] proposed a new knowledge and data-driven fuzzy clustering algorithm, which was called the density viewpoint-induced possibilistic fuzzy clustering algorithm (DVPFCM). Thereinto, a new calculation method of density radius was proposed. Based upon this, the hypersphere density-based cluster center initialization (HDCCI) algorithm was established to obtain the initial cluster centers in densely sampled areas. Then, the high-density points obtained by the HDCCI method were used as new viewpoints, and they would be integrated into the DVPFCM algorithm.
The current problems of fuzzy clustering algorithms are mainly as follows:
Most fuzzy clustering algorithms are sensitive to the initial results of clustering. For example, FCM, V-FCM, SCAD, ESSC, and FRFCM all rely on the result of the initialization method. Compared with them, the initialization processing mechanism in the DVPFCM algorithm is better; that is, the initialization method HDCCI is based on the DPC algorithm. However, the HDCCI algorithm still has a shortcoming existing in the calculation method of cut-off distance. It uses a fixed and strict formula, which lacks a solid foundation and proof process and can not adapt to various data sets.
With the arrival of the big data era, the volume of information is constantly increasing, and the dimensions are getting higher simultaneously, which forces clustering algorithms to have the ability to deal with high-dimensional data properly. However, most clustering algorithms are still weak in this aspect. When faced with high-dimensional data, FCM, V-FCM, and DVPFCM have no corresponding measures. Obviously, none of them can do this. SCAD, ESSC, and FRFCM all use subspace processing methods with different weights, which are relatively, better. However, the algorithms mentioned above are purely data-driven algorithms, and the clustering efficiency and accuracy of high-dimensional data cannot reach the ideal level.
In this study, we put forward the VSFCM algorithm and the following is a brief introduction.
On the one hand, we put forward a new optimized cut-off distance calculation method based on the DPC algorithm, which can select the point with the highest density as the viewpoint to induce the algorithm to find the cluster center more accurately. That is, the cut-off distance-induced clustering initialization method CDCI provides a new initialization strategy and perspective in this field. Under the guidance of this initialization method, the number of iterations of the algorithm is reduced, and the accuracy of the algorithm is also improved. On the other hand, we introduce viewpoints into subspace clustering to improve the convergence speed of each subspace. Moreover, the separation term is added between the clusters to minimize the compactness of the subspace clusters and maximize the projection subspace of each cluster. On the basis of the CDCI and the viewpoint, combined with the above fuzzy feature weight processing mode, the viewpoint-driven subspace fuzzy c-means algorithm VSFCM is established.
The innovations of this study are reflected in the following aspects. First of all, a new cut-off distance is proposed, improving the cluster center initialization effect and also serving as a new viewpoint selection method, which can better adapt to the clustering data structure and speed up the clustering process. Secondly, the fuzzy weight is introduced, that is, a weight that is added to each cluster and dimension to reflect the contribution degree of each feature to the clustering. Among them, smaller weights are assigned to features with a larger proportion of noise point values to reduce their participation in clustering and indirectly weaken their impact on clustering results. Finally, a new method of separation between clusters is given. The average value of the initialized cluster centers is used as the reference point for separation between clusters. Additionally, combined with the optimized weight allocation, the distance between clusters can be effectively increased.
The paper is organized as follows.
Section 2 reviews existing related algorithms.
Section 3 describes the proposed clustering initialization method CDCI and the viewpoint-driven subspace fuzzy c-means algorithm VSFCM in detail.
Section 4 presents the experimental results of the VSFCM algorithm and several other relevant algorithms on artificial data sets and some data sets of machine learning.
Section 5 gives a summary and outlook for further research.
2. Related Work
In this section, we review several clustering algorithms closely related to our work.
Assuming that the data set is a set of n samples, we divide the data into c clusters to get a set of cluster centers . Each sample and cluster center are positioned in space where is the data dimension.
The objective function of the FCM algorithm [
33] is as follows:
Among them, is the degree of membership, which ranges from 0 to 1, and needs to satisfy the constraint . represents the Euclidean distance between the i-th cluster center and the j-th sample, and is a fuzzy coefficient.
Frigui and Nasraoui [
27] established two versions of the SCAD algorithm. SCAD1 attempts to balance the two terms of a composite objective function and introduces a penalty term to determine the optimal attribute-related weight. Moreover, SCAD2 introduces a fuzzy weighted index to minimize the single-item criterion. In subsequent experiments, we choose to compare our algorithm with SCAD1, so only its expression is shown here:
Thereinto, the distance formula is
,
is a weighted constraint expressed as
Here K is a constant, and the superscript is the previous iteration of the t-th. Frigui and Nasraoui proved that SCAD1 and SCAD2 had similar behavior and yielded similar clustering results.
Deng et al. [
28] proposed the enhanced fuzzy weighted soft subspace clustering (ESSC) algorithm. The most prominent advantage of this algorithm is its ability to minimize the intra-class distance while maximizing the inter-class distance. The objective function of the algorithm is as follows:
Here
and
are constants,
. Moreover,
is the feature weight, which satisfies
. The third term is the weighted inter-class separation term,
represents the center point of this data set, and the expression is
Because the ESSC algorithm considers both distances within and between classes and uses the idea of entropy information, it has certain advantages in soft subspace clustering algorithms belonging to the same category. One disadvantage of ESSC is that more parameters need to be set manually, and how to select the appropriate parameters of it is still an open question.
Yang and Nataliani [
29] came up with the feature-reduction FCM (FRFCM) algorithm. It calculates a new weight for each feature by adding feature-weighted entropy to the FRFCM objective function. Then, during the iteration, the cluster centers and the fuzzy membership matrix are updated with these new weights. It not only improves the performance compared with the FCM algorithm but also can select important features by weighting and reducing the feature dimension by discarding unimportant features. Therefore, the algorithm can automatically reduce the feature dimension and achieves a good clustering effect.
The V-FCM [
31] algorithm optimizes the FCM algorithm by introducing the viewpoints. The viewpoints are represented by typical characteristic data of artificial selection, such as average value, maximum value, and minimum value. The objective function of the algorithm is as follows:
.
The V-FCM algorithm still has problems that it is not suitable for processing high-dimensional data, is sensitive to the initialization of cluster centers, and the selection of viewpoint types will also affect the clustering results.
The DPC algorithm [
8] assumes that the cluster center is surrounded by neighbors with lower local density, and they are at a relatively large distance from any points with a higher local density. For all data points
, it is necessary to calculate their local density and the minimum distance
between itself and other points with higher local density
. The formulas are (
):
Here is the distance between two data points and is the density radius. The local density reflects the number of data points within the radius . For the point with the largest local density value, its . In the DPC algorithm, according to the distribution map, the data point with high density and a larger distance from the data point with a higher density are regarded as the cluster center. The reason is that the true cluster center has a large value of and , while the of the noise point is small.
Recently, Tang et al. [
32] proposed the density viewpoint-induced possibilistic fuzzy c-means (DVPFCM) algorithm. It provides a new initial method called the hypersphere density-based cluster center initialization (HDCCI) method, which is served for the viewpoint selection. On the basis of it, and combined with the advantages of FCM and PFCM, the DVPFCM algorithm is proposed. The specific objective function of DVPFCM is expressed as follows:
where
Here m and p are the fuzzy coefficients and the typical matrix fuzzy coefficient, respectively. With the help of the viewpoint, its convergence speed has been improved, and its robustness has also been strengthened.
Unfortunately, the above six algorithms, FCM, SCAD1, FRFCM, V-FCM, ESSC, and DVPFCM algorithms, all have the problem that they are sensitive to the initialization of cluster centers. An improper initial value may cause the result to converge to a local optimal value or cause a slow clustering process, which has a great impact on the clustering result and also contributes to the unstable result of the algorithm.
Moreover, these algorithms also have other problems. The FCM algorithm is the most classic algorithm, but the actual efficiency and accuracy of processing high-dimensional data of it are usually not ideal. Although the V-FCM algorithm combines viewpoints to simplify the algorithm process and improve the convergence speed of the algorithm, its anti-noise ability is still weak. The ESSC algorithm proposes a weighted subspace clustering objective function based on entropy, which greatly increases the intra-cluster compactness and inter-cluster separation. However, more parameters need to be manually set, which increases the uncertainty of the algorithm. The characteristic of FRFCM is to select important features by weighting and to reduce the feature dimension by discarding unimportant features, but it fails to take the spatial features of the data into account.
The DPC algorithm is a hard clustering algorithm with a fast running speed, but its density radius is difficult to determine. The cluster center needs to be obtained by observing the density–distance distribution graph, for which it is easy to generate human error. The DVPFCM algorithm is slightly weaker when processing high-dimensional data. In addition,
is proposed in DVPFCM, which is used to calculate the density cut-off radius, but there is no rigorous proof that the radius obtained is appropriate. Actually, this formula cannot achieve a good result in many cases, and the robustness is not very ideal. The comparison of these algorithms is summarized in
Table 1.
In a word, the existing algorithms still have great defects in the initialization of cluster centers and the processing of high-dimensional data. For this reason, we focus on solving these two types of problems.
3. Proposed VSFCM Algorithm
In this section, a new cluster initialization method named the cut-off distance-induced clustering initialization (CDCI) is proposed first. Based on the CDCI, the viewpoint-driven subspace fuzzy c-means algorithm (VSFCM) is established subsequently, which has introduced the subspace clustering mode and fuzzy feature weight processing mechanism and combined with the separation formula between clusters optimized with the viewpoint. Its algorithm idea is shown in
Figure 1.
3.1. Cluster Initialization Method Induced by Cut-Off Distance
Here is a new cluster center initialization method. The following is the corresponding flowchart (
Figure 2).
The DPC algorithm is used as the start point, where the local density of the data point and its minimum distance between itself and other points with higher local density still use the previous calculation formulas, namely (7) and (9).
In this study, we put forward a new cut-off distance as follows:
We arrange in ascending order, and we might as well write the resulting ordered sequence as . Moreover, n is the number of data points, and c is the number of clusters. is the distance from the data point to . The cut-off distance recommended by the DPC algorithm should make the average number of neighbors of each data point about 1–2% of the total data. So, we choose the cut-off distance according to the upper limit ratio of , which can be taken as , where . Hence we get , namely .This is the first factor.
Then, the data are divided into c clusters in the form of containing the maximum distance as the radius. Additionally, the reference number is , which is recorded as . In the actual processing, and are combined as another factor .
Considering that the average number of neighbors is about 1–2% of the total data, and selecting the smaller one of the two factors, (12) is obtained. Then, naturally, we use the new local density calculation formula:
Next, we introduce parameters
to calculate the initial cluster centers directly. The formula is as below:
.
The traditional DPC algorithm uses the distribution map to subjectively select the cluster centers, which is easy to cause human error. In contrast, the initialization method we proposed can automatically select a more appropriate cut-off radius so that the selected initial cluster center is closer to the true value. Specifically, we calculate the parameters and sort them from small to large, and then select the point of the largest value of as the first initial cluster center. In the next high-density point selection process, we limit the distance between the current cluster center and other selected cluster centers to be greater than the cut-off distance, ensuring that we can choose the initial cluster centers and viewpoints more conveniently, efficiently, and accurately.
3.2. The Mechanism of the VSFCM Algorithm
Here, we show the main idea of the viewpoint-driven subspace fuzzy c-means (VSFCM) algorithm. Its flowchart is shown below (
Figure 3).
The first initial cluster center selected by the CDCI method is recorded as (i.e., the point of the largest value of ), which is taken as the viewpoint. The position of our viewpoint is constantly changing with iteration. The row number of the viewpoint in the cluster center matrix is with . That is, we replace the cluster center closest to the viewpoint as the viewpoint.
We use three parts to complete the construction of the objective function. In the first part, the fuzzy feature weight is combined with the classical FCM algorithm, and the cluster centers and the viewpoint are integrated together. The second part is an adaptive fuzzy weight penalty term, which uses the parameter
that can be automatically calculated. The third part represents the inter-cluster separation term, in which the fuzzy feature weight and the cluster center that is integrated with the viewpoint are used, and the centroid of the initialization cluster center serves as the reference point for inter-cluster separation. The objective function is expressed as follows:
Among them (
).
The following constraints are imposed (
,
)
.
Here, is the cluster center fused with the viewpoint. When , is replaced with the point of maximum value of , namely the viewpoint. The reference point of separation between clusters is the above (17). The fuzzy weight is the weight of the k-th feature of the i-th cluster, in which the fuzzy coefficient is used, and generally . is the parameter used to implement the fuzzy weight penalty. Parameter is used to adaptively adjust the value of the separation term between clusters.
Note that (16) can be transformed into:
The solution process is given below. We use the Lagrangian multiplier method for (19), and it becomes an optimization problem of the following formula:
In order to minimize (20), the following three partial derivative relations need to be satisfied:
For the convenience of expression, we set
Firstly, we give the solution process of
. Starting from
, we get
Because
, it follows that
Substituting (25) into (24), we have
The iterative formula of membership degree is obtained, and it involves the value of
. Note that when
is very large,
may become negative, which is obviously not what we want. For this reason, we can naturally give the following constraints (
,
,
):
Here is a constant, and .
Secondly, we show the solution process of
. Starting from
, we have (
,
)
It can be further calculated, and we can get
Regarding the solution of , there are two cases:
Case 1: From (17) we know that when ().
Case 2: When
, we can get from (30):
Finally, we provide the solving process of
. Starting from
(
,
), we can get
Similarly, for the convenience of expression, we set
Substituting (37) into (35), we obtain
.
Among them
,
is a penalty term parameter, which reflects the contribution of each attribute to the cluster centers. Moreover, the selection of its value is critical to the performance of the clustering. In actual processing, we can define
as the ratio of the sum of the previous part of (19) and the fuzzy feature weight:
Here K is a positive constant.
So far, the derivation process of cluster centers, membership degree matrix, and weight matrix of the VSFCM algorithm have been fully explained.
3.3. Framework of the VSFCM Algorithm
The execution process of the CDCI method and the VSFCM algorithm is shown respectively in Algorithms 1 and 2.
Algorithm 1 Cut-off Distance-induced Clustering Initialization (CDCI) |
Input: Data set , number of clusters . Output: cluster center matrix . procedure CDCI (Data X, Number C); according to (12); of each point according to (14); of each point according to (9); according to (15); from large to small, and get the corresponding
after the original data set is re-sorted by ;
- 7.
corresponding to as the first cluster center , and let ; - 8.
Let tt = 1, k = 2; - 9.
Repeat - 10.
//If the distance between and selected cluster center is smaller than , //then skip directly - 11.
- 12.
; - 13.
tt = tt + 1; - 14.
- 15.
return - 16.
end procedure
|
Algorithm 2 Viewpoint-driven subspace fuzzy C-means (VSFCM) algorithm |
Input: Data set , number of clusters . Output: Membership matrix , cluster center matrix , weight matrix . procedure DVPFCM (Data X, Number C) Set threshold
and maximum number of iterations ; and the point with the highest density; Do ; by calculating memberships using (26); by calculating centers using (32); by calculating weights using (38); and ; , ,; end procedure |
4. Experimental Results
In this section, we validate the clustering ability of the proposed VSFCM algorithm through a series of experiments. In the comparative experiment section, five relevant algorithms are selected, including V-FCM, SCAD, ESSC, FRFCM, and DVPFCM. Among the two algorithms of SCAD, the structure of the first one is more complex and closer to the algorithm in this paper, so we chose it to compare with ours. In terms of initialization methods, we compare our algorithm with the HDCCI algorithm and present it visually.
The testing data sets include two artificial data sets, 10 UCI machine-learning data sets, and the Olivetti face database. The artificial data sets DATA1 and DATA2 are composed of Gaussian distribution points obtained by generation tools. The tested UCI data sets [
34] include Iris, Wireless Indoor Localization, Wine, Breast Cancer Wisconsin, Seeds, Letter Recognition (A, B), Ionosphere, SPECT heart data, Aggregation, and Zoo. These UCI data sets are popular and representative of the field of machine learning. The Olivetti face database corresponds to a collection of 40 people and 10 pictures per person. We select 200 pictures corresponding to 20 people from it, with a total of 1024 attributes.
Table 2 counts the basic information of two artificial data sets and 10 UCI machine learning data sets, which include a total number of instances, features, and reference clusters. For all experiments, the default values are selected for the parameters, and the specific settings are as follows:
. For convenience, the selection of
in the CDCI algorithm is calculated based on (12).
4.1. Evaluation Indicator
We use two types of evaluation indicators: hard clustering effectiveness indicators and soft clustering effectiveness indicators. Moreover, the superscript “(+)” indicates that the larger the value of the indicator, the better the clustering performance. The superscript “(−)” indicates the opposite meaning.
Since fuzzy clustering algorithms divide data points into the clusters with the highest corresponding membership, hard clustering indicators can also be used to evaluate their clustering effects.
The hard clustering effectiveness indicators selected are the following three kinds.
The classification rate (CR) [
32] reflects the proportion of data that are correctly classified. The formula is:
in which
is the number of objects found correctly in the
i-th cluster, and
n is the number of all objects in the data set.
- (2)
Normalized mutual information
Normalized mutual information (NMI) [
35] reflects the statistical information shared between two clusters:
Here are the two distributions of the data set. Supposing that have I and J clusters, respectively. In addition, , and is the number of objects contained in the cluster . , and . The structure of the calculation formula of is similar.
- (3)
Calinski–Harabasz indicator
The Calinski–Harabasz (CH) indicator [
36] is a measure from the perspective of distance within a cluster and dispersion between clusters:
Here corresponds to the number of objects contained in cluster , and is the average of the cluster centers.
The following two kinds of soft clustering effectiveness indicators are used.
- (4)
The extension indicator of ARI
The EARI [
37] indicator is a fuzzy extension of the adjusted rand indicator (ARI) [
38,
39], and its purpose is to describe the similarity of two clustering results.
Assuming that R and Q are two hard partitions, corresponding to k and v clusters, respectively, there are some definitions in the ARI indicator as follows:
e represents the number of pairs of data belonging to the same class in R and to the same cluster in Q meanwhile.
f represents the number of pairs of data belonging to the same class in R and to the different clusters in Q meanwhile.
g represents the number of pairs of data belonging to the different classes in R and to the same cluster in Q meanwhile.
h represents the number of pairs of data belonging to the different classes in R and to the different clusters in Q meanwhile.
According to the above definition, given two membership matrices B1 and B2, it is obvious that
e,
f,
g, and
h can be rewritten as below when r and q are two soft partitions:
Among them,
is the set of data pairs belonging to a cluster in B1. Similarly,
Y is the set of data pairs belonging to the same cluster in B2. And
is a set of data pairs that do not belong to the same cluster in B1. Similarly,
Z is a collection of data pairs that do not belong to the same cluster in B2.
are t-norm and s-norm, and min, and max are often used for actual processing, respectively. EARI is specifically obtained by the following formula:
The Xie–Beni (XB) indicator [
40] is a highly recognized indicator of the effectiveness of fuzzy clustering, whose formula is as below:
4.2. Artificial Data Sets
Table 2 gives the basic information of all the data sets used in the experiment. Let us first discuss the artificial data sets DATA1 and DATA2.
Figure 4 is a data distribution diagram of DATA1.
As shown in
Figure 4, the clusters of three colors correspond to three classes of the DATA1, the red triangle “△” represents the cluster center of each cluster, and the black solid square“■” represents the reference point for inter-cluster separation. The cluster centers are
,
, and
. After calculation, we get
,
. We express the weight of the full space as
, three subspace weights are expressed as
. The corresponding separation between clusters can be expressed respectively as:
Among them, , . Here is obtained by using our proposed CDCI method.
In this example, the separation between clusters in the subspace is significantly greater than the separation between clusters in the full space, which means that the subspace clustering has higher inter-cluster separation and better clustering effect than the full-space clustering. In particular, the separation between full-space clusters and the separation between subspace clusters can also be used as separations under different distance metrics.
Compared with the ESSC algorithm, which is also a subspace algorithm, our algorithm has been improved to some extent. That is, replace with in its formula. Moreover, the effect of subspace separation of our algorithm is slightly better than it, and ours is more stable. When faced with more complex data sets or more scattered and irregular data sets, our algorithms can still maintain a good result. In a word, our proposed algorithm can get better separation between clusters, which can get better results.
Figure 5 is the distribution diagram of each dimension of DATA2, and
Table 3 is the distribution of each algorithm on DATA2.
From
Table 3, we can find that the weight of the ESSC and SCAD1 algorithm does not fluctuate very much around 0.3333, but our VSFCM algorithm has a large difference in the attribution of weights. Taking Cluster 1 as an example, three weights, 0.3433, 0.3321, and 0.3246 of ESSC, are very close. However, the weight values obtained by the VSFCM algorithm are 0.4394, 0.2980, and 0.2626. From the VSFCM algorithm, it can be seen that the first feature has a significant contribution to the clustering result, followed by the second feature, and the worst is the third feature.
Table 4 shows the clustering results of several algorithms on two artificial data sets, DATA1 and DATA2. Obviously, the performance of our algorithm is better than other algorithms. Moreover, mainly due to the contribution of weight distribution, the values of the five clustering indicators of the subspace clustering algorithms are obviously higher than that of other algorithms.
4.3. UCI Data Sets
The UCI data sets adopted here include Iris, Wireless Localization, Wine, Seeds, breast cancer Wisconsin, letter recognition (A, B), SPECT heart data, Aggregation, and Zoo. Among them, Breast Cancer data is a common medical data set in machine learning, which can be divided into two clusters.
Figure 6 shows its Sammon mapping, where the yellow and purple point sets represent two classes of the dataset respectively.
Figure 7 shows the
distribution diagram of Breast Cancer of two cluster center initialization methods. In
Figure 7, we can see that the first initial cluster center is well determined, that is, the data point in the upper right corner. In the selection process of the second initial cluster center, the boundary between the optional points of the CDCI algorithm and other data points is very clear, while that of the HDCCI algorithm is fuzzy. So, when we use the HDCCI algorithm to select the initial cluster center, it is easy to choose the wrong one. In contrast, the initial cluster centers selected by our cluster center initialization method CDCI are more in line with the characteristics of the ideal cluster centers.
Table 5 describes the average value of the results of each clustering algorithm (including V-FCM, SCAD, ESSC, FRFCM, DVPFCM, and our algorithm) running 20 times on the selected UCI data sets. The adopted evaluation indicators here are the above-mentioned three hard indicators (CR, CH, and NMI) and two fuzzy indicators (EARI and XB). According to the above five clustering indicators, the performance of the proposed VSFCM algorithm is evaluated and compared with the existing three subspace clustering algorithms and two fuzzy clustering algorithms combined with viewpoints. In order to observe the results more conveniently, we bold the best result and underlined the second-best result.
As can be seen from
Table 5, the six algorithms can be divided into three grades, the worst is V-FCM and DVPFCM, the performance of FRFCM and ESSC is equivalent to or better than that of SCAD1, and the best is the proposed VSFCM algorithm. Although DVPFCM and V-FCM algorithms mentioned are generally inferior to the other three algorithms, they can achieve better clustering performance on the Iris and Letter Recognition (A, B) data sets measured by NMI and CR, respectively. This shows that for all data sets, no one algorithm is always better than others.
Through comparison, we further notice that the best clustering performance indicated by NMI and CR is not always consistent with the best clustering performance indicated by other indicators. That is, the clustering performance with higher NMI and CR values does not necessarily possess higher XB, CH, and EARI values. Therefore, it is necessary to comprehensively evaluate the performance of clustering algorithms with different metrics.
From
Table 5, the following conclusions can be drawn:
First of all, due to the low dimension of small data sets, which contributes to the small impact of weights on the results, V-FCM and DVPFCM (belonging to viewpoint-oriented fuzzy clustering algorithms) can get better results on these data sets. In addition, the viewpoint can guide the clustering algorithm to run in a more correct direction, so the NMI and CR indicators of V-FCM and DVPFCM perform better on some data sets.
Secondly, for the weighted fuzzy clustering algorithms, SCAD, ESSC, and FRFCM with more complex structures have a great advantage in processing multi-dimensional data sets, owing to the efficiency and effect of clustering improved by the weight distribution.
Finally, our proposed algorithm VSFCM is obviously superior to the other algorithms mentioned above. Because our algorithm well integrates the advantages of the two types of algorithms and successfully improves the clustering effect.
In general, our proposed VSFCM algorithm is more ideal for obtaining initial cluster centers and has better performance in various clustering evaluation indicators.
Table 6 reflects the dimension weight distribution results of the subspace algorithm carried out on the Wine data set, presenting in the order of the weight values of each cluster corresponding to the 13 features in the data. It can be seen from
Table 6 that different subspace clustering algorithms may have different degrees of importance to the same feature. Similar results can be obtained with other UCI data sets. For multi-dimensional data sets, the reasonable distribution of dimension weights can improve the accuracy and efficiency of clustering, so our proposed VSFCM algorithm can achieve more ideal clustering results.
4.4. The Olivetti Face Database
The 400 pictures in the Olivetti face database were collected in different places, with different light intensities and different emotional states of the same person (eyes open or closed, smiling or not smiling). Our experiment uses the first 20 people’s pictures. Moreover,
Figure 8 shows the histogram of its running results. Moreover, from
Table 7, we can see the clustering results of different relevant algorithms on this database.
In
Table 7, the CR value of our VSFCM algorithm on this data set can even reach 0.9850, while that of FRFCM, ESSC, SCAD1, DVPFCM, and V-FCM are significantly lower, which are 0.8300, 0.9450, 0.7500, 0.8350, and 0.6950, respectively. As can be seen from the histogram in
Figure 8, the indicator values of our algorithm are obviously better than other algorithms. When processing Olivetti’s face database, the indicators CR, NMI, and EARI of our algorithm are all close to 1. It reveals that the weight distribution of our algorithm plays an important role in the process of high-dimensional data (200 × 1024), which includes reducing the weight of some less important dimensions while increasing the weight of relatively important dimensions to improve the accuracy and efficiency of the algorithm. Moreover, the proposed algorithm introduces the separation term between clusters to increase the distance between clusters, and the initial cluster centers obtained by using the cluster center initialization method CDCI are closer to the real ones, which prevents the VSFCM algorithm from falling into the local optimal value or iterative divergence.
4.5. Time Complexity
Table 8 and
Table 9, respectively, present the average number of iterations of each algorithm and the average calculation time per run (each algorithm was executed 50 times). The number in parentheses indicates the ranking of the algorithm, and Arank indicates the average ranking of the algorithm. Note that there is no iterative process in the HDCCI algorithm, so its relevant statistical data is not needed here.
It can be seen from the following two tables that the VSFCM algorithm ranks third in terms of the average number of iterations and second in terms of average iteration time. Compared with other clustering algorithms, the number of iterations and iteration time of the VSFCM algorithm required are generally less. This shows that the convergence speed of our proposed algorithm is relatively fast.
4.6. Discussion
This study proposes a new and more effective cut-off distance, using high-density points to achieve the initialization of the cluster center and the selection of viewpoints. This can effectively restrain the interference of noise (because the ideal cluster center has a large value of the and , but the value of the noise point is small). Therefore, the cluster center initialization method CDCI in this study is better than the DPC and the HDCCI (the initialization method of DVPFCM).
In addition, we test the performance of our proposed VSFCM algorithm and other algorithms on two artificial data sets, 10 UCI data sets, and the Olivetti face database. Through comparison, it is found that the proposed VSFCM algorithm performs better than V-FCM, DVPFCM, SCAD1, ESSC, and FRFCM in five indicators, and the distribution of feature weights of our algorithm is more in line with the characteristics of the data.
Finally, we compare the average number of iterations and iteration time of each algorithm and discover that the number of iterations and iteration time required by the VSFCM algorithm is generally less, which leads to the conclusion that the VSFCM algorithm has a better convergence speed.
From a deeper perspective, the proposed VSFCM algorithm has better performance due to the following aspects:
First, the proposed cluster center initialization method CDCI can get the initial cluster center close to the real data structure through the new cut-off distance. This not only prevents the iterative divergence of the algorithm but also speeds up the convergence of the algorithm and helps improve the accuracy of the algorithm;
Second, as a part of the data structure of the VSFCM algorithm, the viewpoint serves for the knowledge-induced clustering process. Moreover, it is combined with data to form a cluster driving force driven by both knowledge and data, inducing the clustering algorithm to obtain more accurate clustering results. Moreover, the viewpoint is introduced into the subspace fuzzy clustering, which can better guide the clustering process of each cluster subspace;
Third, the fuzzy weight is introduced, which can overcome the influence of outliers on cluster analysis and speed up the convergence of clustering. Our algorithm adds weights to each data as a whole to indicate the degree of contribution to clustering and assigns smaller weights to noise points to reduce their participation in clustering, thus weakening their impact on clustering results;
Finally, the separation term between clusters is introduced into the objective function. The reference point for separation between clusters is the average value of the initial cluster centers, which is combined with the distribution of weights to improve the accuracy of the algorithm by enhancing the ability to process data spatial distance.
Compared with the ESSC in [
28], the proposed VSFCM algorithm has the following advantages:
The reference point for inter-cluster separation has been improved. In this study, is used instead of . When we initialize the cluster center, outliers are automatically removed, which weakens their influence of them on the reference points so that each cluster can be separated more clearly;
We use the viewpoint to guide the convergence of each projection subspace of soft subspace clustering, which conforms to the data structure of clustering and accelerates the clustering process.
Compared with the DVPFCM of [
32], the proposed VSFCM algorithm has the following advantages:
The cut-off distance is optimized. The selection of cut-off distance is very important and directly affects the clustering accuracy of the algorithm. Generally, the cut-off distance should be selected so that the sphere space of the cut-off radius contains 1% to 2% of the total number of datasets. However, the calculation formula of the cut-off radius of HDCCI has no corresponding rigorous scientific proof, which makes it difficult to initialize the cluster center completely in line with the concept of DPC. In contrast, the improved cut-off distance achieved better results while meeting the DPC standards;
Fuzzy weights are introduced. By assigning smaller weights to features with a larger proportion of noise points and outliers, the influence of them on clustering results can be indirectly weakened;
Subspace clustering is employed. When faced with high-dimensional data, traditional clustering algorithms have no effective processing measures. However, the algorithm combined with subspace clustering has better adaptability to it. By continuously adjusting the weight of each subspace, the contribution of each subspace to the clustering can be accurately described, making the clustering results better.
5. Conclusions
This paper has proposed a new subspace clustering algorithm, which is named viewpoint-driven subspace fuzzy c-means algorithm (VSFCM), and has achieved good clustering results. The main work and contributions of this paper are summarized as follows:
First of all, in view of the problem that a large number of clustering algorithms are sensitive to the initialization of cluster centers, we propose a new cut-off distance under the system of the DPC algorithm and further provide a cut-off distance-induced clustering initialization method CDCI, which is used as an initialization strategy for cluster centers and also served as a new viewpoint selection strategy. It makes the initial cluster center closer to the true cluster center, which not only improves the clustering convergence speed, but also promotes the clustering accuracy to a certain extent.
Secondly, by taking the viewpoint obtained by CDCI as being reflective of domain knowledge, the fuzzy clustering idea driven by knowledge and data is proposed. Subsequently, we introduce the subspace clustering mode and fuzzy feature weight processing mechanism and propose the separation formula between clusters, which is optimized with the viewpoint. On the basis of these, we establish the viewpoint-driven subspace fuzzy c-means algorithm (VSFCM). The viewpoint in it helps guide the clustering algorithm to discover the real data structure and thus get better clustering results. Additionally, the introduced separation term between clusters can maximize the distance between cluster centers in real time, achieve the maximum separation between clusters, and also optimize the internal mechanism of clustering.
Finally, by applying our proposed VSFCM algorithm and comparison algorithms (V-FCM, SCAD, ESSC, FRFCM, and DVPFCM) to the three types of data sets, we can see that the proposed VSFCM algorithm performs best in terms of the five indicators, and exhibits stronger stability.
Our clustering algorithm has a certain degree of noise resistance [
5,
41,
42], but its effectiveness is not outstanding and can be further improved in the future. Moreover, there is room for improvement in the number of iterations and iteration time of our algorithm, which can be reduced by optimizing our algorithm structure and parameter settings.
In addition, the work of this study can be extended to the clustering of granular information [
30,
43,
44,
45], which will offer new directions for data analysis. Moreover, we can develop our fuzzy clustering algorithm in the field of fuzzy reasoning [
46,
47] and carry on clustering for the fuzzy rules.