Next Article in Journal
ClueCatcher: Catching Domain-Wise Independent Clues for Deepfake Detection
Next Article in Special Issue
Depth Map Super-Resolution Based on Semi-Couple Deformable Convolution Networks
Previous Article in Journal
Algebraic Solution of Tropical Best Approximation Problems
Previous Article in Special Issue
A Communication-Efficient Federated Text Classification Method Based on Parameter Pruning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

One-Step Clustering with Adaptively Local Kernels and a Neighborhood Kernel

1
School of Computer Science and Engineering, Guangxi Normal University, 15 Yucai Road, Guilin 541004, China
2
School of Mathematics and Statistics, Guangxi Normal University, 15 Yucai Road, Guilin 541004, China
3
Department of Psychiatry, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
*
Authors to whom correspondence should be addressed.
Mathematics 2023, 11(18), 3950; https://doi.org/10.3390/math11183950
Submission received: 28 August 2023 / Revised: 14 September 2023 / Accepted: 15 September 2023 / Published: 17 September 2023

Abstract

:
Among the methods of multiple kernel clustering (MKC), some adopt a neighborhood kernel as the optimal kernel, and some use local base kernels to generate an optimal kernel. However, these two methods are not synthetically combined together to leverage their advantages, which affects the quality of the optimal kernel. Furthermore, most existing MKC methods require a two-step strategy to cluster, i.e., first learn an indicator matrix, then executive clustering. This does not guarantee the optimality of the final results. To overcome the above drawbacks, a one-step clustering with adaptively local kernels and a neighborhood kernel (OSC-ALK-ONK) is proposed in this paper, where the two methods are combined together to produce an optimal kernel. In particular, the neighborhood kernel improves the expression capability of the optimal kernel and enlarges its search range, and local base kernels avoid the redundancy of base kernels and promote their variety. Accordingly, the quality of the optimal kernel is enhanced. Further, a soft block diagonal (BD) regularizer is utilized to encourage the indicator matrix to be BD. It is helpful to obtain explicit clustering results directly and achieve one-step clustering, then overcome the disadvantage of the two-step strategy. In addition, extensive experiments on eight data sets and comparisons with six clustering methods show that OSC-ALK-ONK is effective.

1. Introduction

The data in real problems usually contain nonlinear structures. When clustering these data, it is necessary to use a clustering method that can capture the nonlinear structure. Multiple kernel clustering (MKC) has the advantage of not only processing nonlinear data but also fusing the information of multiple given kernels to yield an optimal kernel. Therefore, it attracts extensive attention from scholars. Recently, many MKC methods for generating an optimal kernel have been proposed.
One strategy is to use a linear combination of given kernels to form an optimal kernel. The weights of given kernels in [1,2] are learned by 1 -regular term, while the weights of given kernels in [3,4] are yielded from 2 -regular term. More generally, p -regular term [5,6] is used to optimize the weights of given kernels and learn an optimal kernel, and it makes the selection of regular term more flexible. In addition, many research studies adopt the strategy of linear combination to learn the optimal kernel [7,8,9,10,11]. In particular, a mini-max model is utilized in a simple MKC method (SimpleMKKM) to learn the kernel coefficient and update the indicator matrix [12]. It is worth noting that this strategy is based on the assumption that the optimal kernel stays in a linear combination of given kernels. This assumption may not hold according to the fact, because this strategy restricts the search scope of the optimal kernel and degrades its quality.
In order to expand the search scope of the optimal kernel, a neighborhood kernel is used in [13,14,15]. The optimal kernel in [13,14] is learned from a neighborhood of the consensus kernel, where a low-rank constraint [14] is applied to the neighborhood kernel to reveal the clustering structure between samples. In particular, the base neighbor kernels with block diagonal structure [15] are produced by defining the neighbor kernel of base kernels, and then an optimal kernel is obtained by combining linearly the neighbor kernels. However, the neighborhood kernels in the literature above are generated from all the base kernels. The disadvantage is that it leads to the redundancy of base kernels because of not taking into account the correlation between given kernels.
Based on the consideration of the correlation between given kernels, selecting local base kernels to generate an optimal kernel emerges. This can avoid the redundancy of given kernels and promote diversity. On the basis of simpleMKKM [12], by considering the similarity of k-nearest neighbors between samples, a local simpleMKKM is proposed [16]. By selecting subsets from the predefined kernel pool to determine local kernels, an MKC method by using representative kernels (MKKM-RK) to learn an optimal kernel is presented [17]. In [18], a matrix-induced regularization is applied in an MKC method (MKKM-MR) to measure the correlation between each pair of kernels to generate an optimal kernel, where the kernels with strong correlation are assigned smaller coefficients, and those with weak correlation are assigned larger coefficients. By constructing the index set of samples to select local base kernels, the optimal kernel is relaxed into a neighborhood of the combination of local base kernels [19].
In recent years, various kernel evaluation methods for model selection have emerged in endless succession; for example, kernel alignment [20], kernel polarization [21], kernel class separability [22], etc. Among them, kernel alignment is one of the most commonly used evaluation methods on account of its simplicity, efficiency, and theoretical support. For example, centered kernel alignment is merged in an MKC method [23]. And, in [24], a local kernel alignment strategy is proposed by requiring only one sample to align with its k-nearest neighbors. Further, the global and local structure alignment, i.e., the internal structure of the data, is preserved in [25].
The research mentioned above fully shows that MKC has been widely used. However, most of them only adopt either a neighborhood kernel or local base kernels and do not combine these two methods together. Thus, they cannot broaden the search area of the optimal kernel and promote the variety of given kernels simultaneously and therefore cannot ensure the quality of the optimal kernel. In addition, most of the above methods require two steps; that is, first obtain the indicator matrix and then perform clustering. The two-step strategy does not guarantee the reliability and optimality of the final results because of error propagation and accumulation from each step.
In an ideal state, there is only one nonzero element in each row of the indicator matrix and the column in which the nonzero element resides corresponds to the cluster to which the sample belongs. That is, the indicator matrix in the ideal state directly displays clustering results. In this state, multiplying the indicator matrix by its transpose yields a block diagonal (BD) matrix [23]. However, in the actual clustering process, the indicator matrix is usually not the ideal case. As a result, clustering results can only be obtained after clustering is performed on the indicator matrix. This is why most MKC methods adopt the two-step operation. The shortcomings of this operation have been mentioned above. In this case, the product of the indicator matrix and its transpose is not BD. Nevertheless, we can think in reverse: if the product is BD, the indicator matrix is guided towards the ideal state, and clustering results are obtained directly.
Inspired by the above idea, we impose a BD constraint on the product of the indicator matrix and its transpose to guide it be BD, which aims to obtain clustering results directly from the indicator matrix, i.e., one-step clustering. Then, we propose a one-step clustering with adaptively local kernels and a neighborhood kernel (OSC-ALK-ONK) in this paper. This method not only merges the advantages of local base kernels and the neighborhood kernel but also achieves one-step clustering. The process of generating a neighborhood kernel can be seen in Figure 1.
Here are the main contributions of this paper.
  • By considering the correlation between base kernels, a simple strategy for selecting local base kernels is used to produce a consensus kernel, which adjusts adaptively to avoid the redundancy of given kernels and promote variety.
  • By selecting a neighborhood kernel of the consensus kernel as the optimal kernel, the expression capability of the optimal kernel is improved and its search scope is expanded.
  • A soft BD regularizer is used to encourage the product of the indicator matrix and its transpose to be BD, which means that the clustering results are obtained from the indicator matrix directly. Therefore, one-step clustering is realized, which ensures the final clustering results are optimal.
  • A four-step iterative algorithm including the Riemann conjugate gradient method in [26], is used to overcome the difficulty of solving the model.
  • Extensive experiment results conducted on eight benchmark datasets and compared with six clustering methods indicate that OSC-ALK-ONK is effective.
The remaining sections of the paper are as follows. Section 2 presents the notations used and the background of MKKC. In Section 3, the proposed OSC-ALK-ONK method and the optimization process are introduced in detail. Section 4 presents the experimental results and makes some discussions. The conclusions are stated in Section 5.

2. Related Work

2.1. Notations

The details of notations used in this paper are listed in Table 1.

2.2. Kernel k-Means Clustering (KKC)

Let X = { x i } i = 1 n be a set of samples and ϕ ( · ) : X H be a kernel mapping from an original space X to a reproducing Hilbert space H . Kernel k-means clustering (KKC) is usually expressed as
min Z { 0 , 1 } n × k i = 1 n c = 1 k Z i c ϕ ( x i ) μ c 2 2 s . t . c = 1 k Z i c = 1 ,
where Z { 0 , 1 } n × k is an assignment matrix, k is the number of clusters,
μ c = 1 n c i = 1 n Z i c ϕ ( x i ) , n c = i = 1 n Z i c
are the centroid and the number of the c-th ( 1 c k ) cluster.
Denoting the design matrix as Φ = [ ϕ ( x 1 ) , ϕ ( x 2 ) , , ϕ ( x n ) ] R d × n and the centroid matrix as U = [ μ 1 , μ 2 , , μ k ] R d × k , problem (1) can be rewritten as
min Z { 0 , 1 } n × k T r ( ( Φ U Z T ) T ( Φ U Z T ) ) s . t . Z · 1 k = 1 n .
Taking L = d i a g ( [ n 1 1 , n 2 1 , , n k 1 ] ) , then Z T Z = L 1 , U = Φ Z L . And taking a kernel matrix K with K i j = ϕ ( x i ) T ϕ ( x j ) , problem (3) can be simplified as
min Z { 0 , 1 } n × k T r ( K K Z L Z T ) s . t . Z · 1 k = 1 n .
According to the matrix decomposition, problem (4) is equivalent to
min Z { 0 , 1 } n × k T r ( K L 1 2 Z T K Z L 1 2 ) s . t . Z · 1 k = 1 n .
The difficulty of solving (5) is from the discreteness of Z . To overcome this difficulty, the discrete Z is usually relaxed to arbitrary real values, and its approximate values are treated as the solution of (5). Specifically, denoting H = Z L 1 2 , the following relaxed form of (5) is derived:
min H R n × k T r ( K ( I n H H T ) ) s . t . H T H = I k ,
where H R n × k , I k is a k-order identity matrix. The optimal H for (6) is made up of the k eigenvectors corresponding to the k largest eigenvalues of K .

2.3. Multiple Kernel k-Means Clustering (MKKC)

In MKC, a consensus kernel is computed by
K w = p = 1 m w p 2 K p ,
where K p is the p-th base kernel, w p is the p-th component of the weight vector w = [ w 1 , w 2 , , w m ] T , m is the number of base kernels.
Replacing K in (6) with K w , the model of MKKC is:
min H R n × k , w R + m T r ( K w ( I n H H T ) ) s . t . H T H = I k , w T 1 m = 1 .
Problem (8) can be solved by updating H and w alternately. (i) Updating H with fixed w , i.e., solving the similar one to problem (6). (ii) Updating w with fixed H , i.e., solving a quadratic programming problem:
min w R + m p = 1 m w p 2 T r ( K p ( I n H H T ) ) s . t . w T 1 m = 1 .

3. Proposed Method

3.1. Localized Kernel Selection

For a series of base kernels K 1 , K 2 , , K m , considering the relationship between base kernel pairs, we define
y p q = 0 , i f | G K p F 2 G K q F 2 | < δ , 1 , e l s e .
For the matrix G and a given positive parameter δ , in one hand, | G K p F 2 G K q F 2 | < δ means K p , K q are both in the neighborhood of G , i.e., they have large similarity. In this case, we set that y p q = 0 , which aims to discard the base kernels with high similarity. On the other hand, if | G K p F 2 G K q F 2 | < δ does not hold, we set that y p q = 1 , which means that we select the base kernels with low similarity to yield an optimal kernel. In summary, (10) can effectively avoid the redundancy of base kernels while maintaining their variety.
Evidently, y p q in (10) reflects the similarity between K p and K q , then q = 1 m y p q represents the similarity between K p and all the K q ( q = 1 , 2 , , m ) , T r ( Y T 1 M ) is the total similarity between each K p and K q ( q = 1 , 2 , , m ) .
Let w p = 1 T r ( Y T 1 M ) q = 1 m y p q , then
w p = 1 T r ( Y T 1 M ) q = 1 m y p q = 1 T r ( Y T 1 M ) ( Y · 1 m ) p [ 0 , 1 ] ,
and
p = 1 m w p = 1 T r ( Y T 1 M ) p = 1 m ( Y · 1 m ) p = 1 T r ( Y T 1 M ) T r ( Y T 1 M ) = 1 .
Thereby such a w p can balance the contribution of different given kernels to generate an optimal kernel.

3.2. Block Diagonal Regularizer

The clustering indicator matrix H in (6) and (8) is not a square matrix. In the ideal case, its element can be computed as:
H i j = 1 n j , i f x i C j , 0 , i f x i C j ,
where x i denotes the i-th sample, C j denotes the j-th cluster, n j represents the number of samples in C j . From (13), only one element in each row of H is nonzero, and this means the corresponding sample belongs to one and only one cluster. Further, if the samples are arranged from C 1 to C k by the cluster they belong to, then H H T is a BD matrix as follows:
H H T = 1 n 1 1 n 1 T 1 n 2 1 n 2 T 1 n k 1 n k T
(14) prompts us to have the following idea: If H H T itself has the property of (14), then it will in turn induce H to have the elements as (13), which means explicit clustering results are obtained directly from H .
Inspired by this idea, we hope that H H T possesses the BD property.
Since H H T is a square matrix, we view H H T as an adjacency matrix, then according to Laplacian matrix in graph theory, its degree matrix D is a diagonal matrix with d i i = c = 1 k ( H H T ) i c , i.e.,
D = D i a g ( H H T · 1 n ) ,
thus
L H H T = D i a g ( H H T · 1 n ) H H T .
There is an important conclusion between a matrix and a Laplacian matrix.
Theorem 1
([27]). For any A R n × n 0 , the number of connected components (blocks) in A equals the multiplicity k of the eigenvalue 0 of the corresponding Laplacian matrix L A .
Then, A has k connected components if and only if
λ i ( L A ) > 0 , i = 1 , , n k , = 0 , i = n k + 1 , , n ,
where λ i ( L A ) ( i = 1 , , n ) are the eigenvalues of L A in decreasing order.
Hence, the k-BD representation of H H T can be given as follows.
Definition 1
([27]). For H H T R n × n , the k-BD representation is defined as the sum of the k smallest eigenvalues of L H H T , i.e.,
H H T k = i = n k + 1 n λ i ( L H H T ) .
From Theorem 1, (16) and (17), H H T k = 0 means that H H T is k-BD. Then, minimizing H H T k is to encourage it to be BD. Thereby, it is a natural idea that H H T k is viewed as a BD regularizer. Its advantages, such as controlling the number of blocks, are softer than the BD method in [28] and be better than the alternatives of Rank ( L H H T ) or the convex relaxation L H H T * , are stated in detail in [27].

3.3. Objective Function

Hereto, combining localized kernel selection, the block diagonal regularizer, and choosing a neighborhood kernel as the optimal kernel, we formulate the final model as follows:
min G , H , K w T r ( G ( I n H H T ) ) + α 2 G K w F 2 + β 2 H H T k s . t . H T H = I k , H R n × k , G 0 , K w = p = 1 m w p K p ,
where w p is computed according to (10) and (11).
The loss function of the objective function is used to executive multiple kernel clustering, the consensus term is used to choose a neighbor kernel, and the block diagonal term is used to encourage H H T to be block diagonal, the aim of which is to obtain an expected H as Equation (13) and to implement one-step clustering.

3.4. Optimization

The regularizer H H T k in problem (18) is non-convex, which leads the difficulty of solving it. For this, a theorem is introduced to reformulate H H T k .
Theorem 2
([29], p. 515). Let L R n × n and L 0 . Then
i = n k + 1 n λ i ( L ) = min W L , W s . t . 0 W I , T r ( W ) = k .
By (17) and (19), then
H H T k = min W L H H T , W , s . t . 0 W I , T r ( W ) = k .
From L , W = T r ( L T W ) , problem (18) is equivalent to
min G , H , K w , W T r ( G ( I n H H T ) ) + α 2 G K w F 2 + β 2 T r ( ( D i a g ( H H T · 1 n ) H H T ) T W ) s . t . H T H = I k , H R n × k , G 0 , K w = p = 1 m w p K p , 0 W I , T r ( W ) = k .
Although problem (20) is not jointly convex on W , G , K w and H , it is convex for each variable with the rest variables fixed. Thus, we optimize each variable alternately to solve (20).

3.4.1. Update W While Fixing G, K w and H

While G , K w and H are fixed, problem (20) is
min W T r ( ( D i a g ( H H T · 1 n ) H H T ) T W ) s . t . 0 W I , T r ( W ) = k .
For (21), W k + 1 = U U T , where U R n × k is composed of the k eigenvectors associated with the k smallest eigenvalues of D i a g ( H H T · 1 n ) H H T [27].

3.4.2. Update G While Fixing W, K w and H

While W , K w and H are fixed, problem (20) is the following form:
min G T r ( G ( I n H H T ) ) + α 2 G K w F 2 s . t . G 0 .
Problem (22) can be expressed as
min G 1 2 G B F 2 s . t . G 0 ,
where B = K w 1 α ( I n H H T ) .
The optimal solution of problem (23) is G = U B Σ B + V B T , where B = U B Σ B V B T is SVD of B , Σ B + is a diagonal matrix where the diagonal elements are the positive elements of Σ B and zeros [13].

3.4.3. Update K w While Fixing W, G and H

With fixed W , G and H , problem (20) reduces to
min K w α 2 G K w F 2 s . t . K w = p = 1 m w p K p .
By introducing a parameter γ , problem (24) can be turned into
min K w α 2 G K w F 2 + γ 2 K w p = 1 m w p K p F 2 .
The closed-form solution of K w in (25) is computed by taking its derivative with respect to K w to zero:
K w = 1 α + γ ( α G + γ p = 1 m w p K p ) .
where w p is updated according to newly generated Y that is learned from new G .

3.4.4. Update H While Fixing G, K w and W

Here, problem (20) is
min H T r ( G · H H T ) + β T r ( ( D i a g ( H H T · 1 n ) H H T ) W ) s . t . H T H = I k , H R n × k .
The term β T r ( ( D i a g ( H H T · 1 n ) H H T ) W ) leads to the difficulty of solving (27) directly. By means of matrix operations and the properties of the trace, T r ( D i a g ( H H T · 1 n ) · W ) = T r ( ( 1 M D i a g ( W ) ) · H H T ) , then (27) can be changed into
min H T r ( β ( 1 M D i a g ( W ) G ) · H H T ) s . t . H T H = I k , H R n × k .
Because 1 M D i a g ( W ) in (28) is not symmetric, the same solution as a kernel k-means clustering is not suitable for (28). Notably, (28) is similar to the problem on the Stiefel manifold in [26]; thus, the Riemann conjugate gradient method in [26] can be used to solve it.
These are the main steps of our proposed algorithm.

4. Experiments

4.1. Data Sets

Eight real data sets are used in our experiments, and their sizes and classes are summarized in Table 2.

4.2. Comparison Methods

To demonstrate the clustering performance, we compare OSC-ALK-ONK with six clustering methods. Among them, KKM is a single kernel clustering, MKKM and RMKKM are two classic MKC methods, and MKKM-MR, SimpleMKKM, and MKKM-RK are three MKC methods recently proposed.
  • KKM integrates integral operator kernel functions in principal component analysis to deal with nonlinear data [30].
  • MKKM combines the fuzzy k-means clustering with multiple kernel learning, where the weights of base kernels are automatically updated to produce the optimal kernel [31].
  • RMKKM is an extension based on MKKM, and its robustness is ensured by an 2 , 1 -norm in kernel space [7].
  • MKKM-MR uses a matrix-induced regularization to measure the correlation between all the kernel pairs and implements MKC [18].
  • SimpleMKKM adopts a min-max model to minimize kernel alignment on the kernel coefficient and maximize kernel alignment on the clustering matrix, and is a simple MKC [12].
  • MKKM-RK is an MKC method by selecting representative kernels from the base kernel pool to generate the optimal kernel [17].

4.3. Multiple Kernels’ Construction

In this paper, we construct a kernel pool by selecting twelve base kernels (i.e., m = 12 ), which consists of seven radial basis function kernels with k e r ( x i , x j ) = e x p ( x i x j 2 2 / ( 2 τ σ 2 ) ) , where the value of τ is selected from { 0.01 , 0.05 , 0.1 , 1 , 10 , 50 , 100 } and σ is the maximum distance between samples; four polynomial kernels with k e r ( x i , x j ) = ( a + x i T x j ) b , where a and b are chosen from { 0 , 1 } and { 2 , 4 } , respectively; and a cosine kernel with k e r ( x i , x j ) = ( x i T x j ) / ( x i · x j ) . And all the kernels { K p } p = 1 m are normalized to the range of [ 0 , 1 ] .

4.4. Experimental Results and Analysis

To obtain better and more stable clustering performance, we utilize the ten-fold cross-validation method with the five-fold cross-validation embedded in OSC-ALK-ONK. To this end, at first, we randomly partition all the samples into ten subsets without repetition, where nine subsets are viewed as training sets and the rest are regarded as testing sets. Further, the nine training sets are partitioned into five subsets without repetition, where four subsets are utilized as training sets and the rest one is the validation set. In order to lose generality, the values of two parameters α , β change from [ 10 2 , 10 1 , , 10 1 , 10 2 ] . Five-fold cross validation aims to select the optimal combination of parameters α , β . The obtained optimal combinations are used in the test set to produce the final clustering results. And the number of cluster k in each data set is set as the true value of the cluster.
For each method used for comparison, we set the parameters according to the corresponding literature.
The final experimental results of each method on each data set, namely the average ACC, NMI, and Purity of 15 experiments, are reported in Table 3. The best ACC, NMI, and Purity on each data set are highlighted in boldface. The last three rows in Table 3 are the mean ACC, NMI, and Purity of each method on all the data sets. Evidently, the proposed OSC-ALK-ONK performs best. The detailed analyses are as follows.
(1) OSC-ALK-ONK outperms KKM by 54.41%, 62.44%, 53.26% according to ACC, NMI and Purity. This verifies that multiple kernel clustering is prior to single kernel clustering. (2) OSC-ALK-ONK exceeds MKKM and RMKKM by 58.62%, 67.01%, 60.53% and 31.29%, 34.36%, 32.89% in terms of ACC, NMI, and Purity. The reason should be that the combination of the local kernel method and the neighborhood kernel method is used to avoid the redundancy of the base kernel and expand the search range of the optimal kernel. And the clustering results of OSC-ALK-ONK are better than SimpleMKKM, which should also give credit to the combination of the two methods. (3) Although MKKM-MR and MKKM-RK exceed KKM, MKKM, and RMKKM, they are inferior to OSC-ALK-ONK. The reason should be the localized kernel strategy in OSC-ALK-ONK ensures the sparsity of base kernels and successfully avoids the redundance of base kernels. In a word, OSC-ALK-ONK improves the quality of the optimal kernel and promotes the clustering performance by combining local kernels and a neighborhood kernel. In addition, the BD representation ensures the reliability of clustering results and further promotes the clustering performance of OSC-ALK-ONK.
Overall, the experiment results show that OSC-ALK-ONK is an effective clustering method.
In order to further substantiate the effectiveness of OSC-ALK-ONK, we present the visualization of clustering results for all methods on ISOLET (for convenience, only a fifth of samples in ISOLET are chosen). As can be seen from Figure 2, OSC-ALK-ONK achieves a good clustering effect.

4.5. Ablation Study

In OSC-ALK-ONK, the weights of base kernels are adjusted adaptively, which aims to choose base kernels with small correlation and discard those with large correlation. These weights are automatically updated during the optimization process of the model. To verify the effectiveness of the localized kernel selection strategy, we adopt the uniform weight strategy as a contrast, i.e., w p = 1 m , p = 1 , 2 , , m , to perform ablation study. For convenience’s sake, this model is denoted as OSC-ONK-UW. That is, all the base kernels are selected in OSC-ONK-UW.
In addition, the BD regularization term is used in our OSC-ALK-ONK. To validate its effect, we also conduct an ablation study on the model not including this term, i.e., we only consider the following model (ALK-ONK-NoBD):
min G , H , K w T r ( G ( I n H H T ) ) + α 2 G K w F 2 s . t . H T H = I k , H R n × k , G 0 , K w = i = 1 m w i K i ,
where w p is computed according to (10) and (11).
The results of ablation studies on eight data sets, namely OSC-ONK-UW, ALK-ONK-NoBD, and OSC-ALK-ONK, are shown in Figure 3, which indicates that OSC-ALK-ONK outperforms OSC-ONK-UW and ALK-ONK-NoBD. Accordingly, OSC-ALK-ONK improves the clustering performance through the strategy of localized kernel selection and BD regularizer.

4.6. Parameters’ Sensitivity

The model of OSC-ALK-ONK involves the parameters α , β , and a penalty parameter γ . We set γ to be 0.1 in experiments. To verify the sensitivity of OSC-ALK-ONK to α and β , they are tuned in the ranges [ 10 2 , 10 1 , , 10 1 , 10 2 ] via leveraging a grid search technique. Figure 4 shows the clustering performance of OSC-ALK-ONK corresponding to varying α and β , which indicates that OSC-ALK-ONK is data-driven.

4.7. Convergence

In this section, we first prove the convergence of the objective function of (20). For convenience, we express the objective function of problem (20) as
J ( W , G , K w , H ) = { min G , H , K w , W T r ( G ( I n H H T ) ) + α 2 G K w F 2 + β 2 T r ( ( D i a g ( H H T · 1 n ) H H T ) T W ) s . t . H T H = I k , H R n × k , G 0 , K w = q = 1 r w q K q , 0 W I , T r ( W ) = k } .
When updating W with fixed G , K w , H , problem (21) is a convex programming problem [27], so it can converge to the global optimal solution. We denote the optimal solution as W t + 1 , then
J ( W ( t + 1 ) , G t , K w t , H t ) J ( W t , G t , K w t , H t ) .
When updating G with fixed W , K w , H , problem (22) is convex and the global optimal solution can be obtained, which we denote as G t + 1 , then
J ( W t + 1 , G t + 1 , K w t , H t ) J ( W t + 1 , G t , K w t , H t ) .
When updating K w with fixed W , G , H , problem (25) is convex, then the global optimal solution can be obtained. It is denoted as K w t + 1 , then
J ( W t + 1 , G t + 1 , K w t + 1 , H t ) J ( W t + 1 , G t + 1 , K w t , H t ) .
When updating H with fixed W , G , K w , since 1 M D i a g ( W ) is not symmetric, it is difficult to prove that problem (28) for H is convex. Nevertheless, the global convergence of the conjugate gradient method after finite step iteration has been proved in [26], i.e., the conjugate gradient method ensures that problem (28) can converge to the global optimal solution when updating H . The optimal solution is denoted as H t + 1 , then
J ( W t + 1 , G t + 1 , K w t + 1 , H ( t + 1 ) ) J ( W ( t + 1 ) , G t + 1 , K w t + 1 , H t ) .
Combining (31)–(34), it is concluded that
J ( W t + 1 , G t + 1 , K w t + 1 , H t + 1 ) J ( W t , G t , K w t , H t ) .
Therefore, J ( W t , G t , K w t , H t ) monotonically decreases at each iteration, until it converges to the global optimal solution.
The above proof shows that Algorithm 1 can monotonically reduce the value of the objective function at each iteration, i.e., the objective function is monotonically decreasing. The convergence graphs of OSC-ALK-ONK on all the data sets are shown in Figure 5, where the stopping criteria of the algorithm are | o b j ( t + 1 ) o b j ( t ) | | o b j ( t ) | 10 3 , and o b j ( t ) denotes the objective function value at the t-th iteration. Evidently, the changing trend of the objective function value with respect to the iteration number in Figure 5 shows the monotone descent. Further, they converge within 10 iterations on all the data sets, which demonstrates the rapid convergence of OSC-ALK-ONK.
   Algorithm 1: Pseudo code of solving problem (18).
            Input: m base kernels { K p } p = 1 m and parameters α , β , γ .
            Initialize:  ( K ) 1 = 1 m p = 1 m K p , ( H ) 1 = r a n d ( n , k ) , { ( w p ) 1 } p = 1 m = 1 m .
            While not converged do.
                 (1) Update W k + 1 by solving (21).
                 (2) Update G k + 1 by solving (22).
                 (3) Update K w k + 1 via (26).
                 (4) Update H k + 1 via (27), compute w via (10) and (11).
            end while
            Obtain the optimal G * , H * , K w * , W * .
            Output: ACC, NMI and Purity.

5. Conclusions

In this paper, we proposed a novel MKC method called OSC-ALK-ONK. It selects adaptively local given kernels to generate a consensus kernel and uses a neighborhood kernel of this consensus kernel as an optimal one. The combination of these two methods promotes the quality of the optimal kernel by enlarging its search area while avoiding the redundancy of base kernels. Furthermore, a BD regularizer is utilized on the indicator matrix to execute one-step clustering and avoid two-step operations. In addition, sufficient experiment results indicate the effectiveness of OSC-ALK-ONK.
In real applications, a lot of data are multi-view data, which may be incomplete for some objective reasons. Due to the effectiveness of the local kernel selection strategy in this paper, it can be considered to combine this strategy with the neighborhood kernel in the future to obtain a high-quality optimal kernel in multi-view data. In addition, on account of the advantages of the BD regular term in this paper, it is also used in multi-view data, even incomplete multi-view data. All these are worth studying in the future.

Author Contributions

Conceptualization, C.C., J.M. and Z.L.; methodology, C.C.; software, Z.H. and H.X.; validation, C.C.; formal analysis, J.M. and Z.L.; resources, C.C. and Z.H.; writing—original draft preparation, C.C.; writing—review and editing, J.M. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research fund of Guangxi Key Lab of Multi-source Information Mining and Security (No. MIMS22-03, No. MIMS21-M-01), the Guangxi Natural Science Foundation (No. 2023GXNSFBA026010).

Data Availability Statement

The datasets used in the experiments are available in the corresponding website at http://featureselection.asu.edu/datasets.php (accessed on 12 September 2022) (i.e., AR), http://www.cs.nyu.edu/roweis/data.html (accessed on 3 September 2022) (i.e., BA), https://jundongl.github.io/scikit-feature/datasets.html (accessed on 5 October 2022) (i.e., GLIOMA, LYMPHOMA, ORL, YALE), and https://archive-beta.ics.uci.edu/ml/datasets (accessed on 1 November 2022) (i.e., CCUDS10, ISOLET).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Valizadegan, H.; Jin, R. Generalized maximum margin clustering and unsupervised kernel learning. In Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; pp. 1417–1424. [Google Scholar]
  2. Zeng, H.; Cheung, Y.M. Feature selection and kernel learning for local learning-based clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 1532–1547. [Google Scholar] [CrossRef] [PubMed]
  3. Cortes, C.; Mohri, M.; Rostamizadeh, A. L2 regularization for learning kernels. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–21 June 2009; pp. 109–116. [Google Scholar]
  4. Zhao, B.; Kwok, J.T.; Zhang, C.S. Multiple Kernel Clustering. In Proceedings of the SIAM International Conference on Data Mining, Sparks, NV, USA, 30 April–2 May 2009; pp. 638–649. [Google Scholar]
  5. Kloft, M.; Brefeld, U.; Sonnenburg, S.; Laskov, P.; Müller, K.R.; Zien, A.; Sonnenburg, S. Efficient and accurate lp-norm multiple kernel learning. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; pp. 997–1005. [Google Scholar]
  6. Xu, Z.L.; Jin, R.; Yang, H.Q.; King, I.; Lyu, M.R. Simple and efficient multiple kernel learning by group lasso. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 1175–1182. [Google Scholar]
  7. Du, L.; Zhou, P.; Shi, L.; Wang, H.M.; Fan, M.Y.; Wang, W.J.; Shen, Y.D. Robust multiple kernel k-means using l21-norm. In Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 3476–3482. [Google Scholar]
  8. Kang, Z.; Lu, X.; Yi, J.F.; Xu, Z.L. Self-weighted multiple kernel learning for graph-based clustering and semi-supervised classification. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial, Stockholm, Sweden, 13–19 July 2018; pp. 2312–2318. [Google Scholar]
  9. Kang, Z.; Peng, C.; Cheng, Q.; Xu, Z.L. Unified spectral clustering with optimal graph. In Proceedings of the Thirty-Second Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 3366–3373. [Google Scholar]
  10. Kang, Z.; Wen, L.J.; Chen, W.Y.; Xu, Z.L. Low-rank kernel learning for graph-based clustering. Knowl. Based Syst. 2019, 163, 510–517. [Google Scholar] [CrossRef]
  11. Zhou, S.H.; Zhu, E.; Liu, X.W.; Zheng, T.M.; Liu, Q.; Xia, J.Y.; Yin, J.P. Subspace segmentation-based robust multiple kernel clustering. Inf. Fusion 2020, 53, 145–154. [Google Scholar] [CrossRef]
  12. Liu, X.W. Simplemkkm:Simple multiple kernel k-means. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5174–5186. [Google Scholar] [PubMed]
  13. Liu, X.W.; Zhou, S.H.; Wang, Y.Q.; Li, M.M.; Dou, Y.; Zhu, E.; Yin, J.P. Optimal neighborhood kernel clustering with multiple kernels. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 2266–2272. [Google Scholar]
  14. Ou, Q.Y.; Gao, L.; Zhu, E. Multiple kernel k-means with low-rank neighborhood kernel. IEEE Access 2021, 9, 3291–3300. [Google Scholar] [CrossRef]
  15. Zhou, S.H.; Liu, X.W.; Li, M.M.; Zhu, E.; Liu, L.; Zhang, C.W.; Yin, J.P. Multiple kernel clustering with neighbor-kernel subspace segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1351–1362. [Google Scholar] [CrossRef] [PubMed]
  16. Liu, X.W.; Zhou, S.H.; Liu, L.; Tang, C.; Wang, S.W.; Liu, J.Y.; Zhang, Y. Localized simple multiple kernel k-means. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9273–9281. [Google Scholar]
  17. Yao, Y.Q.; Li, Y.; Jiang, B.B.; Chen, H.H. Multiple kernel k-means clustering by selecting representative kernels. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4983–4996. [Google Scholar] [CrossRef] [PubMed]
  18. Liu, X.W.; Dou, Y.; Yin, J.P.; Wang, L.; Zhu, E. Multiple kernel k-means clustering with matrix-induced regularization. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1888–1894. [Google Scholar]
  19. Liu, J.Y.; Liu, X.W.; Xiong, J.; Liao, Q.; Zhou, S.H.; Wang, S.W.; Yang, Y.X. Optimal neighborhood multiple kernel clustering with adaptive local kernels. IEEE Trans. Knowl. Data Eng. 2022, 34, 2872–2885. [Google Scholar] [CrossRef]
  20. Afkanpour, A.; Szepesvári, C.; Bowling, M. Alignment based kernel learning with a continuous set of base kernels. Mach. Learn. 2013, 91, 305–324. [Google Scholar] [CrossRef]
  21. Wang, T.H.; Tian, S.F.; Huang, H.K.; Deng, D.Y. Learning by local kernel polarization. Neurocomputing 2009, 72, 3077–3084. [Google Scholar] [CrossRef]
  22. Wang, L. Feature selection with kernel class separability. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1534–1546. [Google Scholar] [CrossRef] [PubMed]
  23. Lu, Y.T.; Wang, L.T.; Lu, J.F.; Yang, J.Y.; Shen, C.H. Multiple kernel clustering based on centered kernel alignment. Pattern Recognit. 2014, 47, 3656–3664. [Google Scholar] [CrossRef]
  24. Li, M.M.; Liu, X.W.; Wang, L.; Dou, Y.; Yin, J.P.; Zhu, E. Multiple kernel clustering with local kernel alignment maximization. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 1704–1710. [Google Scholar]
  25. Wang, C.L.; Zhu, E.; Liu, X.W.; Gao, L.; Yin, J.P.; Hu, N. Multiple kernel clustering with global and local structure alignment. IEEE Access 2018, 6, 77911–77920. [Google Scholar] [CrossRef]
  26. Li, J.F.; Qin, S.J.; Zhang, L.; Hou, W.T. An efficient method for solving a class of matrix trace function minization problem in multivariate statistical. Math. Numer. Sin. 2021, 43, 70–86. [Google Scholar]
  27. Lu, C.Y.; Feng, J.S.; Lin, Z.C.; Mei, T.; Yan, S.C. Subspace clustering by block diagonal representation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 487–501. [Google Scholar] [CrossRef] [PubMed]
  28. Feng, J.S.; Lin, Z.C.; Xu, H.; Yan, S.C. Robust subspace segmentation with block-diagonal prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3818–3825. [Google Scholar]
  29. Dattorro, J. Convex Optimization and Euclidean Distance Geometry. 2016. Available online: http://meboo.convexoptimization.com/Meboo.html (accessed on 10 October 2022).
  30. Schölkopf, B.; Smola, A.J.; Müller, K.R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 1998, 10, 1299–1319. [Google Scholar] [CrossRef]
  31. Huang, H.C.; Chuang, Y.Y.; Chen, C.S. Multiple kernel fuzzy clustering. IEEE Trans. Fuzzy Syst. 2012, 20, 120–134. [Google Scholar] [CrossRef]
Figure 1. The process of generating a neighborhood kernel.
Figure 1. The process of generating a neighborhood kernel.
Mathematics 11 03950 g001
Figure 2. The visualization of clustering results for OSC-ALK-ONK and comparison methods on ISOLET.
Figure 2. The visualization of clustering results for OSC-ALK-ONK and comparison methods on ISOLET.
Mathematics 11 03950 g002
Figure 3. Comparison of clustering results among OSC-ONK-UW, ALK-ONK-NoBD and OSC-ALK-ONK on eight datasets.
Figure 3. Comparison of clustering results among OSC-ONK-UW, ALK-ONK-NoBD and OSC-ALK-ONK on eight datasets.
Mathematics 11 03950 g003
Figure 4. ACC of OSC-ALK-ONK with different parameter’s settings.
Figure 4. ACC of OSC-ALK-ONK with different parameter’s settings.
Mathematics 11 03950 g004
Figure 5. Objective function value of OSC-ALK-ONK at each iteration.
Figure 5. Objective function value of OSC-ALK-ONK at each iteration.
Mathematics 11 03950 g005
Table 1. Details of notations.
Table 1. Details of notations.
A F Frobenius norm of A, i.e., A F = i , j A i , j 2
A T transpose of A
t r ( A ) trace of A
D i a g ( A ) diagonal matrix with diagonal elements of A
A 0 positive semi-definite A
I k k-order identity matrix
1 n all-one column vector
1 M all-one matrix
Table 2. Summaries of data sets.
Table 2. Summaries of data sets.
Data Sets# (Samples)# (Features)# (Classes)
AR840768120
BA140432036
CCUDS10194410110
GLIOMA5044344
ISOLET15606172
LYMPHOMA9640269
ORL400102440
YALE165102415
Table 3. Clustering results of different methods.
Table 3. Clustering results of different methods.
DatasetMetricKKMMKKMRMKKMMKKMSimpleMKKMProposed
-MR -MKKM -RK
ARACC0.30000.31670.31680.48630.51500.50470.6686
NMI0.63600.63500.66080.76150.76440.76080.8890
Purity0.31900.34370.33580.53980.53040.53050.7826
BAACC0.28630.38680.40880.41770.44960.37080.4211
NMI0.43650.53010.56390.58820.59190.51940.6716
Purity0.32260.40100.43290.46190.47800.39620.4773
CCUDS10ACC0.12800.12140.12850.13450.12870.12870.2031
NMI0.00930.00810.00910.00830.01020.00730.1054
Purity0.13180.12340.13310.13570.13270.13100.2182
GLIOMAACC0.50320.48800.57600.49550.51200.56400.7900
NMI0.32560.29430.48180.30830.29570.40770.7178
Purity0.53570.54000.64600.53410.53200.57870.8420
ISOLETACC0.56590.52690.56430.52820.58010.55580.6489
NMI0.02240.00210.01210.00230.01920.00900.0862
Purity0.56590.52690.56430.52820.58010.55580.6500
LYMPHOMAACC0.49820.50850.61350.54370.59320.56390.6929
NMI0.51050.50700.61720.64950.60990.59630.7070
Purity0.71630.70360.80310.78260.82660.81250.8350
ORLACC0.43080.34750.55210.63570.63910.58600.7127
NMI0.63830.53780.74060.81630.80730.75810.8898
Purity0.47970.35250.60010.69080.68600.61880.8107
YALEACC0.41820.35150.52180.53410.55120.59390.6962
NMI0.43300.41520.55580.56140.58260.59860.8262
Purity0.44240.36360.53640.54950.55550.59880.7691
AvgACC0.39130.38090.46020.47200.49610.48350.6042
NMI0.37650.36620.45520.46200.46020.45720.6116
Purity0.43920.41930.50650.52780.54020.52780.6731
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, C.; Hu, Z.; Xiao, H.; Ma, J.; Li, Z. One-Step Clustering with Adaptively Local Kernels and a Neighborhood Kernel. Mathematics 2023, 11, 3950. https://doi.org/10.3390/math11183950

AMA Style

Chen C, Hu Z, Xiao H, Ma J, Li Z. One-Step Clustering with Adaptively Local Kernels and a Neighborhood Kernel. Mathematics. 2023; 11(18):3950. https://doi.org/10.3390/math11183950

Chicago/Turabian Style

Chen, Cuiling, Zhijun Hu, Hongbin Xiao, Junbo Ma, and Zhi Li. 2023. "One-Step Clustering with Adaptively Local Kernels and a Neighborhood Kernel" Mathematics 11, no. 18: 3950. https://doi.org/10.3390/math11183950

APA Style

Chen, C., Hu, Z., Xiao, H., Ma, J., & Li, Z. (2023). One-Step Clustering with Adaptively Local Kernels and a Neighborhood Kernel. Mathematics, 11(18), 3950. https://doi.org/10.3390/math11183950

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop