Next Article in Journal
An Intelligent Spam Detection Model Based on Artificial Immune System
Previous Article in Journal
Privacy-Aware MapReduce Based Multi-Party Secure Skyline Computation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Latent Feature Group Learning for High-Dimensional Data Clustering

1
Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
2
National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
3
School of Computer Science, McGill University, Montreal, QC H3A OG4, Canada
*
Author to whom correspondence should be addressed.
Information 2019, 10(6), 208; https://doi.org/10.3390/info10060208
Submission received: 1 April 2019 / Revised: 17 May 2019 / Accepted: 6 June 2019 / Published: 10 June 2019
(This article belongs to the Section Artificial Intelligence)

Abstract

:
In this paper, we propose a latent feature group learning (LFGL) algorithm to discover the feature grouping structures and subspace clusters for high-dimensional data. The feature grouping structures, which are learned in an analytical way, can enhance the accuracy and efficiency of high-dimensional data clustering. In LFGL algorithm, the Darwinian evolutionary process is used to explore the optimal feature grouping structures, which are coded as chromosomes in the genetic algorithm. The feature grouping weighting k-means algorithm is used as the fitness function to evaluate the chromosomes or feature grouping structures in each generation of evolution. To better handle the diverse densities of clusters in high-dimensional data, the original feature grouping weighting k-means is revised with the mass-based dissimilarity measure rather than the Euclidean distance measure and the feature weights are optimized as a nonnegative matrix factorization problem under the orthogonal constraint of feature weight matrix. The genetic operations of mutation and crossover are used to generate the new chromosomes for next generation. In comparison with the well-known clustering algorithms, LFGL algorithm produced encouraging experimental results on real world datasets, which demonstrated the better performance of LFGL when clustering high-dimensional data.

1. Introduction

High-dimensional data analysis is a challenging task in the domains of machine learning and artificial intelligence. Thousands of features in high-dimensional data cause a high complexity when using the existing tools for low-dimensional data to cluster high-dimensional data [1]. High-dimensional data often contain many redundant, irrelevant and noise features, which affect the clustering results. In the past decades, many subspace clustering algorithms have been proposed to handle high-dimensional data, aiming at finding clusters from subspaces of data rather than the entire data space [2]. Among the various subspace clustering methods, soft subspace clustering is an important technique. It assigns the weights to individual features and uses the weights to identify important features where the subspace structures of clusters can be discovered [3,4].
The individual feature weighting method suffers from the high-dimensionality of data with hundreds of thousands of features. The first problem is that the feature weights produced by the individual feature weighting method are unstable. For example, experimental results reveal that the individual feature weights can converge to similar values in low-dimensional data [5]. However, with the increase of data dimensionality, the feature weights calculated with different initial values become unstable and no longer converge [6]. The second problem is that many weak features exist in high-dimensional data so it is hard to recover the subspace cluster structure from the data. The third problem is that the clusters are often buried in many noise features.
To tackle the above-mentioned problems, the feature grouping weighting k-means algorithm named FG-k-means [7] is proposed for high-dimensional data. In FG-k-means, the features are divided into a small set of feature groups, where each being treated as a group feature in the low dimensional space of feature groups. High-dimensional data that are clustered on the group features and the clusters in different subspaces of group features are discovered by assigning the weights to group features [6]. Because the group features generalize the information of individual features in high-dimensional data, the FG-k-means algorithm performs better than the clustering algorithms, which cluster the data on the individual features. One limitation of FG-k-means is that the feature groups have to be known in advance. However, very few high-dimensional datasets in the real world have the feature groups available. This requirement limits the practical use of FG-k-means algorithm.
In this paper, we propose a latent feature group learning (LFGL) algorithm to automatically learn the latent feature groups in the process of subspace clustering for high-dimensional data. This algorithm consists of two levels of optimizations. The outer level of optimization uses the Darwinian evolutionary process to learn the optimal structure of feature groups. In this process, the feature group structures are coded as the chromosomes in genetic algorithm and the best chromosome is selected through evolutions of generations. The inner level of optimization is used in each generation to evaluate the chromosomes and select the stronger ones for genetic operations to generate the new chromosomes in the next generation of the Darwinian evolution process. The FG-k-means algorithm is used in the inner level of optimization as the fitness function for chromosome selection. To effectively deal with complex high-dimensional data, two revisions are made in the FG-k-means algorithm. The mass-based dissimilarity measure is used to replace the Euclidean distance in calculating the dissimilarity between objects so the different densities of clusters can be better handled. The feature grouping and group weights are optimized with a convex penalty relaxation method by using the orthogonal constraint to ensure the orthogonality among the feature groups. The relaxation problem is solved with the process of nonnegative matrix factorization. At the outer level of optimization, mutation and crossover operations are used to manipulate chromosomes for the next generation. We conducted experiments on six gene datasets and one text dataset. We compared the results of LFGL algorithm with five existing algorithms. Encouraging experimental results were obtained by LFGL algorithm, which demonstrated the better performance of proposed LFGL algorithm.
The remainder of this paper is organized as follows. In Section 2, we review some related work. In Section 3, we present the latent feature grouping model for projection of high-dimensional data to a low-dimensional space. Section 4 presents the LFGL algorithm. The experimental results are presented and discussed in Section 5. The conclusions are drawn in Section 6.

2. Related Work

In the past decade, the soft subspace clustering has been an important research topic in cluster analysis [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]. The representative works are summarized as follows.
  • Huang et al. [5] proposed W-k-means clustering algorithm which can automatically compute the feature weights in k-means clustering process. W-k-means extends the standard k-means algorithm with one additional step to compute the feature weights at each iteration of clustering process. The feature weight is inversely proportional to the sum of within-cluster variances of feature. As such, noise features can be identified and their effects on the clustering result are significantly reduced. Amorim et al. [22] extended W-k-means algorithm with Minkowski’s metric and assigned the Minkowski’s exponent that coincides with the exponent to feature weights.
  • Hoff [17] proposed a multivariate Dirichlet process mixture model, which is based on a Pólya urn cluster model for the multivariate means and variances. The model is learned by a Markov chain Monte Carlo process. However, its computational cost is prohibitive. Fan et al. [23] proposed a variational inference framework for unsupervised non-Gaussian feature selection, in the context of the finite generalized Dirichlet (GD) mixture-based clustering. Under the proposed principled variational framework, it simultaneously estimates all the involved parameters and determines the complexity (i.e., both model and feature selection) of GD mixture.
  • Domeniconi et al. [16] proposed the locally adaptive clustering (LAC) algorithm, which assigns a weight for each feature in the cluster. They used an iterative algorithm to minimize the objective function. Cheng et al. [21] proposed another weighting k-means approach very similar to LAC, but allowing for incorporation of further constraints. Jing et al. [4] proposed the entropy weighting K-means (EWKM) algorithm, which also assigns a weight to each feature in each cluster. Different from LAC, EWKM extends the standard k-means algorithm with one additional step to compute the feature weights for each cluster at each iteration. The weight is inversely proportional to the sum of within-cluster variances of feature in the cluster.
  • Tsai and Chiu [19] developed a feature weights self-adjustment mechanism for k-means clustering on the relational datasets, in which the feature weights are automatically computed by minimizing the within-cluster discrepancy and maximizing the between-cluster discrepancy simultaneously. Deng et al. [20] proposed an enhanced soft subspace clustering algorithm (ESSC) which employs both within-cluster and between-cluster information in the subspace clustering process. Xia et al. [24] proposed a multi-objective-evolutionary-approach-based soft subspace clustering (MOEASSC) algorithm which optimizes two minimization objective functions simultaneously by using a multi-objective evolutionary approach. They also proposed a two-step method to reduce the difficulty in identifying the subspace of each cluster, the cluster memberships of objects and the number of clusters.
  • Chen et al. [7] proposed a two-level weighting method named FG-k-means for multi-view data in which two types of weights are employed: a view weight is assigned to each view to identify the view compactness and a feature weight is also assigned to each feature in a view to identify the feature contribution. Cai et al. [25] used FG-k-means for text clustering. They first used the topic model LDA to partition the words into several groups and then used FG-k-means to cluster the text data. The experimental results show that the word grouping method improves the clustering performance on text data. Gan et al. [26] extended FG-k-means to generate the feature groups automatically by clustering the features weights. The objective function of clustering feature weights is added as a penalty term to the objective function of FG-k-means. This indicates that there is a much higher chance to construct all informative abstract features than all informative individual features.

3. Latent Feature Group Projection in Subspace Clustering

Although many features are used to describe data in the high-dimensional space, few of them are needed to distinguish a specific cluster from others. Thus, the subspace clustering algorithms [4,5,16,24] obtain better performance than that of general k-means algorithm. As the number of dimension increases, the strategy of searching subspace of decisive features in the entire space often leads to the suboptimal results because of the existence of many noise features [6]. Meanwhile, it deteriorates the performance of subspace clustering.
To solve the above-mentioned problem, we project the high-dimensional objects into the low-dimensional latent feature group space. The features in the high-dimensional spaces are not fully independent. Rather, they gather together into nearly mutually exclusive groups. However, the feature groups are unknown in most real world datasets and they are latent in high-dimensional data. To make the projection possible, we need to design an effective learning process that can automatically find the latent feature groups and cluster the objects in the subspace of feature groups. We use two steps to solve this problem. The first step is to build a latent feature grouping model that enables to express the feature partition. The second step is to embed the latent feature grouping model into the objective function of subspace clustering process, so that grouping features and clustering data can be performed simultaneously.
Figure 1 illustrates the latent feature grouping model. The left column shows the features in a dataset X i , j R n × m . The feature set A = { x 1 , x 2 , , x m } are mapped into t groups { g 1 , g 2 , , g t } in the middle column. We can see that each feature is mapped to only one group, which ensures the exclusiveness of features in different groups. The mapping can be defined in a partition matrix V 0 in which the row indicates the feature groups and the column indicates the features. If a feature i is mapped to the feature group j, the element ( i , j ) is assigned to 1. Otherwise, it is assigned to 0. Because of the exclusiveness, each column in V 0 has only one element with value 1. The rest are 0s. When a feature i is mapped to the feature group j, we also assign a weight to the feature to indicate the feature importance in the group. The feature weights are represented in the weight matrix V l , where l is a cluster. We let V l = V 0 V l . The feature groups are also weighted with W R + t to result in the weighted group values { g ( x ) 1 , g ( x ) 2 , , g ( x ) t } shown in the right column. The weights of feature groups identify the importance of the feature groups in determining each cluster. Formally, we can write the latent feature grouping model as
g ( x ) = W l ( V 0 V l ) x T .
Below, we use a simple example to illustrate Equation (1). The matrix
W l = 0.1 0 0 0 0.2 0 0 0 0.7
provides the weights of feature groups in cluster l with the normalized constraint w l = 1 . A partition of six features into three groups is specified in the partition matrix V 0 as
V 0 = A 1   A 2   A 3   A 4   A 5   A 6   G 1 G 2 G 3 ( 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 0 ) ,
where the rows of V 0 represent the feature groups and the element 1 in each column indicates the feature group to which the feature is assigned. The features in each group are also weighted on the importance of each feature in the feature group of each cluster. The weighting scheme for cluster l is represented as the following weight matrix
V l =       A 1        A 2        A 3        A 4        A 5        A 6        G 1 G 2 G 3 ( 0.89 0.46 0 0 0 0 0 0 0 0.60 0 0.80 0 0 0.35 0 0.94 0 ) .
The none zero elements in V l are the same as those in V 0 . V l s are initialized and optimized on the constraint of V l V l T = I , thus we have
j = 1 6 v i , j 2 = 1 , for 1 i 3 .
Since V l = V 0 V l , the feature group structure V 0 and the individual feature weights V l in each cluster can be optimized separately. In the current feature grouping weighting algorithms such as FG-k-means, V 0 is supposed to be known in advance and V l s are optimized in the clustering process. However, V 0 is usually unknown. Therefore, the current algorithms are not able to learn the feature grouping structure automatically. With the latent feature grouping model in Equation (1), we are able to learn V 0 in the latent feature group learning algorithm described in the next section.

4. Latent Feature Group Learning Algorithm

In this section, we introduce the latent feature group learning (LFGL) algorithm which can search the optimal latent feature groups and computes the group weights and individual feature weights for each cluster. This is automatically achieved in two levels of optimization. The outer level uses the Darwinian evolution process to select the optimal feature grouping structure V 0 . The inner level uses the revised FG-k-means algorithm to evaluate each feature grouping structure. In the following, we first present the revised FG-k-means. Then, we present the Darwinian evolution process for selecting the optimal feature grouping structure.

4.1. Revised FG-k-Means

In LFGL algorithm, each given feature grouping structure V 0 is evaluated by the revised FG-k-means on input dataset. The objective function of revised FG-k-means is defined as
P ( U , Z , V , W ) = l = 1 k [ i = 1 n t = 1 T j G t h i , l w l , t v l , j d ( x i , j , z l , j ) + λ t = 1 T w l , t log ( w l , t ) ]
subject to
l = 1 k h i , l = 1 , i = 1 , 2 , , n l = 1 k w l = 1 V l V l T = I ,
where X = { x i x i R m } i = 1 n is the set of n objects with m features, H = { h i , l h i , l { 0 , 1 } } is the set of membership indicators of n objects in k clusters, Z = { z l z l R m } l = 1 k is the set of k cluster centers, and V l is specified based on V 0 , i.e., V l = V 0 V l .
There are two revisions made on the original FG-k-means [7]. The mass-based dissimilarity is used to replace the Euclidean distance and the orthogonal constraint is set on the matrices of individual feature weights V l . Below, we discuss the techniques to solve the aforementioned optimization problem.

4.2. The Mass-Based Dissimilarity

Due to the dimensionality curse in the high-dimensional space, the Euclidean distances between objects tend to be similar. To overcome this problem, we choose to use the mass-based dissimilarity studied in [27]. This dissimilarity measures the mass dissimilarity depending on data distribution, i.e., the probability mass of the smallest region covering the two objects. It provides a more effective match than the distance measures in the nearest neighbor search for k-nn classifiers and information retrieval. It is named m p -dissimilarity and defined in the same form as l p -norm, except that the dissimilarity in dimension i is the probability mass in a region P ( R i ( x ; y ) ) rather than the distance x i y i .
Let D be the data domain with the probability density F. The mass-based dissimilarity of objects x and y in D is defined as P ( R ( x , y | H ; D ) ) , where H H ( D ) is a hierarchical model to partition the data space into non-overlapping and non-empty regions. The smallest local region R ( x , y | H ; D ) covering x and y with regard to H and D is defined as
R ( x , y | H ; D ) = arg min { x , y } r l D 1 ( l r ) ,
where 1 ( · ) is an indicator function.
The mass-based dissimilarity measure of x and y with regard to H and D is defined as the probability of R ( x , y | H ; D ) , i.e.,
d ( x , y ) = E H ( D ) [ P F R ( x , y | H ; D ) ] ,
where P F ( i ) is the probability with regard to F and the expectation is taken over all models in H ( D ) . In practice, the mass-based dissimilarity is estimated from a finite number of models H i H ( D ) , i = 1 , 2 , , t as
d ( x , y ) = 1 t i = 1 t P ˜ ( R ( x , y | H i ; D ) ) ,
where P ˜ ( R ) = 1 | D | z D 1 ( z ) R . Note that d ( x , y ) is the smallest local region covering x and y and it is analogous to the shortest distance between x and y used in the geometric model.

4.3. Optimization of Equation (6)

We minimize the objective function in Equation (6) by iteratively solving the following four minimization problems:
1.
Problem P 1 : Fix Z = Z ^ , V = V ^ and W = W ^ , and solve the reduced problem P ( U ) . Problem P 1 is solved by
u i , l = 1 , if D l D s for 1 s k u i , s = 0 , otherwise ,
where D s = t = 1 T w s , t j G t v s , j d ( x i , j , z s , j ) .
2.
Problem P 2 : Fix U = U ^ , V = V ^ and W = W ^ , and solve the reduced problem P ( Z ) . Problem P 2 is solved by updating the centers of the clusters by Algorithm 1.
Algorithm 1 Cluster center updating.
Input:U,W,V.
Output: The updating cluster centers Z.
  1:  Generate a similarity matrix S R n n , where s i , j = s j , i = u i , j d ( x i , x j ) ;
  2:  Generate a distance sum matrix S s u m R 1 n , where s i = j = 1 n s j , i ;
  3:  Generate a density matrix D R 1 n , find the smallest value in every column m of D U and consider the corresponding object as the center of cluster m for m = 1 , 2 , , k .
  4:  return Z;
3.
Problem P 3 : Fix U = U ^ , Z = Z ^ , and V = V ^ , and solve the reduced problem P ( W ) . Problem P 3 is solved by the following Theorem 1.
Theorem 1.
Let U = U ^ , Z = Z ^ , and V = V ^ be fixed and λ > 0 , P ( W ) is minimized iff
w l , t = exp E l , j λ / j G t exp E l , j λ ,
where
D l , t = i = 1 n u i , l ^ v l , j ^ d ( x i , j , z l , j ^ ) .
Proof. 
Given U = U ^ , Z = Z ^ and V = V ^ , we minimize the objective function with respect to W. Since there exists a set of k × T constraints t = 1 T w l , t = 1 , we form the Lagrangian by isolating the terms which contain { w l , 1 , w l , 2 , , w l , t } and adding the appropriate Lagrangian multipliers as
L { w l , 1 , w l , 2 , , w l , T } = t = 1 T [ w l , t D l , t + λ t = 1 T j G t w l , t log w l , t v l , j log v l , j + γ l , t ( j G t w l , t 1 ) ] ,
where D l , t is a constant in the tth feature group on the lth cluster for fixed U = U ^ , Z = Z ^ and V = V ^ . By setting the gradient of L { w l , 1 , w l , 2 , , w l , T } with respect to γ and w l , t to zero, we obtain
L { w l , 1 , w l , 2 , , w l , T } γ = t = 1 T w l , t 1 = 0
and
L { w l , 1 , w l , 2 , , w l , T } w l , t = D l , t + λ j G t v l , j log v l , j ( 1 + log w l , t ) + γ = 0 ,
where t is the index of feature group to which the jth feature is assigned. Then, we obtain
w l , t = exp D l , t λ / t = 1 T exp D l , t λ .
 □
4.
Problem P 4 : Fix U = U ^ , Z = Z ^ , and W = W ^ , and solve the reduced problem P ( V ) . Problem P 4 is solved as follows. Because of the additivity of objective function in Equation (6), the matrix W can be divided into k subproblems for k clusters, respectively. Let
Q l = d i a g w l T d i a g w l
and
q i , l = h i , l x i z l ,
then the lth subproblem of original problem can be written as
min V R + t × m i = 1 n q i , l T V l T Q l V l q i , l s . t . V l V l T = I .
The subproblem in Equation (20) has the nonnegative and orthogonal constraints on matrix V l , which makes the problem NP-hard to solve directly. The methods used here are analogous to that of nonnegative matrix factorization (NMF). We replace the orthogonal constraint with a F-norm measurement of orthogonality as the relaxation, i.e.,
min V R + t × m i = 1 n q i , l T V T Q l V q i , l + η 2 ( V V T I ) F 2 ,
where η 0 is a parameter to control the orthogonality of V explicitly. The Lagrangian of Equation (21) is
L ( V , Λ ) = i = 1 n q i , l T V T Q l V q i , l + η 2 ( V V T I ) F 2 t r ( Λ V T ) ,
where Λ is the Lagrange multiplier for constraint V R + t × m . By V L = 0 , we have
2 Q V S S T + 2 η ( V V T I ) V Λ = 0 ,
where S = [ q 1 , l , q 2 , l , , q n , l ] R m × n . According to the KKT complementary condition on [ V ] i , j 0 , by making a Hadamard product with V on both sides of Equation (23), we obtain
( Q V S S T + η ( V V T I ) V ) V = 0 .
The multiplicative updating rule for V l is derived as
V i , j V i , j [ Q V ( S S T ) + η V ] i , j [ Q V ( S S T ) + + η V V T V ] i , j ,
where η is a parameter to control the orthogonality among different rows of V, and ( ) + and ( ) are the operators to get the positive and negative parts of the input matrix, respectively, i.e.,
[ ( A ) + ] i , j = [ A ] i , j if [ A ] i , j > 0 0 otherwise
and
[ ( A ) ] i , j = | [ A ] i , j ] | if [ A ] i , j < 0 0 otherwise .
In the next section, we introduce the method to optimize the latent feature grouping structure V 0 .

4.4. Evolutionary Method to Select the Best Feature Grouping Structure V 0

The Darwinian evolutionary process [28] is used to search the best feature grouping structure V 0 . In this process, the feature grouping structures V 0 are encoded as the chromosomes and the revised FG-k-means is used as the fitness function to evaluate the chromosomes. The best V 0 is selected through the evolutions of generations. To our knowledge, this is the first attempt to use the evolutionary process to search the best feature grouping structure from high-dimensional data.
We first present a classical evolutionary method where the population looks for the best grouping from the feature set. Each individual chromosome encodes a feature grouping structure V 0 . The chromosome A i , g of the ith individual in the gth generation is defined as
A i , g = ( V 1 i , g , , V k i , g , , V m i , g ) ,
where A i , g is a binary sequence with length t, V k i , g is the kth column of matrix V 0 and m is the number of features in the dataset. For example, the matrix V 0 in Equation (3) is encoded as
A i , g = { ( 1 , 0 , 0 ) , ( 1 , 0 , 0 ) , ( 0 , 0 , 1 ) , ( 0 , 1 , 0 ) , ( 0 , 0 , 1 ) , ( 0 , 1 , 0 ) } .
We can see that each binary sequence V k i , g has only one element as 1 and the rest as 0. This is a constraint on the structure of chromosomes. To start the evolutionary process, there are 20 chromosomes that are generated randomly as the first generation of individuals. To generate the binary sequences for each chromosome, one position is randomly selected from t possible positions and is set as value 1. The remaining t 1 positions are set as 0.
After all chromosomes are initialized, they are evaluated by the revised FG-k-means algorithm. From each chromosome, the matrix V 0 is constructed. The matrix V l is initialized by solving V l V l T = I . Since the solutions are not unique, the different initial V l s for different clusters are initialized. Then, the initial V l s are obtained by V l = V 0 V l .
The initial feature group weights and initial cluster centers are generated and selected randomly. The number of clusters k is given. The revised FG-k-means algorithm is executed on the input dataset once for each chromosome to produce one clustering result. The Bayesian information criterion (BIC) is used to evaluate the clustering result and score the chromosome. After all chromosomes are scored, the genetic operations such as selection, crossover and mutation are applied to the chromosomes to produce the new individual chromosomes for next generation as follows.
  • There are 10 strongest chromosomes which are selected with the highest scores.
  • The crossover is performed in the following steps. The 10 chromosomes are randomly grouped into five pairs. For each pair of chromosome i and chromosome j, the corresponding binary sequence V k i , g and V k j , g are compared. If two sequences are same, the sequence is copied as the new generation of V k s , g + 1 . For the remaining pairs of different binary sequences, we randomly select one sequence from one chromosome to replace the corresponding sequence of another chromosome by the probability α k [ 0 , 1 ] . Finally, we encode V 0 as a new chromosome in the next generation. The rule to generate V k s , g + 1 is defined as
    V k s , g + 1 = V k i , g if V k i , g = V k j , g or α k 0.5 V k j , g otherwise ,
    where α k is randomly generated for each V k s , g + 1 .
  • For the process of mutation, we randomly choose five chromosomes from 10 alternative chromosomes. For each chromosome A i , g , we randomly generate a new chromosome A k r a n d = ( V 1 r a n d , , V k r a n d , , V m r a n d ) . The rule to generate V k i , g + 1 is
    V k i , g + 1 = V k i , g if α k 0.5 V k r a n d otherwise ,
    where α k is randomly generated for each V k i , g + 1 .
In this way, we generate 10 new chromosomes and combine them with the 10 strongest chromosomes to form a new population for exploration and exploitation in the next generation of evolution.

4.5. LFGL Algorithm

The process of learning the latent feature grouping structure V 0 consists of three stages, i.e., the individual feature weights, the feature group weights and a chromosome score from the input dataset. The initialization stage generates the first generation of 20 chromosomes representing 20 initial V 0 s. The second stage uses the revised FG-k-means algorithm to score the 20 chromosomes. The third stage selects the 10 strongest chromosomes according to the scores and performs the genetic operations on the selected chromosomes to produce the new generation of chromosomes for evolution. This process continues until the termination criterion is met. The evolution process of LFGL algorithm is summarized in Algorithm 2.
Algorithm 2 LFGL algorithm.
Input: The dataset X, the number of clusters k, two positive parameters λ and η , the number of feature groups t.
Output: Local optimal values of H , Z , V , and W .
  1:  Initialize 20 chromosomes representing 20 different possibilities of feature grouping;
  2:  For each chromosome, we initialize W by sampling the positive values [ w l ] i N ( 1 , 0.01 ) , then normalize w l so that 1 T w l = 1 ;
  3:  Initialize V with the method mentioned in Section 4.4 to build V matrix, then normalize V l so that the 2 -norm of each row V l of is 1;
  4:  Randomly choose k cluster centers Z 0 ;
  5:  Update H t + 1 , Z t + 1 , W t + 1 and V t + 1 , respectively;
  6:  The objective function P obtains its local minimum value, then update V t + 1 and go back to Step 9;
  7:  Calculate BIC of 20 clustering results from 20 chromosomes, choose the best 10 ones and make 10 new chromosomes by crossover and mutation;
  8:  Repeat ten times and find the best solution of clustering.

5. Experiments

We tested the performance of LFGL algorithm on several datasets in high dimensions. The results were compared with five existing clustering algorithms: k-means [29,30], TWKM [6], EWKM [4], LAC [16] and FG-k-means [7].

5.1. Datasets

Seven high-dimensional datasets in the real world were used to evaluate LFGL algorithm, where six genetic datasets were downloaded from http://archive.ics.uci.edu/ml/datasets.html and one text dataset was obtained from http://www.escience.cn/people/fpnie/papers.html. The common characteristics of these datasets are small numbers of objects with large numbers of features. The details of the datasets are listed in Table 1.

5.2. Evaluation Measures

Five measures were used to evaluate the clustering results of six algorithms. Each algorithm was run on each dataset 100 times to produce 100 results. The average value of 100 results on each dataset corresponding to each algorithm was used as the measure of algorithm. Let L be the partition of a dataset by the labeled classes and L ^ the partition of the clustering result by algorithms. The confusion matrix can be generated, as shown in Table 2, to calculate the correspondence between the true clusters and results. The elements of T P , F N , F P and T N are the numbers of objects satisfying both true and predicted conditions. The five measures are defined based on the confusion matrix as follows.
  • Accuracy is defined as
    A c c u r a c y = T P + T N T P + F P + T N + F N ;
  • Rand Index is defined as
    R a n d I n d e x = T P + F N T P + F P + T N + F N ;
  • Precision is defined as
    P r e c i s i o n = T P T P + F P ;
  • Recall is defined as
    R e c a l l = T P T P + F N ;
  • F - measure is defined as
    F - m e a s u r e = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l .

5.3. Parameters Settings

In Algorithm 2, we include two parameters λ and η which may impact the performance of LFGL. We set the parameter λ as { 1 , 2 , 3 , 4 , 5 , 8 , 10 , 14 , 16 , 20 } and η as { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 } . For each combination of λ and η , we ran LFGL 100 times and recorded the five measurements mentioned above. We ran the tests on dataset Prostate. The results are illustrated in Figure 2. We did not observe significant rule of the parameters. Hence, λ = 1 and η = 1 were determined for the followed experiments.

5.4. Clustering Results and Analysis

The clustering results on six genetic datasets and the text dataset are shown in Table 3. From the results, we can see that LFGL algorithm outperformed all five other clustering algorithms on most datasets. If we consider all clustering results, LFGL algorithm significantly outperformed all other five clustering algorithms on Prostate dataset. On other datasets, LFGL algorithm produced similar results as the five other clustering algorithms. These results show that LFGL algorithm is effective in clustering high-dimensional data. LFGL algorithm is established particularly with a target on the Genetic datasets to investigate the relations between human genes and diseases. It is an extra gain that it also performed well on the text dataset.

5.5. Feature Grouping Analysis

We conducted the experiments to investigate the trend of evolution of chromosomes, i.e., the feature grouping structure V 0 in the evolutionary process. Figure 3 shows the results on the dataset SRBCT. Figure 3a shows the Rand Index measures of 10 strongest chromosomes in different generations. We can see that the clustering results of LFGL algorithm improved with the increase of generations in the evolutionary process and became stable after some generations. This indicates that the evolutionary process optimized the clustering through searching the optimal feature grouping structure V 0 . Figure 3b shows the standard deviations of mutual dissimilarities between 10 strongest chromosomes in each generation. We can see the continuous dropping of standard deviation with the increase of generations. This indicates that the strongest chromosomes tended to become similar during the evolution, which implies the convergence of optimal feature grouping structure.

6. Conclusions and Future Works

In this paper, we present a new method to automatically find the latent feature grouping structure in high-dimensional data. The latent feature group learning (LFGL) algorithm is proposed to cluster high-dimensional data from subspaces of feature groups and individual features. The Darwinian evolution process is used to search the optimal group structures. The revised FG-k-means is used to evaluate the feature grouping and cluster the data. The experimental results on different kinds of datasets show that LFGL algorithm outperformed five existing clustering algorithms. Meanwhile, the results of clustering were evaluated for the accuracy of feature groupings. The future works will mainly focus on two directions. First, we will seek real applications for LFGL algorithm. The integration of LFGL algorithm with feature selection [31,32] to improve the generalization of learning algorithm will be very promising future work. Second, we will extend LFGL algorithm to big data analysis and management [33,34].

Author Contributions

Methodology, W.W.; Conceptualization, Y.H.; Validation, L.M.; Writing-Original Draft Preparation, W.W.; Writing-Review & Editing, Y.H.; Supervision, J.Z.H.

Funding

We would like to thank the editors and two anonymous reviewers whose meticulous readings and valuable suggestions helped us to improve this paper significantly. This paper was supported by National Key R&D Program of China (2017YFC0822604-2), China Postdoctoral Science Foundation (2016T90799) and Scientific Research Foundation of Shenzhen University for Newly-introduced Teachers (2018060).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Donoho, D.L. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Chall. Lect. 2000, 1, 375. [Google Scholar]
  2. Parsons, L.; Haque, E.; Liu, H. Subspace clustering for high dimensional data: A review. ACM Sigkdd Explor. Newsl. 2004, 6, 90–105. [Google Scholar] [CrossRef]
  3. Kriegel, H.P.; Kröger, P.; Zimek, A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 2009, 3, 1. [Google Scholar] [CrossRef]
  4. Jing, L.; Ng, M.K.; Huang, J.Z. An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 2007, 8, 1026–1041. [Google Scholar] [CrossRef]
  5. Huang, J.Z.; Ng, M.K.; Rong, H.; Li, Z. Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 5, 657–668. [Google Scholar] [CrossRef]
  6. Chen, X.; Xu, X.; Huang, J.Z.; Ye, Y. TW-k-means: Automated two-level variable weighting clustering algorithm for multiview data. IEEE Trans. Knowl. Data Eng. 2013, 25, 932–944. [Google Scholar] [CrossRef]
  7. Chen, X.; Ye, Y.; Xu, X.; Huang, J.Z. A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognit. 2012, 45, 434–446. [Google Scholar] [CrossRef]
  8. Gnanadesikan, R.; Kettenring, J.R.; Tsao, S.L. Weighting and selection of variables for cluster analysis. J. Classif. 1995, 12, 113–136. [Google Scholar] [CrossRef]
  9. De Soete, G. Optimal variable weighting for ultrametric and additive tree clustering. Qual. Quant. 1986, 20, 169–180. [Google Scholar] [CrossRef]
  10. De Soete, G. OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting. J. Classif. 1988, 5, 101–104. [Google Scholar] [CrossRef]
  11. Fowlkes, E.B.; Gnanadesikan, R.; Kettenring, J.R. Variable selection in clustering. J. Classif. 1988, 5, 205–228. [Google Scholar] [CrossRef]
  12. Makarenkov, V.; Leclerc, B. An algorithm for the fitting of a tree metric according to a weighted least-squares criterion. J. Classif. 1999, 16, 3–26. [Google Scholar] [CrossRef]
  13. Makarenkov, V.; Legendre, P. Optimal variable weighting for ultrametric and additive trees and K-means partitioning: Methods and software. J. Classif. 2001, 18, 245–271. [Google Scholar]
  14. Modha, D.S.; Spangler, W.S. Feature weighting in k-means clustering. Mach. Learn. 2003, 52, 217–237. [Google Scholar] [CrossRef]
  15. Friedman, J.H.; Meulman, J.J. Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. B 2004, 66, 815–849. [Google Scholar] [CrossRef]
  16. Domeniconi, C.; Gunopulos, D.; Ma, S.; Yan, B.; Al-Razgan, M.; Papadopoulos, D. Locally adaptive metrics for clustering high dimensional data. Data Min. Knowl. Discov. 2007, 14, 63–97. [Google Scholar] [CrossRef]
  17. Hoff, P.D. Model-based subspace clustering. Bayesian Anal. 2006, 1, 321–344. [Google Scholar] [CrossRef]
  18. Bouveyron, C.; Girard, S.; Schmid, C. High-dimensional data clustering. Comput. Stat. Data Anal. 2007, 52, 502–519. [Google Scholar] [CrossRef] [Green Version]
  19. Tsai, C.Y.; Chiu, C.C. Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm. Comput. Stat. Data Anal. 2008, 52, 4658–4672. [Google Scholar] [CrossRef]
  20. Deng, Z.; Choi, K.S.; Chung, F.L.; Wang, S. Enhanced soft subspace clustering integrating within-cluster and between-cluster information. Pattern Recognit. 2010, 43, 767–781. [Google Scholar] [CrossRef]
  21. Cheng, H.; Hua, K.A.; Vu, K. Constrained locally weighted clustering. Proc. VLDB Endow. 2008, 1, 90–101. [Google Scholar] [CrossRef] [Green Version]
  22. De Amorim, R.C.; Mirkin, B. Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recognit. 2012, 45, 1061–1075. [Google Scholar] [CrossRef]
  23. Fan, W.; Bouguila, N.; Ziou, D. Unsupervised hybrid feature extraction selection for high-dimensional non-Gaussian data clustering with variational inference. IEEE Trans. Knowl. Data Eng. 2013, 25, 1670–1685. [Google Scholar] [CrossRef]
  24. Xia, H.; Zhuang, J.; Yu, D. Novel soft subspace clustering with multi-objective evolutionary approach for high-dimensional data. Pattern Recognit. 2013, 46, 2562–2575. [Google Scholar] [CrossRef]
  25. Cai, Y.; Chen, X.; Peng, P.X.; Huang, J.Z. A LDA feature grouping method for subspace clustering of text data. In Proceedings of the Pacific-Asia Workshop on Intelligence and Security Informatics, Tainan, Taiwan, 13 May 2014; pp. 78–90. [Google Scholar]
  26. Gan, G.; Ng, M.K.P. Subspace clustering with automatic feature grouping. Pattern Recognit. 2015, 48, 3703–3713. [Google Scholar] [CrossRef]
  27. Ting, K.M.; Zhu, Y.; Carman, M.; Zhu, Y.; Zhou, Z.H. Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1205–1214. [Google Scholar]
  28. Gançarski, P.; Blansché, A. Darwinian, Lamarckian, and Baldwinian (Co) Evolutionary Approaches for Feature Weighting in K-means-Based Algorithms. IEEE Trans. Evol. Comput. 2008, 12, 617–629. [Google Scholar] [CrossRef]
  29. Hu, Q.; Xie, S.; Lin, S.; Wang, S.; Philip, S.Y. Clustering embedded approaches for efficient information network inference. Data Sci. Eng. 2016, 1, 29–40. [Google Scholar] [CrossRef]
  30. Mai, S.T.; Amer-Yahia, S.; Bailly, S.; Pépin, J.L.; Chouakria, A.D.; Nguyen, K.T.; Nguyen, A.D. Evolutionary active constrained clustering for obstructive sleep apnea analysis. Data Sci. Eng. 2018, 3, 359–378. [Google Scholar] [CrossRef]
  31. Song, Q.; Ni, J.; Wang, G. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans. Knowl. Data Eng. 2013, 25, 1–14. [Google Scholar] [CrossRef]
  32. García-Torres, M.; Gómez-Vela, F.; Melián-Batista, B.; Moreno-Vega, J.M. High-dimensional feature selection via feature grouping: A Variable Neighborhood Search approach. Inf. Sci. 2016, 326, 102–118. [Google Scholar] [CrossRef]
  33. Ur Rehman, M.H.; Liew, C.S.; Abbas, A.; Jayaraman, P.P.; Wah, T.Y.; Khan, S.U. Big data reduction methods: a survey. Data Sci. Eng. 2016, 1, 265–284. [Google Scholar] [CrossRef]
  34. Vargas-Solar, G.; Zechinelli-Martini, J.L.; Espinosa-Oviedo, J.A. Big Data Management: What to Keep from the Past to Face Future Challenges? Data Sci. Eng. 2017, 2, 328–345. [Google Scholar] [CrossRef]
Figure 1. Latent feature grouping model.
Figure 1. Latent feature grouping model.
Information 10 00208 g001
Figure 2. The parameter control for Prostate dataset.
Figure 2. The parameter control for Prostate dataset.
Information 10 00208 g002
Figure 3. The comparison of chromosomes during the evolution process.
Figure 3. The comparison of chromosomes during the evolution process.
Information 10 00208 g003
Table 1. Dataset Description.
Table 1. Dataset Description.
DatasetObjectsFeaturesClasses
SRBCT6323084
Lymphoma6240263
Prostate10260332
Adenocarcinoma7698682
Breast2classes7748702
CNS6071292
WebKB texas81440297
Table 2. Confusion matrix.
Table 2. Confusion matrix.
DataPredicted PositivePredicted Negative
True condition positiveTrue Positive ( T P )False Negative ( F N )
True condition negativeFalse Positive ( F P )True Negative ( T N )
Table 3. Summary of clustering results corresponding to six clustering algorithms.
Table 3. Summary of clustering results corresponding to six clustering algorithms.
DataEvaluationk-MeansEWKMTWKMLACFG-k-MeansLFGL
SRBCTRand Index0.6610.6030.6540.5970.654 0.777
Accuracy0.5280.4760.5300.413 0.555 0.512
Precision 0.376 0.3040.3660.2960.3750.351
Recall0.3530.3440.3800.3390.385 0.428
F-Measure0.2930.163 0.359 0.2930.2870.339
LymphomaRand Index0.7270.8040.6120.4880.921 0.929
Accuracy0.8550.8380.7550.419 0.951 0.845
Precision0.8440.8540.6770.488 0.966 0.663
Recall0.5630.7360.4410.336 0.857 0.926
F-Measure0.6750.7910.5340.488 0.918 0.771
ProstateRand Index0.5070.5040.5070.4950.509 0.524
Accuracy0.5780.6060.5760.500 0.578 0.593
Precision0.5030.490.5020.4910.500 0.507
Recall0.5410.5150.5730.5100.523 0.671
F-Measure0.5210.5070.5320.491 0.511 0.579
AdenocarcinomaRand Index0.528 0.730 0.6050.588 0.552 0.730
Accuracy0.8420.8420.8320.615 0.842 0.843
Precision0.7220.7430.7310.642 0.708 0.738
Recall0.576 0.964 0.7230.562 0.661 0.807
F-Measure0.641 0.839 0.7190.692 0.682 0.768
Breast2classesRand Index0.5460.5070.4790.4910.508 0.547
Accuracy 0.662 0.5840.5740.4700.584 0.611
Precision 0.537 0.5070.4520.404 0.506 0.523
Recall0.6960.7570.6080.6030.682 0.763
F-Measure0.6070.6080.5270.4810.599 0.611
CNSRand Index0.4930.5060.4960.500 0.512 0.544
Accuracy0.6610.6610.6610.615 0.555 0.595
Precision0.5310.5420.5360.507 0.540 0.485
Recall0.5830.5840.5450.609 0.683 0.737
F-Measure0.5560.5590.5390.513 0.540 0.485
WebKB texasRand Index0.5070.5070.4960.4880.501 0.523
Accuracy0.5780.5630.5160.5280.545 0.593
Precision0.5020.5020.4920.5090.496 0.511
Recall 0.757 0.5140.5570.5620.5640.612
F-Measure 0.603 0.5080.5200.5130.4960.507

Share and Cite

MDPI and ACS Style

Wang, W.; He, Y.; Ma, L.; Huang, J.Z. Latent Feature Group Learning for High-Dimensional Data Clustering. Information 2019, 10, 208. https://doi.org/10.3390/info10060208

AMA Style

Wang W, He Y, Ma L, Huang JZ. Latent Feature Group Learning for High-Dimensional Data Clustering. Information. 2019; 10(6):208. https://doi.org/10.3390/info10060208

Chicago/Turabian Style

Wang, Wenting, Yulin He, Liheng Ma, and Joshua Zhexue Huang. 2019. "Latent Feature Group Learning for High-Dimensional Data Clustering" Information 10, no. 6: 208. https://doi.org/10.3390/info10060208

APA Style

Wang, W., He, Y., Ma, L., & Huang, J. Z. (2019). Latent Feature Group Learning for High-Dimensional Data Clustering. Information, 10(6), 208. https://doi.org/10.3390/info10060208

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop