1. Introduction
Information networks, an essential form for expressing complex relationships between objects, are ubiquitous in our daily lives and work. They include social information networks formed by social media platforms such as WeChat and semantic information networks constituted by webpages on the Internet.
Spectral clustering, detailed in numerous studies [
1,
2,
3,
4], comprises three main steps: constructing similarity graphs, transforming graphs into low-dimensional vectors, and utilizing the K-means clustering algorithm. For example, Cai et al. [
1,
2] proved that a similarity graph based on the representative point strategy could be successfully applied to spectral clustering for information networks. Luo et al. [
3] used K-means to select representative points from original data and constructed a multi-layer kernel-free similarity graph based on representative points. Shanham et al. [
4] used the deep learning network approach to overcome the scalability and generalization issues of low-dimensional embedding in spectral clustering. In summary, at present, the optimization of spectral clustering algorithms mainly consists of the construction of similarity graphs and low-dimensional embedding representations.
With the development of information networks, designing effective algorithms to construct information network graphs to represent their semantic information has become a focal research issue. Traditional methods [
5,
6] generally adopt high-dimensional sparse vectors, representing the structure and semantic information of information networks through adjacency matrices and involving the design of various application algorithms based on the above. However, high-dimensional sparse representation requires more computational space and time as the network scale increases, leading to poor scalability and accuracy in a graph structure.
The k-nearest neighbor (KNN) graph is a manifold graph construction learning method, which includes methods such as Isomap [
7] and local linear embedding [
8]. This particular model has good generalization capabilities and can map topological manifolds. If high-dimensional manifolds are embedded in low-dimensional manifold spaces, the properties of Euclidean space still exist locally. The KNN algorithm is known for its stability, simplicity, and efficiency. However, it presents limitations when handling large datasets or high-dimensional data, which can result in increased computational complexity and limited clustering results. In recent years, considering nearest neighbor search methods, D. Cai et al. [
1,
3,
9] improved the KNN graph by introducing a KNN graph clustering algorithm that utilizes representative points. These representative points represent each cluster during computation, and noise has minimal influence on this approach. Moreover, for KNN graph construction, the representative point graph has been applied in spectral clustering. By computing a low-rank approximation of the affinity tensor graph, the initial unified affinity matrix is refined with reliable information from the confidence affinity matrix.
For low-dimensional vector representation, there are many effective methods available. Most models [
10,
11] that learn graph construction first use linear encoding and then apply singular value decomposition (SVD) [
12] for linear dimension reduction. Linear encoding [
6,
13] produces sparse representation, where each data point is represented as a linear representation of a small number of base vectors. Mathematically, linear encoding is defined as finding two matrices U and Z, whose product can best approximate the data
. Large-scale sparse coding clustering (LSC) [
9] combines linear encoding, SVD, and K-means into a spectral clustering method. It has a solid theoretical basis and has contributed to significant advancements in clustering research on large-scale data. GraRep [
14] is a novel model designed for learning weighted graph vertex representations by generating low-dimensional vectors to represent graphs. However, these methods all involve the use of SVD for linear dimension reduction, and more effective non-linear dimension reduction techniques have yet to be explored.
In recent years, deep learning [
4,
15] has been widely applied in graph representation learning. Implementing a greedy layer training strategy can improve computational performance. Stacked autoencoders [
16] are considered an effective method for mapping from the original representation space to low-dimensional spaces, with the capability to produce highly non-linear projections.
More recently, sparse autoencoders have been used to replace the SVD step in spectral clustering. Shao et al. [
17] proposed a deep architecture called DLC for fast graph clustering with linear encoders. Their method combines feature embedding and linear encoding and was implemented through a deep hierarchical architecture. Cao et al. [
18] proposed the DNGR model for learning graph representations. The model utilizes pointwise mutual information (PMI) to represent the original graph. It then projects this low-dimensional representation using stacked denoising autoencoders. The authors theoretically and empirically demonstrate the effectiveness of the model in clustering and classification processes. Tian et al. [
13] indicated in their study that the computational complexity of autoencoders,
, is considerably lower than the characteristic decomposition time,
, of spectral clustering. They proposed a non-linear spectral clustering algorithm based on deep learning. In this method, the graph is embedded in deep learning through autoencoders, followed by clustering using the K-means algorithm. This study on non-linear mapping spectral clustering algorithms made significant contributions to the field; however, the study presented the following issues: (1) for large-scale data, directly representing graph structures with high-dimensional similarity matrices does not reflect the essential features of graph structures; and (2) the authors simply propose a framework for non-linear spectral clustering algorithms without providing theoretical proof.
In our study, we utilized sparse representation for the graph similarity matrix, which typically enhances clustering accuracy by eliminating noise information that can impact clustering results. In representation learning, deep structures are used to construct features. Deep structures can benefit from hierarchical training for fine-tuning, which helps in finding improved local solutions to meet the expectation maximization objectives. The method proposed in this paper is called sparse encoder graph spectral clustering .
The algorithm proposed in this paper makes the following contributions:
In this study, we study the manifold learning similarity matrix and linear low-dimensional mapping of semantic information networks. We define a model structure using the K-means clustering algorithm, similar to spectral algorithms.
Theoretically, the results presented in this paper prove that the model adopts a clustering method consistent with the spectral clustering objective function and captures the non-linear features of semantic information networks.
Empirically, our results demonstrate that the model can enhance the clustering performance of semantic information networks by capturing meaningful semantic, relational, and structural information. Our experimental results show that the model can effectively complete clustering tasks.
The remainder of this paper is arranged as follows: in
Section 2, we review related works; in
Section 3, we describe the proposed framework, its theoretical analysis, computation complexity analysis, and the optimization algorithm; in
Section 4, we report the experimental results; and finally, we conclude the paper in
Section 5.
3. Model Framework
The model proposed in this paper is driven by a three-step scheme of spectral clustering, which includes the construction of similarity graphs, low-dimensional graph embedding, and the K-means clustering algorithm. In our study, we first implemented the reconstruction of the similarity matrix for spectral clustering, followed by creating a low-dimensional embedded graph. Our spectral clustering model was utilized for large-scale clustering. The basis of this model is to design an effective method for constructing a similarity graph and performing Laplacian matrix eigen decomposition. Specifically, in the similarity graph S, we attempted to design it as symmetric and positive semi-definite, satisfying the following form:
where
represents the p-dimensional representation of the original graph,
.
The purpose of the spectral clustering model is to first represent the information network as a graph , where S is the node similarity graph. Next, it embeds S into a low-dimensional graph embedding matrix R and then identifies a disjoint partition using the K-means algorithm, where k represents the number of clusters.
3.1. Similarity Matrix
KNN [
26] is a commonly used manifold learning method. By utilizing a test sample, we integrated the concept of k-nearest neighbor information and distance measurement with the normalized Laplacian matrix of spectral clustering to derive a suitable weighted information network similarity matrix. This matrix can be represented by the following Formula (
8):
where
is the k-nearest neighbor node of
.
S defines the similarity relationship between each node and all its adjacent nodes; however, this representation is very time-consuming, especially for large-scale data. According to the SC method, for any data point
, our approximation can be represented as shown in Formula (
9):
where the value of U is randomly selected from the information network graph to create a set of
p landmark nodes, generating the landmark node set
. Calculating the relationship values of U with the k-nearest neighbor nodes, we obtained the similarity graph. According to the Nadaraya–Watson method [
27], the sparse representation Z is defined as shown in Formula (
10):
where K(:) is the Gaussian kernel function. Therefore, the definition of the similarity matrix
S is as shown in Formula (
11):
3.2. Non-Linear Low-Dimensional Graph Embedding
First, in the present paper, we analyze the non-linear spectral clustering properties of the single-layer network. Second, the model representation was studied, and we describe the deep embedding graph of the multi-layer network. Third, we analyze the computation process of the multi-layer network based on the objection function. Finally, we analyze its time performance and the algorithm.
3.2.1. Spectral Analysis on Single-Layer Graph
Using the method of graph dimension reduction based on non-linear projection, the projection of sample points in the low-dimensional coordinate system is as follows:
where
is the i-th row vector of the similarity matrix
S,
is the j-th row vector of parameter
W, and
is the low-dimensional embedding landmark of
.
approximates the reconstruction of
, which can be represented as
The objective formula for the difference in distance between the original sample
and the projected reconstructed sample
is as follows:
where
; hence, the objective formula solving process can be approximated as
Therefore, the model ensures that the distance to the sample points on the hyperplane is sufficiently close, and the projections of the sample points on the hyperplane are separated as far as possible.
Notably, when the function g() is a linear function, this objective model transforms into the objective function of regular spectral clustering, as shown in Formula (
16):
Therefore, Formula (
15) is defined as the non-linear objective model of spectral clustering.
3.2.2. Deep Graph Embedding on a Multi-Layer Graph
As previously stated, we can consider S as a training set containing n nodes. We consider multi-layer information networks in each layer. The objective formula is as follows:
Assuming the multi-layer information network deep architecture as shown in
Figure 2, the input of the first layer is denoted as
. More hidden layers need to be added to the multi-layer architecture.
It is widely known that multi-layer information networks are divided into an input layer, hidden layers, and an output layer; the structure of the i-th neuron in layer (l + 1) is shown in
Figure 3.
The formula for
is:
where
f is the neuron activation function, which is generally a sigmoid or tanh function.
represents the data of the layer
l, assuming that the number of neural units in layer
l is
.
Assuming there are
r training samples
through the deep neural information networks, if the input sample is
, the output of the output layer is
, and the actual output should be
. Therefore, a loss function with regularization terms is defined as
where
is the sum of squared errors of the output layer. Here,
denotes the number of layers in the entire information network, and
indicates the number of units in layer
l, excluding bias units.
represents the regularization term parameter.
3.2.3. Computation Analysis Based on Deep Graph Embedding
The regularization term stabilizes the algorithm and suppresses overfitting. is a non-convex function; thus, using the gradient descent method can easily lead to local optimization results. In addition, the initialization of each parameter and must be set randomly and not all set to zero. If this is not performed, the output will be the same for any input.
The step formula for parameters
and
is
where
is the step learning parameter.
where
. Assuming
, then
Similarly,
calculate
,
where
; hence, the output layer
; thus, the entire output vector
, where
and ⊙ represents the Hadamard product. Therefore, the middle layer is
where
. Thus, the parameters were trained using the gradient descent method.
where
r represents the number of samples.
3.2.4. Computation Complexity Analysis of an Algorithm on a Multi-Layer Graph
The results presented in this paper refer to the deep model as Sparse Encoding Graph Spectral Clustering (), and its implementation process is shown in Algorithm 1.
Suppose we possess the original data
and use
p landmark points; we need
to construct the similarity matrix and
to complete the multi-layer graph embedding. K-means requires
, where
t is the number of iterations. In Algorithm 1, the computational complexity is
.
Algorithm 1 algorithm |
- Require:
Original information network X, number of landmark nodes p, and m-layer information network structure; - Ensure:
Clustering partitions
|
4. Experiment
In the following section, we report on the experiments conducted in our study. First, we introduce the datasets used in the experiments, the evaluation metrics, and the benchmark algorithms. Subsequently, we demonstrate the effectiveness of the proposed algorithm through empirical testing.
4.1. Experimental Data
To evaluate model performance, in this study we tested the model on various real-world datasets obtained from the UCI Machine Learning Repository and social information network datasets. A brief description of the datasets used can be found below as follows:
Wine: The Wine dataset, sourced from the UCI Machine Learning Repository (Asuncion and Newman, 2007), comprises 178 instances and 13 features. These data are the results of chemical analyses of mature wine grapes. All samples are labeled with three categories of wine. In our study, we established a similarity graph.
Pendigit: This dataset is a handwritten digit dataset consisting of 3498 samples and 16 features created by 44 writers.
Minist: This dataset consists of handwritten digits and is available on Yann LeCun’s webpage. Each image is represented as a 784-dimensional vector and there are 4000 images in total.
Blogdata: This dataset is a social relationship information network provided by blog authors. This information network contains 10,321 nodes, 333,983 edges, and 39 different cluster labels. These labels represent the inferred and provided blog interests of the blog authors.
20 newsgroups: This dataset is a news social network containing 18,846 newsgroup documents and 26,214 features divided into 20 different groups.
Seimic: This dataset is used to classify different types of mobile vehicles, selecting 98,000 samples, 50 features, and 3 classifications.
Covtype: This dataset is a forest cover-type dataset containing 581,012 data, 54 features, and 2 classes. This dataset contains tree observations from four areas of the Roosevelt National Forest in Colorado, USA.
Detailed information on all of these semantic networks is summarized in
Table 1.
4.2. Benchmark Algorithms
To demonstrate the effectiveness of , the following methods were used as benchmark algorithms. A brief description of each method is found below as follows:
PCA [
20]: A common matrix decomposition method used for dimensionality reduction or feature extraction. PCA defines a loss function based on the distance between the original sample and the reconstructed sample after projection. PCA is a linear graph embedding method. The clustering result is then obtained using the K-means algorithm.
SC [
2]: Spectral clustering consists of an anchor graph based on set-to-set distances derived from local covariance matrix representation.
GraphEncoder [
13]: A method for learning graph representations. This method explores non-linear graph embedding using deep learning and obtains clustering results by running the K-means algorithm on the embedding.
DCN (deep clustering network) [
28]: The DCN performs joint dimensionality reduction and clustering. Reduction is achieved by training a deep neural information network to approximate any non-linear function.
4.3. Evaluation Criteria
In this paper, we evaluate the performance of the aforementioned clustering algorithms using the accuracy (ACC) and normalized mutual information (NMI) of the clustering results. ACC evaluates the performance of the algorithm by comparing the clustering labels it produces with the actual clustering labels, examining the clustering status of all pairs of nodes to determine if they belong to the same cluster. ACC is defined as follows:
where
n represents the number of nodes,
represents the clustering label result of the i-th node by the algorithm, and
represents the real clustering label of the i-th node.
NMI is the mutual information entropy between the algorithm’s clustering labels and the actual clustering labels. Mutual information entropy considers partitions as the probability distribution of nodes falling into a cluster. It is defined as follows:
where
and
represent two types of clustering partitioning,
represents the number of nodes belonging to both cluster
and cluster
,
indicates the number of nodes belonging to the i-th cluster
, and
indicates the number of nodes belonging to the j-th cluster
. NMI’s value ranges from 0 to 1, with a value of 1 being given when the results of a and b are the same.
4.4. Parameter Settings
In our experiments, each sample from all datasets was normalized to possess the same unit length. We set the neighborhood number in the sparse KNN graph search to five. We detected the relationship between parameters p (the number of landmark points) and m (depth level lay) and performance indicators ACC and NMI on real small- and medium-sized data from various sources.
Experiments were conducted on the
model, setting the landmark number parameter
p from 200 to 1200, increasing in intervals of 100, with the derived ACC results as shown in
Figure 4. From
Figure 4, it can be seen that the ACC value increases with the change in landmark value, which is attributed to the random selection method. Evidently, having more landmarks can enhance the final ACC performance; however, the change was not found to be significant in our study.
Figure 5 illustrates the relationship between the number of landmarks and NMI. The NMI value generally increases as the number of landmarks increases but with limited magnitude.
Examining the settings of the depth structure layer (lay), the ACC value of the first layer of
on the dataset in
Figure 6 was obtained by directly applying K-means to the input normalized sparse similarity graph. It can be observed from
Figure 6 that as the layers become deeper, the ACC value increases. This suggests that deep structures play a crucial role in generating high-quality clustering results. In
Figure 7, as the number of layers increases, the NMI values for large-scale datasets such as blogdata and 20 newgroup increase.
In summary, the results presented in this paper suggest using five layers for medium- to large-scale datasets; small-scale datasets, in contrast, do not require more than three layers. The information presented in
Table 2 shows the number of nodes in each layer. Moreover, without the loss of generality, in this study we set the number of landmarks to 1000.
All neural networks use the sigmoid function as the activation function for each layer, and the sparse target (the target value for hidden layer activation) is 0.05. Specific parameter information is shown in
Table 3.
4.5. Algorithm Comparison
Table 4 presents the average running time of the
algorithm in comparison with similar deep neural network models such as Graphencoder, indicating that their running times are approximately equal. This result meets the purpose of sparse embedding representation time optimization.
In terms of cluster performance,
Table 5 summarizes the clustering results of seven datasets, indicating that
offers relatively higher ACC.
Table 6 displays the comparison values of various algorithms on large-scale data, demonstrating stronger NMI performance advantages as the data scale increases.
From the results shown in
Table 5 and
Table 6, it can be seen that the model proposed in this paper exhibits superior performance compared to the other models. As shown in
Table 5,
achieved higher ACC. This improvement was significant compared to the other algorithms, attributed to the optimization of graph similarity. In addition, as the graph uses a non-linear projection, it can be observed from
Table 6 that in the various types of large-scale datasets, the NMI values tend to be optimal.
The advantage of this model lies in using sparse similarity graphs to replace the original graph’s similarity matrix and adopting non-linear encoding schemes instead of feature decomposition. During the final step in particular, the low-dimensional features inputted into K-means are complemented by spectral clustering. Furthermore, the non-linear encoding scheme reduces the dimensionality of the data.