1. Introduction
Graph-structured data often contains abundant node features and topological information. Benefiting from its powerful expressive ability, graph-structured data are often used to model drug discovery [
1], social networks [
2], and recommender systems [
3].
Moreover, the graph clustering task has attracted extensive attention as an important part of unsupervised learning on graphs. There are many directions in graph clustering tasks, such as node-wise graph clustering, and graph-wise graph clustering. In recent years, Graph Neural Networks (GNNs) have become a popular field in deep learning, which improves the performance of graph clustering tasks effectively. Graph Convolutional Networks (GCNs) [
4] are one of the representative methods that can utilize node features and topology information to obtain low-dimensional embeddings. On this basis, many graph clustering models combined with deep learning techniques have been proposed to achieve state-of-the-art effects. Kipf et al. [
5] learn embedding through GCN layers, then the decoder reconstructs the features as similar as possible to the original features. Ahn et al. [
6] optimize the aforementioned model to solve the norm-zero tendency of isolated nodes. In addition, Pan et al. [
7] combine graph convolution with the adversarial training method. However, the above models fail to pay attention to the importance of nodes. Therefore, Veličković et al. [
8] introduce the attention mechanism into the graph convolution, which can aggregate the information based on nodes’ importance. Wang et al. [
9] adopt the receptive field with an attention mechanism to encode the features and jointly optimize the clustering and embedding module. Although the models with the graph auto-encoder as the backbone network can generate the embeddings effectively; it ignores the information in the data structure. Combining different orders of the embeddings and the structural information, Bo et al. [
10] integrate structural information into a deep clustering method for the first time to improve the clustering effect. In addition, Li et al. [
11] adopts the combination of deep clustering and the graph auto-encoder to design a triple self-supervised module to supervise the embedding and clustering module.
However, the above models have the following limitations: (1) Existing models are limited by the calculation method of graph convolution, and their computational costs will increase exponentially as the scale of the graph grows. (2) Stacking too many convolutional layers will introduce the over-smoothing noises and ignore the local graph structure (3) The model depth is orthogonal to the neighborhood range, which makes it difficult to expand both.
Therefore, researchers expect to find more scalable models. Klicpera et al. [
12] propose a simple model using the relation between graph convolution and pagerank, which enables an efficient neighborhood expansion. Wu et al. [
13] entirely reduce the computational costs by decoupling the feature propagation from the training process. Following the idea of SGC, Frasca et al. [
14] consider the features of different receptive layers and splice them without ignoring information, while Zhu et al. [
15] average them to generate combined features with the same dimension. Meanwhile, Zhang et al. [
16] simplify the GNN from the perspective of spectral graph theory and it can select different high-order information orders according to different graphs. Despite this, the above models ignore the difference in node importance in the aggregation process. Chen et al. [
17] adopt a constant decay factor to solve this issue, while Zhang et al. [
18] use the receptive field weighted aggregation with an attention mechanism to aggregate neighborhood information. However, the above methods are suitable for supervised or semi-supervised learning scenarios, lacking a task-oriented model framework for unsupervised clustering tasks.
In response to the above problems, we propose a network that can effectively utilize various types of information in a graph with high scalability. We adopt a dual self-supervision module to guide the training of the Quasi-GNN module and the DNN module. With this dual-supervised module, the entire model can be trained in an end-to-end manner for graph clustering. In addition, it should be mentioned that the algorithm of our proposed method requires a vector form of the data as input in addition to the graph.
In summary, our contributions are described as follows:
A highly scalable deep network to process graph-structured data is proposed. This network can combine the topological information and the node features effectively to obtain potential embeddings for clustering tasks.
A linear propagation based on personalized pagerank is proposed, which improves the performance of the clustering task and alleviates the over-smoothing issue.
We conduct extensive experiments on five real-world datasets and achieve superior performance with fewer iterations. The experimental results show that our model outperforms the current state-of-the-art methods.
2. Related Work
Graph clustering divides the unlabeled nodes into different clusters with a certain metric. After that, we can mine the relationships between different nodes in a graph. The early graph clustering models perform poorly on real-world datasets due to their shallow architecture and learning capabilities, such as matrix factorization [
19] and DeepWalk [
20]. In addition, Sieranoja et al [
21]. propose two complementary algorithms for graph clustering called K-algorithm and M-algorithm. The combination of these two algorithms can obtain several local optimizations on the graph and they can be used with different cost functions. However, the two algorithms fail to integrate the graph topology information, which limits their final performance.
Recently, many more effective models applied to unsupervised learning are proposed, such as auto-encoder [
22] and generative adversarial networks (GAN) [
23]. On this basis, many graph clustering models combined with deep learning techniques have been proposed and they have achieved good performance. The Graph Auto-encoder (GAE) [
5] combines the auto-encoder with graph convolution. It first utilizes the two GCN layers to capture the information between graph topology and node features and then reconstructs an adjacency matrix to be as similar to the original matrix. The ARGA [
7] adopts the adversarial training scheme to normalize the embedding process to obtain more robust embeddings. However, none of the above-mentioned models are clustering task-oriented joint optimization training methods. The DAEGC [
9] combines the two components to jointly optimize the embedding module and the clustering module, which improves the quality of the embeddings and the clustering effect. Meanwhile, the SDCN [
10] integrates the structural information into deep clustering by using a transfer operator to combine the auto-encoder with the GCN module. It can conduct end-to-end clustering training with the dual self-supervision module. Although these methods have superior performance, they still use the GCN module based on the message passing mechanism as the backbone network, limiting the scalability of these models.
However, the above models also have several drawbacks: (1) There is no solution to the orthogonal relationship between the model depth and the neighborhood range (2) Too many smoothing iterations or GCN layers stacking lead to over-smoothing issues.
To solve the limitations of the traditional GNN models, scalable GNN models are proposed. Early scalable GNNs simplified the model by sampling the graph, the GraphSAGE [
24] samples the neighbors around the target node with the same probability, while the FastGCN [
25] samples the nodes according to the importance of each node. Due to their node-wise sampling or layer-wise sampling method, they fail to learn large-scale sparse graphs effectively. On this basis, the GraphSAINT [
26] proposes a subgraph-wise sampling method with high scalability, which decouples sampling from GNNs and further reduces the computational costs.
The other direction of scalable GNNs in recent years is to simplify the model structure. The SGC [
13] transforms the nonlinear GCN into a simple linear model by repeatedly eliminating the nonlinear function between the GCN layers and folding the final function into a linear function. The PPNP [
12] modifies the propagation scheme by adopting the relationship between GCN and pagerank. The AGC [
16] advocates the use of high-order graph convolution to capture the global features of the graph and it can adaptively select the appropriate order according to different graphs, while the AGE [
27] optimizes the model of GAE by decoupling the GCNs and modifies the GNN models from the perspective of graph signal processing. The S2GC [
15] adopts an improved Markov diffusion kernel to derive a simpler variant of GCN that captures the global and local context of each node. Nevertheless, the SIGN [
14] points out that the features among multiple layers should be considered together instead of a certain layer. They splice the features with different degrees of smoothness and utilize them for downstream tasks. Meanwhile, the GBP [
17] adopts a constant weighted average decay factor to consider the difference in importance between the receptive fields of different nodes. On this basis, the GAMLP [
18] integrates multi-scale node features effectively with three different attention mechanisms to improve the scalability and computational efficiency. However, the above models with high scalability lack a jointly training framework for graph clustering tasks.
In contrast, our proposed scalable deep network can not only effectively integrate the graph structure and node features, but also decouple the encoding process from the propagation process. We improve the scalability of the existing models and alleviate the over-smoothing issue. Moreover, SDN solves the issue of the orthogonal relationship between the model depth and the range of the neighborhood by the Quasi-GNN module. Meanwhile, we utilize a dual self-supervised module to train the clustering task end-to-end, which enables high-confidence clustering results while obtaining high-quality embeddings.
3. The Proposed Method
In this section, we first introduce the definition of graph and clustering tasks. Then we introduce our proposed Scalable Deep Network (SDN). The overall framework of SDN is shown in
Figure 1. Specifically, SDN consists of three modules, a DNN module for auto-encoder, a Quasi-GNN module, and a dual self-supervised module. We first utilize the DNN module and linear encoder to generate the intermediate embedding, then use the linear propagation module to obtain the final embeddings. Meanwhile, we utilize the dual self-supervised module to supervise the training of these two modules. We introduce the specific details of our model as follows.
3.1. Problem Formalization
Graph-structured data can be defined as , where is a vertex set with m vertices, is an edge set with n edges, is a feature matrix (input data). The topology of graph is described by an adjacency matrix (with self-loops) , where . If there there is an edge between and , , otherwise . For non-graph data, we obtain their adjacency matrix by constructing a KNN graph with the Dot-product. We first calculate the similarity between different nodes by , and select K nodes with the highest similarity for each sample as their neighbors. Degree matrix , where represents the degree of any node .
Graph clustering is to divide the nodes into t disjoint clusters according to a selected criterion, and there is . When the node is divided into a certain cluster, it can be expressed as .
3.2. DNN Module for Auto-Encoder
It is not sufficient to obtain embeddings only based on node features and topology information, so we utilize auto-encoders to obtain high-dimensional representations of node features and integrate them into the embedding learning process of multi-layer perceptrons. For accommodating different data types, we adopt the most basic auto-encoder to obtain high-dimensional embeddings of nodes. First, the initial feature matrix (vector data)
is fed into the fully connected neural network of the DNN module to obtain the high-dimensional embedding
. The specific process and formula are defined as follows.
where
represents the encoding result of the
l-th layer, and for the 0-th layer of the network, we set
.
represents the encoding weight matrix of the
l layer,
represents the nonlinear function, such as
. After encoding at layers
l, we decode the embedding using a decoder that is fully symmetric to the encoder.
where
represents the results of the
l-th layer, for the 0-th layer of the decoding network, there is
.
represents the decoding weight matrix of the
l layer. After that, we set
and make the following results as the objective function.
3.3. Quasi-GNN Module
Although the auto-encoder can learn the embeddings from the data themselves, such as , and , it ignores the relationship between nodes. Therefore, the traditional deep clustering method needs to utilize the GCN module to capture the relationship between nodes as the supplements. The GCN module can solve this issue, but it is difficult to expand the model depth and the range of the neighborhood together, which limits the learning ability and architecture of the models. Therefore, we propose a Quasi-GNN module, which decouples the encoding process from the propagation process and it can not only capture the relationship between nodes, but also expand the range of the neighborhood and the depth of the model together, reducing the computational cost and improving the scalability.
3.3.1. Linear Encoder
We utilize the multilayer perceptron (MLP) as our encoder to get the embeddings. The result of each layer can be defined as
where
is the embeddings of the
l-th layer. Specially,
,
is the weight matrix of MLP. To obatin a more complete and powerful embedding, we combine the high-dimensional representations
learned from the DNN module with
. The formula is as follows.
is the calculation result of the
l-th layer DNN module.
is the balance coefficient and we set it to
. After that, we need to do the propagation operation on it to aggregate information in the neighborhood. Then we utilize the
as the input of the
l-th layer in MLP to generate the embeddings
3.3.2. Linear Propagation Module
We first briefly review the message passing algorithm of traditional GCNs. A traditional two-layer GCN model can be defined as
where
,
is the normalized adjacency matrix, by setting
or
, we can obtain different regularization methods, such as
,
, and
. The predicted labels is
. In a traditional two-layer GCN model, the calculation of each layer depends on the calculation result of the previous layer. Limited by this calculation method, the computational costs of the traditional GNN models increase exponentially. It is difficult to expand the model depth and the neighborhodd range together for their orthogonal relationship. According to Xu et al. [
28], the influence score of sample
x on
y in GNN can be defined as
In the
k-layers GNN,
, where
is the random walk distribution after fine-tuning. When
, if the graph is irreducible and aperiodic, the value will approach a stable distribution independent of
x (i.e., the same amount of influence scaling), which indicates that the influence of the
x on the
y at this time will eventually be independent of the local graph structure. Assuming that this stable distribution is
, we can calculate the distribution by the following formula
Obviously, the result is related to the structure of the whole graph and has no relation to the starting point of the random walk, which means that we finally consider the information of the whole graph and ignore the nodes themselves. In addition, the original pagerank also adopts this calculation method to obtain the full graph structure.
Based on this, we can adopt a variant of pagerank (i.e., personalized pagerank) to reconsider the root node. Assuming that
is the indicator vector of node
x, its vector representation after multiple propagations can be defined as
where
is the transmission probability,
,
is the normalized adjacency matrix. In this way, we can obtain an approximate post-propagation matrix with respect to the entire graph data
where
is the result of the
k-th propagation,
is the embedding obtained by the linear encoder. Therefore, we can deduce the final embeddings
by combining the intermediate embeddings
in the
Section 3.3.1.
The last layer of the linear propagation module is the multi-classification function of the softmax function
As a result, indicates the probability that a node belongs to the cluster j. Moreover, we can consider as a kind of aggregation class distribution.
3.4. Dual Self-Supervised Module
Through the above two modules, we mechanically combine the DNN module and the Quasi-GNN module, they essentially are all used for unsupervised or supervised learning in different scenarios and we cannot apply them to our depth clustering task directly. Therefore, we need to unify the Quasi-GNN module and the DNN module with the same optimization objective. We set the goal of these modules to approximate the target distribution P, which makes the results tend to be consistent during the training process, and because of the strong connection between the two modules, we call it a dual self-Supervised module. This module does not require the participation of labels during the training process.
First, for the DNN module, we utilize Student’s t-distribution as the kernel to measure the similarity between the node embeddings
and the cluster center vector
:
where
is the
i-th row of the embedding
,
is initialized by the K-means learned by the pre-train auto-encoder,
v is the degree of freedom of Student’s t-distribution.
can be seen as the probability of assigning sample
i to cluster
j. From this, we can obtain the cluster distribution
about the nodes. To enable nodes to be assigned to different clusters with higher confidence, we calculate the target distribution
.
where
is the soft clustering frequency. The target distribution
normalizes the sum of squares of each distribution in the cluster distribution
. By using two distributions to constrain different embeddings, the embedding obtained by the DNN module and the Quasi-GNN module can be considered simultaneously to optimize the embedding and clustering quality jointly. On this basis, we can obtain the corresponding objective function
This objective function constrains the DNN module and we can obtain the superior embeddings for clustering by reducing the KL divergence loss of the two distributions
and
. In addition, we need to utilize the
distribution to constrain the Quasi-GNN module
To sum up, the final loss can be defined as
where
and
are constraint coefficients,
and
. Algorithm 1 shows the training process of our proposed model.
Algorithm 1 Training process of SDN. |
- Require:
Initial features: , Graph: , Numbers of clusters: K, Adjacency matrix: , Iteration number: , Layer number of linear encoder: , Layer number of linear propagation module: ; - Ensure:
Clustering results ; - 1:
Initialize , , , with pre-train auto-encoder; - 2:
Initialize with K-means on the representations learned by pre-train auto-encoder; - 3:
Initialize randomly; - 4:
for do - 5:
Generate DNN embeddings , , , ⋯, ; - 6:
Use to calculate the distribution Q by Equation (17); - 7:
Calculate target distribution P by Equation (18); - 8:
for do - 9:
Set the balance coefficient to calculate by Equation (5); - 10:
Calculate the embeddings of the next layer of MLP by Equation (6); - 11:
end for - 12:
Set - 13:
for do - 14:
Calculate the embeddings by Equation (15); - 15:
end for - 16:
Set the transmission probability = 0.3 to calculate the distribution Z by Equation (16); - 17:
Feed into the decoder to obtain the refactored feature ; - 18:
Calculate , , , respectively; - 19:
Calculate the whole loss function by Equation (21); - 20:
Back propagation and update parameters in SDN; - 21:
end for - 22:
Calculate the clustering results based on distribution Z; - 23:
return ;
|
6. Ablation Study
To further verify the effectiveness of our proposed model, we adopt two variant methods to verify the performance of each module. SDN only employs the Quasi-GNN module and SDN only utilizes the DNN module for encoding. Finally, we perform K-means on the embeddings obtained from them. Compared with other baselines, the experimental results show that removing either of the above two modules will lead to a decrease in accuracy and other metrics, which indicates that the two components of our proposed model are inseparable.
Moreover, it is worth noticing that some results obtained by SDN are better than baselines. For instance, the accuracy in Reutuers, ACM, and CiteSeer of SDN is about 2%, 0.5%, and 1% higher than that of AGCN. In addition, other metrics of SDN also surpass AGCN with varying degrees, and this result indicates that the decoupled method we proposed still has superior performance while having high scalability.
7. Analysis of Transmission Probability
The linear propagation layer adopts the propagation strategy defined by Equation (11). Owing to no parameters, it simply performs linear operations on the original matrix, which greatly improves scalability and reduces computational costs. We set
, and 1, respectively, on each dataset, and measure their final clustering effect. In this way, we obtain the most suitable transmission probability
setting, and the experimental results are shown in
Figure 3. The experimental results generally show a trend of increasing first and then decreasing with the increase in
. The transition probability essentially indicates the probability that the target node learns from itself or its neighbor nodes. Based on the experimental results, we can infer that learning from only one of them is not sufficient. Therefore, it is necessary to find a suitable
to integrate the information of the target node and its neighbor nodes to obtain the deep embeddings.
8. Conclusions
In this paper, we propose a scalable deep network with a Quasi-GNN module and a DNN module. First, we utilize the Quasi-GNN module to capture the information of graph topology and node features in different dimensions and employ the DNN module for auto-encoder to supplement the structural information. In addition, the combination of these two components can be combined with other post-processing methods to enable nodes further to be assigned to clusters with higher confidence, so it has high scalability. Moreover, our proposed model solves the issue of the orthogonal relationship between the model depth and the neighborhood range. It reduces the computational costs of the traditional GCN models and alleviates the over-smoothing issue caused by the stacking of multiple GCN layers. Experiments on benchmark datasets show that our model has superior performance and achieves the SOTA effect.
For future work, we plan to optimize the Quasi-GNN module using the attention mechanism to consider the difference in importance between different nodes. On the other hand, we can add variants of GAE/VGAE to obtain more robust embeddings or propose different self-supervised modules to supervise the training of deep embeddings and clustering effectively.