1. Introduction
As a powerful tool, the Graph Neural Network (GNN) has been designed to deal with graph data such as social networks, knowledge graphs, citation networks, etc. The invention of the GNN has greatly facilitated graph-related tasks such as graph classification [
1,
2,
3], neural machine translation [
4,
5], relation extraction [
6,
7], relational reasoning [
8,
9], and graph clustering [
10,
11,
12]. Unlike traditional clustering methods such as K-means, GNN-based graph clustering models use deep neural networks for representation learning before clustering. Adaptive graph convolution (AGC) [
11] is a method that can adaptively choose its neighborhood over various graphs. A deep attentional embedded graph clustering model (DAEGC) [
13] can learn to aggregate neighbors by calculating their importance. The adversarially regularized graph autoencoder (ARGV) [
14] introduces adversarial regularization to learn and improve the robustness of representations. The work on attributed graph embedding (AGE) [
15] proposed a Laplacian filtering mechanism that can effectively denoise features. The deep fusion clustering network (DFCN) [
16] is a hybrid method that integrates embeddings from autoencoder (AE) [
17] and graph autoencoder (GAE) [
18] modules for representation learning.
Recently, there has been growing interest in contrastive learning. Applying contrastive learning to deep graph clustering has become more common than before. The principle of contrastive learning is to bring similar or positive sample pairs closer and push dissimilar or negative sample pairs further away from each other. Graph clustering is a fundamental but challenging task in graph analysis. The contrastive multi-view representation learning method (MVGRL) [
19] has achieved its best performance by contrasting the embeddings of the nodes and sampled sub-graphs. Specifically, it constructs an extra diffusion graph for contrastive learning. The node embeddings from one view are contrasted with the sub-graph embeddings from the other. The method determines which nodes and sub-graphs belong to the positive pair and which belong to the negative pair. The self-consistent contrastive attributed graph clustering method (SCAGC) [
20] can maintain the consistency between the learned representation and cluster structure by performing contrastive learning between clusters and between nodes with the guidance of clustering results. Inspired by the deep graph infomax method (DGI) [
21], the community detection-oriented deep graph infomax method (CommDGI) [
22] introduced a community mutual information loss to capture the community structural information for nodes.
Although promising performance has been achieved, there still exist problems that need to be addressed. Firstly, in existing methods, manual augmentation such as feature masks and edge drops can result in semantic drift, which leads to sub-optimal performance. Secondly, most of the methods perform contrastive learning on feature-based (first-order) contrastive similarity, ignoring structure-based (second-order) contrastive similarity, which leads to sub-optimal performance.
Figure 1 shows the difference between first-order contrastive learning and second-order contrastive learning.
To solve the above-mentioned problems, we propose a contrastive graph clustering method termed Graph Clustering with High-Order Contrastive Learning. To address the first problem, we build two views by performing Laplacian smoothing with different normalizations on the same features. We build two similarity matrices with features. Each element in the similarity matrices denotes the similarity between nodes. We argue that the corresponding embeddings can be mapped into the same space using the alignment loss between the similarity matrices. To address the second problem, we build a contrastive similarity matrix using the similarity matrices. Inspired by [
23], we perform contrastive learning by minimizing the loss between the contrastive similarity matrix and an identity matrix. In this way, our model can implement contrastive learning at the structure level. Meanwhile, the contrastive similarity matrix is built using the feature-based similarity matrix, and contrastive learning can also be assumed to be at the feature level to some degree. Furthermore, we can learn clustering-friendly representations naturally without the manual sampling that is applied in most contrastive methods and we need no extra clustering algorithms for training. Moreover, our method can be trained on large datasets. The key contributions of this paper are as follows:
Without any manual augmentations, we use two different Laplacian smoothing methods to build two views for contrastive learning and design an alignment loss to force the learned embeddings to map into the same space.
We design a novel structure-based contrastive loss without a sampling phase. By contrasting two similarity matrices, our model can learn clustering-friendly representations. It is worth noting that our model can also be applied to large-scale datasets.
Extensive experiments on five open datasets validate the effectiveness of our model.
4. Experiment
4.1. Dataset
We conducted extensive experiments on five widely used benchmark datasets: Cora, Dblp, Amap, Corafull, and Reddit. More details can be found in
Table 2.
Cora [
18] is a citation dataset. Each node denotes a machine learning paper, and each edge denotes the citation relationship between two papers. The papers within it are divided into seven classes: case-based, genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning, and theory. Each node’s feature is represented by a 0, 1 vector. Each dimension is a keyword from a specific vocabulary.
Dblp [
24] is a cooperative network. The authors are categorized into four classes: database, data mining, machine learning, and information retrieval. Each edge represents a collaborative relationship between authors. The node features consist of elements from a bag-of-words method represented by keywords.
Amap [
33] is a co-purchase graph dataset. Each node denotes a type of good, and each edge denotes the corresponding goods that are often purchased together. These nodes are divided into eight classes according to the category of the goods.
Corafull [
33] is similar to Cora but is larger, and the papers within it are divided into 70 classes.
Reddit [
1] is constructed from Reddit posts from September 2014. Each node denotes a post, and each edge denotes two posts commented on by the same user. The posts are divided into 41 classes. The node features are the average of 300-dimensional GloVe word vectors associated with the content of the posts, including the title, comments, score, and number of comments.
4.2. Experimental Setup
All experiments were run on a computer with a GeForce RTX 1080Ti GPU, 64 G RAM, and Pytorch 1.8.1. We set the maximum number of iterations for training to 100 for all datasets. We optimized our model using the Adam optimizer. When the training process stopped, we ran the K-means clustering algorithms on the learned embeddings. To reduce the impact of randomness, we repeated each experiment 10 times and report the average results.
4.3. Parameter Setting
In our model, we used a single-layer MLP as the encoder. The dimension of the output was 100 for Reddit and 500 for the other datasets. For simplicity, we used no activation function except for a linear transformation. In our model, instead of inputting the whole feature matrix for training, we performed the training in batches with an assigned batch size. Specifically, we denoted
b as the batch size. For Amap and Reddit, we set
; for Cora and Corafull, we set
; and the batch size for Dblp was 1024. Regarding the compared baselines, we utilized the settings specified in their respective papers. The details of the hyperparameters are shown in
Table 3.
4.4. Metrics
The clustering performance was evaluated on four widely used metrics: ACC (Accuracy) [
34], NMI (Normalized Mutual Information) [
35], ARI (Average Rank index) [
36], and F1 (macro-F1 score) [
37].
4.5. Performance Comparison
In our experiments, we compared our model to 14 methods on five benchmark datasets. Specifically, K-means is the classic clustering algorithm. GAE, VGAE, MGAE, and DAEGC [
12,
13,
18] are reconstructive learning methods. ARGE and ARVGE [
14] are adversarial regularization methods. AGCN, SDCN, and DFCN [
16,
24,
25] are hybrid methods. SCAGC, GDCL, MVGRL, AutoSSL, and Sublime [
19,
20,
26,
28,
31] are contrastive learning-based methods. Details on the performance comparison can be found in
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8. The best results are marked in
bold. From the information in these tables, we can make the following observations:
The proposed model achieved the best performance in most cases. For example, on the Amap dataset, our model achieved ACC, NMI, ARI, and F1 scores of 79.18%, 70.37%, 62.22%, and 72.93%, respectively, We observed relative improvements of 1.1%, 1.5%, 2.7%, and 5.3% over the second-best baseline on the Cora, Dblp, Amap, and Corafull datasets.
K-means performed clustering directly on the raw features and could, to some degree, indicate the quality of the attributes of the dataset. As can be seen, the attributes of the Cora dataset demonstrated the highest quality for clustering. The baselines from GAE to DFCN were classical deep graph clustering models and were mostly trained by reconstructing the raw features or the given graphs. GAE, VGAE, MGAE, ARGE, ARVGE, AGCN, and DAEGC were sub-optimal compared to our model because they only used a single view for embeddings, which had a limitation in providing diverse features for representation learning. SDCN and DFCN learned the representations through a cross-module approach, enriching the information for learning. The reason our model outperformed SDCN and DFCN was that they heavily relied on the provided graph, which could not fully reveal the complete connections between nodes and may have misled representation learning. The utilization of a similarity matrix in our model can greatly alleviate this.
The baselines from SCAGC to Sublime are graph clustering models based on contrastive learning. All of them implemented contrastive learning at the feature level, which could not effectively capture the neighborhood of each node, an important aspect for clustering tasks. Our model directly performed contrastive learning at the structural level. This allows the contrastive learning in our model to facilitate the clustering task more effectively.
On the Reddit dataset, most of the baselines struggled with the training cost, leading to OOM (out-of-memory) issues. There are two reasons for this: (1) they usually input the whole dataset into the model during training, and (2) the entire adjacency matrix consistently participated during training. In our model, we input batches of features into the model instead of the whole feature matrix, which greatly reduced the computations.
4.6. Ablation Study
We performed an ablation study from two perspectives: (1) To validate the effectiveness of high-order contrastive learning, we implemented two experiments, one on first-order contrastive learning and one on second-order contrastive learning. (2) To assess the effectiveness of each component in our model, we conducted experiments by individually removing the structure alignment and contrastive learning.
In
Table 9, we can observe that the contrastive learning on the first-order similarity matrix consistently underperformed compared to the second-order similarity matrix. This is because first-order contrastive learning is based on feature similarity, which may lead to representation bias. However, second-order contrastive learning is based on neighborhood similarity, which can alleviate this bias. In addition, compared to first-order contrastive learning, second-order contrastive learning can learn clustering-oriented representations more effectively.
In
Table 10, we can observe that each component in our model contributed to the performance. Specifically, when we removed the contrastive part, the performance decreased significantly on all datasets. This is because without CL, the representation bias impacted the performance across all datasets. When SA was omitted, the impact on the performance for the Cora, Dblp, Amap, and Corafull datasets was minimal, but for the Reddit dataset, it was significant. This was because CL carried a risk of reducing useful relationships, which could harm performance, but SA could preserve these relationships, alleviating this issue. The model conducted graph convolution five times on Reddit, and no more than three times on the other datasets. By aggregating more neighbors, the number of similar nodes to the target one increased in the embedding space. When the model performed contrastive learning on the similarity matrices of the Reddit dataset, it reduced more useful relationships compared to the other datasets. Therefore, the performance decreased more on the Reddit dataset compared to the others.
4.7. Hyperparameter Analysis
In this paper, we introduced two hyperparameters b and t. b denotes the batch size of the input features, and t is used to control the power of the Laplacian smoothing before training.
In
Figure 3, we show how the performance varied with changes in the batch size within the range of
. From this figure, we can see that the performance fluctuation on the Amap, Cora, and Corafull datasets was not sensitive to changes in the batch size. However, a larger batch size enhanced clustering performance on Dblp; when the batch size was 1024, the clustering achieved the best results, whereas on Reddit, a smaller batch size was more beneficial for representation learning. This is because Dblp aggregated the first-order neighborhood for its representation, whereas Reddit aggregated the fifth-order neighborhood. A larger batch size facilitated the reduction of redundant relationships in Dblp but increased the risk of reducing useful relationships in Reddit.
In
Figure 4, we illustrate how the performance varied with changes in the Laplacian smoothing power. From this figure, we can see that the ACC stabilized when the power reached 2, except for the Reddit dataset. On Reddit, the model achieved its best performance when
t was equal to 5, and it maintained stability within the range of [3, 6]. In summary, our model demonstrated low sensitivity to these two hyperparameters, even when they varied within considerable ranges.
4.8. Visualization Analysis
To demonstrate the effectiveness of our model in the clustering task, we illustrate a series of similarity matrices in
Figure 5, showing the quality of the learned representations in each cluster. In
Figure 5, we can observe that our model outperformed the other methods with respect to both the number of clusters and the clarity of the clustering structure.
5. Conclusions
In this paper, we propose GCHCL, a high-order contrastive learning method for graph clustering without manual augmentation. We contrast two high-order structures, constructed using two different Laplacian smoothing methods, to reveal the nodes’ similarity at the structural level, and we align the high-order structures to force the corresponding embeddings to map into the same space. After building a contrastive structure using the high-order structures, we perform contrastive learning by aligning the contrastive structure with an identity matrix. In this way, our model can naturally learn the clustering-friendly representations. Extensive experiments on datasets of various scales validate the effectiveness of the proposed model.