1. Introduction
In today’s data-driven world, exploring graphs is a crucial field of study. From social networks to biological data to communication systems, graphs help us make sense of complex systems. Graphs represent systems in which entities are represented by nodes and relationships between these entities are indicated by edges. Understanding the structure and dynamics of graphs requires an analysis of the interactions and strengths of relationships between nodes. This makes it possible to identify communities or closely linked groups that have similar characteristics or purposes within the graph.
Community detection is a fundamental aspect of understanding the structure and dynamics of a graph. It is an important concept used in identifying closely connected cluster of nodes in a network. This grouping of nodes helps in understanding some of the hidden yet significant details of relations and structure present in the network graph. Understanding these intrinsic details can reveal crucial insights about the graph such as the influence of certain nodes, the clusters present in the graph, and also about the network flow in the graph. For example, in social networks it serves for segmenting groups by common interests or demographic information; in biology, communities in a graph might represent different species that tend to coexist in the same ecosystems. In communications, they could be clusters of people who frequently contact each other. Whether you are a sociologist, a biologist, or a network engineer, graph exploration and community detection provide powerful tools to make sense of complex systems and uncover hidden patterns and structures.
With advancements in graph theory and machine learning, new methods [
1] have emerged for detecting communities in big graphs faster. Traditional approaches that use only the structure of the graph and some optimization metrics are effective for smaller networks. The traditional approaches include the following algorithms: Girvan–Newman, Louvain, spectral clustering, and Leiden. However, when dealing with larger, more complex systems with additional information or features on the nodes and edges, these traditional methods may not be as scalable or rich as needed. Therefore, learning-based approaches that use either probabilistic graphical model-based or deep learning-based are considered. These offer the advantage that they work fast with big data and can incorporate additional features.
Deep learning models working on graphs like graph autoencoders [
2], graph neural networks (GNNs) [
3], and graph convolutional networks (GCNs) [
4] have shown potential to solve problems in graphs by learning representations of nodes and graphs, while considering both node features and graph structures. However, these deep learning models often do not perform that well, because they do not explicitly incorporate crucial aspects of prior community structure information of the nodes and appropriate edge information like weights. By developing new deep learning frameworks that can better encode the underlying community memberships and edge relationships, we can create more powerful and flexible community detection approaches that can handle a wide range of real-world graph-based applications.
The goal of this paper is to advance the state-of-the-art methods in community detection by designing deep learning models that can effectively capture the inherent community structure and incorporate edge weights within graphs. This involves novel neural network architectures, optimization techniques, or other innovations that allow the models to better leverage these crucial graph properties. The resulting approaches exhibit good results across diverse graph datasets and application domains compared to existing community detection methods.
More specifically, we explore the implementation and effectiveness of variational graph autoencoders (VGAEs) for community detection. We focus our attention on exploring the application of variational graph autoencoders (VGAEs) [
5] for enhancing community detection in complex systems. VGAEs are a type of deep learning model that uses a probabilistic approach to learn compact, low-dimensional representations of graphs. We choose to experiment with VGAEs to embed nodes for the community detection problem because they have emerged as powerful tools due to their probabilistic framework, and are scalable and flexible. Our proposed technique, called Enhanced Community Detection with Structural Information VGAE (VGAE-ECF), aims to improve the VGAEs for community detection by incorporating preliminary insights about community structure, thus calls community preserving algorithms and edge weights into the learning process. This integration is expected to make learning more efficient and powerful, as the VGAEs can leverage the enriched, low-dimensional latent space embeddings to detect communities in a more informed way.
The problem, therefore, centers on how to effectively integrate community detection into deep learning models without compromising the generalization and robustness of these models. The proposed solution seeks to bridge the gap between classical community detection methods and modern deep learning techniques, to obtain an accurate community detection method.
The structure of the paper is as follows.
Section 2 introduces the problem of community detection, relevant terminology, and background to provide context for the subsequent sections.
Section 3 reviews the existing literature related to the methodology employed in this study and previous works in the field of community detection using VGAEs or similar approaches.
Section 4 describes the methodology adopted in this paper.
Section 5 presents the experiments conducted and the results obtained. It discusses the datasets used for evaluation, the evaluation metrics employed, and provides a thorough analysis and discussion of the findings. Finally,
Section 6 concludes the paper by summarizing the key findings and insights gained from the research. It also suggests potential avenues for future research in this area.
3. Related Work
Historically, algorithms like Girvan–Newman [
7], modularity optimization [
8,
9], and spectral clustering [
10] have been popular for community detection. These methods often rely on network topologies and heuristic approaches. While they can be effective for smaller or less complex networks, their performance often degrades with increasing network size and complexity. Moreover, they might not be well suited for dynamic networks where community structures evolve over time.
In 2018, J. J. Choong et al. [
11] presented a unique deep generative model called Variational Graph Autoencoder for Community Detection (VGAECD). This model takes the standard VGAE framework and customizes it specifically for community detection. Unlike regular VAEs that use simple distributions, VGAECD uses multivariate distributions for generating latent representation to better capture the intricate relationships between nodes in the same community. It breaks down the joint probability of node attributes, latent variables, and community labels to generate those latent vectors. When reconstructing node attributes, it ties this directly to community-specific latent vectors, so the model can pick up on community characteristics more accurately.
In 2020, J. J. Choong et al. [
12] took this idea a step further by introducing OPT-VGAECD, or Optimized VGAECD, with linearization of the encoder within the VGAE framework as proposed by Wu et al. [
13]. Traditional VAEs use graph convolutional networks (GCNs) as the encoder, which are powerful but computationally expensive. By removing the non-linear activation and simplifying the GCN, a more streamlined model called simplified graph convolutional network (SGC) was proposed. This method focuses on the main objective of discovering communities that may be compromised when lowering the VGAE loss.
Another key aspect of community detection models is the loss function. The VGAECD framework has a unique dual loss that includes both reconstruction loss and community quality loss. Optimizing this dual loss requires a model that is powerful enough to minimize both the reconstruction error and ensure the detected communities are of high quality. Recent research suggests the introduction of a dual-optimization algorithm inspired by the Neural Expectation–Maximization (NEM) algorithm [
14]. Unlike the classic Expectation–Maximization algorithm [
15], NEM can be trained end-to-end with gradient descent, giving a solution to tackle the dual-optimization challenge. It enables the VGAECD model to learn robust node embeddings and recognize communities for nodes at the same time.
Previous GAE and VGAE models employed encoders that, while adept at capturing node features for reconstruction tasks, did not specifically preserve the community structures during the encoding process. This limitation led to the innovative development of modularity-aware GAE and VGAE models by Guillaume Salha-Galvana et al. [
2] in 2022. These models introduce a revised encoding mechanism that incorporates a modularity-aware message-passing operator. This novel approach respects the original graph structure and emphasizes modularity-based node communities, thereby enriching the node embedding spaces with community-centric information. The Louvain algorithm [
16] is integrated with the deep learning VGAE framework [
2] to develop modularity-aware node communities. This algorithm is efficient on massive networks. Using the Louvain method to perform the community detection process before feeding the input to the VGAE encoder resulted in an overall performance boost. The new input to the VGAE now becomes communities detected by the Louvain method along with the original graph structure.
Moreover, earlier models often relied on optimization strategies that prioritized link prediction tasks, consequently sidelining the aspect of community detection. This bias stemmed from the utilization of standard loss functions like cross-entropy and the evidence lower bound (ELBO) [
6]. The contemporary approach redefines this by integrating a modularity-inspired loss function, which acts as a regularization mechanism over the pairwise reconstruction losses. This alternative loss function serves to align the optimization process with the dual objectives of link prediction and community detection, fostering a balance between local pairwise interactions and the global community architecture. Additionally, the integration of the Louvain algorithm for maximizing modularity has been a crucial step forward. This technique not only automates the selection of the number of communities but also proves to be scalable and complementary to the encoding–decoding processes of GAEs and VAEs. The Louvain algorithm, known for its efficiency in large-scale networks, aids in initializing the community detection process and iteratively refines the detected communities, enhancing the overall performance of the model.
Another significant contribution to the domain is shown in [
17]. This work presented a novel approach to extract the core structure in networks using the K-truss algorithm which along with a variational autoencoder enhances the detection of nodes and links within communities. The K-truss [
18] is a subgraph such that every edge must participate in at least (K-2) triangles within the subgraph. This approach maintains the relation of community structures but with a reduced computational burden as compared to K-core, making the discovery of community structures in complex networks more practical.
The above works show the state-of-the-art deep learning methods with traditional community detection approaches. The integration of the K-truss technique to preserve the strength of relationships in nodes is the main goal of this study.
Author Contributions
Conceptualization, J.H.P. and K.P.; methodology, J.H.P.; software, J.H.P.; validation, P.P., W.B.A. and K.P.; formal analysis, K.P.; investigation, J.H.P. and K.P.; resources, J.H.P.; data curation, J.H.P.; writing—original draft preparation, J.H.P. and K.P.; writing—review and editing, P.P. and W.B.A.; visualization, J.H.P.; supervision, K.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
VGAE | Variational graph autoencoders |
GNN | Graph neural network |
GCN | Graph convolutional network |
VAE | Variational autoencoders |
VGAE-ECF | VGAE Enhanced Community Detection with Structural Information |
References
- Jin, D.; Yu, Z.; Jiao, P.; Pan, S.; He, D.; Wu, J.; Yu, P.S.; Zhang, W. A Survey of Community Detection Approaches: From Statistical Modeling to Deep Learning. IEEE Trans. Knowl. Data Eng. 2023, 35, 1149–1170. [Google Scholar] [CrossRef]
- Salha-Galvan, G.; Lutzeyer, J.F.; Dasoulas, G.; Hennequin, R.; Vazirgiannis, M. Modularity-aware graph autoencoders for joint community detection and link prediction. Neural Netw. 2022, 153, 474–495. [Google Scholar] [CrossRef] [PubMed]
- Khoshraftar, S.; An, A. A Survey on Graph Representation Learning Methods. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–55. [Google Scholar] [CrossRef]
- Zhang, L.; Song, H.; Aletras, N.; Lu, H. Graph Node-Feature Convolution for Representation Learning. arXiv 2022, arXiv:cs.LG/1812.00086. [Google Scholar]
- Kipf, T.N.; Welling, M. Variational Graph Auto-Encoders. arXiv 2016, arXiv:stat.ML/1611.07308. [Google Scholar]
- Kingma, D.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Despalatović, L.; Vojković, T.; Vukicević, D. Community structure in networks: Girvan-Newman algorithm improvement. In Proceedings of the 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 26–30 May 2014; pp. 997–1002. [Google Scholar]
- Chen, M.; Kuzmin, K.; Szymański, B.K. Community Detection via Maximization of Modularity and Its Variants. IEEE Trans. Comput. Soc. Syst. 2014, 1, 46–65. [Google Scholar] [CrossRef]
- Zhang, X.S.; Wang, R.S.; Wang, Y.; Wang, J.; Qiu, Y.; Wang, L.; Chen, L. Modularity optimization in community detection of complex networks. Europhys. Lett. 2009, 87, 38002. [Google Scholar] [CrossRef]
- Bach, F.; Jordan, M. Learning Spectral Clustering. In Advances in Neural Information Processing Systems; Thrun, S., Saul, L., Schölkopf, B., Eds.; MIT Press: Cambridge, MA, USA, 2003; Volume 16. [Google Scholar]
- Choong, J.J.; Liu, X.; Murata, T. Learning Community Structure with Variational Autoencoder. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 69–78. [Google Scholar] [CrossRef]
- Choong, J.J.; Liu, X.; Murata, T. Optimizing variational graph autoencoder for community detection with dual optimization. Entropy 2020, 22, 197. [Google Scholar] [CrossRef] [PubMed]
- Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; ML Research Press: Philadelphia, PA, USA, 2019; Volume 97, pp. 6861–6871. [Google Scholar]
- Greff, K.; van Steenkiste, S.; Schmidhuber, J. Neural Expectation Maximization. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6694–6704. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–38. [Google Scholar] [CrossRef]
- Blondel, V.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast Unfolding of Communities in Large Networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef]
- Fei, R.; Wan, Y.; Hu, B.; Li, A.; Li, Q. A Novel Network Core Structure Extraction Algorithm Utilized Variational Autoencoder for Community Detection. Expert Syst. Appl. 2023, 222, 119775. [Google Scholar] [CrossRef]
- Cohen, J. Trusses: Cohesive subgraphs for social network analysis. Natl. Secur. Agency Tech. Rep. 2008, 16, 1–29. [Google Scholar]
- Traag, V.A.; Waltman, L.; van Eck, N.J. From Louvain to Leiden: Guaranteeing well-connected communities. Sci. Rep. 2019, 9, 1–12. [Google Scholar] [CrossRef] [PubMed]
- Panduranga, N.K.; Gao, J.; Yuan, X.; Stanley, H.E.; Havlin, S. Generalized model for k-core percolation and interdependent networks. Phys. Rev. E 2017, 96, 032317. [Google Scholar] [CrossRef] [PubMed]
- Bahmani, B.; Moseley, B.; Vattani, A.; Kumar, R.; Vassilvitskii, S. Scalable K-Means++. arXiv 2012, arXiv:1203.6402. [Google Scholar] [CrossRef]
- Caragea, C.; Wu, J.; Ciobanu, A.; Williams, K.; Fernández-Ramírez, J.; Chen, H.H.; Wu, Z.; Giles, L. CiteSeerx: A Scholarly Big Dataset. In Advances in Information Retrieval; de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K., Eds.; Springer: Cham, Switzerland, 2014; pp. 311–322. [Google Scholar]
- Available online: https://ieee-dataport.org/documents/cora (accessed on 11 September 2024).
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:cs.LG/1412.6980. [Google Scholar]
- Salha, G.; Hennequin, R.; Vazirgiannis, M. Simple and Effective Graph Autoencoders with One-Hop Linear Models. arXiv 2020, arXiv:cs.LG/2001.07614. [Google Scholar]
- Kinney, J.B.; Atwal, G.S. Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA 2014, 111, 3354–3359. [Google Scholar] [CrossRef] [PubMed]
Figure 1.
Architecture of a VGAE with encoder and decoder.
Figure 2.
Process flow from an input graph to detecting communities. The VGAE produces embeddings that reflect community structure and edge weights, which are clustered with K-means.
Figure 3.
Proposed architecture to generate communities.
Figure 4.
Comparison of performance of different models for Citeseer dataset. Note that the blue line is for the GCN_VAE method.
Figure 5.
Comparison of performance of different models for Cora dataset. The orange line is for the GCN_VAE method.
Figure 6.
Ground truth communities in Citeseer dataset.
Figure 7.
Ground truth communities in Cora dataset.
Figure 8.
K-truss+GCN_VAE on Citeseer.
Figure 9.
K-truss+Linear_VAE on Citeseer.
Figure 10.
Leiden+GCN_VAE on Citeseer.
Figure 11.
Leiden+Linear_VAE on Citeseer.
Figure 12.
K-truss+GCN_VAE on Cora.
Figure 13.
K-truss+Linear_VAE on Cora.
Figure 14.
Leiden+GCN_VAE on Cora.
Figure 15.
Leiden+Linear_VAE on Cora.
Table 1.
Community detection results on Citeseer dataset considering features.
Model | Iterations | Method | Mean AMI (%) | Mean ARI (%) |
Time (s)
|
---|
Linear_VAE | 100 | Louvain [2] | 25.39 | 10.58 | 559.73 |
GCN_VAE | 100 | Louvain [2] | 26.59 | 26.09 | 393.51 |
Linear_VAE | 100 | Leiden | 21.30 | 12.80 | 585.73 |
GCN_VAE | 100 | Leiden | 24.65 | 23.57 | 330.17 |
Linear_VAE | 100 | K-truss | 30.27 | 24.80 | 667.77 |
GCN_VAE | 100 | K-truss | 29.71 | 28.19 | 378.43 |
Linear_VAE | 100 | Random | −0.012 | −0.013 | 335.64 |
GCN_VAE | 100 | Random | 0.37 | 0.29 | 339.03 |
Table 2.
Experimental results on Citeseer dataset without considering features.
Model | Iterations | Method | Mean AMI (%) | Mean ARI (%) |
Time (s)
|
---|
Linear_VAE | 100 | Louvain [2] | 15.83 | 7.94 | 621.86 |
GCN_VAE | 100 | Louvain [2] | 10.95 | 7.75 | 422.83 |
Linear_VAE | 100 | Leiden | 15.68 | 5.72 | 618.01 |
GCN_VAE | 100 | Leiden | 10.53 | 7.98 | 412.26 |
Linear_VAE | 100 | K-truss | 1.01 | 0.78 | 826.20 |
GCN_VAE | 100 | K-truss | 3.86 | 3.04 | 480.62 |
Table 3.
Experimental results on Cora dataset considering features.
Model | Iterations | Method | Mean AMI (%) | Mean ARI (%) |
Time (s)
|
---|
Linear_VAE | 100 | Louvain [2] | 50.12 | 43.32 | 405.82 |
GCN_VAE | 100 | Louvain [2] | 49.02 | 38.67 | 352.50 |
Linear_VAE | 100 | Leiden | 50.75 | 46.12 | 392.30 |
GCN_VAE | 100 | Leiden | 52.30 | 49.81 | 341.77 |
Linear_VAE | 100 | K-truss | 45.09 | 39.09 | 476.85 |
GCN_VAE | 100 | K-truss | 47.76 | 40.58 | 428.69 |
Linear_vae | 100 | Random | 0.29 | 0.15 | 243.93 |
GCN_VAE | 100 | Random | 0.69 | 0.59 | 330.69 |
Table 4.
Experimental results on Cora dataset without considering features.
Model | Iterations | Method | Mean AMI (%) | Mean ARI (%) |
Time (s)
|
---|
Linear_VAE | 100 | Louvain [2] | 50.12 | 43.39 | 481.55 |
GCN_VAE | 100 | Louvain [2] | 35.76 | 27.09 | 394.40 |
Linear_VAE | 100 | Leiden | 40.63 | 32.80 | 465.28 |
GCN_VAE | 100 | Leiden | 38.98 | 32.46 | 365.97 |
Linear_VAE | 100 | K-truss | 2.57 | 1.54 | 715.39 |
GCN_VAE | 100 | K-truss | 8.20 | 4.08 | 461.72 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).