Stacked Community Prediction: A Distributed Stacking-Based Community Extraction Methodology for Large Scale Social Networks
Abstract
:1. Introduction
- Section 2 presents the existing bibliography regarding community detection and community prediction.
- Section 3 analyzes the proposed methodology and explains its implementation.
- Section 4 describes the conducted experimentation process and assesses the generated results.
- Section 5 outlines the conclusions and sets the future steps.
2. Background
2.1. Community Detection Definitions & Classic Algorithms
- mint is the number of the intra-connection edges of the group C,
- n is the number of nodes within the group C, and
- ½ ∗ n ∗ (n − 1) is the number of all possible internal edges of the group C.
- mext is the number of the inter-connection edges of the group C,
- n is the number of nodes within the group C,
- N is the number of nodes of the given information network, and
- n ∗ (N − n) is the number of all possible internal edges of the group C.
2.2. Community Prediction
3. Proposed Methodology
3.1. Subgraph Extraction
3.2. Feature Enrichment
- The intersection: The number of common neighbor vertices up to the kth depth.
- The loose similarity: The fraction of the intersection’s cardinality divided by the union’s cardinality of the individual sets of the distinct neighbor vertices up to the kth depth.
- The dissimilarity: The fraction of the symmetric difference’s cardinality divided by the union’s cardinality of the individual sets of the distinct neighbor vertices up to the kth depth.
- The edge balance: The absolute difference of the cardinalities of the individual sets of the distinct neighbor vertices up to the kth depth.
- the edge information: the imminent edge’s interconnection strength indicated by the number of different k-length paths between the vertices a, b [33].
3.3. Stacking Ensemble Learner
- a distributed bagging ensemble of L2 regularized, binary logistic regression classifiers, and
- a distributed gradient boosted trees ensemble model [35], aka distributed GBT boosting ensemble model.
- The loss function that should be set to logloss to outline the classification purpose of the stacking ensemble classifier.
- The number of decision trees.
- The maximum depth of the decision trees that depend on the input variables’ interaction.
- The learning rate, i.e., shrinkage, that determines the weighting of new decision trees added to the model.
3.4. Complexity Analysis
- The subgraph extraction process requires the application of multiple linear BFS crawlers which number is constant.
- The number of the extracted subgraphs is also constant.
- the size of the extracted subgraphs is considerably limited. Particularly, as shown in the conducted experimentation’s execution parameters in Table 4 this size was consistently smaller than two order of magnitude comparing to the input graph’s size.
- The feature enrichment process is of linear complexity and is intrinsically distributed on the edge level.
- The edge labelling process requires multiple executions of the Louvain algorithm [7] on each extracted subgraph, where the number of executions is constant.
- The Louvain [7] algorithm’s time complexity is considered O(k logk), where k indicates the number of subgraph’s nodes.
4. Experimentation Process & Results Discussion
4.1. Complexity Analysis
- Accuracy: The ratio of the correctly classified predictions over all the predictions made.
- Recall: The fraction of the correctly predicted inter-connection edges over all truly inter-connection edges. This is practically the indicator of how well a community prediction model classifies the truly inter-connection edges.
- Precision: The ratio of the correctly classified inter-connection edges over all inter-connection predictions made.
- Specificity: The fraction of the correctly predicted intra-connection edges over all truly intra-connection edges. Accordingly, to recall, this is the indicator of how well a community prediction model classifies the truly intra-connection edges.
- F1-Score: The harmonic mean of the correctly classified predictions.
4.2. Netowrk Quality Evaluation
- Coverage [1]: The average value of the individual community ratio of the number of intra-community edges to the total number of edges in the graph.
- Performance [1]: The average value of the individual community ratio of the number of intra-community edges plus inter-community non-edges to the total number of potential edges.
4.3. Execution Performance Evaluation
5. Conclusions
- Automating the stacking ensemble learner parameterization regarding the number of the extracted subgraphs, the size of the extracted subgraphs and the value of the depth considered on the feature enrichment process.
- Enhancing the set of network topology features calculated with additional topological properties.
- Automating the parameterization process of the distributed GBT component to optimally minimize the prediction bias.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Fortunato, S. Community detection in graphs. arXiv 2009, arXiv:0906.0612. [Google Scholar] [CrossRef] [Green Version]
- Schaeffer, S.E. Graph clustering. Comput. Sci. Rev. 2007, 1, 27–64. [Google Scholar] [CrossRef]
- Souravlas, S.; Sifaleras, A.; Tsintogianni, M.; Katsavounis, S. A classification of community detection methods in social networks: A survey. Int. J. Gen. Syst. 2021, 50, 63–91. [Google Scholar] [CrossRef]
- Khediri, N.; Karoui, W. Community Detection in Social Network with Node Attributes Based on Formal Concept Analysis. In Proceedings of the 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), Hammamet, Tunisia, 30 Octorber–3 November 2017; pp. 1346–1353. [Google Scholar] [CrossRef]
- Li, Y. Community Detection with Node Attributes and its Generalization. arXiv 2016, arXiv:1604.03601. [Google Scholar]
- Devi, J.C.; Poovammal, E. An Analysis of Overlapping Community Detection Algorithms in Social Networks. Procedia Comput. Sci. 2016, 89, 349–358. [Google Scholar] [CrossRef] [Green Version]
- Blondel, V.D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of community hierarchies in large networks. J. Stat. Mech. Theory Exp. 2008, 2008, P10008. [Google Scholar] [CrossRef] [Green Version]
- Newman, M.E.J.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 2004, 69, 026113. [Google Scholar] [CrossRef] [Green Version]
- Lancichinetti, A.; Kivelä, M.; Saramäki, J.; Fortunato, S. Characterizing the community structure of complex networks. PLoS ONE 2010, 5, e11976. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bouhatem, F.; El Hadj, A.A.; Souam, F. Density-based Approach with Dual Optimization for Tracking Community Structure of Increasing Social Networks. Int. J. Artif. Intell. Tools 2020, 29, 2050002. [Google Scholar] [CrossRef]
- Jarukasemratana, S.; Murata, T. Edge Weight Method for Community Detection on Mixed Scale-Free Networks. Int. J. Artif. Intell. Tools 2015, 24, 1540007. [Google Scholar] [CrossRef]
- Wakita, K.; Tsurumi, T. Finding community structure in mega-scale social networks. In Proceedings of the 16th International Conference on Information Integration and Web-Based Applications & Services, Hanoi, Vietnam, 4–6 December 2007; Volume 105, pp. 1275–1276. [Google Scholar]
- Backlinko.com. Available online: https://backlinko.com/social-media-users (accessed on 3 December 2020).
- Takaffoli, M.; Rabbany, R.; Zaïane, O.R. Community evolution prediction in dynamic social networks. In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), Beijing, China, 17–20 August 2014; pp. 9–16. [Google Scholar] [CrossRef]
- Makris, C.; Pettas, D.; Pispirigos, G. Distributed Community Prediction for Social Graphs Based on Louvain Algorithm. In IFIP International Conference on Artificial Intelligence Applications and Innovations; Springer: Cham, Switzerland, 2019; pp. 500–511. [Google Scholar]
- Makris, C.; Pispirigos, G.; Rizos, I.O. A Distributed Bagging Ensemble Methodology for Community Prediction in Social Networks. Information 2020, 11, 199. [Google Scholar] [CrossRef] [Green Version]
- Hamster Friendships Network Data Set—KONECT. April 2017. Available online: http://konect.uni-koblenz.de/networks/petster-friendships-hamster (accessed on 15 September 2020).
- Kumar, S.; Hooi, B.; Makhija, D.; Kumar, M.; Subrahmanian, V.S.; Faloutsos, C. REV2: Fraudulent User Prediction in Rating Platforms. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), Los Angeles, CA, USA, 5–9 February 2018; Available online: https://snap.stanford.edu/data/soc-sign-bitcoin-alpha.html (accessed on 15 September 2020).
- Yin, H.; Benson, A.R.; Leskovec, J.; Gleich, D.F. Local Higher-Order Graph Clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; Available online: https://snap.stanford.edu/data/email-Eu-core.html (accessed on 15 September 2020).
- McAuley, J.; Leskovec, J. Learning to Discover Social Circles in Ego Networks. NIPS. 2012. Available online: https://snap.stanford.edu/data/ego-Facebook.html (accessed on 15 September 2020).
- Klimmt, B.; Yang, Y. Introducing the Enron corpus. In Proceedings of the CEAS Conference, Mountain View, CA, USA, 30–31 July 2004; Available online: https://snap.stanford.edu/data/email-Enron.html (accessed on 15 September 2020).
- Douban Network Data Set—KONECT. April 2017. Available online: http://konect.uni-koblenz.de/networks/douban (accessed on 15 September 2020).
- Richardson, M.; Agrawal, R.; Domingos, P. Trust Management for the Semantic Web. ISWC. 2003. Available online: https://snap.stanford.edu/data/soc-Epinions1.html (accessed on 15 September 2020).
- Girvan, M.; Newman, M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef] [Green Version]
- Clauset, A.; Newman, M.E.J.; Moore, C. Finding community structure in very large networks. Phys. Rev. E 2004, 70, 066111. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gregory, S. Finding overlapping communities in networks by label propagation. New J. Phys. 2010, 12, 103018. [Google Scholar] [CrossRef]
- Zhang, X.-K.; Ren, J.; Song, C.; Jia, J.; Zhang, Q. Label propagation algorithm for community detection based on node importance and label influence. Phys. Lett. A 2017, 381, 2691–2698. [Google Scholar] [CrossRef]
- Zhao, W.; Martha, V.; Xu, X. PSCAN: A Parallel Structural Clustering Algorithm for Big Networks in MapReduce. In Proceedings of the 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), Barcelona, Spain, 25–28 March 2013; pp. 862–869. [Google Scholar]
- Newman, M.E.J. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006, 103, 8577–8582. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shao, J.; Zhang, Z.; Yu, Z.; Wang, J.; Zhao, Y.; Yang, Q. Community Detection and Link Prediction via Cluster-driven Low-rank Matrix Completion. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 1–16 August 2019; pp. 3382–3388. [Google Scholar] [CrossRef]
- Messaoudi, I.; Kamel, N. Community Detection Using Fireworks Optimization Algorithm. Int. J. Artif. Intell. Tools 2019, 28, 1950010. [Google Scholar] [CrossRef]
- Amelio, A.; Pizzuti, C. An Evolutionary and Local Refinement Approach for Community Detection in Signed Networks. Int. J. Artif. Intell. Tools 2016, 25, 1650021. [Google Scholar] [CrossRef]
- Dan, K. The Square of Adjacency Matrices. arXiv 2012, arXiv:1207.3122. [Google Scholar]
- Peel, L.; Larremore, D.B.; Clauset, A. The ground truth about metadata and community detection in networks. Sci. Adv. 2017, 3, e1602548. [Google Scholar] [CrossRef] [Green Version]
- Meng, X.; Bradley, J.K.; Yavuz, B.; Sparks, E.R.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.B.; Amde, M.; Owen, S.; et al. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 2016, 17, 34. [Google Scholar]
- Github.com/NetworkX.com/Louvain Implementation. Available online: https://github.com/taynaud/python-louvain (accessed on 15 September 2020).
- NetworkX.com/Girvan-Newman Implementation. Available online: https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.centrality.girvan_newman.html (accessed on 15 September 2020).
Graph | Nodes | Edges | Average Degree |
---|---|---|---|
Hamster [17] | 1858 | 12534 | 13.49 |
Bitcoin [18] | 3783 | 24186 | 12.78 |
Email-Eu-Core [19] | 1005 | 25571 | 31.96 |
Facebook [20] | 4039 | 88234 | 43.69 |
Email-Enron [21] | 36692 | 183831 | 10.02 |
Douban [22] | 154908 | 327162 | 4.22 |
Epinions [23] | 75879 | 508837 | 10.69 |
Graph | Feature Enrichment Depth | Training Data Set Size Ratio Over Input Graph |
---|---|---|
Hamster [17] | 3 | 30% |
Bitcoin [18] | 3 | 32% |
Email-Eu-Core [19] | 3 | 30% |
Facebook [20] | 2 | 25% |
Email-Enron [21] | 2 | 35% |
Douban [22] | 2 | 14% |
Epinions [23] | 2 | 16% |
Graph | Feature Enrichment Depth | Base Learners Number | Base Learners’ Training Data Set Size Ratio Over Input Graph |
---|---|---|---|
Hamster [17] | 3 | 3 | 10% |
Bitcoin [18] | 3 | 4 | 8% |
Email-Eu-Core [19] | 3 | 5 | 6% |
Facebook [20] | 2 | 5 | 5% |
Email-Enron [21] | 2 | 7 | 5% |
Douban [22] | 2 | 7 | 2% |
Epinions [23] | 2 | 8 | 2% |
Graph | Feature Enrichment Depth | Bagging’s Base Learners Number | Training Base Learner’s Data Set Size Ratio Over Input Graph |
---|---|---|---|
Hamster [17] | 3 | 3 | 10% |
Bitcoin [18] | 3 | 4 | 8% |
Email-Eu-Core [19] | 3 | 5 | 6% |
Facebook [20] | 2 | 5 | 5% |
Email-Enron [21] | 2 | 7 | 5% |
Douban [22] | 2 | 7 | 2% |
Epinions [23] | 2 | 8 | 2% |
Graph | Feature Enrichment Depth | GBT Decision Trees Number | Decision Trees Maximum Depth | GBT Shrinkage | GBT Training Data Set’s Size Ratio Over Input Graph |
---|---|---|---|---|---|
Hamster [17] | 3 | 3 | 12 | 0.03 | 30% |
Bitcoin [18] | 3 | 4 | 12 | 0.03 | 32% |
Email-Eu-Core [19] | 3 | 5 | 12 | 0.03 | 30% |
Facebook [20] | 2 | 5 | 12 | 0.03 | 25% |
Email-Enron [21] | 2 | 7 | 12 | 0.03 | 35% |
Douban [22] | 2 | 7 | 12 | 0.03 | 14% |
Epinions [23] | 2 | 8 | 12 | 0.03 | 16% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Makris, C.; Pispirigos, G. Stacked Community Prediction: A Distributed Stacking-Based Community Extraction Methodology for Large Scale Social Networks. Big Data Cogn. Comput. 2021, 5, 14. https://doi.org/10.3390/bdcc5010014
Makris C, Pispirigos G. Stacked Community Prediction: A Distributed Stacking-Based Community Extraction Methodology for Large Scale Social Networks. Big Data and Cognitive Computing. 2021; 5(1):14. https://doi.org/10.3390/bdcc5010014
Chicago/Turabian StyleMakris, Christos, and Georgios Pispirigos. 2021. "Stacked Community Prediction: A Distributed Stacking-Based Community Extraction Methodology for Large Scale Social Networks" Big Data and Cognitive Computing 5, no. 1: 14. https://doi.org/10.3390/bdcc5010014
APA StyleMakris, C., & Pispirigos, G. (2021). Stacked Community Prediction: A Distributed Stacking-Based Community Extraction Methodology for Large Scale Social Networks. Big Data and Cognitive Computing, 5(1), 14. https://doi.org/10.3390/bdcc5010014