Next Article in Journal
Long-Term Lifetime Prediction of Power MOSFET Devices Based on LSTM and GRU Algorithms
Previous Article in Journal
Bertrand’s Paradox Resolution and Its Implications for the Bing–Fisher Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GLOD: The Local Greedy Expansion Method for Overlapping Community Detection in Dynamic Provenance Networks

1
Department of Computer, Beijing Information Science and Technology University, Beijing 100101, China
2
R&D Center, Beijing Guowang Fuda Science & Technology Development Co., Ltd., Beijing 100070, China
3
Software Engineering College, Zhengzhou University of Light Industry, Zhengzhou 450002, China
*
Authors to whom correspondence should be addressed.
Mathematics 2023, 11(15), 3284; https://doi.org/10.3390/math11153284
Submission received: 16 June 2023 / Revised: 19 July 2023 / Accepted: 21 July 2023 / Published: 26 July 2023
(This article belongs to the Section Mathematics and Computer Science)

Abstract

:
Local overlapping community detection is a hot problem in the field of studying complex networks. It is the process of finding dense clusters based on local network information. This paper proposes a method called local greedy extended dynamic overlapping community detection (GLOD) to address the challenges of detecting high-quality overlapping communities in complex networks. The goal is to improve the accuracy of community detection by considering the dynamic nature of community boundaries and leveraging local network information. The GLOD method consists of several steps. First, a coupling seed is constructed by selecting nodes from blank communities (i.e., nodes not assigned to any community) and their similar neighboring nodes. This seed serves as the starting point for community detection. Next, the seed boundaries are extended by applying multiple community fitness functions. These fitness functions determine the likelihood of nodes belonging to a specific community based on various local network properties. By iteratively expanding the seed boundaries, communities with higher density and better internal structure are formed. Finally, the overlapping communities are merged using an improved version of the Jaccard coefficient, which is a measure of similarity between sets. This step ensures that overlapping nodes between communities are properly identified and accounted for in the final community structure. The proposed method is evaluated using real networks and three sets of LFR (Lancichinetti–Fortunato–Radicchi) networks, which are synthetic benchmark networks widely used in community detection research. The experimental results demonstrate that GLOD outperforms existing algorithms and achieves a 2.1% improvement in the F-score, a community quality evaluation metric, compared to the LOCD framework. It outperforms the best existing LOCD algorithm on the real provenance network. In summary, the GLOD method aims to overcome the limitations of existing community detection algorithms by incorporating local network information, considering overlapping communities, and dynamically adjusting community boundaries. The experimental results suggest that GLOD is effective in improving the quality of community detection in complex networks.

1. Introduction

Many real-world networks, such as social networks and citation graphs [1,2,3,4], contain community structure, and analyzing network structure and knowledge extraction is an important research topic [5,6]. The purpose of community detection is to partition the entire network into different communities, ensuring that nodes within a community are tightly connected, while nodes between different communities are relatively sparsely connected. However, the concept of community itself lacks a universally accepted definition, as discussed in Fortunato [7]. The dynamic provenance network refers to changes in the number of nodes, the number of edges of nodes, and the structure of local communities. Real-world community structures often produce overlapping intersections. Therefore, discovering overlapping communities is more helpful in understanding the topological and functional structures of networks, and using community networks as a guide is of great significance for analyzing information dissemination and information tracing.
Community detection algorithms typically require global information of the entire network [8]. However, global information of real-world networks may be unavailable or costly to obtain [9]. Local community detection starts from the perspective of local structures and detects node community membership based on local information. Therefore, local expansion methods are more suitable for discovering communities in large-scale networks. The accuracy and running time of community detection are two important evaluation metrics for local expansion-based community discovery methods. However, these two metrics are mutually constrained, and improving both accuracy and running time is an NP-hard problem. Currently, existing local overlapping detection algorithms, such as LFM [10], are inefficient in detecting the local maximum of fitness functions for community detection, and there is a problem of creating empty communities by deleting nodes. The LOCD framework [11] selects representative nodes to join communities but does not consider the analysis and merging issues after community detection. Reid et al. [12] pointed out that overlap is indeed an important feature of many real-world networks. Baltsou et al. [13] proposed that local community detection can also evaluate the community structure of a network. Gupta et al. [14] proposed faction propagation in networks to detect overlapping nodes. Shi et al. [15] pointed out that local expansion methods are more suitable for large-scale networks. Newman et al. [16] proposed modularity to measure the quality of detected communities.
The existing LOCD local overlapping community detection algorithm [11] has three problems that need to be improved in order to enhance the quality of community detection: First, selecting a single node to expand the community leads to the problem of seed invalidity. Second, when nodes within a community are deleted, there may be nodes that do not belong to any community. Third, when there are too many overlapping nodes in overlapping communities, there is a problem of redundant communities that multiple communities contain similar overlapping nodes, that is, multiple overlapping nodes belong to the same multiple communities. This creates redundancy in some communities. Redundancy is defined as a one-third threshold, and communities exceeding this threshold are considered redundant communities.
This paper proposes a method called GLOD for local overlapping community detection, which addresses three common issues in existing methods. The main contributions are as follows:
(1) Utilizing the same computational process and neighbor similarity to expand nodes into coupled seeds can reduce seed invalidity and avoid the issue of creating blank communities due to node deletion.
(2) Extending community boundaries based on centrality seeds through a blended community fitness function yields locally well-adapted and structurally stable community discovery results. This approach mitigates the problem of excessive reliance on the influence of central nodes and avoids community jitter.
(3) Comparing and merging results of overlapping communities based on an improved Jaccard coefficient helps maintain high community coverage while reducing redundancy in communities.
The rest of this paper is organized as follows. Section 2 introduces some related methods. In Section 3, we describe the problems of local overlapping community detection and some methods used in our framework. Section 4 presents our method and the complexity analysis of the method. Section 5 presents the experimental results and analysis. Finally, we provide conclusions.

2. Related Work

Local overlapping community detection is helpful in understanding the topological and functional structure of the propagation graph; therefore, utilizing community networks as a guide is of great significance for analyzing information propagation and information tracing. In this section, we will elaborate on the progress and existing issues of community detection from three aspects: local community detection, overlapping community detection, and local overlapping community detection.

2.1. Local Community Detection

The purpose of local community detection is to find the community membership of nodes based on local information [17,18]. Luo et al. [8] proposed two local community detection algorithms, DMF_M and DMF_R. DiTursi et al. [17] extended traditional local community detection in static networks to dynamic networks that change over time. Fanrong et al. [19] discussed the problem of ineffective seeds, where seed nodes sometimes do not participate in community detection. Therefore, community expansion starts not from a single node but from the maximum clique that includes the seed nodes. They proposed a local community detection algorithm based on maximum clique expansion (LCD-MC). Christopoulos [20] proposed a framework to track the evolution of a community in dynamic networks by focusing on a key node called the anchor, overcoming the identity problem and validating its effectiveness through experiments. Huang [21] introduced Community Contrastive Learning (Community-CL), an online framework that combines contrastive learning with graph views to simultaneously learn node representations and detect communities in networks, outperforming state-of-the-art baselines in community detection tasks. Qing’s [22] approach combines weighted modularity with spectral clustering to accurately estimate the number of communities in weighted networks, including those with negative edge weights and signed networks. Yao [23] introduced a new modularity function 𝐹2 and proposed a constrained Louvain algorithm that outperforms other methods, including the classical Louvain algorithm and Newman’s fast method, in community detection on various networks. Wan [24] introduced a novel spatial autoregressive and gravity model united method (SARGM) to simulate benchmark spatial networks and proposed a spectral-clustering-based spatial community detection method (SCSCD) that outperforms other methods in accuracy and effectiveness, with a case study on a high-speed train network in China.

2.2. Overlapping Community Detection

Overlapping is one of the important features of propagation graphs. Tanmoy et al. [25] proposed a vertex-based metric, GenPerm, but the problem of ineffective seeds (vertices) still exists. Bai et al. [26] developed the OCDDP overlapping community detection algorithm based on density peaks. The algorithm first sets a distance matrix based on similarity and then selects community cores. Rezvani et al. [27] proposed a fitness-based method based on triangles to address the free-rider and separation effects of overlapping nodes. The local fitness method (LFM) [10] found that the local maximum of the fitness function is inefficient for detecting communities, as the algorithm repeats detecting nodes when they are included in multiple communities. The clique percolation method (CPM) [28] finds k small clusters in the network and shares (k − 1) nodes to obtain a single community. However, this algorithm cannot assign nodes that do not belong to any community. LOC [29] solves this problem by selecting nodes with the highest density and using an algorithm based on the clique expansion strategy for community partitioning, which improves the efficiency and quality of community detection. In addition, compared with overlapping community detection algorithms that focus on the global structure of the network, the computational cost of local overlapping community detection methods is greatly reduced. Xu H et al. [30] proposed the influence-based community overlap propagation algorithm (INF-COPRA) that improves the accuracy of overlapping community detection by ranking the influence of nodes and labels, outperforming existing methods in benchmark and real networks. Li X et al. [31] proposed algorithm combines local expansion and label propagation to detect overlapping components in community structures, achieving higher effectiveness, accuracy, and stability across diverse network structures. Lin H et al. [32] proposed an algorithm for detecting overlapping communities in social networks using an augmented attribute graph and an improved weight adjustment strategy, demonstrating its effectiveness through extensive experiments on synthetic and real-world datasets. Huang M et al. [33] proposed approach, ESNMF, utilizes symmetric non-negative matrix factorization and outperforms existing methods for community detection in large-scale networks by dividing the network into sub-graphs and extracting precise communities through non-negative matrix factorization. Peng Y et al. [34] proposed the DSNE algorithm and community density metric (D), which offers an effective approach for detecting overlapping communities in bipartite networks, outperforming existing algorithms and providing a more comprehensive evaluation of community structures compared to modularity. Gao R et al. [35] introduced an algorithm for overlapping community detection in complex networks based on membership degree propagation, which achieves improved accuracy and speed compared to other methods while reducing computational complexity. Li Y et al. [36] proposed the Two Expansions of Seeds (TES) algorithm, which effectively discovers overlapping communities in complex networks, outperforming existing algorithms in terms of robustness, accuracy, and redundancy reduction. Tsung et al. [37] proposed a novel algorithm for detecting overlapping communities by introducing a node weight allocation problem and extending the modularity measure, showing improved results in detecting valuable overlapping nodes.

2.3. Local Overlapping Community Detection

Kamuhanda et al. [38] used the non-negative matrix factorization (NMF) algorithm to address multiple local community detection, but the quality of community detection on sparse networks is not ideal. Hollocou et al. [39] proposed a method called MULTICOM to discover multiple local communities, but community detection based on seed influence will have a large error compared to the true communities. The (α,γ)-OCS model [40] is based on the α-adjacency-γ-quasi-k-clique method, and uses the idea of local maximum cliques for local overlapping community detection. However, the difficulty lies in determining the number of nodes in the local maximum clique, as the size of the maximum cliques in different networks is unknown, making it difficult to estimate the parameters. Ni et al. [11] proposed a framework called LOCD, but the LOCD framework uses nodes as seeds, which leads to the problem of seed inefficiency and does not consider the issue of community similarity. Riolo M A et al. [41] proposed an information theoretic method for identifying building blocks in specific networks, highlighting the valuable insights provided by traditional community detection approaches.

3. Preliminaries

In order to clarify the proposed overlapping community detection method in this paper, we first introduce some background knowledge, including the problem description of local overlapping community detection, the neighbor similarity function used for seed community formation, the community fitness function used for local community expansion, the Jaccard coefficient used for community merging, and the F-score used for overall analysis in experiments.

3.1. Problem Statement

For graph G and given node v, node v is an overlapping node belonging to both communities c1 and c2. Our goal is to find communities c1 and c2 using only local information for node v.

3.2. The Common Neighbor Similarity

The rough seed nodes (i.e., rough communities) can weaken the influence of a single seed on community expansion. The calculation of similarity is Equation (1), where N ( v i ) is the set of neighbors of node vi. The rough node V, the process of roughening, and the neighbors of the node are classified into a set according to the similarity. Rough nodes are used as seeds, which contain multiple nodes with high similarity as a whole, and are initially composed of vi and its most similar neighbors, following the common neighbor similarity NC. By sorting the centrality of the rough node V, the largest node V with the highest centrality is selected as the seed s.
N C ( v i , v j ) = N ( v i ) N ( v j )  

3.3. Community Fitness Function

The community fitness function serves as a criterion for determining whether a node can join the current community and determines who can be added from the set of neighboring nodes to the initial community.

3.3.1. f(C) Method

The fitness function f(C) of a community C in the graph G = (V,E) is defined as follows. Here, d in C and d out C denote the weighted node degrees of all nodes in C within and outside C, respectively, and α > 0 is a tunable parameter that controls the size of the community, called the resolution parameter. Based on this, v C N ( C ) , the fitness of node v is defined by Equation (3). The LFM algorithm [10] adds the node with the maximum fitness to the current community C in each iteration to obtain a new community C , then computes the fitness of all nodes in C and removes nodes with negative fitness to obtain a new community C . This process is repeated until the fitness function no longer increases. Chen et al. [42] found that the NMI value of overlapping communities is highest when α = 0.8 in practical applications. Therefore, in this experiment, we adopt the overlapping community partitioning strategy with the highest NMI value.
f ( C ) = d in C ( d in C + d out C ) α ,
f ( v ) = { f ( C { v } ) f ( C ) , v N ( C ) ; f ( C ) f ( C { v } ) , v C .

3.3.2. ω(vi) Method

During the community expansion phase in the DMF_M algorithm [8], the node with the highest value of ω(vi) is selected by computing the node’s fitness and updating C, N, and M accordingly. Here, N ( v i ) denotes the set of neighbors of node v i , N 2 ( v i ) = { x y N ( v i ) , x N ( y ) } , and N C i is the intersection of N ( v i ) and C.
ω ( v i ) = { m a x v j N C i ( | N ( v i ) N ( v j ) | + 1 N ( v j ) | + 0.1 | N 2 ( v i ) N 2 ( v j ) | + 1 | N 2 ( v j ) | ) 1.1       Δ M 0 0 ,       Δ M < 0 ,

3.3.3. F(v, s). Method

In a graph G = (V,E), vi is the selected node, s = { v i , v j } is the constructed seed where v k N ( v i ) , and N ( v i ) is the set of all neighboring sub-nodes of vi. The value of F ranges from 0 to 1, and when F = 1, all elements in c are neighbors of v k . Equation (5) describes the degree of affiliation of v i ’s neighboring nodes with the number of elements in seed s. It reflects how closely v i ’s neighbors are connected to seed s. To increase the density of internal nodes in s, we use this equation to expand the seed. Therefore, when the F value of a node is too high, it indicates that the node is highly influenced by s, and the node is added to s. Then, the local gain of s will increase.
F ( v k , s ) = ( N ( v k ) s ) | s |

3.4. The Jaccard Coefficient for Community Merging

The Jaccard coefficient [43] is used to evaluate the effectiveness of overlapping community detection results by measuring the similarity between the detected communities and the true community structure. It is defined as Equation (6). The GREESE method [44] merges communities when the Jaccard coefficient exceeds one-third of the nodes in the smallest community.
The modified community merging coefficient J is defined as shown in Equation (8). A larger J value indicates that communities should be merged, and when J = 0, the two communities are completely independent and unrelated. Here, C r represents the true community structure, while C f represents the community structure discovered using a community detection method. O V is the set of only overlapping nodes, O V i is one of the shared overlapping node sets, o i v i is an overlapping node in the O V i overlap set, and C is the set of multiple communities to which O V i belongs.
J = C r , j C r m a x C j , C f J ( C f , j , C r , i ) N r
J ( C f , j , C r , i ) = | C f , j C r , i | | C f , j C r , i |
J ( o i v i ) = Σ o i v i C , C i C , C j C , i j 1 | C i C j |

3.5. F-Score

F-score [45,46] is the harmonic mean of p r e c i s i o n and r e c a l l , used to measure the similarity/correlation between the given community detection result and the true community structure, and can assess the accuracy of the community detection result. Its calculation Equation is as follows.
F s = C r , i C r m a x C f , j C f F s ( C f , j , C r , i ) N r  
F s ( C f , j , C r , i ) = 2 p r e c i s i o n ( C f , j , C r , i ) r e c a l l ( C f , j , C r , i ) p r e c i s i o n ( C f , j , C r , i ) r e c a l l ( C f , j , C r , i )  
p r e c i s i o n ( C f , j , C r , i ) = | C f , j C r , i | | C f , j |  
r e c a l l ( C f , j , C r , i ) = | C f , j C r , i | | C r , i |  
precision   = C r , i C r m a x C f , j C f p r e c i s i o n ( C f , j , C r , i ) N r  
recall   = C r , i C r m a x C f , j C f r e c a l l ( C f , j , C r , i ) N r  
where p r e c i s i o n ( C f , j , C r , i ) is the precision of community detection and r e c a l l ( C f , j , C r , i ) is the recall of community detection, as shown in the formula. Since the calculation requires information about the true community structure, the calculation of the F-score must include a dataset with the true community structure. The larger the value of the F-score, the better the quality of overlapping community detection.

4. Materials and Methods

To address the problem that global information is unattainable in dynamic provenance graphs, this paper proposes a local greedy expansion-based overlapping community detection method. The method integrates the fitness between local nodes and local communities, as well as the similarity between local nodes and nodes within the community, and expands the community boundaries through a variety of mixed similarity functions. Finally, the overlapping communities with shared nodes are analogized and merged to obtain the final community detection results.

4.1. Seeding Phase

In real-world networks, core nodes contribute significantly to information propagation and are usually located in densely connected regions of the network. Our method selects core nodes and their neighbors as seeds (i.e., the coarsening process). First, we select a transitional node vi that has no community label yet from the newly added nodes in the dynamic traceability graph and aggregate its most similar neighboring nodes using Equation (1) to construct a new seed node Vi (i.e., a coarse community). Node vi is the core of the rough node Vi, and the most similar neighboring nodes are used to guide the expansion of the seed node. Vi is a node set of G, where Vf is the node set obtained after the node expansion stage. Then, we rank the rough nodes based on the number and weight of nodes in the seed node, as well as the centrality measure of the seed node, to select a subset of core nodes as seeds S. The Algorithm 1 for the seed selection process is as follows:
Algorithm 1 Seeding Phase
Input: Graph G = (V, E);
Input: NL collection of nodes without community tags
Output: Seed S
1     S
2        while NL do
3           for v i NL do
4              //Filter the nodes with the greatest similarity
5               V f = Perform C the expansion phase for { v i , v j }
6               N C ( v i )   = N C ( v i ) + V f
7              //Iterate step 4 until no node with the greatest similarity
8              V i ←({ v i } ∪ N ( v i ) )
9           end for
10         end while
11     MaxDegreeSet   arg m a x V i V ( V i .   degree   + V i .   nodeCount   + V i .   edgeCount ,   )
12     return S
The benefit of this seed selection strategy is the ability to discover dispersed rough nodes with high local degree in the network. The communities extended through these nodes have a good local community structure, and the community detection results achieve local full coverage of the local network. This avoids the problem of nodes not belonging to any community due to node deletion.

4.2. Expansion Phase

The local expansion process aims to iteratively add the neighbors of seed nodes S to construct stable and effective communities. The current seed S is augmented or pruned to maximize the fitness function f(C), the ω(vi) function, and the F(v,s) function. The process is repeated until none of the remaining neighbors meet the threshold for belonging to the current community. The Algorithm 2 for the local expansion of a given seed node S to a community is as follows.
Algorithm 2 Expansion Phase
Input: Graph G = (V, E); Seed S
Output: Community C
1     for s ∈ S do
2        Ne = neighbors of s
3        for v i  ∈ Ne do
4           f ( v i ) , w ( v i ) , F ( v i , s )
5        end for
6        if v i arg m a x v i , f ( v i ) ( { f ( v i ) } ) or
arg m a x v i , w ( v i ) ( { w ( v i ) } ) or
arg m a x v i , F ( v i ) ( { F ( v i , s ) } ) then
7           add v i to C
8        end if
9     end for
10 return C
During the process of expanding communities, it is necessary to greedily assign isolated neighboring nodes to appropriate communities. During the propagation phase, expansion involving multiple seed nodes does not involve duplicate calculations of merged communities in the iterative process, and overlapping nodes will be assigned to multiple communities.

4.3. Merge Phase

The problem of overlapping community detection involves the existence of common overlapping nodes among different communities, leading to some degree of similarity between communities. When the similarity is high, “overlapping excessively” may occur, resulting in redundant communities. Therefore, we merge communities with a large number of shared overlapping nodes to reduce redundancy and simplify the community structure. We use the community merging coefficient J as a measure, and when the value of J exceeds 1/3, multiple overlapping communities are merged. The Algorithm 3 for the merge phase process is as follows:
Algorithm 3 Merging Phase
Input: Overlapping node set O V = { O V 1 , O V 2 O V k }
1     for O V i O V do
2        for o v i O V i do
3           //computing J ( o v i )
4           J ( O V i ) = J ( o v i )   + J ( O V i )
5        end for
6        if J ( O V i ) >= 1/3 then
7           if O V i     C = { c 1 , c 2 c k } then
8               merge C
9           end if
10      end if
11  end for
O V is a set of only overlapping nodes, O V i is a set of overlapping nodes that are shared by multiple communities in C, and o i v i is one overlapping node in the set. When the overlap similarity J ( O V i ) is greater than 1/3, all overlapping communities C that O V i belongs to will be merged.

4.4. Complexity Analysis

The first stage of seed selection in the GLOD method uses neighbor similarity to form candidate seeds, with a time complexity of O ( n ) , where n is the number of nodes without community labels. The second stage of local community expansion expands the community boundary based on the node with the maximum fitness function, with a time complexity of O ( s N ) , where s is the size of the seed community and N ( v i ) is the set of neighbors of node vi. The third stage of community comparison and merging has a time complexity of O ( c 2 ) , where c is the total number of nodes in the community. Typically, the number of nodes in community c is much larger than that in the seed community s, but not all communities c are larger than seed community s. Overall, the worst-case time complexity of the GLOD method is O ( c 2 ) .

5. Results

This section introduces the datasets and evaluations used in the experiments. The quality of overlap community detection is evaluated by comparing the F-score values with the LOCD1, LOCD2, LOCD3, LOCD4, LOCD5, LOCD6 [11], MULTICOM [39], and OCS [40] algorithms. The experimental results demonstrate the effectiveness of the GLOD algorithm.

5.1. Datasets

The experimental evaluation in this section is divided into two parts. The first part evaluates the performance of local overlap community detection algorithms using synthetic networks generated by the LFR benchmark [47]. The LFR benchmark test network is controlled by several parameters that determine the topology of the network. The parameter N controls the number of nodes in the network; k ¯ represents the average node degree of the network; k m a x represents the maximum node degree of the network; the parameter O n sets the number of overlapping nodes in the network; the parameter O m controls the number of communities to which each overlapping node belongs; C m i n and C m a x are used to control the range of community sizes that can be taken in the network; and μ is the overlapping parameter, which controls the probability that the nodes within a community are connected to nodes outside the community. By adjusting the different parameters of the LFR benchmark network, we can obtain topological graphs with different properties, which simulate real network data. In addition, the LFR benchmark network can also generate a community partition of network nodes, which can be used to compare the results of overlap community discovery with the generated true community structure and validate the effectiveness of the algorithm.
The second part is a real-world dataset, and in this experiment, we used the Amazon dataset (https://snap.stanford.edu/data/ accessed on 1 December 2022), which is a product sales network and can be found on their website. The Amazon dataset contains 75,149 communities, 334,863 nodes, and 925,872 edges. The Provenance Graph dataset, which is a computing-data provenance log, contains 14,575 communities, 134,882 nodes, and 285,376 edges.

5.2. Synthetic Networks

The simulation experiments evaluated the effectiveness of the algorithm using LFR benchmarks. First, three groups of LFR benchmark test networks were generated based on parameter settings with N = 5000 and μ { 0.1 , 0.3 , 0.4 } , where each group of LFR networks had the same node size but different levels of community mixing. Each group contained seven network datasets with varying degrees of overlap, where O m [ 2 , 8 ] . Other parameters were set to the same values: average node degree k = 10; maximum node degree k m a x  = 50; number of overlapping nodes O n = 0.1   N ; and community size upper and lower bounds c m i n = 25 and c m a x = 200 . As the number of communities to which overlapping nodes belong O m increases, the difficulty of detecting overlapping communities also increases.
  • LFR dataset with N = 5000 and μ = 0.1 .
From Table 1, we can draw the following conclusions: Overall, as Om increases, the boundaries of overlapping communities become increasingly blurred, and the problem of identifying overlapping nodes becomes more difficult. This leads to a gradual decrease in the quality of overlapping community detection algorithms, as evidenced by a decreasing trend in F-score values. When O m { 2 , 3 , 5 , 8 } , GLOD achieves the best F-score values, while LOCD6 performs best when O m { 4 , 6 , 7 } . However, overall, GLOD outperforms the other algorithms. LOCD6 uses the maximal clique algorithm and has an advantage in identifying communities with a larger number of nodes when the community boundary is not very blurred. On the other hand, GLOD adopts a method of maximizing local correlation and the fitness function, resulting in a higher degree of separation and a larger number of communities when O m { 4 , 6 , 7 } , leading to a decrease in F-score values.
Figure 1 shows the trend charts of the GLOD, LOCD, OCS, and MULTICOM algorithms. For the high-quality community detection algorithms, GLOD and LOCD, the F-score decreases, and the community boundaries become blurred as Om increases. Overall, the GLOD algorithm has an advantage over the others.
  • LFR dataset with N = 5000, μ = 0.3
From Table 2, the following conclusions can be drawn: When O m { 2 , 3 , 4 , 6 , 8 } , the GLOD algorithm obtains the best F-score value, LOCD1 obtains the maximum F-score value when O m { 5 } and LOCD2 obtains the maximum F-score value when O m { 7 } . GLOD, LOCD1, LOCD2, and LOCD3 all adopt the DMF_M algorithm. It can be clearly seen that the F-score values of algorithms using DMF_M are higher than other algorithms when O m { 5 , 6 , 7 , 8 } , indicating that the DMF_M algorithm has higher quality of overlapping community detection when dealing with overlapping nodes that belong to multiple communities.
As shown in Figure 2, the trend charts of the GLOD, LOCD, OCS, and MULTICOM algorithms are presented. When O m { 3 , 5 , 6 } , the GLOD, LOCD, and OCS algorithms oscillate, but the GLOD algorithm has smaller fluctuations and a more stable trend. Overall, the community detection quality of the GLOD algorithm is more stable.
  • LFR dataset with N = 5000, μ = 0.4
The following conclusions can be drawn from Table 3: when O m { 2 , 3 , 4 , 5 , 7 , 8 } , the GLOD algorithm obtains the best F-score value, and LOCD5 obtains the maximum F-score value when O m { 6 } . The GLOD algorithm has a high degree of separation for overlapping communities in highly overlapping networks, and there are no nodes with unknown community labels in small parts of the network, indicating a high level of local expandability.
Figure 3 shows the trend graphs of the GLOD, LOCD, OCS, and MULTICOM algorithms. When O m > 4 , the GLOD algorithm tends to stabilize and remains above 0.5. The trend graph also indicates that the GLOD algorithm has a higher effectiveness in community detection than other algorithms. Overall, the community detection quality of the GLOD algorithm is more stable.
In summary, as the number of communities that have overlapping nodes belonging to O m increases, the quality of overlapping community detection decreases, resulting in a decreasing trend in F-score values for all 11 algorithms. However, algorithms that use a hybrid function during the expansion phase show higher quality in detecting overlapping communities as O m and μ increase, compared to algorithms using a single principle. Additionally, the GLOD algorithm has an extra integration phase, which better implements the effectiveness of local expansion compared to other algorithms.

5.3. Real-World Networks

For real-world network data, since the actual community structure is unknown, it is not possible to compare the community detection results with the true community results. Therefore, in this experiment, the Amazon dataset with real community structure and its dataset of top5000 communities were selected to verify the effectiveness of overlapping community detection.
As shown in Figure 4, on the Amazon dataset, the GLOD, LOCD1, LOCD2, and LOCD3 algorithms are significantly better than the LOCD4, LOCD5, LOCD6, OCS, and MULTICOM algorithms. Moreover, it is evident that the GLOD, LOCD1, LOCD2, and LOCD3 algorithms all use the DMF_M algorithm in the expansion phase, which significantly improves the quality of overlap community detection compared to those without the DMF_M algorithm. In the high-quality community dataset of the top5000 communities in Amazon, as the overlap degree of communities increases and the separability becomes clearer, the quality of community detection improves. The F-score values of all algorithms on the top5000 dataset have significantly increased, with the GLOD algorithm’s community detection effectiveness being 2.1% higher than the best LOCD2 algorithm. In the experiment, the OCS algorithm may generate blank communities where single nodes do not belong to any community, leading to a decrease in the quality of overlap community detection, which the GLOD algorithm addresses. Although LOCD’s six algorithms solve the blank community problem, it is challenging to define the fuzzy relationship threshold for calculating community centers, resulting in multiple community centers. GLOD’s focus in the expansion phase is on boundary problems, and community center weight is not the most critical issue to solve. During community merging, GLOD addresses the problem of high similarity between overlapping nodes belonging to multiple communities, merging related communities when the similarity between overlapping nodes exceeds 1/3, indirectly improving the quality of communities. In conclusion, in a dynamic graph with high-quality communities and a large number of nodes and edges, finding the global optimum as the graph scales up is an NP-hard problem. The GLOD algorithm can effectively detect overlap communities in the local graph through a hybrid method of local expansion.
On the Provenance Networks dataset, we can see that the overall trend is declining. The GLOD algorithm has an obvious advantage in the Provenance Networks data. The first stage is mainly on the seed selection of the neighbor similarity function. Although the boundary is not obvious, the community is found. The general orientation of the second part uses the f(C), ω ( v i ) and F ( v , s ) hybrid algorithm to expand and clarify the boundaries of the community. Finally, the J algorithm is used to merge redundant communities to further improve the quality of the community.

6. Conclusions

In this paper, we propose a method of local greedy expansion, called GLOD. This method uses neighbor similarity to form seed communities, which avoids the problem of invalid seeds caused by a single node forming a seed and the problem of blank communities. Furthermore, the method extends the boundaries of communities based on a mixed community similarity function, strictly controlling the separation boundaries of communities and avoiding community jitter. The analysis and merging of community detection results can reduce redundant communities while maintaining high coverage. Experiments on LFR test networks and real network datasets show that the GLOD method can achieve good community detection results on datasets with high overlap and good community quality.
In addition, GLOD may exhibit unstable community detection quality when the community boundaries are not clear or the dataset has low community quality. This is mainly observed in dynamic graphs, where the instability of local nodes causes floating computation issues. The next step will be to conduct further research on the adaptability function of community extension, so as to maintain stable community detection quality even when the community boundaries are not clear or the dataset has low community quality.

Author Contributions

Conceptualization, Y.S. (Ying Song); methodology, Z.Z.; software, Z.Z.; validation, Z.Z.; formal analysis, Z.Z. and Y.S. (Ying Song); investigation, B.W. and Z.Z.; resources, Y.S. (Ying Song) and Z.Z.; data curation, Z.Z. and B.W.; writing—original draft preparation, Z.Z. and Y.S. (Yunmei Shi); writing—review and editing, Z.Z. and Y.S. (Ying Song); supervision, Z.Z. and Y.S. (Ying Song). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China (Grant No. 61872043) and the State Key Laboratory of Computer Architecture (ICT, CAS) under Grant No. CARCHA202103.

Data Availability Statement

The data presented in this study are available.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dourisboure, Y.; Geraci, F.; Pellegrini, M. Extraction and classification of dense communities in the web. In Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; pp. 461–470. [Google Scholar]
  2. Krause, A.E.; Frank, K.A.; Mason, D.M.; Ulanowicz, R.E.; Taylor, W.W. Compartments revealed in food-web structure. Nature 2003, 426, 282–285. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Battle, L.; Heer, J. Characterizing exploratory visual analysis: A literature review and evaluation of analytic provenance in tableau. Comput. Graph. Forum 2019, 38, 145–159. [Google Scholar] [CrossRef]
  4. Visser, R.M. Dendrochronological Provenance Patterns. Network Analysis of Tree-Ring Material Reveals Spatial and Economic Relations of Roman Timber in the Continental North-Western Provinces. J. Comput. Appl. Archaeol. 2021, 4, 230–253. [Google Scholar] [CrossRef]
  5. Gao, W.; Luo, W.; Bu, C. Adapting the Top Leaders algorithm for dynamic social networks. J. Supercomput. 2020, 76, 7883–7905. [Google Scholar] [CrossRef]
  6. Whang, J.J.; Gleich, D.F.; Dhillon, I.S. Overlapping community detection using neighborhood-inflated seed expansion. IEEE Trans. Knowl. Data Eng. 2016, 28, 1272–1284. [Google Scholar] [CrossRef]
  7. Fortunato, S.; Hric, D. Community detection in networks: A user guide. Phys. Rep. 2016, 659, 1–44. [Google Scholar] [CrossRef] [Green Version]
  8. Luo, W.; Zhang, D.; Jiang, H.; Ni, L.; Hu, Y. Local community detection with the dynamic membership function. IEEE Trans. Fuzzy Syst. 2018, 26, 3136–3150. [Google Scholar] [CrossRef]
  9. Stanton, I.; Kliot, G. Streaming graph partitioning for large distributed graphs. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; pp. 1222–1230. [Google Scholar]
  10. Lancichinetti, A.; Fortunato, S.; Kertész, J. Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 2009, 11, 033015. [Google Scholar] [CrossRef]
  11. Ni, L.; Luo, W.; Zhu, W. Local overlapping community detection. ACM Trans. Knowl. Discov. Data (TKDD) 2019, 14, 1–25. [Google Scholar] [CrossRef]
  12. Reid, F.; McDaid, A.; Hurley, N. Partitioning breaks communities. In Mining Social Networks and Security Informatics; Springer: Berlin/Heidelberg, Germany, 2013; pp. 79–105. [Google Scholar]
  13. Baltsou, G.; Christopoulos, K.; Tsichlas, K. Local Community Detection: A. Survey. IEEE Access 2022, 10, 110701–110726. [Google Scholar] [CrossRef]
  14. Gupta, S.K.; Singh, D.P.; Choudhary, J. A review of clique-based overlapping community detection algorithms. Knowl. Inf. Syst. 2022, 64, 2023–2058. [Google Scholar] [CrossRef]
  15. Shi, Y.; Wang, Y.; Zhao, Q. Research status of community detection based on local expansion. J. Commun. 2019, 40, 149–162. [Google Scholar]
  16. Newman, M.E.J.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 2004, 69, 26113. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. DiTursi, D.J.; Ghosh, G.; Bogdanov, P. Local community detection in dynamic networks. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 847–852. [Google Scholar]
  18. Yin, H.; Benson, A.R.; Leskovec, J.; Gleich, D.F. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 555–564. [Google Scholar]
  19. Meng, F.; Zhu, M.; Zhou, Y.; Zhou, R. Local community detection in complex networks based on maximum cliques extension. Math. Probl. Eng. 2014, 2014, 653670. [Google Scholar]
  20. Christopoulos, K.; Baltsou, G.; Tsichlas, K. Local Community Detection in Graph Streams with Anchors. Information 2023, 14, 332. [Google Scholar] [CrossRef]
  21. Huang, Z.; Xu, W.; Zhuo, X. Community-CL: An Enhanced Community Detection Algorithm Based on Contrastive Learning. Entropy 2023, 25, 864. [Google Scholar] [CrossRef] [PubMed]
  22. Qing, H. Estimating the number of communities in weighted networks. Entropy 2023, 25, 551. [Google Scholar] [CrossRef]
  23. Yao, B.; Zhu, J.; Ma, P.; Gao, K.; Ren, X. A Constrained Louvain Algorithm with a Novel Modularity. Appl. Sci. 2023, 13, 4045. [Google Scholar] [CrossRef]
  24. Wan, Y.; Tan, X.; Shu, H. Finding and Evaluating Community Structures in Spatial Networks. ISPRS Int. J. Geo-Inf. 2023, 12, 187. [Google Scholar] [CrossRef]
  25. Chakraborty, T.; Kumar, S.; Ganguly, N.; Mukherjee, A.; Bhowmick, S. GenPerm: A unified method for detecting non-overlapping and overlapping communities. IEEE Trans. Knowl. Data Eng. 2016, 28, 2101–2114. [Google Scholar] [CrossRef] [Green Version]
  26. Bai, X.; Yang, P.; Shi, X. An overlapping community detection algorithm based on density peaks. Neurocomputing 2017, 226, 7–15. [Google Scholar] [CrossRef]
  27. Rezvani, M.; Liang, W.; Liu, C.; Yu, J.X. Efficient detection of overlapping communities using asymmetric triangle cuts. IEEE Trans. Knowl. Data Eng. 2018, 30, 2093–2105. [Google Scholar] [CrossRef]
  28. Palla, G.; Derényi, I.; Farkas, I.; Vicsek, T. Uncovering the overlapping community structure of complex networks in nature and society. Nature 2005, 435, 814–818. [Google Scholar] [CrossRef] [Green Version]
  29. Ma, J.; Fan, J. Local optimization for clique-based overlapping community detection in complex networks. IEEE Access 2019, 8, 5091–5103. [Google Scholar] [CrossRef]
  30. Xu, H.; Ran, Y.; Xing, J. An Influence-Based Label Propagation Algorithm for Overlapping Community Detection. Mathematics 2023, 11, 2133. [Google Scholar] [CrossRef]
  31. Li, X.; Sun, Q. Detect Overlapping Community Based on the Combination of Local Expansion and Label Propagation. Algorithms 2021, 14, 237. [Google Scholar] [CrossRef]
  32. Lin, H.; Zhan, Y.; Zhao, Z.; Chen, Y.; Dong, C. Overlapping Community Detection Based on Attribute Augmented Graph. Entropy 2021, 23, 680. [Google Scholar] [CrossRef]
  33. Huang, M.; Jiang, Q.; Qu, Q.; Rasool, A. An overlapping community detection approach in ego-splitting networks using symmetric nonnegative matrix factorization. Symmetry 2021, 13, 869. [Google Scholar] [CrossRef]
  34. Peng, Y.; Zhang, B.; Chang, F. Overlapping community detection of bipartite networks based on a novel community density. Future Internet 2021, 13, 89. [Google Scholar] [CrossRef]
  35. Gao, R.; Li, S.; Shi, X.; Liang, Y.; Xu, D. Overlapping community detection based on membership degree propagation. Entropy 2020, 23, 15. [Google Scholar] [CrossRef]
  36. Li, Y.; He, J.; Wu, Y.; Lv, R. Overlapping community discovery method based on two expansions of seeds. Symmetry 2020, 13, 18. [Google Scholar] [CrossRef]
  37. Tsung, C.K.; Ho, H.J.; Chen, C.Y.; Chang, T.W.; Lee, S.L. Detecting overlapping communities in modularity optimization by reweighting vertices. Entropy 2020, 22, 819. [Google Scholar] [CrossRef] [PubMed]
  38. Kamuhanda, D.; He, K. A nonnegative matrix factorization approach for multiple local community detection. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Barcelona, Spain, 28–31 August 2018; pp. 642–649. [Google Scholar]
  39. Hollocou, A.; Bonald, T.; Lelarge, M. Multiple local community detection. ACM SIGMETRICS Perform. Eval. Rev. 2018, 45, 76–83. [Google Scholar] [CrossRef] [Green Version]
  40. Cui, W.; Xiao, Y.; Wang, H.; Lu, Y.; Wang, W. Online search of overlapping communities. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 277–288. [Google Scholar]
  41. Riolo, M.A.; Newman, M. Consistency of community structure in complex networks. Phys. Rev. E 2020, 101, 052306. [Google Scholar] [CrossRef] [PubMed]
  42. Chen, J.; Zhou, G.; Nan, Y. Semi-Supervised Local Expansion Method for Overlapping Community Detection. J. Comput. Res. Dev. 2016, 53, 1376–1388. [Google Scholar]
  43. Jeub, L.G.S.; Mahoney, M.W.; Mucha, P.J.; Porter, M.A. A Local Perspective on Community Structure in Multilayer Networks; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
  44. Asmi, K.; Lotfi, D.; Abarda, A. The greedy coupled-seeds expansion method for the overlapping community detection in social networks. Computing 2022, 104, 295–313. [Google Scholar] [CrossRef]
  45. Xiao, M.; Meng, X.W.; Shi, Y.C. A Circuits Merging Community Discovery Algorithm Based on Mobile User Behaviors. J. Electron. Inf. Technol. 2012, 34, 2369–2374. [Google Scholar] [CrossRef]
  46. Zhou, Y.; Sun, G.; Xing, Y.; Zhou, R.; Wang, Z. Local Community Detection Algorithm Based on Minimal Cluster. Appl. Comput. Intell. Soft Comput. 2016, 2016, 3217612. [Google Scholar] [CrossRef] [Green Version]
  47. Lancichinetti, A.; Fortunato, S. Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys. Rev. E 2009, 80, 016118. [Google Scholar] [CrossRef]
Figure 1. F-score on LFR networks with N = 5000, μ   = 0.1.
Figure 1. F-score on LFR networks with N = 5000, μ   = 0.1.
Mathematics 11 03284 g001
Figure 2. F-score on LFR networks with N = 5000, μ   = 0.3.
Figure 2. F-score on LFR networks with N = 5000, μ   = 0.3.
Mathematics 11 03284 g002
Figure 3. F-score on LFR networks with N = 5000, μ = 0.4.
Figure 3. F-score on LFR networks with N = 5000, μ = 0.4.
Mathematics 11 03284 g003
Figure 4. F-score on Amazon real networks.
Figure 4. F-score on Amazon real networks.
Mathematics 11 03284 g004
Table 1. F-score on LFR networks with N = 5000, μ   = 0.1.
Table 1. F-score on LFR networks with N = 5000, μ   = 0.1.
Om2345678
GLOD0.96230.83860.79350.76470.72760.69530.6863
LOCD10.81180.79160.76730.74510.71230.6830.6555
LOCD20.82750.79440.78470.7590.73350.70190.6764
LOCD30.81280.7910.77570.75390.73270.69830.6647
LOCD40.92990.80720.7890.74090.71050.6940.6606
LOCD50.93010.81740.79980.7560.72790.70310.6719
LOCD60.93680.8220.80380.76070.73440.70850.6765
(2, 1) OCS0.70070.48670.5660.46250.42460.52750.5462
(3, 0.8) OCS0.67070.43760.54930.44760.4060.51550.5584
(3, 0.9) OCS0.41160.37080.37640.39390.39860.41970.3945
MULTICOM0.2380.23730.23290.24780.24170.22840.2261
Table 2. F-score on LFR networks with N = 5000, μ   = 0.3.
Table 2. F-score on LFR networks with N = 5000, μ   = 0.3.
Om2345678
GLOD0.78940.66320.68640.65220.63970.62310.6095
LOCD10.72760.6390.66180.6930.61950.63720.6014
LOCD20.73290.64270.65230.6870.60840.64140.5936
LOCD30.72650.63390.64870.68790.60350.63710.5931
LOCD40.64760.61650.64370.60260.55960.6040.5717
LOCD50.67980.63480.68070.61240.52810.60470.5703
LOCD60.67310.6340.6820.6160.53630.61030.5747
(2, 1) OCS0.28040.17120.30780.25170.31060.25630.2275
(3, 0.8) OCS0.26520.14590.33170.21130.30190.28480.2065
(3, 0.9) OCS0.17950.22130.22020.24130.2930.240.2152
MULTICOM0.16890.16790.15570.16730.17220.18020.1674
Table 3. F-score on LFR networks with N = 5000, μ   = 0.4.
Table 3. F-score on LFR networks with N = 5000, μ   = 0.4.
Om2345678
GLOD0.71740.60380.55430.53770.53710.52960.5291
LOCD10.38660.45440.42570.45270.47620.51360.5089
LOCD20.38360.47620.40030.44390.46620.49640.5024
LOCD30.37770.45290.38690.42930.44990.48260.4958
LOCD40.64340.54890.49020.50480.51570.5150.524
LOCD50.70550.570.48170.46210.54920.48620.4635
LOCD60.70390.57150.48280.46330.55020.48890.4663
(2, 1) OCS0.30640.30530.23360.26460.29220.35990.3455
(3, 0.8) OCS0.29840.2870.21450.24960.28620.36070.3341
(3, 0.9) OCS0.04910.09760.0480.04990.05860.0860.1212
MULTICOM0.14680.16420.1460.14450.14660.14690.1482
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, Y.; Zheng, Z.; Shi, Y.; Wang, B. GLOD: The Local Greedy Expansion Method for Overlapping Community Detection in Dynamic Provenance Networks. Mathematics 2023, 11, 3284. https://doi.org/10.3390/math11153284

AMA Style

Song Y, Zheng Z, Shi Y, Wang B. GLOD: The Local Greedy Expansion Method for Overlapping Community Detection in Dynamic Provenance Networks. Mathematics. 2023; 11(15):3284. https://doi.org/10.3390/math11153284

Chicago/Turabian Style

Song, Ying, Zhiwen Zheng, Yunmei Shi, and Bo Wang. 2023. "GLOD: The Local Greedy Expansion Method for Overlapping Community Detection in Dynamic Provenance Networks" Mathematics 11, no. 15: 3284. https://doi.org/10.3390/math11153284

APA Style

Song, Y., Zheng, Z., Shi, Y., & Wang, B. (2023). GLOD: The Local Greedy Expansion Method for Overlapping Community Detection in Dynamic Provenance Networks. Mathematics, 11(15), 3284. https://doi.org/10.3390/math11153284

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop