1. Introduction
Enumerating or identifying the occurrences of small subgraphs, also known as graphlets or motifs, is crucial in solving problems on large networks, and has attracted considerable interest from the research community. Graphlets such as triangles, cycles, and cliques are widely used in various domains, such as spam detection [
1,
2], link prediction [
3], uncovering patterns in biological networks [
4,
5], anomaly detection [
6,
7], community detection [
8], social network analysis [
9,
10,
11], conducting clustering [
12], and classification [
13] tasks. They provide valuable insights into the structural properties of complex networks. Various researchers have developed algorithms to count the occurrence of graphlets in different types of networks [
14,
15,
16,
17].
Fully connected k vertices represent a dense subgraph known as a k-clique. The triangle, representing the fundamental form of a k-clique with k = 3, is a widely used graphlet because its robust structure captures complex interactions between entities, often representing social connections, communication patterns, or dependencies in various networks, such as social media [
18,
19,
20].
Clique counting algorithms are essential in several domains and to analyze different networks, including social network analysis [
21], biological network analysis [
22], recommendation systems [
23], fraud detection [
24], network similarity [
25], and many others [
26,
27]. These algorithms facilitate identifying cohesive groups, clusters, or communities within complex networks, providing insights, optimizations, and decision support in various real-world applications. In social networks, it is crucial to identify close-knit groups, such as group of friends or professional networks, to better understand the structure and dynamics of interactions. Similarly, in biological networks, to understand the functions and interactions of proteins, it is important to recognize repeating patterns, which can be modeled using the clique pattern in repeating structures. This clique pattern can indicate tight connections, strong and consistent interactions, and offer deeper insights into the underlying structure of the network. Additionally, in the realm of AI, Exploratory Data Analysis (EDA) is crucial for the preparation and preprocessing phases of machine learning model development. Our methodology can be effectively applied during the EDA phase of data science projects to gain insights and analyze data. However, dealing with the number of k-cliques for larger values of k poses significant algorithmic challenges, mainly due to the exponential growth of the search space associated with large cliques.
There are many algorithms for exact or approximate clique counting [
28,
29,
30,
31,
32,
33]. In large datasets, using exact algorithms is often not feasible, leading researchers to use sampling-based methods. Turán-shadow [
31] introduces an innovative method for approximate clique counting, using Turán’s theorem [
34] to enumerate cliques up to size 10. Meanwhile, Pivoter [
32] has revolutionized clique counting by eliminating the need for enumeration, allowing for the exact computation of k-clique counts for all k values across different graphs. However, the authors state that it is unsuitable for large datasets, such as “com-lj”, beyond
. The Yacc algorithm [
33] approximates k-cliques up to 40 by revisiting the Turán-shadow algorithm and incorporating various insights to achieve faster clique counting. Ye et al. [
35] introduce three dynamic programming and sampling-based approximate k-clique counting algorithms and incorporate them with Pivoter. However, the trade-off between execution time, accuracy, and sample requirements remains a notable limitation. Achieving a balance between computational efficiency and result accuracy remains a significant challenge in clique counting algorithms.
Our algorithm, Boundary-driven approximations of k-cliques (BDAC), uses the Turán theorem [
34] and additional theorems [
36,
37,
38,
39] from the previous literature to approximate the number of k-cliques. The proposed method differs from existing methods in eliminating the need for sampling procedures and repetitive recursive calls. Instead, it provides lower and upper bounds for local (per-vertex) and global k-clique counts over the entire datasets, and can compute these approximations in roughly the same time for any value of k. As a result, our algorithm is suitable for various k values, although our tests have been limited to k = 50. This limitation is not due to any specific issue with the algorithm, but is simply due to the scope of our current testing. This approach allows for the examination of high-order k-cliques in other application areas and large datasets from various domains, such as social, biological, web networks, etc., thereby providing a more comprehensive analysis of these datasets.
In research areas like social networks and biological networks, small k-cliques are commonly used to discover communities or clusters. This is partly due to the challenges posed by the combinatorial explosion in complexity. As the graph size increases, the number of possible k-cliques grows exponentially, making it difficult to count larger cliques, especially as the graph’s density increases. However, our algorithm, capable of handling larger k values, offers the potential to re-evaluate these fields. It allows for the exploration of larger k-cliques, providing a different perspective that may uncover new insights into the structure and interactions within these networks.
The remainder of this work is structured as shown below:
Section 2 presents the necessary terminology and several key notations, details the theorems used in our approach, and explains the contribution of the work.
Section 3 provides the specifics of the proposed method, illustrated with a practical example. In
Section 4, datasets and their properties, working environment, and experimental results are presented.
Section 5 compares the proposed algorithm with other algorithms in the literature regarding performance and complexities, and explains the limitations of our algorithm. The paper is concluded in
Section 6, and gives the future direction in
Section 7.
2. Preliminaries
This part introduces basic graph terms and clique density theorems. Following that, we will address the problem and our contribution to it.
2.1. Graph Terminology
In the context of an undirected simple graph , where V represents the set of vertices with cardinality and E represents the set of edges with cardinality , the degree of a node is defined as the cardinality of its neighboring set, representing the count of adjacent edges connected to that node.
A maximal subgraph within the graph, such that every pair of vertices in this subgraph is connected by an edge, thereby constituting a complete subgraph, formally defines a clique. A k-clique () is a clique formed by a set of k nodes, where kn. In graph theory, a 1-clique () denotes a single vertex, a 2-clique () denotes an edge between two vertices, and a 3-cliques () denotes a triangle formed by three connected vertices.
The edge density of a graph is the ratio of the number of edges present in the graph to the total number of possible edges in the graph, calculated as . Let us denote as the number of triangles that include node v and as the degree of node v. The triangle density of node v is calculated as .
An induced subgraph is similar to a snapshot of a section of the original graph. It consists of specific vertices from the original graph and all the edges that connect these vertices.
Let
H be a graph with
k vertices and
G be a graph with
n vertices. We denote the set of induced copies of
H in
G as
. The
induced density of
H is calculated as [
40]:
In this paper, denotes the k-clique density of node v. This proportion can be calculated by dividing the number of k-cliques that include the node v by the total number of potential k-cliques that could include v. The total potential k-cliques is determined by the number of possible combinations of k − 1 nodes selected from the set of neighbors of node v. denotes the k-clique density of graph G.
In graph theory,
degeneracy ordering involves iteratively removing the vertex with the lowest degree. If multiple vertices have the lowest degree, one is selected according to predefined choices, such as the vertex ID, or by an arbitrary choice. After a vertex is removed, the degrees of its neighbors are decreased by one. This process is repeated until all vertices are removed. In this method, vertices are sorted in the order in which they were removed from the graph. The objective of degeneracy ordering is to arrange the vertices so that each vertex has fewer neighbors than the preceding one in the ordering. The degeneracy of a graph, denoted by
, is defined as the maximum out-degree among all vertices in the degeneracy ordering. In a graph with low degeneracy, all induced subgraphs are composed of vertices with low degrees [
41]. In
degree ordering, nodes are arranged based on their degrees. The
color ordering technique utilizes a greedy coloring algorithm [
42,
43] to color graphs with
m colors. This approach assigns a color between
to each node while ensuring that no two neighboring nodes have the same color.
When searching for cliques in an undirected graph, they can be repeatedly discovered by examining each constituent edge. To avoid counting the same clique multiple times, the graph is transformed into a Directed Acyclic Graph (DAG), which is a directed graph with no cycles. To create this DAG, the nodes are first ordered using a specific ordering technique. Each edge (u, v) is then directed based on this ordering: if node u precedes node v in the order, the edge is directed from u to v.
An undirected graph consists of symmetric relations. For example, let us consider a social network whose participants collaborate on a software development project. Participants A and B collectively decide on the project’s direction by discussing it with each other. If participant A says it collaborates with B, that means B collaborates with A. It is unreasonable for B to say it does not cooperate with A. If we know the relation between A and B is symmetric, there is only one edge between them, and this edge can be called AB or BA. Furthermore, a clique pattern also comprises symmetric relations, as each node in the clique maintains a direct, mutual relation with every other node in that clique.
2.2. Clique Density Theorems
This section presents the theorems utilized within our algorithm. To access the formal proofs of the referenced theorems in this paper, refer to the provided references, which offer thorough and rigorous demonstrations of their validity.
Theorem 1 (Turán’s theorem [
34]).
For any integer k, if . In extremal graph theory, Turán posed a question regarding a positive number n and a graph F. He asked for the maximum number of edges a graph with n vertices can have without containing graph F as a subgraph. Turán provided a complete solution for the case where F is a clique.
The expression is denoted as Turán threshold. For clarity, if the edge density satisfies the Turán threshold, the graph contains at least one k-clique ().
Theorem 2 (Erdős’s theorem [
36]).
In a graph with n vertices, if , this graph contains at least k-cliques. Erdős’ theorem states that if the edge density of a graph G exceeds a threshold relative to k, it indicates the minimum number of k-cliques present in the graph.
This theorem provides a lower bound for the number of k-cliques in a graph that meets the Turán threshold.
Theorem 3 (Zykov’s theorem [
37]).
For integers , if = 0 then [40]. Zykov generalized Turán’s theorem. Let us refer this expression as the Zykov threshold. Another point of view is that if the density of a graph satisfies the Zykov threshold, this graph contains at least one k-clique.
Theorem 4 (Kruskal–Katona theorem [
38,
44]).
if then This theorem establishes an upper bound on the number of k-cliques within a given graph. We can derive the upper bound by applying the following formula, where
denotes the count of k-cliques.
Theorem 5 (Reiher’s clique density theorem [
39]).
If and , then every graph on n vertices with at least edges contains at leastk-cliques, where is an integer with and is implicitly defined by − [39]. Reiher addresses the question by determining the minimum number of k-cliques that must exist in graphs with n vertices and more edges than . Consider a graph G with n vertices and m edges. Let . Thus, if , how many k-cliques does this graph contain, at a minimum, guaranteed?
According to this theorem, in a graph, if we know the number of vertices (n) and edges (m), we calculate
as
, so by rearranging the interval
in terms of
s, we obtain an interval for
s as follows:
Then we compute by substituting and s into the equation. Finally, the theorem indicates that it can predict the existence of any k-cliques if . The represents the Turán threshold. If , the binomial coefficient becomes zero. According to Turán’s theorem, this scenario does not predict the existence of k-cliques.
2.3. Main Contribution
The main challenge of counting k-cliques arises from the overwhelming number of possibilities, known as a combinatorial explosion. Hence, methods have been developed to sample large, dense subgraphs, mainly to count larger k-cliques. However, these sampling methods require assumptions about the distribution or sufficiency of the sample count.
This study introduces a novel algorithm to approximate the number of k-cliques (for all k) in a given graph. The algorithm functions by examining each node in the graph and identifying the triangles formed by that node. Utilizing this localized information with established theorems from the existing literature, the algorithm can determine lower and upper bounds for the number of k-cliques linked to each node. By consolidating these individual boundaries throughout the entire dataset, we obtain a comprehensive range for the total number of k-cliques in the graph. This methodology provides an efficient computational approach to approximate the number of k-cliques.
The Boundary-driven approximations of k-cliques (BDAC) algorithm draws inspiration from the Turán-shadow algorithm [
31], which leverages classic extremal combinatorics principles concerning clique densities. Turán [
34] and Erdős [
36] established lower bounds on the number of cliques within sufficiently dense graphs. The novelty of our algorithm lies in integrating additional theorems, such as Zykov [
37], Kruskal–Katona [
38,
44], and Reiher [
39] to provide both lower and upper bounds for dense graphs.
The Turán-shadow [
31] method involves constructing a set of dense subgraphs, known as a Turán-shadow, to cover graph G comprehensively and include all cliques. Then, it uses standard techniques to design an unbiased estimator for the clique count. The Turán-shadow algorithm dedicates a significant portion of its overall runtime to constructing the shadow. In contrast, sampling constitutes a small fraction of the total execution time.
The BDAC algorithm innovatively bypasses the requirement for the Turán-shadow and sampling techniques, offering a direct and efficient method for estimating k-clique counts in complex networks. By traversing each node in the graph, we establish lower and upper bounds for potential k-cliques formed by each vertex (local bound per vertex), aggregating these bounds to determine the approximation of k-cliques without needing a Turán-shadow (global bound). This streamlined approach ensures high-speed performance, significantly reducing computational overhead and facilitating efficient approximation of k-clique counts. Additionally, this algorithm also provides insights into local k-clique counts formed by each vertex. Our algorithm is designed to work for any k value, with its complexity remaining unaffected by the value of k.
We compare the BDAC algorithm and Turán-shadow regarding their execution time and results, mainly focusing on relatively small datasets. When dealing with large datasets, the applicability of Turán-shadow diminishes significantly. As a result, we compared the BDAC algorithm with YACC [
33], a modified version of Turán-shadow designed for such scenarios. Unfortunately, we cannot access the source code of YACC, which prevents a direct comparison of their execution times. As a result, we can only compare algorithm estimation results across identical datasets.
The outcomes of the Pivoter [
32] algorithm provide exact values for comparing the algorithm in terms of relative error. However, Pivoter is not scalable to handle larger datasets (like com-lj [
45]). We also compare BDAC with the algorithm proposed by Ye et al. [
35] regarding the estimation results and execution time. A comprehensive comparison of these state-of-the-art algorithms is provided, focusing on k values ranging from 8 to 50, to analyze the effect of k values on various datasets. As a result, the requirement for exact values for large datasets prevents a comparative assessment.
2.4. Related Work
Cliques have attracted significant attention, as they provide valuable insights into networks’ intricate relationships and structural patterns. Triangles serve as the basic building block of cliques, which researchers use to compute the clustering coefficient and the transitivity ratio [
46]. Larger clique counts are commonly used in various applications, such as community detection [
47], clustering [
48], epilepsy prediction [
49], graph compression [
50], and finding correlated genes [
51].
The literature documents two categories of k-clique counting algorithms: exact methods and approximate approaches for k-clique counting. Chiba and Nishizeki [
28] propose the first k-clique algorithm, which lists all k-cliques and provides a novel graph orientation technique.
Table 1 provides an overview of the distinct characteristics of the algorithms. These characteristics include the year of publication, whether the algorithm is exact or approximate, the bounds utilized, the range of k tested on a large dataset, the ordering heuristics employed, the usage of recursion trees, the theorems utilized, and the time and space complexity.
Finocchi et al. [
29] parallelize this algorithm by leveraging the MapReduce technique [
52], integrating a degree ordering strategy to enhance its computational efficacy. Danisch et al. [
30] additionally introduce a parallel version (kclist) of this algorithm, employing a degeneracy ordering technique. These algorithms are exact and rely on the k-clique enumeration technique. However, combinatorial explosion makes these algorithms unsuitable for large
) on large datasets.
Table 1.
Characteristics of clique counting algorithms. In the time and space complexity columns, k indicates clique size, m refers to the number of edges, n denotes the number of vertices, is the degeneracy of the graph, is the number of colors used for coloring, is the number of triangles of the input graph.
Table 1.
Characteristics of clique counting algorithms. In the time and space complexity columns, k indicates clique size, m refers to the number of edges, n denotes the number of vertices, is the degeneracy of the graph, is the number of colors used for coloring, is the number of triangles of the input graph.
Algorithms | Year | Exact/Approximate | Used Bounds | Tested K | Ordering Heuristic | Recursion Tree | Theorems | Time Complexity | Space Complexity |
---|
Arbo [28] | 1985 | exact | x | k <= 3 | degree ordering | x | x | | |
Clique counting [29] | 2015 | exact | x | k <= 6 | degree ordering | x | Arbo [28] | | |
Turán-shadow [31] | 2017 | approx (sampling) | lower | k <= 10 | degeneracy ordering | √ | Turán [34],
Erdős [36] | | |
Kclist [30] | 2018 | exact | x | k <= 10 | degeneracy ordering | x | Arbo [28] | | |
Pivoter [32] | 2020 | exact | x | k <= 10 | degeneracy ordering | √ | Bron-
Kerbosch [53] | | |
YACC [33] | 2022 | approx (sampling) | lower | k <= 40 | degeneracy ordering | √ | Turán [34], Erdős [36] | | |
DP-color path [35] | 2023 | approx (sampling) | x | k <= 15 | degeneracy
& color
ordering | x | x |
|
|
BDAC | - | approx (no sampling) | lower & upper | k <= 50 | degeneracy ordering | x | Turán [34], Erdős [36], Zykov [37], Kruskal-
Katona [38,44] Reiher [39] | | |
Another exact k-clique algorithm, Pivoter, is proposed by Jain and Seshadri [
32]. This algorithm constructs a recursion tree called a Succinct Clique Tree (SCT) using the pivoting technique based on the traditional strategy introduced by Bron-Kerbosch [
53]. This method is essential for reducing the recursion tree of backtracking algorithms used to identify maximal cliques. The central concept behind the SCT structure is to preserve a distinct representation of all k-cliques, yet its size is significantly smaller than the total number of k-cliques. Based on current knowledge, Pivoter is considered the leading exact k-clique algorithm. Pivoter is highly efficient in sparse graphs but may face performance issues in dense graph regions due to numerous cliques with complex overlap relationships, leading to the expansion of a large recursion tree during its execution.
Due to the combinatorial explosion, there has been a shift towards approximation solutions based on sampling methods. The Turán-shadow algorithm [
31], introduced by Jain and Seshadri, stands as the state-of-the-art sampling-based approximate k-clique algorithm
. This method involves creating a set of dense subgraphs, a recursion tree known as Turán-shadow, to cover graph G comprehensively and include all cliques. Afterward, Turán-shadow employs an unbiased estimator to count the cliques using standard techniques. The construction of the shadow consumes a significant portion of the algorithm’s time.
YACC [
33] extends the capabilities of Turán-shadow for larger k values (up to 40) by reducing the recursion tree size through a systematic relaxation of the stopping condition during tree creation, leading to a more efficient algorithm. The YACC algorithm addresses the challenges of the Turán-shadow algorithm and introduces a strategy to minimize the recursion tree size. However, this size reduction comes at the cost of sacrificing algorithm accuracy. Additionally, achieving a more small size demands a considerable increase in the number of samples.
The algorithms proposed by Ye et al. [
35] combine exact approach and sampling strategies. The algorithm divides the input graph into sparse and dense regions based on average degree. Then, it performs exact counting using Pivoter [
32] for the sparse regions. It proposes three color-based sampling algorithms for the dense region: k-color set, k-color path, and k-triangle. The algorithm employs a linear time coloring process for dense regions using the greedy coloring algorithms [
42,
43]. K-color set sampling selects k-color sets, each consisting of k nodes with unique colors. The k-color path that samples the connected k-color sets is called a k-color path, ensuring that the subgraph induced by k vertices remains connected. The k-triangle path sampling method selects connected k-color sets in which three consecutive nodes form a triangle. These sets are referred to as k-triangle paths. Each of these algorithms applies a dynamic programming strategy for uniform sampling. The time complexities of these algorithms are presented in
Table 1 in the order expressed here.
Obtaining the count of k-cliques for larger k is algorithmically problematic due to a combinatorial explosion in the search field of large cliques. The Pivoter algorithm needs help to compute results for large k values (
) in specific extensive datasets like com-lj and soc-lj [
45]. However, YACC is the first algorithm to count large k-cliques (
) in several graphs.
Several recent algorithms focus on identifying k-clique densest subgraphs, rather than simply counting cliques. These algorithms aim to find subgraphs that maximize the density of k-cliques in relation to their size. However, these approaches are beyond the scope of our current research [
54,
55].
The efficient enumeration of k-cliques without leaning on extensive recursion trees or intricate sampling strategies still needs to be addressed. The sampling strategies to achieve a high level of accuracy might still require a substantial number of samples. The efficiency and practicality of handling high k values and large datasets remain concerns; proposed solutions must emphasize both efficiency and practicality while ensuring accurate approximation, especially in dense datasets.
3. Materials and Methods
This section introduces the fundamental concept underlying our study. We enhance the clarity of the proposed method by providing a numerical example. The basis of this study is the Turán-shadow algorithm [
31], which utilizes principles outlined in the Turán theorem [
34]. Similar to Turán-shadow, the proposed algorithm initiates by transforming the input graph into a directed acyclic graph (DAG) using degeneracy ordering. This process aims to prevent the redundant discovery of cliques. During graph traversal, the algorithm just explores the out neighborhood of each node. To be self-contained, we start with an overview of the Turán-shadow algorithm. The last row in
Table 1 shows the characteristics of our algorithm in comparison to the state-of-the-art k-clique counting algorithms. We will elucidate the distinctive characteristics of the proposed method.
Turán’s theorem [
31] establishes if edge density satisfies the Turán threshold
. Afterwards, it contains at least one k-clique. The Turán-shadow theorem states that identifying all k-cliques containing a vertex v can be achieved by identifying all (k − 1)-cliques among the vertices adjacent to v. Turán-shadow views each node’s neighbor list as a subgraph comprised nodes and connecting edges. If the edge density of this subgraph meets the Turán threshold, then this node with its neighbors contains at least one k-clique.
In the Turán-shadow algorithm, the Turán threshold determines the size of the recursion tree. With increasing values of k requiring the exploration of larger cliques, more profound levels of recursion are necessary, leading to an expansion in the size of the recursion tree. Therefore, it becomes infeasible to acquire k-cliques for larger values of k on large datasets using the Turán-shadow algorithm.
3.1. Boundary-Driven Approximations of K-Cliques (BDAC)
The primary innovation of BDAC occurs with its extraction of edge density information from the neighborhood of a node, facilitating the determination of the triangle density () associated with that node. Because the count of edges connecting the neighbors of a given node corresponds to the number of triangles formed by that node, at this point, we use Zykov’s theorem instead of Turán’s theorem. Zykov’s theorem extends Turán’s theorem. Formally, for any positive integers r and k, such that , if the density of r-clique satisfies the Zykov threshold, then the graph contains at least one k-clique.
The pseudocode illustrating BDAC is shown in Algorithm 1. Line 5 in the pseudocode, H, represents the subgraph of the current node’s neighbors with edges between them. If the number of vertices of H, which also equals the current node’s degree, is not enough to construct the k-clique, the algorithm continues with the next vertex. Line 12 calculates the edge density of subgraph H (), which is also equal to the triangle density of the current node.
Now that we have triangle density information for each node, we can determine the presence of k-clique (). By using triangle density information instead of edge density , we obtain a relatively lower threshold. Suppose a node’s triangle density satisfies the Zykov threshold. In that case, the next step involves deriving the upper bound of k-cliques obtained from that node and its neighborhood using the Kruskal–Katona theorem explained above in Theorem 4.
In line 15, is a procedure that computes the maximum number of k-cliques in subgraph H utilizing Kruskal–Katona’s theorem. The algorithm requires the triangle density of node v (), which is equal to the edge density of subgraph H (), the number of nodes in subgraph H, and the size of the clique we search in H as input parameters.
If the density satisfies the threshold, we establish a lower k-cliques bound. We initially employed Reiher’s theorem to verify the lower bound for , as outlined in Theorem 5.
In line 18,
is a procedure that calculates the minimum clique counts in subgraph H utilizing Reiher’s theorem. The inputs of this procedure are the
explained in Theorem 5, the number of nodes in subgraph H, and the size of the clique that we search in H. However, in instances where
, the diminished value of the binomial coefficient
renders Reiher’s theorem inadequate for determining the presence of
. In this scenario, the procedure returns −1, and then we verify if
holds. If it does, we apply Erdős’s theorem (line 22). Otherwise, we accept the lower bound as 0.
Algorithm 1 BDAC |
- 1:
procedure approximate_k_clique(Graph G, DAG D, Integer k) - 2:
- 3:
- 4:
for all in G do - 5:
Construct induced subgraph from the current node’s out-neighbors in D - 6:
k − 1 - 7:
the no. of vertices of subgraph H - 8:
if n < k then - 9:
continue - 10:
end if - 11:
the no. of edges of subgraph H - 12:
- 13:
− − − − - 14:
- 15:
- 16:
- 17:
if then - 18:
- 19:
if then - 20:
− − - 21:
if then - 22:
- 23:
else - 24:
- 25:
end if - 26:
end if - 27:
- 28:
end if - 29:
end for - 30:
Print totalMin - 31:
Print totalMax - 32:
end procedure
|
An additional differentiation from the Turán-shadow algorithm occurs when the edge density of a node and its neighborhood fails to meet the Turán threshold. In such cases, the algorithm recurses on the set of vertices within that neighborhood, constructing a recursion tree named Turán-shadow within the Turán-shadow algorithm. Significantly, the algorithm dedicates a notable portion of its overall computational time to establishing this recursion tree.
Conversely, if a node and its triangle density do not satisfy the Zykov threshold, we omit that particular node from consideration. Consequently, in this scenario, we abstain from constructing any recursion tree.
As a final step, once we have obtained both the lower and upper bounds from each node that meets the Zykov threshold, we combine these bounds to calculate the overall lower and upper bounds. These aggregated bounds enable us to approximate the final count of k-cliques (lines 16 and 27).
3.2. Example
This section presents an illustrative example aimed at elucidating the theorems discussed.
Figure 1 illustrates a sample graph
G and the out-degrees of each node after degeneracy ordering and constructing DAG. In this example, we aim to estimate 5-cliques. Following the ordering, the algorithm traverses each node individually. If the neighbor count satisfies the desired clique count minus one, which is 4 (excluding the current node itself), we check whether the density satisfies the threshold. In this example, only nodes 2 and 5 have sufficient neighbors to form the desired clique.
The following
Figure 2 illustrates the out-neighbors of nodes 2 and 5, along with the edges between these neighbors (induced graph) represented as a subgraph H. Both nodes yield the same induced subgraph. Then, the density of this subgraph H is calculated (density=
) and checked to see if it satisfies the Zykov threshold for
(threshold= 0.22). As previously mentioned, the edge density of subgraph H provides the triangle density of nodes 2 and 5,
=
and
=
. Therefore, the r-value is 3. Later, the algorithm utilizes Reiher’s theorem to establish the lower bound, yielding
and
, with a minimum clique count of 1. Kruskal–Katona’s theorem provides the maximum clique count as 1. The same results are generated by node 2 and 5, the totalMin is 2 and totalMax is 2, with the exact value also being 2. The 5-cliques in the
G are
and
.
5. Discussion
This section discusses the results of our algorithm compared to state-of-the-art algorithms, the performance of BDAC, and its limitations.
Comparison with the other algorithms: Turán-shadow is suitable for relatively small datasets since its effectiveness significantly diminishes when dealing with larger datasets. The BDAC algorithm’s capacity to address dense subgraphs provides a distinct advantage in specific contexts. The BDAC consistently demonstrates a notably wider gap between the minimum and maximum values in certain cases when compared to the Turán-shadow algorithm (refer to
Table 3). Its robust capability to handle large dense subgraphs exceeds that of both the Pivoter and Turán-shadow algorithms. It is capable of handling large, dense subgraphs, which goes beyond what the Pivoter and Turán-shadow algorithms can do. The scalability of Pivoter hinders its applicability to larger datasets. As a result, the need for exact values in larger datasets poses challenges in conducting a comparative evaluation.
Mostly, the execution time of the DP-color path algorithm outperforms the BDAC, especially for larger datasets and larger k; it requires a larger sampling size. Also, it is observed that the execution time of the DP-color path algorithm grows with the k on large datasets and gets closer to the BDAC execution time. However, the BDAC algorithm has approximately the same running time on a dataset for all k values, because it does not build a recursion tree or apply sampling strategies.
Table 4 shows that the YACC and DP-color path results on “soc-LJ” for k = 40 and “com-orkut” for k = 20 and 40 are inconsistent. Without knowing the exact values, we cannot determine which algorithm’s sampling results are more accurate or which algorithm requires more samples. Our algorithm provides minimum and maximum k-clique counts in such cases, offering guarantees based on theoretical foundations.
Based on the dataset results, the exact values obtained do not surpass our estimated maximum value on any dataset. However, for some datasets, there is a significant difference between the BDAC’s minimum and maximum k-clique counts. The limitation part below explains this variance.
Complexity analysis: Both the Turán-shadow and YACC algorithms share the same time and space complexity. Each algorithm involves iterating over all vertices, which takes time, where n is the number of vertices. During each iteration, the algorithm recursively searches for -cliques within the neighborhoods of vertices. This recursive search operation takes time, where represents the degeneracy of the graph, indicating that each subgraph has at most neighbors. Also, in each recursive search, the algorithm calculates the edge density. That means it checks the neighborhood of a current node, whether any of two neighbors form an edge . Therefore, constructing the recursion tree for these operations has a time complexity of . So, total complexity is , for degeneracy orientation, m indicates the number of edges.
This complexity suggests that the algorithm’s performance scales linearly with the number of vertices (n) but exponentially with the size of the structure (k), adjusted by the degeneracy (). The () term indicates that, for each vertex, the algorithm explores combinations of neighbors, but the degeneracy () limits the growth of these combinations, making the algorithm more efficient than a naive approach for dense graphs.
In summary, this complexity indicates an algorithm that is efficient for sparse graphs (where () is low) and for finding relatively small structures (where (k) is not too large), as the cost grows significantly with larger (k) values, especially in denser graphs where () is higher. The space complexity is , for the recursion tree and storing subsets of neighbors at each level, ) for storing the original graph.
The time complexity of algorithms proposed by Ye et al. [
35] are k-color sampling is
, k-color path sampling
, and k-triangle sampling algorithms
, where
is the number of colors of the graph G obtained by the greedy coloring algorithm [
42,
43],
k is the clique size,
n is the number of vertices,
m is the number of edges and
is the number of triangles of the input graph. The k-color sampling algorithm considers all possible sets of k different colors, and in the worst case, there are
such sets. For each set, it checks whether it forms a k-clique, which leads to
complexity. The space complexity is
,
to store graphs and colors,
to store dynamic programming table (DP). In the k-color path algorithm,
denotes the possible coloring of paths of length
k over
n vertices, and
m denotes the traversal of each edge. The space complexity of its
DP and DAG. The k-triangle algorithm samples a triangle and extends it to the k-clique. So, the time complexity is
, and the space complexity is
to store the DP table.
The BDAC algorithm also iterates through each vertex in the graph. For every vertex, it examines pairs of its neighbors to determine if they form an edge. This checking operation, for each vertex, has a time complexity of . Consequently, the overall time complexity of BDAC is , where n represents the number of vertices in the graph.
This complexity indicates that the time it takes for the algorithm to run scales linearly with the number of vertices (n). However, the time taken for each vertex scales quadratically with the degeneracy (). This is because the algorithm checks pairs of neighbors for each vertex to determine whether they form an edge. The space complexity is , for storing the nodes in induced subgraphs of each vertex, for storing the original graph.
The BDAC gives better time and space complexity than Turán-shadow, as it eliminates the recursion tree construction. If we compare the k-color set sampling with BDAC, ( vs. ), in the worst-case scenario where close to n and causes which is the higher time complexity of compared complexity of the BDAC. Compared with the k-color path, the term can grow extremely large, making it impractical for large datasets (n) and larger k. So, compared to BDAC, it is less efficient than BDAC for larger k and n. The time complexity of the k-triangle algorithm depends on the number of triangles of input and the clique size. Compared with BDAC, this algorithm can be more efficient for datasets with fewer triangles. Still, this algorithm can be comparable to or worse than BDAC for dense datasets with larger triangles.
As a result, The BDAC algorithm is generally more efficient than the sampling-based methods for large graphs because is typically much smaller than n. Sampling approaches are effective in certain settings, such as when the number of colors is small. However, they become more complex for larger graphs, especially as k increases.
Performance of BDAC: Applying theorems in the algorithmic process lets us determine the dataset’s minimum and maximum k-clique counts per vertex and global. This approach ensures rigorous analysis and guarantees that the generated k-cliques adhere to established mathematical principles. Additionally, understanding the range between the minimum and maximum values offers insights into the diversity and distribution of k-cliques, facilitating informed decision-making in data analysis and interpretation.
The execution time of the sampling-based algorithm mostly outperforms the BDAC. However, their accuracy depends on the sample sizes, and the complexity of these algorithms grows with k. The BDAC algorithm complexity is not dependent on k; it depends on the degeneracy of the graph. The BDAC algorithm is well-suited for large and densely connected datasets such as com-lj, soc-pokec, and soc-LJ. In these datasets, the node densities typically meet the given threshold, allowing us to obtain both the minimum and maximum k-clique counts per node. This leads to a smaller difference between the lowest and highest values overall. This work represents the first attempt to provide lower and upper bounds and results for .
Limitation of BDAC: For some datasets, there is a significant disparity between the estimated minimum and maximum k-clique compared to other datasets. This discrepancy highlights a limitation of our algorithm. Specifically, accurate estimation of the minimum k-cliques size becomes problematic when the edge density of node neighbors can be at most the threshold. This factor also affects the final estimation of k-cliques. Additionally, if a node has a high degree but its density is lower than the given threshold, it indicates that the node yields a sparse induced subgraph. This can lead to a significantly higher potential maximum k-clique count, particularly for large n but lower k, because of the binomial coefficients of used in calculating the maximum value.
This observation emphasizes a crucial aspect of our algorithm’s performance and offers valuable insights into its limitations. In cases where there is a substantial variance in estimates, it might be beneficial to assess metrics like geometric mean or utilize the minimum or maximum value.
6. Conclusions
The BDAC demonstrates significant efficacy in estimating k-cliques across various values of k, mainly focusing on k = 8, 15, 25, 40, and 50. The aim of providing results on different k values is to measure the capability of the algorithms from lower to large k values. By leveraging theorems in the algorithmic process, we can determine the minimum and maximum k-cliques locally per vertex and globally within the entire datasets, ensuring adherence to established mathematical principles and providing insights into the diversity and distribution of k-cliques. Our work represents the first attempt to provide both lower and upper bounds and results for k = 50, contributing to the advancement of k-clique counting algorithms.
Comparison with state-of-the-art algorithms, including Turán-shadow, Pivoter, YACC, and DP-color path, reveals distinctive characteristics of BDAC. Compared to Turán-shadow, its ability to handle large dense subgraphs offers a unique advantage, addressing a crucial limitation of existing algorithms and underscoring the potential utility of BDAC in specific contexts. Similarly, compared to YACC, our BDAC algorithm demonstrates competitive performance, delivering dependable estimations across datasets by offering lower and upper bounds. The DP-color path algorithm mostly outperforms the BDAC regarding execution time, but it requires a much larger sample size for large datasets and larger k. In such a situation, the execution time of the BDAC and DP-color algorithm becomes competitive, as is shown in the experimental results.
However, the BDAC exhibits limitations, particularly in datasets like “com-orkut”. The significant disparity between estimated minimum and maximum k-cliques highlights challenges in accurately estimating minimum k-clique sizes, especially when the edge density of node neighbors falls below a Turán threshold.
In summary, we present a direct method for estimating k-cliques (where ) without reliance on sampling techniques or the construction of recursion trees. Using established theorems, the BDAC provides upper and lower bounds for k-cliques per vertex and globally, providing a reliable and efficient alternative to traditional methods. This advancement significantly enhances the accuracy and speed of analyzing complex networks and graph structures.