BDAC: Boundary-Driven Approximations of K-Cliques

Çalmaz, Büşra; Bostanoğlu, Belgin Ergenç

doi:10.3390/sym16080983

Open AccessArticle

BDAC: Boundary-Driven Approximations of K-Cliques

by

Büşra Çalmaz

and

Belgin Ergenç Bostanoğlu

^*

Department of Computer Engineering, Izmir Institute of Technology, 35433 Izmir, Turkey

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(8), 983; https://doi.org/10.3390/sym16080983

Submission received: 30 June 2024 / Revised: 30 July 2024 / Accepted: 1 August 2024 / Published: 2 August 2024

(This article belongs to the Special Issue Advances in Graph Theory)

Download

Browse Figures

Versions Notes

Abstract

:

Clique counts are crucial in applications like detecting communities in social networks and recurring patterns in bioinformatics. Counting k-cliques—a fully connected subgraph with k nodes, where each node has a direct, mutual, and symmetric relationship with every other node—becomes computationally challenging for larger k due to combinatorial explosion, especially in large, dense graphs. Existing exact methods have difficulties beyond k = 10, especially on large datasets, while sampling-based approaches often involve trade-offs in terms of accuracy, resource utilization, and efficiency. This difficulty becomes more pronounced in dense graphs as the number of potential k-cliques grows exponentially. We present Boundary-driven approximations of k-cliques (BDAC), a novel algorithm that approximates k-clique counts without using recursive procedures or sampling methods. BDAC offers both lower and upper bounds for k-cliques at local (per-vertex) and global levels, making it ideal for large, dense graphs. Unlike other approaches, BDAC’s complexity remains unaffected by the value of k. We demonstrate its effectiveness by comparing it with leading algorithms across various datasets, focusing on k values ranging from 8 to 50.

Keywords:

approximate; graphs; clique counts; k-cliques; Turán’s theorem

1. Introduction

Enumerating or identifying the occurrences of small subgraphs, also known as graphlets or motifs, is crucial in solving problems on large networks, and has attracted considerable interest from the research community. Graphlets such as triangles, cycles, and cliques are widely used in various domains, such as spam detection [1,2], link prediction [3], uncovering patterns in biological networks [4,5], anomaly detection [6,7], community detection [8], social network analysis [9,10,11], conducting clustering [12], and classification [13] tasks. They provide valuable insights into the structural properties of complex networks. Various researchers have developed algorithms to count the occurrence of graphlets in different types of networks [14,15,16,17].

Fully connected k vertices represent a dense subgraph known as a k-clique. The triangle, representing the fundamental form of a k-clique with k = 3, is a widely used graphlet because its robust structure captures complex interactions between entities, often representing social connections, communication patterns, or dependencies in various networks, such as social media [18,19,20].

Clique counting algorithms are essential in several domains and to analyze different networks, including social network analysis [21], biological network analysis [22], recommendation systems [23], fraud detection [24], network similarity [25], and many others [26,27]. These algorithms facilitate identifying cohesive groups, clusters, or communities within complex networks, providing insights, optimizations, and decision support in various real-world applications. In social networks, it is crucial to identify close-knit groups, such as group of friends or professional networks, to better understand the structure and dynamics of interactions. Similarly, in biological networks, to understand the functions and interactions of proteins, it is important to recognize repeating patterns, which can be modeled using the clique pattern in repeating structures. This clique pattern can indicate tight connections, strong and consistent interactions, and offer deeper insights into the underlying structure of the network. Additionally, in the realm of AI, Exploratory Data Analysis (EDA) is crucial for the preparation and preprocessing phases of machine learning model development. Our methodology can be effectively applied during the EDA phase of data science projects to gain insights and analyze data. However, dealing with the number of k-cliques for larger values of k poses significant algorithmic challenges, mainly due to the exponential growth of the search space associated with large cliques.

There are many algorithms for exact or approximate clique counting [28,29,30,31,32,33]. In large datasets, using exact algorithms is often not feasible, leading researchers to use sampling-based methods. Turán-shadow [31] introduces an innovative method for approximate clique counting, using Turán’s theorem [34] to enumerate cliques up to size 10. Meanwhile, Pivoter [32] has revolutionized clique counting by eliminating the need for enumeration, allowing for the exact computation of k-clique counts for all k values across different graphs. However, the authors state that it is unsuitable for large datasets, such as “com-lj”, beyond

k > 10

. The Yacc algorithm [33] approximates k-cliques up to 40 by revisiting the Turán-shadow algorithm and incorporating various insights to achieve faster clique counting. Ye et al. [35] introduce three dynamic programming and sampling-based approximate k-clique counting algorithms and incorporate them with Pivoter. However, the trade-off between execution time, accuracy, and sample requirements remains a notable limitation. Achieving a balance between computational efficiency and result accuracy remains a significant challenge in clique counting algorithms.

Our algorithm, Boundary-driven approximations of k-cliques (BDAC), uses the Turán theorem [34] and additional theorems [36,37,38,39] from the previous literature to approximate the number of k-cliques. The proposed method differs from existing methods in eliminating the need for sampling procedures and repetitive recursive calls. Instead, it provides lower and upper bounds for local (per-vertex) and global k-clique counts over the entire datasets, and can compute these approximations in roughly the same time for any value of k. As a result, our algorithm is suitable for various k values, although our tests have been limited to k = 50. This limitation is not due to any specific issue with the algorithm, but is simply due to the scope of our current testing. This approach allows for the examination of high-order k-cliques in other application areas and large datasets from various domains, such as social, biological, web networks, etc., thereby providing a more comprehensive analysis of these datasets.

In research areas like social networks and biological networks, small k-cliques are commonly used to discover communities or clusters. This is partly due to the challenges posed by the combinatorial explosion in complexity. As the graph size increases, the number of possible k-cliques grows exponentially, making it difficult to count larger cliques, especially as the graph’s density increases. However, our algorithm, capable of handling larger k values, offers the potential to re-evaluate these fields. It allows for the exploration of larger k-cliques, providing a different perspective that may uncover new insights into the structure and interactions within these networks.

The remainder of this work is structured as shown below: Section 2 presents the necessary terminology and several key notations, details the theorems used in our approach, and explains the contribution of the work. Section 3 provides the specifics of the proposed method, illustrated with a practical example. In Section 4, datasets and their properties, working environment, and experimental results are presented. Section 5 compares the proposed algorithm with other algorithms in the literature regarding performance and complexities, and explains the limitations of our algorithm. The paper is concluded in Section 6, and gives the future direction in Section 7.

2. Preliminaries

This part introduces basic graph terms and clique density theorems. Following that, we will address the problem and our contribution to it.

2.1. Graph Terminology

In the context of an undirected simple graph

G (V, E)

, where V represents the set of vertices with cardinality

n = | V |

and E represents the set of edges with cardinality

m = | E |

, the degree of a node is defined as the cardinality of its neighboring set, representing the count of adjacent edges connected to that node.

A maximal subgraph within the graph, such that every pair of vertices in this subgraph is connected by an edge, thereby constituting a complete subgraph, formally defines a clique. A k-clique (

K_{k}

) is a clique formed by a set of k nodes, where k

\leq

n. In graph theory, a 1-clique (

K_{1}

) denotes a single vertex, a 2-clique (

K_{2}

) denotes an edge between two vertices, and a 3-cliques (

K_{3}

) denotes a triangle formed by three connected vertices.

The edge density of a graph is the ratio of the number of edges present in the graph to the total number of possible edges in the graph, calculated as

\frac{m}{(\binom{n}{2})}

. Let us denote

t_{v}

as the number of triangles that include node v and

d_{v}

as the degree of node v. The triangle density of node v is calculated as

\frac{t_{v}}{(\binom{d_{v}}{2})}

.

An induced subgraph is similar to a snapshot of a section of the original graph. It consists of specific vertices from the original graph and all the edges that connect these vertices.

Let H be a graph with k vertices and G be a graph with n vertices. We denote the set of induced copies of H in G as

I n d (H; G)

. The induced density of H is calculated as [40]:

d (H; G) = \frac{| I n d (H; G) |}{(\binom{n}{k})} .

In this paper,

d {(K_{k})}_{v}

denotes the k-clique density of node v. This proportion can be calculated by dividing the number of k-cliques that include the node v by the total number of potential k-cliques that could include v. The total potential k-cliques is determined by the number of possible combinations of k − 1 nodes selected from the set of neighbors of node v.

d (K_{k}; G)

denotes the k-clique density of graph G.

In graph theory, degeneracy ordering involves iteratively removing the vertex with the lowest degree. If multiple vertices have the lowest degree, one is selected according to predefined choices, such as the vertex ID, or by an arbitrary choice. After a vertex is removed, the degrees of its neighbors are decreased by one. This process is repeated until all vertices are removed. In this method, vertices are sorted in the order in which they were removed from the graph. The objective of degeneracy ordering is to arrange the vertices so that each vertex has fewer neighbors than the preceding one in the ordering. The degeneracy of a graph, denoted by

α

, is defined as the maximum out-degree among all vertices in the degeneracy ordering. In a graph with low degeneracy, all induced subgraphs are composed of vertices with low degrees [41]. In degree ordering, nodes are arranged based on their degrees. The color ordering technique utilizes a greedy coloring algorithm [42,43] to color graphs with m colors. This approach assigns a color between

1, . . ., m

to each node while ensuring that no two neighboring nodes have the same color.

When searching for cliques in an undirected graph, they can be repeatedly discovered by examining each constituent edge. To avoid counting the same clique multiple times, the graph is transformed into a Directed Acyclic Graph (DAG), which is a directed graph with no cycles. To create this DAG, the nodes are first ordered using a specific ordering technique. Each edge (u, v) is then directed based on this ordering: if node u precedes node v in the order, the edge is directed from u to v.

An undirected graph consists of symmetric relations. For example, let us consider a social network whose participants collaborate on a software development project. Participants A and B collectively decide on the project’s direction by discussing it with each other. If participant A says it collaborates with B, that means B collaborates with A. It is unreasonable for B to say it does not cooperate with A. If we know the relation between A and B is symmetric, there is only one edge between them, and this edge can be called AB or BA. Furthermore, a clique pattern also comprises symmetric relations, as each node in the clique maintains a direct, mutual relation with every other node in that clique.

2.2. Clique Density Theorems

This section presents the theorems utilized within our algorithm. To access the formal proofs of the referenced theorems in this paper, refer to the provided references, which offer thorough and rigorous demonstrations of their validity.

Theorem 1

(Turán’s theorem [34]). For any integer k,

d (K_{k}; G) = 0

if

d (K_{2}; G) \leq \frac{k - 2}{k - 1}

.

In extremal graph theory, Turán posed a question regarding a positive number n and a graph F. He asked for the maximum number of edges a graph with n vertices can have without containing graph F as a subgraph. Turán provided a complete solution for the case where F is a clique.

The expression

\frac{k - 2}{k - 1}

is denoted as Turán threshold. For clarity, if the edge density

d (K_{2}; G)

satisfies the Turán threshold, the graph contains at least one k-clique (

K_{k}

).

Theorem 2

(Erdős’s theorem [36]). In a graph with n vertices, if

d (K_{2}; G) > \frac{k - 2}{k - 1}

, this graph contains at least

{(\frac{n}{k - 1})}^{k - 2}

k-cliques.

Erdős’ theorem states that if the edge density of a graph G exceeds a threshold relative to k, it indicates the minimum number of k-cliques present in the graph.

This theorem provides a lower bound for the number of k-cliques in a graph that meets the Turán threshold.

Theorem 3

(Zykov’s theorem [37]). For integers

2 \leq r \leq k

, if

d (K_{k}; G)

= 0 then

d (K_{r}; G) \leq \frac{(k - 1) (k - 2) . . . (k - r)}{{(k - 1)}^{r}}

[40].

Zykov generalized Turán’s theorem. Let us refer this expression

\frac{(k - 1) (k - 2) . . . (k - r)}{{(k - 1)}^{r}}

as the Zykov threshold. Another point of view is that if the

K_{r}

density of a graph

d (K_{r}; G)

satisfies the Zykov threshold, this graph contains at least one k-clique.

Theorem 4

(Kruskal–Katona theorem [38,44]). if

d (K_{r}; G) = α

then

d (K_{k}; G) \leq α^{k / r}

This theorem establishes an upper bound on the number of k-cliques within a given graph. We can derive the upper bound by applying the following formula, where

| K_{k} |

denotes the count of k-cliques.

| K_{k} | \leq (α^{k / r}) (\binom{n}{k}) .

Theorem 5

(Reiher’s clique density theorem [39]). If

k \geq 3

and

γ \in [0, \frac{1}{2})

, then every graph on n vertices with at least

γ n^{2}

edges contains at least

\frac{1}{{(s + 1)}^{k}} (\binom{s + 1}{k}) (1 + {β)}^{k - 1} (1 - (k - 1) β) n^{k}

k-cliques, where

s ⩾ 1

is an integer with

γ \in [\frac{s - 1}{2 s}, \frac{s}{2 (s + 1)}]

and

β \in [0, \frac{1}{s}]

is implicitly defined by

γ = \frac{s}{2 (s + 1)} (1

−

β^{2})

[39].

Reiher addresses the question by determining the minimum number of k-cliques that must exist in graphs with n vertices and more edges than

\frac{k - 2}{2 (k - 1)} n^{2}

. Consider a graph G with n vertices and m edges. Let

γ = m / n^{2}

. Thus, if

γ > \frac{k - 2}{2 (k - 1)}

, how many k-cliques does this graph contain, at a minimum, guaranteed?

According to this theorem, in a graph, if we know the number of vertices (n) and edges (m), we calculate

γ

as

γ = m / n^{2}

, so by rearranging the interval

γ \in [\frac{s - 1}{2 s}, \frac{s}{2 (s + 1)}]

in terms of s, we obtain an interval for s as follows:

\frac{2 γ}{1 - 2 γ} \leq s \leq \frac{1}{1 - 2 γ}

Then we compute

β

by substituting

γ

and s into the equation. Finally, the theorem indicates that it can predict the existence of any k-cliques if

γ > \frac{k - 2}{2 (k - 1)}

. The

\frac{k - 2}{k - 1}

represents the Turán threshold. If

s < k - 2

, the binomial coefficient

(\binom{s + 1}{k})

becomes zero. According to Turán’s theorem, this scenario does not predict the existence of k-cliques.

2.3. Main Contribution

The main challenge of counting k-cliques arises from the overwhelming number of possibilities, known as a combinatorial explosion. Hence, methods have been developed to sample large, dense subgraphs, mainly to count larger k-cliques. However, these sampling methods require assumptions about the distribution or sufficiency of the sample count.

This study introduces a novel algorithm to approximate the number of k-cliques (for all k) in a given graph. The algorithm functions by examining each node in the graph and identifying the triangles formed by that node. Utilizing this localized information with established theorems from the existing literature, the algorithm can determine lower and upper bounds for the number of k-cliques linked to each node. By consolidating these individual boundaries throughout the entire dataset, we obtain a comprehensive range for the total number of k-cliques in the graph. This methodology provides an efficient computational approach to approximate the number of k-cliques.

The Boundary-driven approximations of k-cliques (BDAC) algorithm draws inspiration from the Turán-shadow algorithm [31], which leverages classic extremal combinatorics principles concerning clique densities. Turán [34] and Erdős [36] established lower bounds on the number of cliques within sufficiently dense graphs. The novelty of our algorithm lies in integrating additional theorems, such as Zykov [37], Kruskal–Katona [38,44], and Reiher [39] to provide both lower and upper bounds for dense graphs.

The Turán-shadow [31] method involves constructing a set of dense subgraphs, known as a Turán-shadow, to cover graph G comprehensively and include all cliques. Then, it uses standard techniques to design an unbiased estimator for the clique count. The Turán-shadow algorithm dedicates a significant portion of its overall runtime to constructing the shadow. In contrast, sampling constitutes a small fraction of the total execution time.

The BDAC algorithm innovatively bypasses the requirement for the Turán-shadow and sampling techniques, offering a direct and efficient method for estimating k-clique counts in complex networks. By traversing each node in the graph, we establish lower and upper bounds for potential k-cliques formed by each vertex (local bound per vertex), aggregating these bounds to determine the approximation of k-cliques without needing a Turán-shadow (global bound). This streamlined approach ensures high-speed performance, significantly reducing computational overhead and facilitating efficient approximation of k-clique counts. Additionally, this algorithm also provides insights into local k-clique counts formed by each vertex. Our algorithm is designed to work for any k value, with its complexity remaining unaffected by the value of k.

We compare the BDAC algorithm and Turán-shadow regarding their execution time and results, mainly focusing on relatively small datasets. When dealing with large datasets, the applicability of Turán-shadow diminishes significantly. As a result, we compared the BDAC algorithm with YACC [33], a modified version of Turán-shadow designed for such scenarios. Unfortunately, we cannot access the source code of YACC, which prevents a direct comparison of their execution times. As a result, we can only compare algorithm estimation results across identical datasets.

The outcomes of the Pivoter [32] algorithm provide exact values for comparing the algorithm in terms of relative error. However, Pivoter is not scalable to handle larger datasets (like com-lj [45]). We also compare BDAC with the algorithm proposed by Ye et al. [35] regarding the estimation results and execution time. A comprehensive comparison of these state-of-the-art algorithms is provided, focusing on k values ranging from 8 to 50, to analyze the effect of k values on various datasets. As a result, the requirement for exact values for large datasets prevents a comparative assessment.

2.4. Related Work

Cliques have attracted significant attention, as they provide valuable insights into networks’ intricate relationships and structural patterns. Triangles serve as the basic building block of cliques, which researchers use to compute the clustering coefficient and the transitivity ratio [46]. Larger clique counts are commonly used in various applications, such as community detection [47], clustering [48], epilepsy prediction [49], graph compression [50], and finding correlated genes [51].

The literature documents two categories of k-clique counting algorithms: exact methods and approximate approaches for k-clique counting. Chiba and Nishizeki [28] propose the first k-clique algorithm, which lists all k-cliques and provides a novel graph orientation technique. Table 1 provides an overview of the distinct characteristics of the algorithms. These characteristics include the year of publication, whether the algorithm is exact or approximate, the bounds utilized, the range of k tested on a large dataset, the ordering heuristics employed, the usage of recursion trees, the theorems utilized, and the time and space complexity.

Finocchi et al. [29] parallelize this algorithm by leveraging the MapReduce technique [52], integrating a degree ordering strategy to enhance its computational efficacy. Danisch et al. [30] additionally introduce a parallel version (kclist) of this algorithm, employing a degeneracy ordering technique. These algorithms are exact and rely on the k-clique enumeration technique. However, combinatorial explosion makes these algorithms unsuitable for large

k, (k ⩾ 8

) on large datasets.

Table 1. Characteristics of clique counting algorithms. In the time and space complexity columns, k indicates clique size, m refers to the number of edges, n denotes the number of vertices,

α

is the degeneracy of the graph,

χ

is the number of colors used for coloring,

Δ

is the number of triangles of the input graph.

Table 1. Characteristics of clique counting algorithms. In the time and space complexity columns, k indicates clique size, m refers to the number of edges, n denotes the number of vertices,

α

is the degeneracy of the graph,

χ

is the number of colors used for coloring,

Δ

is the number of triangles of the input graph.

Algorithms	Year	Exact/Approximate	Used Bounds	Tested K	Ordering Heuristic	Recursion Tree	Theorems	Time Complexity	Space Complexity
Arbo [28]	1985	exact	x	k <= 3	degree ordering	x	x	$O (k m α^{k - 2})$	$O (m + n)$
Clique counting [29]	2015	exact	x	k <= 6	degree ordering	x	Arbo [28]	$O (m^{k / 2})$	$O (m + n)$
Turán-shadow [31]	2017	approx (sampling)	lower	k <= 10	degeneracy ordering	√	Turán [34], Erdős [36]	$O (n α^{k - 1})$	$O (n α^{k - 2} + m)$
Kclist [30]	2018	exact	x	k <= 10	degeneracy ordering	x	Arbo [28]	$O (k m {\frac{d}{2}}^{k - 2})$	$O (m + n)$
Pivoter [32]	2020	exact	x	k <= 10	degeneracy ordering	√	Bron- Kerbosch [53]	$O (n α 3^{α / 3})$	$O (m + n)$
YACC [33]	2022	approx (sampling)	lower	k <= 40	degeneracy ordering	√	Turán [34], Erdős [36]	$O (n α^{k - 1})$	$O (n α^{k - 2} + m)$
DP-color path [35]	2023	approx (sampling)	x	k <= 15	degeneracy & color ordering	x	x	$O (χ^{k})$ $O (χ^{n k} + m)$ $O (k Δ)$	$O (n + m + χ^{k})$ $O (n k + m)$ $O (k m)$
BDAC	-	approx (no sampling)	lower & upper	k <= 50	degeneracy ordering	x	Turán [34], Erdős [36], Zykov [37], Kruskal- Katona [38,44] Reiher [39]	$O (n α^{2})$	$O (m + n + α)$

Another exact k-clique algorithm, Pivoter, is proposed by Jain and Seshadri [32]. This algorithm constructs a recursion tree called a Succinct Clique Tree (SCT) using the pivoting technique based on the traditional strategy introduced by Bron-Kerbosch [53]. This method is essential for reducing the recursion tree of backtracking algorithms used to identify maximal cliques. The central concept behind the SCT structure is to preserve a distinct representation of all k-cliques, yet its size is significantly smaller than the total number of k-cliques. Based on current knowledge, Pivoter is considered the leading exact k-clique algorithm. Pivoter is highly efficient in sparse graphs but may face performance issues in dense graph regions due to numerous cliques with complex overlap relationships, leading to the expansion of a large recursion tree during its execution.

Due to the combinatorial explosion, there has been a shift towards approximation solutions based on sampling methods. The Turán-shadow algorithm [31], introduced by Jain and Seshadri, stands as the state-of-the-art sampling-based approximate k-clique algorithm

k \leq 10

. This method involves creating a set of dense subgraphs, a recursion tree known as Turán-shadow, to cover graph G comprehensively and include all cliques. Afterward, Turán-shadow employs an unbiased estimator to count the cliques using standard techniques. The construction of the shadow consumes a significant portion of the algorithm’s time.

YACC [33] extends the capabilities of Turán-shadow for larger k values (up to 40) by reducing the recursion tree size through a systematic relaxation of the stopping condition during tree creation, leading to a more efficient algorithm. The YACC algorithm addresses the challenges of the Turán-shadow algorithm and introduces a strategy to minimize the recursion tree size. However, this size reduction comes at the cost of sacrificing algorithm accuracy. Additionally, achieving a more small size demands a considerable increase in the number of samples.

The algorithms proposed by Ye et al. [35] combine exact approach and sampling strategies. The algorithm divides the input graph into sparse and dense regions based on average degree. Then, it performs exact counting using Pivoter [32] for the sparse regions. It proposes three color-based sampling algorithms for the dense region: k-color set, k-color path, and k-triangle. The algorithm employs a linear time coloring process for dense regions using the greedy coloring algorithms [42,43]. K-color set sampling selects k-color sets, each consisting of k nodes with unique colors. The k-color path that samples the connected k-color sets is called a k-color path, ensuring that the subgraph induced by k vertices remains connected. The k-triangle path sampling method selects connected k-color sets in which three consecutive nodes form a triangle. These sets are referred to as k-triangle paths. Each of these algorithms applies a dynamic programming strategy for uniform sampling. The time complexities of these algorithms are presented in Table 1 in the order expressed here.

Obtaining the count of k-cliques for larger k is algorithmically problematic due to a combinatorial explosion in the search field of large cliques. The Pivoter algorithm needs help to compute results for large k values (

k > 10

) in specific extensive datasets like com-lj and soc-lj [45]. However, YACC is the first algorithm to count large k-cliques (

k \leq 40

) in several graphs.

Several recent algorithms focus on identifying k-clique densest subgraphs, rather than simply counting cliques. These algorithms aim to find subgraphs that maximize the density of k-cliques in relation to their size. However, these approaches are beyond the scope of our current research [54,55].

The efficient enumeration of k-cliques without leaning on extensive recursion trees or intricate sampling strategies still needs to be addressed. The sampling strategies to achieve a high level of accuracy might still require a substantial number of samples. The efficiency and practicality of handling high k values and large datasets remain concerns; proposed solutions must emphasize both efficiency and practicality while ensuring accurate approximation, especially in dense datasets.

3. Materials and Methods

This section introduces the fundamental concept underlying our study. We enhance the clarity of the proposed method by providing a numerical example. The basis of this study is the Turán-shadow algorithm [31], which utilizes principles outlined in the Turán theorem [34]. Similar to Turán-shadow, the proposed algorithm initiates by transforming the input graph into a directed acyclic graph (DAG) using degeneracy ordering. This process aims to prevent the redundant discovery of cliques. During graph traversal, the algorithm just explores the out neighborhood of each node. To be self-contained, we start with an overview of the Turán-shadow algorithm. The last row in Table 1 shows the characteristics of our algorithm in comparison to the state-of-the-art k-clique counting algorithms. We will elucidate the distinctive characteristics of the proposed method.

Turán’s theorem [31] establishes if edge density satisfies the Turán threshold

(d (K_{2}; G) > \frac{k - 2}{k - 1})

. Afterwards, it contains at least one k-clique. The Turán-shadow theorem states that identifying all k-cliques containing a vertex v can be achieved by identifying all (k − 1)-cliques among the vertices adjacent to v. Turán-shadow views each node’s neighbor list as a subgraph comprised nodes and connecting edges. If the edge density of this subgraph meets the Turán threshold, then this node with its neighbors contains at least one k-clique.

In the Turán-shadow algorithm, the Turán threshold determines the size of the recursion tree. With increasing values of k requiring the exploration of larger cliques, more profound levels of recursion are necessary, leading to an expansion in the size of the recursion tree. Therefore, it becomes infeasible to acquire k-cliques for larger values of k on large datasets using the Turán-shadow algorithm.

3.1. Boundary-Driven Approximations of K-Cliques (BDAC)

The primary innovation of BDAC occurs with its extraction of edge density information from the neighborhood of a node, facilitating the determination of the triangle density (

d {(K_{3})}_{v}

) associated with that node. Because the count of edges connecting the neighbors of a given node corresponds to the number of triangles formed by that node, at this point, we use Zykov’s theorem instead of Turán’s theorem. Zykov’s theorem extends Turán’s theorem. Formally, for any positive integers r and k, such that

2 \leq r \leq k

, if the density of r-clique satisfies the Zykov threshold, then the graph contains at least one k-clique.

The pseudocode illustrating BDAC is shown in Algorithm 1. Line 5 in the pseudocode, H, represents the subgraph of the current node’s neighbors with edges between them. If the number of vertices of H, which also equals the current node’s degree, is not enough to construct the k-clique, the algorithm continues with the next vertex. Line 12 calculates the edge density of subgraph H (

d (K_{2}; H)

), which is also equal to the triangle density of the current node.

Now that we have triangle density information

d (K_{3})

for each node, we can determine the presence of k-clique (

K_{k}

). By using triangle density information

d (K_{3})

instead of edge density

d (K_{2})

, we obtain a relatively lower threshold. Suppose a node’s triangle density

d (K_{3})

satisfies the Zykov threshold. In that case, the next step involves deriving the upper bound of k-cliques obtained from that node and its neighborhood using the Kruskal–Katona theorem explained above in Theorem 4.

In line 15,

c a l M a x C l i q

is a procedure that computes the maximum number of k-cliques in subgraph H utilizing Kruskal–Katona’s theorem. The algorithm requires the triangle density of node v (

d {(K_{3})}_{v}

), which is equal to the edge density of subgraph H (

d (K_{2}; H)

), the number of nodes in subgraph H, and the size of the clique we search in H as input parameters.

If the density satisfies the threshold, we establish a lower k-cliques bound. We initially employed Reiher’s theorem to verify the lower bound for

K_{k}

, as outlined in Theorem 5.

In line 18,

c a l M i n C l i q R e i h e r

is a procedure that calculates the minimum clique counts in subgraph H utilizing Reiher’s theorem. The inputs of this procedure are the

g a m m a

explained in Theorem 5, the number of nodes in subgraph H, and the size of the clique that we search in H. However, in instances where

s \leq k - 2

, the diminished value of the binomial coefficient

(\binom{s + 1}{k})

renders Reiher’s theorem inadequate for determining the presence of

K_{k}

. In this scenario, the procedure returns −1, and then we verify if

d (K_{2}; H) > \frac{k - 2}{k - 1}

holds. If it does, we apply Erdős’s theorem (line 22). Otherwise, we accept the lower bound as 0.

Algorithm 1 BDAC

1:: procedure approximate_k_clique(Graph G, DAG D, Integer k)
2:: $t o t a l M a x \leftarrow 0$
3:: $t o t a l M i n \leftarrow 0$
4:: for all $n o d e$ in G do
5:: $H \leftarrow$ Construct induced subgraph from the current node’s out-neighbors in D
6:: $k \leftarrow$ k − 1
7:: $n \leftarrow$ the no. of vertices of subgraph H
8:: if n < k then
9:: continue
10:: end if
11:: $m \leftarrow$ the no. of edges of subgraph H
12:: $d e n s i t y \leftarrow m / (\binom{n}{2})$
13:: $z y k o v T h r e s \leftarrow$ $(k$ − $1) (k$ − $2) (k$ − $3) / (k$ − ${1)}^{3}$
14:: $γ \leftarrow$ $m / n^{2}$
15:: $m a x C l i q \leftarrow c a l M a x C l i q (n, d e n s i t y, k)$
16:: $t o t a l M a x + \leftarrow m a x C l i q$
17:: if $d e n s i t y > z y k o v T h r e s$ then
18:: $m i n C l i q \leftarrow c a l M i n C l i q R e i h e r (n, γ, k)$
19:: if $m i n C l i q = = - 1$ then
20:: $t u r a n T h r e s \leftarrow$ $(k$ − $2) / (k$ − $1)$
21:: if $d e n s i t y > t u r a n T h r e s$ then
22:: $m i n C l i q \leftarrow {(\frac{n}{k - 1})}^{k - 2}$
23:: else
24:: $m i n C l i q = 0$
25:: end if
26:: end if
27:: $t o t a l M i n + \leftarrow m i n C l i q$
28:: end if
29:: end for
30:: Print totalMin
31:: Print totalMax
32:: end procedure

An additional differentiation from the Turán-shadow algorithm occurs when the edge density of a node and its neighborhood fails to meet the Turán threshold. In such cases, the algorithm recurses on the set of vertices within that neighborhood, constructing a recursion tree named Turán-shadow within the Turán-shadow algorithm. Significantly, the algorithm dedicates a notable portion of its overall computational time to establishing this recursion tree.

Conversely, if a node and its triangle density do not satisfy the Zykov threshold, we omit that particular node from consideration. Consequently, in this scenario, we abstain from constructing any recursion tree.

As a final step, once we have obtained both the lower and upper bounds from each node that meets the Zykov threshold, we combine these bounds to calculate the overall lower and upper bounds. These aggregated bounds enable us to approximate the final count of k-cliques (lines 16 and 27).

3.2. Example

This section presents an illustrative example aimed at elucidating the theorems discussed. Figure 1 illustrates a sample graph G and the out-degrees of each node after degeneracy ordering and constructing DAG. In this example, we aim to estimate 5-cliques. Following the ordering, the algorithm traverses each node individually. If the neighbor count satisfies the desired clique count minus one, which is 4 (excluding the current node itself), we check whether the density satisfies the threshold. In this example, only nodes 2 and 5 have sufficient neighbors to form the desired clique.

The following Figure 2 illustrates the out-neighbors of nodes 2 and 5, along with the edges between these neighbors (induced graph) represented as a subgraph H. Both nodes yield the same induced subgraph. Then, the density of this subgraph H is calculated (density=

6 / (\binom{4}{2}) = 1

) and checked to see if it satisfies the Zykov threshold for

k = 4, r = 3

(threshold= 0.22). As previously mentioned, the edge density of subgraph H provides the triangle density of nodes 2 and 5,

d (K_{2}; H)

=

d {(K_{3})}_{2}

and

d (K_{2}; H)

=

d {(K_{3})}_{5}

. Therefore, the r-value is 3. Later, the algorithm utilizes Reiher’s theorem to establish the lower bound, yielding

s = 3

and

β = 0

, with a minimum clique count of 1. Kruskal–Katona’s theorem provides the maximum clique count as 1. The same results are generated by node 2 and 5, the totalMin is 2 and totalMax is 2, with the exact value also being 2. The 5-cliques in the G are

0 - 2 - 3 - 4 - 7

and

0 - 3 - 4 - 5 - 7

.

4. Results

This section presents the properties of the environment and datasets used for experiments. Then, it presents the result of comparing the proposed algorithm with other approximate algorithms.

4.1. Experimental Setup

We perform a comparative analysis of BDAC results against the Turán-shadow [31] algorithm, Pivoter [32], and DP-color path [35] utilizing publicly available C++ implementations. Neither the code of the DP-color triangle [35] nor YACC [33] are publicly available. So, we can not compare our algorithm with the DP-color triangle. We also compare BDAC, YACC, and DP-color path algorithms regarding the estimation results. The results for YACC are used as detailed in the paper [33]. Throughout our experiments, we leverage and refine the existing C++ implementations of the Turán-shadow algorithm. We thank the authors of Turán-shadow, Pivoter, and DP-color path for generously sharing their code. The assessment of all algorithms occurs on a PC equipped with two 2.2 GHz Intel(R) Xeon(R) Silver 4114 CPUs (10 cores), 640 KB L1 cache, 10MB L2 cache, 13 MB L3 cache, and 32 GB of memory.

We utilize a diverse selection of datasets obtained from both SNAP [45] and Network Repository [56]. Table 2 illustrates the datasets and their respective properties.

The value labeled as

α

in Table 2 indicates the degeneracy of the dataset, which is further explained in Section 2.1. The provided datasets display a wide range of characteristics in size and density. “web-Stanford” and “soc-pokec” are notable for their relatively low degeneracy values, indicating sparser graphs with nodes of lower degrees, suggesting minimal interconnectivity. On the other hand, “com-lj” and “soc-LJ” demonstrate high degeneracy values, signifying denser graphs with significantly higher connectivity, even within their subgraphs. “web-BerkStan” shows moderate to high degeneracy, suggesting graphs with varying levels of density and connectivity. “As-skitter” and “com-orkut” exhibit moderate degeneracy values, indicating graphs with relatively high connectivity but not as dense as the “com-lj” or “soc-LJ” datasets. This comparison highlights the diverse nature of the datasets, showcasing variations in graph density and connectivity across different network structures.

4.2. Experimental Results

We compare the BDAC algorithm with Turán-shadow, Pivoter, and DP-color path performance regarding execution time and estimation (see Table 3). The BDAC does not provide an estimation; it just provides the boundaries. The Pivoter algorithm provides exact values for comparing algorithm results. If exact values are unavailable, Table 3 indicates the corresponding entry as “unknown”. For large datasets, Turán-shadow and Pivoter algorithms are terminated after a specified duration, as indicated by “terminated” in Table 3. The BDAC is also compared with another sampling-based DP-color path algorithm. The results of the DP-color path algorithm are obtained from 500 K samples.

Additionally, we compare the results of the BDAC algorithm and DP-color path with YACC, an adapted version of Turán-shadow designed for such scenarios (see Table 4). However, our inability to access the source code of YACC prevents a direct comparison of their execution times. Therefore, the assessment is limited to comparing the sampling results reported in the YACC paper [33] across identical datasets. The results are consistent for relatively smaller datasets, so we compare these three algorithms on large datasets with k = 20, 40. The results are obtained for 500 K samples, as the YACC states its results under 500 K samples.

5. Discussion

This section discusses the results of our algorithm compared to state-of-the-art algorithms, the performance of BDAC, and its limitations.

Comparison with the other algorithms: Turán-shadow is suitable for relatively small datasets since its effectiveness significantly diminishes when dealing with larger datasets. The BDAC algorithm’s capacity to address dense subgraphs provides a distinct advantage in specific contexts. The BDAC consistently demonstrates a notably wider gap between the minimum and maximum values in certain cases when compared to the Turán-shadow algorithm (refer to Table 3). Its robust capability to handle large dense subgraphs exceeds that of both the Pivoter and Turán-shadow algorithms. It is capable of handling large, dense subgraphs, which goes beyond what the Pivoter and Turán-shadow algorithms can do. The scalability of Pivoter hinders its applicability to larger datasets. As a result, the need for exact values in larger datasets poses challenges in conducting a comparative evaluation.

Mostly, the execution time of the DP-color path algorithm outperforms the BDAC, especially for larger datasets and larger k; it requires a larger sampling size. Also, it is observed that the execution time of the DP-color path algorithm grows with the k on large datasets and gets closer to the BDAC execution time. However, the BDAC algorithm has approximately the same running time on a dataset for all k values, because it does not build a recursion tree or apply sampling strategies.

Table 4 shows that the YACC and DP-color path results on “soc-LJ” for k = 40 and “com-orkut” for k = 20 and 40 are inconsistent. Without knowing the exact values, we cannot determine which algorithm’s sampling results are more accurate or which algorithm requires more samples. Our algorithm provides minimum and maximum k-clique counts in such cases, offering guarantees based on theoretical foundations.

Based on the dataset results, the exact values obtained do not surpass our estimated maximum value on any dataset. However, for some datasets, there is a significant difference between the BDAC’s minimum and maximum k-clique counts. The limitation part below explains this variance.

Complexity analysis: Both the Turán-shadow and YACC algorithms share the same time and space complexity. Each algorithm involves iterating over all vertices, which takes

O (n)

time, where n is the number of vertices. During each iteration, the algorithm recursively searches for

(k - 1)

-cliques within the neighborhoods of vertices. This recursive search operation takes

O (α^{k - 1})

time, where

α

represents the degeneracy of the graph, indicating that each subgraph has at most

α

neighbors. Also, in each recursive search, the algorithm calculates the edge density. That means it checks the neighborhood of a current node, whether any of two neighbors form an edge

O (α^{2})

. Therefore, constructing the recursion tree for these operations has a time complexity of

O (n (α^{2} + α^{k - 1}))

. So, total complexity is

O (n α^{k - 1} + m)

,

O (m)

for degeneracy orientation, m indicates the number of edges.

This complexity suggests that the algorithm’s performance scales linearly with the number of vertices (n) but exponentially with the size of the structure (k), adjusted by the degeneracy (

α

). The (

α^{k - 1}

) term indicates that, for each vertex, the algorithm explores combinations of neighbors, but the degeneracy (

α

) limits the growth of these combinations, making the algorithm more efficient than a naive approach for dense graphs.

In summary, this complexity indicates an algorithm that is efficient for sparse graphs (where (

α

) is low) and for finding relatively small structures (where (k) is not too large), as the cost grows significantly with larger (k) values, especially in denser graphs where (

α

) is higher. The space complexity is

O (n α^{k - 2} + n + m)

,

O (n α^{k - 2})

for the recursion tree and storing subsets of neighbors at each level,

O (n + m

) for storing the original graph.

The time complexity of algorithms proposed by Ye et al. [35] are k-color sampling is

O (χ^{k})

, k-color path sampling

O (χ^{n k} + m)

, and k-triangle sampling algorithms

O (k Δ)

, where

χ

is the number of colors of the graph G obtained by the greedy coloring algorithm [42,43], k is the clique size, n is the number of vertices, m is the number of edges and

Δ

is the number of triangles of the input graph. The k-color sampling algorithm considers all possible sets of k different colors, and in the worst case, there are

χ^{k}

such sets. For each set, it checks whether it forms a k-clique, which leads to

O (χ^{k})

complexity. The space complexity is

O (n + m + χ^{k})

,

O (n + m)

to store graphs and colors,

O (χ^{k})

to store dynamic programming table (DP). In the k-color path algorithm,

χ^{n k}

denotes the possible coloring of paths of length k over n vertices, and m denotes the traversal of each edge. The space complexity of its

O (n k + m)

DP and DAG. The k-triangle algorithm samples a triangle and extends it to the k-clique. So, the time complexity is

O (k Δ)

, and the space complexity is

O (k m)

to store the DP table.

The BDAC algorithm also iterates through each vertex in the graph. For every vertex, it examines pairs of its neighbors to determine if they form an edge. This checking operation, for each vertex, has a time complexity of

O (α^{2})

. Consequently, the overall time complexity of BDAC is

O (n α^{2})

, where n represents the number of vertices in the graph.

This complexity indicates that the time it takes for the algorithm to run scales linearly with the number of vertices (n). However, the time taken for each vertex scales quadratically with the degeneracy (

α

). This is because the algorithm checks pairs of neighbors for each vertex to determine whether they form an edge. The space complexity is

O (n + m + α)

,

O (α)

for storing the nodes in induced subgraphs of each vertex,

O (n + m)

for storing the original graph.

The BDAC gives better time and space complexity than Turán-shadow, as it eliminates the recursion tree construction. If we compare the k-color set sampling with BDAC, (

O (χ^{k})

vs.

O (n α^{2})

), in the worst-case scenario where

χ

close to n and causes

O (n^{k})

which is the higher time complexity of compared

O (n α^{2})

complexity of the BDAC. Compared with the k-color path, the term

χ^{n k}

can grow extremely large, making it impractical for large datasets (n) and larger k. So, compared to BDAC, it is less efficient than BDAC for larger k and n. The time complexity of the k-triangle algorithm depends on the number of triangles of input and the clique size. Compared with BDAC, this algorithm can be more efficient for datasets with fewer triangles. Still, this algorithm can be comparable to or worse than BDAC for dense datasets with larger triangles.

As a result, The BDAC algorithm is generally more efficient than the sampling-based methods for large graphs because

α

is typically much smaller than n. Sampling approaches are effective in certain settings, such as when the number of colors is small. However, they become more complex for larger graphs, especially as k increases.

Performance of BDAC: Applying theorems in the algorithmic process lets us determine the dataset’s minimum and maximum k-clique counts per vertex and global. This approach ensures rigorous analysis and guarantees that the generated k-cliques adhere to established mathematical principles. Additionally, understanding the range between the minimum and maximum values offers insights into the diversity and distribution of k-cliques, facilitating informed decision-making in data analysis and interpretation.

The execution time of the sampling-based algorithm mostly outperforms the BDAC. However, their accuracy depends on the sample sizes, and the complexity of these algorithms grows with k. The BDAC algorithm complexity is not dependent on k; it depends on the degeneracy

α

of the graph. The BDAC algorithm is well-suited for large and densely connected datasets such as com-lj, soc-pokec, and soc-LJ. In these datasets, the node densities typically meet the given threshold, allowing us to obtain both the minimum and maximum k-clique counts per node. This leads to a smaller difference between the lowest and highest values overall. This work represents the first attempt to provide lower and upper bounds and results for

k = 50

.

Limitation of BDAC: For some datasets, there is a significant disparity between the estimated minimum and maximum k-clique compared to other datasets. This discrepancy highlights a limitation of our algorithm. Specifically, accurate estimation of the minimum k-cliques size becomes problematic when the edge density of node neighbors can be at most the threshold. This factor also affects the final estimation of k-cliques. Additionally, if a node has a high degree but its density is lower than the given threshold, it indicates that the node yields a sparse induced subgraph. This can lead to a significantly higher potential maximum k-clique count, particularly for large n but lower k, because of the binomial coefficients of

(\binom{n}{k})

used in calculating the maximum value.

This observation emphasizes a crucial aspect of our algorithm’s performance and offers valuable insights into its limitations. In cases where there is a substantial variance in estimates, it might be beneficial to assess metrics like geometric mean or utilize the minimum or maximum value.

6. Conclusions

The BDAC demonstrates significant efficacy in estimating k-cliques across various values of k, mainly focusing on k = 8, 15, 25, 40, and 50. The aim of providing results on different k values is to measure the capability of the algorithms from lower to large k values. By leveraging theorems in the algorithmic process, we can determine the minimum and maximum k-cliques locally per vertex and globally within the entire datasets, ensuring adherence to established mathematical principles and providing insights into the diversity and distribution of k-cliques. Our work represents the first attempt to provide both lower and upper bounds and results for k = 50, contributing to the advancement of k-clique counting algorithms.

Comparison with state-of-the-art algorithms, including Turán-shadow, Pivoter, YACC, and DP-color path, reveals distinctive characteristics of BDAC. Compared to Turán-shadow, its ability to handle large dense subgraphs offers a unique advantage, addressing a crucial limitation of existing algorithms and underscoring the potential utility of BDAC in specific contexts. Similarly, compared to YACC, our BDAC algorithm demonstrates competitive performance, delivering dependable estimations across datasets by offering lower and upper bounds. The DP-color path algorithm mostly outperforms the BDAC regarding execution time, but it requires a much larger sample size for large datasets and larger k. In such a situation, the execution time of the BDAC and DP-color algorithm becomes competitive, as is shown in the experimental results.

However, the BDAC exhibits limitations, particularly in datasets like “com-orkut”. The significant disparity between estimated minimum and maximum k-cliques highlights challenges in accurately estimating minimum k-clique sizes, especially when the edge density of node neighbors falls below a Turán threshold.

In summary, we present a direct method for estimating k-cliques (where

k \leq 50

) without reliance on sampling techniques or the construction of recursion trees. Using established theorems, the BDAC provides upper and lower bounds for k-cliques per vertex and globally, providing a reliable and efficient alternative to traditional methods. This advancement significantly enhances the accuracy and speed of analyzing complex networks and graph structures.

7. Future Work

In future work, we plan to improve the efficiency of BDAC by implementing parallelization techniques to reduce execution time. This extension will provide valuable insights into the local structure of the graph, enabling more detailed analysis and exploration of k-cliques within specific regions or communities. We will test our algorithms on various large datasets, such as “Indochina-2004” and “UK-2002”.

Furthermore, we intend to leverage datasets that estimate minimum and maximum counts of k-cliques to train a supervised machine learning model. This model will enable us to predict minimum values in cases where they are currently unknown, thereby enhancing the accuracy and comprehensiveness of our estimation methodology.

Author Contributions

Methodology, B.Ç. and B.E.B.; Software, B.Ç.; Validation, B.Ç. and B.E.B.; Formal analysis, B.Ç. and B.E.B.; Writing—original draft, B.Ç.; Writing—review—editing, B.E.B.; Supervision, B.E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Leon-Suematsu, Y.I.; Inui, K.; Kurohashi, S.; Kidawara, Y. Web spam detection by exploring densely connected subgraphs. In Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Lyon, France, 22–27August 2011; IEEE: Piscataway, NJ, USA, 2011; Volume 1, pp. 124–129. [Google Scholar]
Becchetti, L.; Castillo, C.; Donato, D.; Leonardi, S.; Baeza-Yates, R.A. Link-Based Characterization and Detection of Web Spam. In Proceedings of the Adversarial Information Retrieval on the Web, Seattle, WA, USA, 10 August 2006. [Google Scholar]
Li, X.; Chan, T.N.; Cheng, R.; Shan, C.; Ma, C.; Chang, K. Motif Paths: A New Approach for Analyzing Higher-order Semantics between Graph Nodes. HKU Tech. Rep. 2019, 3, 4. [Google Scholar]
Betzler, N.; van Bevern, R.; Fellows, M.R.; Komusiewicz, C.; Niedermeier, R. Parameterized Algorithmics for Finding Connected Motifs in Biological Networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 2011, 8, 1296–1308. [Google Scholar] [CrossRef] [PubMed]
Saha, B.; Hoch, A.; Khuller, S.; Raschid, L.; Zhang, X.N. Dense subgraphs with restrictions and applications to gene annotation graphs. In Proceedings of the Research in Computational Molecular Biology: 14th Annual International Conference, RECOMB 2010, Lisbon, Portugal, 25–28 April 2010; Proceedings 14. Springer: Berlin/Heidelberg, Germany, 2010; pp. 456–472. [Google Scholar]
Yan, H.; Zhang, Q.; Mao, D.; Lu, Z.; Guo, D.; Chen, S. Anomaly detection of network streams via dense subgraph discovery. In Proceedings of the 2021 International Conference on Computer Communications and Networks (ICCCN), Athens, Greece, 19–22 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–9. [Google Scholar]
Gibson, D.; Kumar, R.; Tomkins, A. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, 30 August–2 September 2005; pp. 721–732. [Google Scholar]
Dourisboure, Y.; Geraci, F.; Pellegrini, M. Extraction and classification of dense implicit communities in the web graph. ACM Trans. Web TWEB 2009, 3, 1–36. [Google Scholar] [CrossRef]
Foucault Welles, B.; Van Devender, A.; Contractor, N. Is a friend a friend? Investigating the structure of friendship networks in virtual worlds. In CHI’10 Extended Abstracts on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2010; pp. 4027–4032. [Google Scholar]
Son, S.; Kang, A.R.; Kim, H.C.; Kwon, T.; Park, J.; Kim, H.K. Analysis of context dependence in social interaction networks of a massively multiplayer online role-playing game. PLoS ONE 2012, 7, e33918. [Google Scholar] [CrossRef] [PubMed]
Han, J.; Pei, J.; Kamber, M. Graph mining, social network analysis, and multirelational data mining. In Data Mining: Concepts and Techniques; Elsevier: Amsterdam, The Netherlands, 2006; pp. 535–589. [Google Scholar]
Seeland, M.; Girschick, T.; Buchwald, F.; Kramer, S. Online structural graph clustering using frequent subgraph mining. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2010, Barcelona, Spain, 20–24 September 2010; Proceedings, Part III 21. Springer: Berlin/Heidelberg, Germany, 2010; pp. 213–228. [Google Scholar]
Acosta-Mendoza, N.; Gago-Alonso, A.; Medina-Pagola, J.E. Frequent approximate subgraphs as features for graph-based image classification. Knowl.-Based Syst. 2012, 27, 381–392. [Google Scholar] [CrossRef]
Ahmed, N.K.; Neville, J.; Rossi, R.A.; Duffield, N. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining, Atlantic City, NJ, USA, 14–17 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–10. [Google Scholar]
Jha, M.; Seshadhri, C.; Pinar, A. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 495–505. [Google Scholar]
Pinar, A.; Seshadhri, C.; Vishal, V. Escape: Efficiently counting all 5-vertex subgraphs. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 1431–1440. [Google Scholar]
Ribeiro, P.; Paredes, P.; Silva, M.E.; Aparicio, D.; Silva, F. A survey on subgraph counting: Concepts, algorithms, and applications to network motifs and graphlets. ACM Comput. Surv. CSUR 2021, 54, 1–36. [Google Scholar] [CrossRef]
Holland, P.W.; Leinhardt, S. A method for detecting structure in sociometric data. In Social Networks; Elsevier: Amsterdam, The Netherlands, 1977; pp. 411–432. [Google Scholar]
Tsourakakis, C.E.; Drineas, P.; Michelakis, E.; Koutis, I.; Faloutsos, C. Spectral counting of triangles via element-wise sparsification and triangle-based link recommendation. Soc. Netw. Anal. Min. 2011, 1, 75–81. [Google Scholar] [CrossRef]
Pan, R.; Wang, Y.; Sun, J.; Liu, H.; Zhao, Y.; Xia, J.; Chen, W. Simplifying social networks via triangle-based cohesive subgraphs. Vis. Inform. 2023, 7, 84–94. [Google Scholar] [CrossRef]
Faust, K. A puzzle concerning triads in social networks: Graph constraints and the triad census. Soc. Netw. 2010, 32, 221–233. [Google Scholar] [CrossRef]
Pržulj, N.; Corneil, D.G.; Jurisica, I. Modeling interactome: Scale-free or geometric? Bioinformatics 2004, 20, 3508–3515. [Google Scholar] [CrossRef]
Vilakone, P.; Park, D.S.; Xinchang, K.; Hao, F. An efficient movie recommendation algorithm based on improved k-clique. Hum.-Centric Comput. Inf. Sci. 2018, 8, 1–15. [Google Scholar] [CrossRef]
Yu, J.; Wang, H.; Wang, X.; Li, Z.; Qin, L.; Zhang, W.; Liao, J.; Zhang, Y. Group-based fraud detection network on e-commerce platforms. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 5463–5475. [Google Scholar]
Bonato, A.; Zhang, Z. Clique Counts for Network Similarity. In Modelling and Mining Networks; Dewar, M., Kamiński, B., Kaszyński, D., Kraiński, Ł., Prałat, P., Théberge, F., Wrzosek, M., Eds.; Springer: Cham, Switzerland, 2024; pp. 174–183. [Google Scholar]
Raza, H.; Hayat, S.; Pan, X.F. On the fault-tolerant metric dimension of certain interconnection networks. J. Appl. Math. Comput. 2019, 60, 517–535. [Google Scholar] [CrossRef]
Imran, M.; Hayat, S.; Mailk, M.Y.H. On topological indices of certain interconnection networks. Appl. Math. Comput. 2014, 244, 936–951. [Google Scholar] [CrossRef]
Chiba, N.; Nishizeki, T. Arboricity and subgraph listing algorithms. SIAM J. Comput. 1985, 14, 210–223. [Google Scholar] [CrossRef]
Finocchi, I.; Finocchi, M.; Fusco, E.G. Clique counting in mapreduce: Algorithms and experiments. J. Exp. Algorithmics 2015, 20, 1–20. [Google Scholar] [CrossRef]
Danisch, M.; Balalau, O.; Sozio, M. Listing k-cliques in sparse real-world graphs. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 589–598. [Google Scholar]
Jain, S.; Seshadhri, C. A fast and provable method for estimating clique counts using turán’s theorem. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 441–449. [Google Scholar]
Jain, S.; Seshadhri, C. The power of pivoting for exact clique counting. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; pp. 268–276. [Google Scholar]
Jain, S.; Tong, H. YACC: A Framework Generalizing TuránShadow for Counting Large Cliques. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM), Alexandria, VA, USA, 28–30 April 2022; SIAM: Philadelphia, PA, USA, 2022; pp. 684–692. [Google Scholar]
Turán, P. On an external problem in graph theory. Mat. Fiz. Lapok 1941, 48, 436–452. [Google Scholar]
Ye, X.; Li, R.H.; Dai, Q.; Chen, H.; Wang, G. Efficient k-Clique Counting on Large Graphs: The Power of Color-Based Sampling Approaches. IEEE Trans. Knowl. Data Eng. 2023, 36, 1518–1536. [Google Scholar] [CrossRef]
Larman, D.G. On the Number of Complete Subgraphs and Circuits in a Graph. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 1969, 308, 327–342. [Google Scholar]
Zykov, A.A. On some properties of linear complexes. Mat. Sb. 1949, 66, 163–188. [Google Scholar]
Kruskal, J.B. The number of simplices in a complex. Math. Optim. Tech. 1963, 10, 251–278. [Google Scholar]
Reiher, C. The clique density theorem. Ann. Math. 2016, 184, 683–707. [Google Scholar] [CrossRef]
Huang, H.; Linial, N.; Naves, H.; Peled, Y.; Sudakov, B. On the densities of cliques and independent sets in graphs. Combinatorica 2016, 36, 493–512. [Google Scholar] [CrossRef]
Bera, S.K.; Seshadhri, C. How the degeneracy helps for triangle counting in graph streams. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Portland, OR, USA, 14–19 June 2020; pp. 457–467. [Google Scholar]
Hasenplaugh, W.; Kaler, T.; Schardl, T.B.; Leiserson, C.E. Ordering heuristics for parallel graph coloring. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, Prague, Czech Republic, 23–25 June 2014; pp. 166–177. [Google Scholar]
Yuan, L.; Qin, L.; Lin, X.; Chang, L.; Zhang, W. Effective and efficient dynamic graph coloring. Proc. VLDB Endow. 2017, 11, 338–351. [Google Scholar] [CrossRef]
Katona, G. A theorem of finite sets. In Classic Papers in Combinatorics; Birkhäuser: Boston, MA, USA, 1987; pp. 381–401. [Google Scholar]
Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 12 January 2024).
Schank, T.; Wagner, D. Approximating clustering coefficient and transitivity. J. Graph Algorithms Appl. 2005, 9, 265–275. [Google Scholar] [CrossRef]
Lu, Z.; Wahlström, J.; Nehorai, A. Community detection in complex networks via clique conductance. Sci. Rep. 2018, 8, 1–16. [Google Scholar] [CrossRef] [PubMed]
Duan, D.; Li, Y.; Li, R.; Lu, Z. Incremental K-clique clustering in dynamic social networks. Artif. Intell. Rev. 2012, 38, 129–147. [Google Scholar] [CrossRef]
Iasemidis, L.D.; Shiau, D.S.; Chaovalitwongse, W.; Sackellares, J.C.; Pardalos, P.M.; Principe, J.C.; Carney, P.R.; Prasad, A.; Veeramani, B.; Tsakalis, K. Adaptive epileptic seizure prediction system. IEEE Trans. Biomed. Eng. 2003, 50, 616–627. [Google Scholar] [CrossRef]
Buehrer, G.; Chellapilla, K. A scalable pattern mining approach to web graph compression with communities. In Proceedings of the 2008 International Conference on Web Search and Data Mining, Palo Alto, CA, USA, 11–12 February 2008; pp. 95–106. [Google Scholar]
Presson, A.P.; Sobel, E.M.; Papp, J.C.; Suarez, C.J.; Whistler, T.; Rajeevan, M.S.; Vernon, S.D.; Horvath, S. Integrated weighted gene co-expression network analysis with an application to chronic fatigue syndrome. BMC Syst. Biol. 2008, 2, 1–21. [Google Scholar] [CrossRef] [PubMed]
Dean, J.; Ghemawat, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 2008, 51, 107–113. [Google Scholar] [CrossRef]
Bron, C.; Kerbosch, J. Algorithm 457: Finding all cliques of an undirected graph. Commun. ACM 1973, 16, 575–577. [Google Scholar] [CrossRef]
Zhou, Y.; Guo, Q.; Fang, Y.; Ma, C. A Counting-based Approach for Efficient k-Clique Densest Subgraph Discovery. Proc. ACM Manag. Data 2024, 2, 1–27. [Google Scholar] [CrossRef]
Ye, X.; Qiao, M.; Li, R.H.; Zhang, Q.; Wang, G. Scalable k-clique Densest Subgraph Search. arXiv 2024, arXiv:2403.05775. [Google Scholar]
Rossi, R.A.; Ahmed, N.K. The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the AAAI, Austin, TX, USA, 25–30 January 2015. [Google Scholar]

Figure 1. Visualization of a graph (a) and its node relationships after applying degeneracy ordering and constructing DAG (b).

Figure 2. An induced subgraph H formed by nodes 2 and 5 separately.

Table 2. Dataset properties.

Graph	n	m	$α$
web-Stanford	$2.82 \times 10^{5}$	$1.99 \times 10^{6}$	71
web-BerkStan	$6.85 \times 10^{5}$	$6.65 \times 10^{6}$	201
as-skitter	$1.7 \times 10^{6}$	$4.3 \times 10^{6}$	111
soc-pokec	$1.6 \times 10^{6}$	$1.1 \times 10^{7}$	47
com-lj	$4.0 \times 10^{6}$	$2.2 \times 10^{7}$	360
soc-LJ	$4.8 \times 10^{6}$	$4.2 \times 10^{7}$	372
com-orkut	$3.0 \times 10^{6}$	$1.1 \times 10^{8}$	253

Table 3. Comparison of the boundary provided by BDAC, the exact value (if available), estimated values, and the corresponding execution time in seconds.

		Approx.	Exact	Approx.	Estimation		Time (s)
Dataset	K	BDAC-Min	Pivoter	BDAC-Max	Turán	DP-Color Path	BDAC	Pivoter	Turán	DP-Color Path Time
Tech	8	$7.67 \times 10^{10}$	$4.82 \times 10^{11}$	$5.59 \times 10^{12}$	$4.80 \times 10^{11}$	$4.81 \times 10^{11}$	130	88	188	6
	15	$2.70 \times 10^{15}$	$1.11 \times 10^{16}$	$8.80 \times 10^{18}$	$1.10 \times 10^{16}$	$1.14 \times 10^{16}$	125	88	257	6
	20	$3.48 \times 10^{17}$	$1.28 \times 10^{18}$	$2.38 \times 10^{22}$	$1.28 \times 10^{18}$	$1.13 \times 10^{18}$	126	88	318	7
	25	$7.60 \times 10^{18}$	$3.04 \times 10^{19}$	$1.43 \times 10^{25}$	$3.01 \times 10^{19}$	$2.36 \times 10^{19}$	116	88	387	8
	40	$8.91 \times 10^{18}$	$5.09 \times 10^{19}$	$2.41 \times 10^{30}$	$5.12 \times 10^{19}$	$5.09 \times 10^{19}$	127	88	424	22
	50	$4.23 \times 10^{15}$	$3.05 \times 10^{16}$	$4.96 \times 10^{31}$	$2.94 \times 10^{16}$	$3.05 \times 10^{16}$	129	88	235	41
web-Stanford	8	$1.39 \times 10^{11}$	$2.18 \times 10^{11}$	$3.53 \times 10^{11}$	$2.18 \times 10^{11}$	$2.18 \times 10^{11}$	13	3	53	1
	15	$9.98 \times 10^{14}$	$2.95 \times 10^{15}$	$1.28 \times 10^{16}$	$2.95 \times 10^{15}$	$2.95 \times 10^{15}$	12	3	53	1
	20	$6.04 \times 10^{16}$	$1.53 \times 10^{17}$	$1.52 \times 10^{18}$	$1.53 \times 10^{17}$	$1.53 \times 10^{17}$	11	3	48	1
	25	$6.46 \times 10^{17}$	$1.30 \times 10^{18}$	$2.83 \times 10^{19}$	$1.29 \times 10^{18}$	$1.30 \times 10^{18}$	13	3	41	1
	40	$7.68 \times 10^{16}$	$1.09 \times 10^{17}$	$1.08 \times 10^{19}$	$1.07 \times 10^{17}$	$1.09 \times 10^{17}$	13	3	31	1
	50	$2.43 \times 10^{12}$	$3.95 \times 10^{12}$	$1.25 \times 10^{15}$	$3.77 \times 10^{12}$	$3.95 \times 10^{12}$	12	3	5	1
web-BerkStan	8	$1.83 \times 10^{14}$	$1.84 \times 10^{14}$	$1.84 \times 10^{14}$	$1.83 \times 10^{14}$	$1.84 \times 10^{14}$	105	13	5	3
	15	$3.48 \times 10^{22}$	$3.48 \times 10^{22}$	$3.49 \times 10^{22}$	$3.48 \times 10^{22}$	$3.48 \times 10^{22}$	106	13	257	4
	20	$3.95 \times 10^{27}$	$3.96 \times 10^{27}$	$3.97 \times 10^{27}$	$3.95 \times 10^{27}$	$3.96 \times 10^{27}$	106	13	208	4
	25	$1.16 \times 10^{32}$	$1.16 \times 10^{32}$	$1.70 \times 10^{32}$	$1.16 \times 10^{32}$	$1.16 \times 10^{32}$	105	13	162	4
	40	$6.12 \times 10^{42}$	$6.14 \times 10^{42}$	$6.25 \times 10^{42}$	$6.13 \times 10^{42}$	$6.14 \times 10^{42}$	106	13	338	4
	50	$1.50 \times 10^{48}$	$1.51 \times 10^{48}$	$1.55 \times 10^{48}$	$1.50 \times 10^{48}$	$1.51 \times 10^{48}$	106	13	328	4
soc-pokec	8	$2.06 \times 10^{7}$	$1.11 \times 10^{8}$	$3.06 \times 10^{10}$	$1.11 \times 10^{8}$	$1.13 \times 10^{8}$	25	47	46	15
	15	$2.80 \times 10^{8}$	$3.54 \times 10^{8}$	$5.57 \times 10^{12}$	$3.54 \times 10^{8}$	$3.55 \times 10^{8}$	26	47	44	13
	20	$3.14 \times 10^{7}$	$4.49 \times 10^{7}$	$9.41 \times 10^{12}$	$4.49 \times 10^{7}$	$4.49 \times 10^{7}$	24	47	40	12
	25	$5.05 \times 10^{4}$	$1.22 \times 10^{5}$	$3.62 \times 10^{12}$	$1.22 \times 10^{5}$	$1.22 \times 10^{5}$	26	47	38	10
com-lj	8	$1.54 \times 10^{16}$	$1.69 \times 10^{16}$	$2.09 \times 10^{16}$	unknown	$1.69 \times 10^{16}$	68	terminated	terminated	27
	15	$2.20 \times 10^{26}$	unknown	$2.64 \times 10^{26}$	unknown	$2.27 \times 10^{26}$	66	terminated	terminated	29
	20	$5.28 \times 10^{32}$	unknown	$6.77 \times 10^{32}$	unknown	$5.48 \times 10^{32}$	68	terminated	terminated	31
	25	$3.32 \times 10^{38}$	unknown	$4.94 \times 10^{38}$	unknown	$3.55 \times 10^{38}$	68	terminated	terminated	34
	40	$1.96 \times 10^{53}$	unknown	$6.13 \times 10^{53}$	unknown	$2.51 \times 10^{53}$	68	terminated	terminated	35
	50	$2.92 \times 10^{61}$	unknown	$1.76 \times 10^{62}$	unknown	$4.34 \times 10^{61}$	69	terminated	terminated	39
soc-LJ	8	$3.00 \times 10^{16}$	unknown	$4.24 \times 10^{16}$	unknown	$3.28 \times 10^{16}$	128	terminated	terminated	36
	15	$4.50 \times 10^{26}$	unknown	$8.19 \times 10^{26}$	unknown	$5.23 \times 10^{26}$	128	terminated	terminated	46
	20	$1.05 \times 10^{33}$	unknown	$2.67 \times 10^{33}$	unknown	$1.31 \times 10^{33}$	126	terminated	terminated	51
	25	$6.47 \times 10^{38}$	unknown	$2.39 \times 10^{39}$	unknown	$8.58 \times 10^{38}$	128	terminated	terminated	57
	40	$3.50 \times 10^{53}$	unknown	$5.10 \times 10^{54}$	unknown	$5.86 \times 10^{53}$	128	terminated	terminated	96
	50	$4.56 \times 10^{61}$	unknown	$2.10 \times 10^{63}$	unknown	$9.69 \times 10^{61}$	128	terminated	terminated	166
com-orkut	8	$1.52 \times 10^{11}$	unknown	$3.89 \times 10^{15}$	unknown	$1.56 \times 10^{12}$	1715	terminated	terminated	212
	15	$1.98 \times 10^{14}$	unknown	$2.24 \times 10^{23}$	unknown	$8.58 \times 10^{15}$	1715	terminated	terminated	292
	20	$2.24 \times 10^{15}$	unknown	$1.04 \times 10^{28}$	unknown	$1.99 \times 10^{17}$	1715	terminated	terminated	411
	25	$2.93 \times 10^{15}$	unknown	$1.39 \times 10^{32}$	unknown	$2.43 \times 10^{14}$	1715	terminated	terminated	580
	40	$2.30 \times 10^{9}$	unknown	$1.10 \times 10^{42}$	unknown	$1.16 \times 10^{13}$	1715	terminated	terminated	1233
	50	0	unknown	$1.07 \times 10^{47}$	unknown	1355	1713	terminated	terminated	1742

Table 4. Comparison of BDAC, YACC, and DP-color path regarding estimation results for k = 20, 40.

	K	BDAC-Min	BDAC-Max	YACC	DP-Color Path
com-lj	20	$5.28 \times 10^{32}$	$6.77 \times 10^{32}$	$5.49 \times 10^{32}$	$5.48 \times 10^{32}$
	40	$1.96 \times 10^{53}$	$6.13 \times 10^{53}$	$2.51 \times 10^{53}$	$2.51 \times 10^{53}$
soc-LJ	20	$1.05 \times 10^{33}$	$2.67 \times 10^{33}$	$1.31 \times 10^{33}$	$1.31 \times 10^{33}$
	40	$3.50 \times 10^{53}$	$5.10 \times 10^{54}$	$2.51 \times 10^{53}$	$5.86 \times 10^{53}$
com-orkut	20	$2.24 \times 10^{15}$	$1.04 \times 10^{28}$	$3.38 \times 10^{17}$	$1.99 \times 10^{17}$
	40	$2.30 \times 10^{9}$	$1.10 \times 10^{42}$	$2.61 \times 10^{13}$	$1.16 \times 10^{13}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Çalmaz, B.; Bostanoğlu, B.E. BDAC: Boundary-Driven Approximations of K-Cliques. Symmetry 2024, 16, 983. https://doi.org/10.3390/sym16080983

AMA Style

Çalmaz B, Bostanoğlu BE. BDAC: Boundary-Driven Approximations of K-Cliques. Symmetry. 2024; 16(8):983. https://doi.org/10.3390/sym16080983

Chicago/Turabian Style

Çalmaz, Büşra, and Belgin Ergenç Bostanoğlu. 2024. "BDAC: Boundary-Driven Approximations of K-Cliques" Symmetry 16, no. 8: 983. https://doi.org/10.3390/sym16080983

APA Style

Çalmaz, B., & Bostanoğlu, B. E. (2024). BDAC: Boundary-Driven Approximations of K-Cliques. Symmetry, 16(8), 983. https://doi.org/10.3390/sym16080983

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BDAC: Boundary-Driven Approximations of K-Cliques

Abstract

1. Introduction

2. Preliminaries

2.1. Graph Terminology

2.2. Clique Density Theorems

2.3. Main Contribution

2.4. Related Work

3. Materials and Methods

3.1. Boundary-Driven Approximations of K-Cliques (BDAC)

3.2. Example

4. Results

4.1. Experimental Setup

4.2. Experimental Results

5. Discussion

6. Conclusions

7. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI