1. Introduction
The influence maximization (IM) problem has attracted many researchers in the last few decades, especially due to its applications in various situations, such as viral marketing [
1,
2,
3], terrorist attack prevention [
4], and network control [
5]. Viral marketing has proved to be more interesting for researchers. Chen et al. [
6] justify the strong motivation for studying information and influence diffusion models with viral marketing. That is, if a company needs to introduce a new product in the market through word-of-mouth effects in a social network, it must select influencers in the network and convince them to adopt the new product. The goal is that the selected influencers will generate a very big cascade effect in the network, driving other people to use the new product. Thus, diffusion is the transmission of information among persons, such as the transmission of viruses.
The literature on IM is very extensive. Recent studies range from using reinforcement learning to solve IM problems [
7] to investigating the equilibrium between information influence and the cocoon effect [
8]. Several surveys on IM have been published; see [
9,
10,
11,
12,
13,
14].
The IM problem is a discrete optimization problem that was enunciated by Kempe et al. [
15]. Given a social network modelled by a directed graph
with influence weights on edges, in the influence maximization problem, we are interested in finding the
k (pre-defined parameter) nodes of the network which maximize the spread of the influence once activated under a propagation model. The influence diffusion process occurs in discrete time steps. Here, for the propagation, we consider the linear threshold model, which is classified as a progressive diffusion model, since activated nodes cannot be deactivated [
11].
The IM problem is NP-hard under the linear threshold model (and under other diffusion models such as the Independent Cascade model); see [
15] for details. Therefore, heuristics play an important role in the solution techniques. As stated in [
11], “existing works focus on approximate solutions, and a keystone of these algorithmic IM studies is the greedy framework”. Among these heuristics, probably the most well-known is the SimpleGreedy algorithm proposed by Kempe et al. [
15]. This algorithm employs a greedy approach by selecting the node with the highest marginal gain over the network and including it in the seed set at each iteration.
In this paper, we propose a new algorithm framework based on greedy approaches for the IM problem. This framework consists of creating a partition of the graph nodes into clusters, solving the IM problem within each cluster to obtain a set of candidate nodes from each cluster, and choosing the seed set from those candidate nodes. To solve the IM problem in each cluster, we use a greedy approach. Following the taxonomy proposed in [
11], the proposed algorithms are classified into simulation-based methods, since Monte-Carlo simulations are employed for evaluating the influence spread of the selected seed set.
First, we introduce the ClusterGreedy algorithm for the IM problem under the LT diffusion model. The ClusterGreedy creates a partition of the given network and implements the SimpleGreedy in the smaller networks resulting from that partition. Then, it solves a subproblem to determine the seed set of the original network from the seed set obtained for each of the smaller networks. With this approach, we solve, using a greedy approach, more problems and, consequently, more simulations (one for each iteration of the SimpleGreedy). However, each problem is solved in a smaller network, which is important in order to reduce the running times of the simulations, since this step is computationally very expensive and depends on the size of the network. By exploring the submodularity property of the diffusion function, we improve the ClusterGreedy by reducing the number of iterations that need to be performed. The resulting algorithm, called Improved ClusterGreedy, solves a few more iterations than the SimpleGreedy; however, all simulations run in smaller networks resulting from the clustering operation. The computational experiments show that this improved algorithm allows for a reduction in the runtime to 25% when we compare it with the algorithm proposed by Kempe et al. [
15], under Watts–Strogatz random graphs and 54% in the three datasets selected. One important aspect in our approach is that by splitting the network into smaller networks and analysing each network separately, we lose the interactions between nodes in different clusters. This may be observed as a disadvantage. However, as reported in the computational section, we observe that there may be an advantage in that operation, since analysing each network separately to find the candidate solutions for the seed set in each cluster and choosing the seed set later by combining the different sets of candidates helps in avoiding the myopic behaviour of the greedy approach.
This paper is structured as follows: In
Section 2, we review the literature most related to our work. The IM problem, the diffusion model, and the algorithm proposed by Kempe et al. [
15] are described in
Section 3. In
Section 4, the ClusterGreedy and the Improved ClusterGreedy algorithms are introduced, and the main components, namely the Markov cluster algorithm and the subproblem to determine the final seed set from the seed of each smaller network, are explained. Experimental results are presented in
Section 5. Finally,
Section 6 summarizes the main conclusions of this paper.
2. Related Literature
The literature for IM is quite vast. In this section, we review the literature that is most related to our work. In particular, we focus on works related to greedy approaches.
Kempe et al. [
15] introduced the SimpleGreedy algorithm to solve the problem of maximizing the influence. This algorithm uses the concept of marginal gain, which is the influence spread increase obtained by adding a node to the seed set. The SimpleGreedy algorithm employs a greedy approach by selecting the node with the highest marginal gain over the network and including it in the seed set at each iteration. This process is repeated
k times [
16]. Given the spread function
, the marginal gain of a given node
u can be computed by
, which Kempe et al. [
15] estimated with Monte Carlo simulations. This algorithm is known to have a poor running time performance. That inefficiency has motivated several alternative approaches and motivated our study to reduce the computation time.
The CELF (Cost-Effective Lazy Forward) algorithm was introduced by Leskovec et al. [
17]. In the first round, the algorithm calculates an estimate of the influence spread of each node using Monte Carlo simulations. However, since the marginal gain in each iteration is not superior to its marginal gain of previous iterations, they reduced the number of Monte Carlo simulations [
18]. CELF was presented as being 700 times faster than the SimpleGreedy. Goyal et al. [
19] introduced CELF++ and described it as being 35–55% faster than CELF.
The NewGreedy algorithm was introduced by Chen et al. [
20]. This algorithm improved the one proposed by Kempe et al. [
15]. NewGreedy reuses the results of the Monte Carlo simulations from previous iterations to calculate the influence spread for each node in the same iteration [
18].
Cheng et al. [
21] introduced the StaticGreedy algorithm. They proposed a solution to the warranty submodularity and monotonicity in an effective way. Instead of generating a significant quantity of Monte Carlo simulations in each iteration, a not-particularly large number of Monte Carlo snapshots are generated at the start and then reused in all subsequent iterations. This approach guarantees that the estimated influence spreads of any seed set remain consistent over successive iterations. It also ensures that the features of submodularity and monotonicity are upheld.
The SIMPATH algorithm was introduced by Goyal et al. [
22]. The algorithm uses simpath-spread and is based on the assumption that the influence propagated from a node can be calculated as the sum of the weights of all arcs from that node. When the precedent idea is combined with the CELF algorithm, the resulting algorithm can be applied to solve the IM problem.
Cluster algorithms for graphs are also known as community detection algorithms for networks. Community-based algorithms detect communities on social networks and solve the IM problem through the structure of the given network data; see [
23,
24,
25,
26]. Rahimkani et al. [
23] consider an approach that first identifies the communities on social networks and then creates a new network based on those communities. In the new network, each node represents a community. Then, the most central nodes of the new network are selected based on the measures of betweenness and centrality. Each node in the new network includes nodes in the corresponding community. Therefore, a number of nodes, proportional to the size of the community, was selected based on the degree centrality to form the candidate set. The
k most influential nodes are then selected from the set based on the betweenness centrality measure. Community-based algorithms should not be confused with our clustering approach. Although such algorithms use clusters (corresponding to communities), their approach is quite different from ours. We split the graph into clusters and consider them separately. In the Community-based algorithms, the clusters are simplified by choosing a subset of nodes that represents the cluster.
Ghayour-Baghbani et al. [
27] propose an efficient method using linear programming (LP) for the IM problem in the linear threshold propagation model. The method works in three steps: first, the sparse matrix multiplication is used to estimate the influence graph from the social network graph; second, the influence graph is used to model the IM problem in the LT model as a linear program. Finally, the candidate is created. The linear program’s output is seeded using a randomized rounding approach.
Some methods solve the IM problem by LP in the LT model, but either their runtime is higher than the greedy algorithm or they tend to produce worse solutions (the spread of the produced seed set is less than the greedy algorithm [
28,
29,
30,
31]).
3. Preliminary
In this section, we formulate the IM problem and provide an overview of the linear threshold model and the SimpleGreedy algorithm.
Table 1 provides the notation used in this paper.
Given a graph
, a stochastic diffusion model on
G, and a budget
k, the IM problem is the stochastic optimization problem of finding a set
with
, such that the influence spread of
, under the given stochastic diffusion model is maximized. That is, determine
, such that
Regarding the problem formulation above, it should be noted that finding a group of k nodes with the greatest combined influence is not the same as finding the top k nodes with the greatest individual influences. It makes sense to identify two top influencers who both have a significant impact on the same group of individuals as top influencers, but only one of them should be chosen as a seed to maximize the influence.
One of the two primary diffusion models, the independent cascade (IC) model, can be used to explain simple contagions where activations might result from a single source, like the spread of viruses or information. However, in several circumstances, people need to be exposed to various independent sources for their behavior to change. The linear threshold (LT) model was introduced by Kemp et al. [
15] as a stochastic diffusion model to represent this kind of behavior.
In this work, the LT diffusion model is considered. Each arc
in the LT model has an effect weight
, which indicates how likely
u influences
v. The weights are normalized such that
, for all
[
6]. Formally, we have the following definition [
6]:
Definition 1 (Linear Threshold Model).
Given a graph , with influence weights assigned to the arcs, and a seed set , the LT model creates the active sets , for all repeatedly, using the following randomized operation procedure: first, a threshold value is created for each using a uniform distribution ranging from 0 to 1. For each time step , the set is initialized to the same value as . For each node v in the set that is not active, if the combined weight of the arcs from its active neighbors is equal to or greater than , denoted asthen node v is included in (i.e., node v becomes active at time t). An important property in IM is submodularity,
Definition 2 (Submodularity [
6]).
Let V be a finite set, and denote by the power set of V. A function is called submodular if, for each and any , we have In the context of IM, this property says that the marginal gain obtained by adding a node to the seed set decreases with the increase in the seed set.
Kempe et al. [
15] proved the following theorem:
Theorem 1. The influence spread function in the linear threshold model is monotone and submodular.
Based on the monotony, non-negativity and submodularity properties of the spread function
, Kempe et al. [
15] devised an algorithm to address the influence maximization problem—the SimpleGreedy algorithm (Algorithm 1). The algorithm iteratively selects the node with the highest marginal gain and includes it in the seed set. The algorithm stops when the seed set has
k elements. Under the LT model (and IC model), computing exact marginal gains, i.e., the exact expected spread, is #P−hard [
6].
Since the SimpleGreedy algorithm needs to compute the exact marginal gains, or the exact expected spread (Line 3 of Algorithm 1), Kempe et al. proposed to estimate them by Monte Carlo simulations, with
r runs. With this change, we obtain Algorithm 2.
Algorithm 1 Simple Greedy |
Input: : cardinality of the returning set; : monotone and submodular set function |
Output: seed set S |
1: initialize |
2: for to k do |
3: |
4: |
5: end for |
6: return S |
Algorithm 2 SimpleGreedy |
Input: : social graph; arc weights ; k: cardinality of the returning set |
Output: seed set S |
1: function mc-estimate() |
2: counter |
3: for to r do |
4: Perform a simulation of the diffusion process on the graph G using the seed set S. |
5: The number of active nodes once the diffusion concludes. |
6: |
7: end for |
8: return |
9: end function |
10: initialize |
11: for to k do |
12: mc-estimate
|
13: |
14: end for |
15: return S |
Notice that both Algorithms 1 and 2 generate a sequence of sets and the corresponding estimative of spread. This observation will be used later. Algorithm 2 has time complexity , where n is the number of vertices and e is the number of edges.
Author Contributions
Conceptualization, J.M.S. and A.A.; methodology, J.M.S. and A.A.; software, J.M.S.; validation, J.M.S. and A.A.; formal analysis, J.M.S. and A.A.; investigation, J.M.S. and A.A.; data curation, J.M.S.; writing—original draft preparation, J.M.S.; writing—review and editing, J.M.S. and A.A.; visualization, J.M.S.; supervision, A.A. All authors have read and agreed to the published version of the manuscript.
Data Availability Statement
The Julia code used to obtain reported results can be shared by sending an email to
[email protected].
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
CELF | Cost-effective lazy forward |
IM | Influence maximization |
LT | Linear threshold |
LSP | Linking set problem |
LP | Linear programming |
MCL | Markov clustering |
References
- Domingos, P.; Richardson, M. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; pp. 57–66. [Google Scholar]
- Richardson, M.; Domingos, P. Mining knowledge-sharing sites for viral marketing. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 23–26 July 2002; pp. 61–70. [Google Scholar]
- Rui, X.; Meng, F.; Wang, Z.; Yuan, G. A reversed node ranking approach for influence maximization in social networks. Appl. Intell. 2019, 49, 2684–2698. [Google Scholar] [CrossRef]
- Huang, H.; Shen, H.; Meng, Z. Community-based influence maximization in attributed networks. Appl. Intell. 2020, 50, 354–364. [Google Scholar] [CrossRef]
- Leskovec, J.; Adamic, L.A.; Huberman, B.A. The dynamics of viral marketing. ACM Trans. Web TWEB 2007, 1, 5. [Google Scholar] [CrossRef]
- Chen, W.; Castillo, C.; Lakshmanan, L.V. Information and Influence Propagation in Social Networks; Morgan & Claypool Publishers: San Rafael, CA, USA, 2013. [Google Scholar]
- Haleh, S.; Dizaji, S.; Patil, K.; Avrachenkov, K. Influence Maximization in Dynamic Networks Using Reinforcement Learning. SN Comput. Sci. 2024, 5, 169. [Google Scholar]
- Ni, P.; Guidi, B.; Michienzi, A.; Zhu, J. Equilibrium of individual concern-critical influence maximization in virtual and real blending network. Inf. Sci. 2023, 648, 119646. [Google Scholar] [CrossRef]
- Arora, A.; Galhotra, S.; Ranu, S. Debunking the myths of influence maximization: An in-depth benchmarking study. In Proceedings of the 2017 ACM International Conference on Management of Data, New York, NY, USA, 14–19 May 2002; pp. 651–666. [Google Scholar]
- Guille, A.; Hacid, H.; Favre, C.; Zighed, D.A. Information diffusion in online social networks: A survey. In SIGMOD Rec. 2013, 42, 17–28. [Google Scholar] [CrossRef]
- Li, Y.; Fan, J.; Wang, Y.; Tan, K.-L. Influence Maximization on Social Graphs: A Survey. IEEE Trans. Knowl. Data Eng. 2018, 30, 1852–1872. [Google Scholar] [CrossRef]
- Sun, J.; Tang, J. A survey of models and algorithms for social influence analysis. In Social Network Data Analytics; Springer: New York, NY, USA, 2011; pp. 177–214. [Google Scholar]
- Tejaswi, V.; Bindu, P.V.; Thilagam, P.S. Diffusion models and approaches for influence maximization in social networks. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016; pp. 1345–1351. [Google Scholar]
- Zheng, Y. A Survey: Models, Techniques, and Applications of Influence Maximization Problem; Southern University of Science and Technology: Shenzhen, China, 2018. [Google Scholar]
- Kempe, D.; Kleinberg, J.; Tardos, E. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 24–27 August 2003; pp. 137–146. [Google Scholar]
- Ko, Y.-Y.; Cho, K.-J.; Kim, S.-W. Efficient and effective influence maximization in social networks: A hybrid-approach. Inf. Sci. 2018, 465, 144–161. [Google Scholar] [CrossRef]
- Leskovec, J.; Krause, A.; Guestrin, C.; Faloutsos, C.; VanBriesen, J.; Glance, N. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, USA, 12–15 August 2007; pp. 420–429. [Google Scholar]
- Taherinia, M.; Esmaeili, M.; Minaei Bidgoli, B. Optimizing CELF Agorithm for Influence Maximization Problem in Social Networks. J. Data Min. 2022, 10, 25–41. [Google Scholar]
- Goyal, A.; Lu, W.; Lakshmanan, L.V.S. Celf++ optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th International Conference Companion on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 47–48. [Google Scholar]
- Chen, W.; Wang, Y.; Yang, S. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 199–208. [Google Scholar]
- Cheng, S.; Shen, H.; Huang, J.; Zhang, G.; Cheng, X. Staticgreedy: Solving the scalability-accuracy dilemma in influence maximization. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013. [Google Scholar]
- Goyal, A.; Lu, W.; Lakshmanan, L.V.S. Simpath: An efficient algorithm for influence maximization under the linear threshold model. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining, Vancouver, BC, Canada, 11–14 December 2011; IEEE: New York, NY, USA, 2011; pp. 211–220. [Google Scholar]
- Rahimkhani, K.; Aleahmad, A.; Rahgozar, M.; Moeini, A. A fast algorithm for finding most influential people based on the linear threshold model. Expert Syst. Appl. 2015, 42, 1353–1361. [Google Scholar] [CrossRef]
- Shang, J.; Zhou, S.; Li, X.; Liu, L.; Wu, H. COFIM: A community-based framework for influence maximization on large-scale networks. Knowl. Based Syst. 2017, 117, 88–100. [Google Scholar] [CrossRef]
- Bozorgi, A.; Haghighi, H.; Zahedi, M.S.; Rezvani, M. Incim: A community-based algorithm for influence maximization problem under the linear threshold model. Inf. Process. Manag. 2016, 52, 1188–1199. [Google Scholar] [CrossRef]
- Girvan, M.; Newman, M.E.J. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 2002, 99, 7821–7826. [Google Scholar] [CrossRef]
- Ghayour-Baghbani, F.; Asadpour, M.; Faili, H. Mlpr: Efficient influence maximization in linear threshold propagation model using linear programming. Soc. Netw. Anal. Min. 2021, 11, 3. [Google Scholar] [CrossRef]
- Baghbani, F.G.; Asadpour, M.; Faili, H. Integer linear programming for influence maximization. Iran. J. Sci. Technol. Trans. Electr. Eng. 2019, 43, 627–634. [Google Scholar] [CrossRef]
- Keskin, M.E.; Güler, M.G. Influence maximization in social networks: An integer programming approach. Turk. J. Electr. Eng. Comput. 2018, 26, 3383–3396. [Google Scholar] [CrossRef]
- Wu, H.-H.; Küçükyavuz, S. A Two-stage Stochastic Programming Approach for Influence Maximization in Social Networks. Comput. Optim. Appl. 2018, 69, 563–595. [Google Scholar] [CrossRef]
- Ackerman, E.; Ben-Zwi, O.; Wolfovitz, G. Combinatorial model and bounds for target set selection. Theor. Comput. Sci. 2010, 411, 4017–4022. [Google Scholar] [CrossRef]
- Schaeffer, S.E. Graph clustering. Comput. Sci. Rev. 2007, 1, 27–64. [Google Scholar] [CrossRef]
- Vlasblom, J.; Wodak, S.J. Markov Clustering versus Affinity Propagation for the Partitioning of Protein Interaction Graphs. BMC Bioinform. 2009, 10, 99. [Google Scholar] [CrossRef]
- Van Dongen, S. Graph Clustering Via a Discrete Uncoupling Process. SIAM J. Matrix Anal. Appl. 2008, 30, 121–141. [Google Scholar] [CrossRef]
- Van Dongen, S. Graph Clustering by Flow Simulation. Ph.D. Thesis, Utrecht University, Utrecht, The Netherlands, 2000. [Google Scholar]
- Agra, A.; Requejo, C. The linking set problem: A polynomial special case of the multiple-choice knapsack problem. J. Math. Sci. 2009, 161, 919–929. [Google Scholar] [CrossRef]
- Fairbanks, J.; Besançon, M.; Simon, S.; Hoffiman, J.; Eubank, N.; Karpinski, S. Juliagraphs/Graphs.jl: An Optimized Graph Package for the Julia Programming Language. 2021. Available online: https://github.com/JuliaGraphs/Graphs.Jl (accessed on 6 February 2024).
- Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 6 February 2024).
- Moody, J. Peer Influence Groups: Identifying Dense Clusters in Large Networks. Soc. Netw. 2001, 23, 261–283. [Google Scholar] [CrossRef]
- Agra, A.; Cerdeira, J.O.; Requejo, C. A decomposition approach for the p-median problem on disconnected graphs. Comput. Oper. Res. 2017, 86, 79–85. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).