# Local structure can identify and quantify influential global spreaders in large scale social networks

^{a}School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510006, China;^{b}School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China;^{c}Key Laboratory for Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China;^{d}Computing Science, Institute of High Performance Computing, Agency for Science, Technology, and Research, Singapore 138632;^{e}Department of Physics, National University of Singapore, Singapore 117551;^{f}Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215;^{g}Department of Physics, Bar-Ilan University, Ramat-Gan 52900, Israel

See allHide authors and affiliations

Contributed by H. Eugene Stanley, December 31, 2017 (sent for review August 31, 2017; reviewed by Marc Barthelemy and Zoltán Toroczkai)

## Significance

Identification and quantification of influential spreaders in social networks are challenging due to the gigantic network sizes and limited availability of the entire structure. Here we show that such difficulty can be overcome by reducing the problem scale to a local one, which is essentially independent of the entire network. This is because in viral spreading the characteristic spreading size does not depend on network structure outside the local environment of the seed spreaders. Our approach may open the door to solve various big data problems such as false information surveillance and control, viral marketing, epidemic control, and network protection.

## Abstract

Measuring and optimizing the influence of nodes in big-data online social networks are important for many practical applications, such as the viral marketing and the adoption of new products. As the viral spreading on a social network is a global process, it is commonly believed that measuring the influence of nodes inevitably requires the knowledge of the entire network. Using percolation theory, we show that the spreading process displays a nucleation behavior: Once a piece of information spreads from the seeds to more than a small characteristic number of nodes, it reaches a point of no return and will quickly reach the percolation cluster, regardless of the entire network structure; otherwise the spreading will be contained locally. Thus, we find that, without the knowledge of the entire network, any node’s global influence can be accurately measured using this characteristic number, which is independent of the network size. This motivates an efficient algorithm with constant time complexity on the long-standing problem of best seed spreaders selection, with performance remarkably close to the true optimum.

Modern online social platforms are replacing traditional media (1) for the spreading of information and communication of opinions (2⇓⇓⇓–6). A common feature of today’s online social networks (OSNs) is their gigantic sizes—for example, as of the second quarter of 2016, there are about 1.5 billion monthly active users on Facebook. Notably, multiplicative explosions of some information may take place at a global scale in such gigantic OSNs, which is the foundation of viral marketing strategies (7). Because of this, quantification of viral spreading is traditionally believed to need global network information. Indeed, most measures, such as k-shell (2), degree discount (8), cost-effective lazy forward (9), betweenness (10), closeness (11), and Katz index (12), evaluate the influence of nodes based on the knowledge of global network structures. In general, these methods become impractical for giant OSNs, because either the full network structural data are unavailable or the computational time is nonscalable. On the other hand, based on massive social experiments, Christakis and Fowler (13, 14) proposed the so-called three degrees of influence (TDI) theory, which states that any individual’s social influence ceases beyond three degrees (friends’ friends’ friends) and therefore suggests the existence of an unknown yet local effect. A recent study also shows that a local approximation works fairly well for a qualitative global measure of collective influence (4). The above situation reveals an apparent paradox, which inspires us to ask a fundamental question: Could local network structure accurately determine the size of global spreading?

## Results

Here we recover a local characteristic infection size *A* and *C*). The global and local phases are unambiguously separated. Using the mapping between the SIR family model and bond percolation (18, 26), we provide a concrete physical understanding of these two well-separated phases. We show that the local phase can be used to accurately quantify the node’s (or nodes’) spreading power (Fig. 2*A*). In particular, the statistical properties of infected cluster size distribution allow us to use solely local network structural information for selecting the best seed spreaders in significantly short constant time complexity.

## Methods

Our study is carried out for an SIR spreading mechanism on connected networks. The central quantity of interest in the spreading model is the final number of activated nodes or the spreading influence (17). A common definition of the spreading influence of node i is the expected number of active nodes that originated from i,*i*) It consists of two peaks, which correspond to local and viral spreadings. The local peak is located at small s, while the viral peak is centered at significantly larger s (Fig. 1 *A* and *C*). Furthermore, the viral peak is a δ-like function, whose location is independent of node i and different stochastic realizations (*SI Appendix*, section II). (*ii*) The two peaks are separated by a wide gap, which implies that one may introduce a small filtering size,

The statistical properties underlying these two features can be explored and better understood using the framework of percolation theory. This can be done through mapping the SIR process to bond percolation (18, 19), where every link (bond) has a probability *SI Appendix*, section XII for the more general case where β is link dependent). The final network forms many connected clusters of different sizes. It has been proved that the probability distribution function **1** is exactly equivalent to the cluster size distribution function *B*), where *C*). The size of **1** as*SI Appendix*, section III).

In artificial random networks with structure purely determined by the degree distributions, we can give the analytical solution for this influence quantity **1**, with*SI Appendix*, sections IV and V.

For real networks whose structures are much more complex than random networks, an exact solution to the spreading influence is not possible. But the critical phenomenon and the statistical properties of the two phases remain the same (Fig. 1*C*). We can leverage on these properties, in particular the wide gap between these two phases to distinguish between viral and local spreadings, and construct methods to estimate the spreading influence of nodes in the network. In SIR processes, once the number of activated nodes reaches a threshold parameter m, the simulation could be terminated since this process is known to become most likely viral. We thus obtain a second approximated form for the node spreading power—the truncated spreading power,*A* for a comparison between the real *B* shows that the relative error *SI Appendix*, sections IV and V for a theoretical calculation of the error in random networks). Similar to the giant component size, the characteristic component size *SI Appendix*, section VI).

The average of the cluster size distribution, *D*),*SI Appendix*, section IV). An expansion of this expression around the percolation transition **5** (Fig. 1*D* and *SI Appendix*, section VII). Fig. 1*D*, *Inset* shows that

To reveal the topological meaning of the characteristic size *C*. This result shows that if an SIR spreading is local, then it would vanish within three to four steps; otherwise, it will spread to about *D*), whose scaling is discussed in *SI Appendix*, section IV. This behavior is analogous to a critical phenomenon of a continuous phase transition: At the critical point, the correlation length diverges, but as long as it moves beyond the critical point, a characteristic scale appears.

The above analysis explains the following paradox: While it is shown that the information spreading is in general a global process due to the viral spreading in the supercritical phase, the influence of any node basically depends only on its local network environment. While the computation time for **1** grows linearly with N, it is reduced to an N-independent constant **4**. An important extension of this finding is that the method can be combined with many search algorithms for detecting the best spreaders and reduce their time complexity by one order of N.

Next, we aim to find the best M spreaders

Now we demonstrate one example of how to improve other algorithms and introduce quantification capabilities through the combination of our approach with the natural greedy algorithm (NGA) (8, 17). We call this algorithm the percolation-based greedy algorithm (PBGA): We (*i*) first find the best spreader *ii*) then fix *iii*) repeat this process M times until M spreaders *SI Appendix*, section IX for more details).

As expected, the simulation results show that the computational time in terms of execution count of PBGA is independent of network size N (Fig. 3). This reduction becomes significant for a worldwide online social network with billions of nodes. In *SI Appendix*, Table S1, we compare and summarize the theoretical time complexities of our PBGA, the NGA, and other widely used algorithms, including brute-force search, genetic algorithm, maximum degree, maximum k-shell (2), degree discount heuristic (8), maximum betweenness (10), maximum closeness (11), maximum Katz index (12), eigenvector method, and maximal collective influence (MCI) (4). Although maximum degree, degree discount, and MCI have N-independent theoretical computational complexities, the maximum degree and degree discount performances are much less than that of the PBGA and MCI is much slower than PBGA. This is because MCI needs the information of the nodes up to a distance ℓ of the seed nodes. In real networks which are small world, a small ℓ would lead to thousands or more nodes. On the other hand, PBGA’s complexity depends on *SI Appendix*, Fig. S16 presents a graphical illustration of this difference.

We quantify the algorithm performance by comparing the collective spreading power *A* and *B*). Our results show that for the entire range of studied M, the three algorithms, PBGA, NGA, and genetic algorithm (GA), have the best performances. Remarkably, the three algorithms give solutions indistinguishable from the true optimum obtained by the brute-force algorithm, when M is small (Fig. 4 *A* and *B*, *Insets*). In particular, comparing the performance of the PBGA and MCI in Fig. 5, we see that the PBGA significantly outperforms MCI when the number M of seed nodes is small. This can be understood since the original CI method (4) deals with best nodes for breaking down the network, which are not necessarily the best spreaders. This is likely the reason behind the relatively lower performance of MCI (more detailed discussion in *SI Appendix*, section IX, B9). When M becomes large, the performance difference diminishes, similar to the performance of any other algorithms as seen in Fig. 4*A*. In fact, we conjecture that for any M, the solution of the PBGA should be nearly optimal.

Another important aspect of this maximization problem is to have a sense of how good the solution is compared with the true optimal solution *C*): (*i*) a combined bound *ii*) an approximated bound *SI Appendix*, section X). The rigorous boundaries,

In practical situations, the information spreading rate β is usually unknown. However, our PBGA method could find close to optimal solutions without knowing the exact β value, as long as the information spreading is viral, i.e., a supercritical region with *SI Appendix*, section XI for the nonviral subcritical regions *D*, the solutions found at an arbitrary spreading rate *SI Appendix*, section XII.

## Summary

In this work, we show from first principles that any node’s influence can be quantified purely from its local network environment, based on the nature of the spreading dynamics. Our approach is distinct from other local attempts, which usually use some distance truncation strategies to approximate a relative global measure without the ability to quantify the actual influence. Although our framework is demonstrated on the basic SIR model, its applicability can be extended to several other spreading models if the following two properties hold: (*i*) For a collection of seed spreaders, the final steady states have two different outcomes of being either a localized outbreak with a small and finite number of infections or a global epidemic with the infectious/recovered population being proportional to the network size. (*ii*) When it is in the global outbreak, the size of the influence does not correlate with that of the initial spreader. See *SI Appendix*, section XII for discussions on a more general family of SIR models as well as for more complex models that include stiflers (34) and the susceptible–infected–susceptible model (22).

## Acknowledgments

Y.H., S.J., and L.F. are supported by The National Nature Science Foundation of China Grants 61773412, U1711265, 71731002; Guangzhou Science and Technology Project Grant 201804010473; and Three Big Constructions Supercomputing Application Cultivation Projects sponsored by National Supercomputer Center in Guangzhou. Y.J. is supported by Chinese Academy of Sciences Hundred-Talent Program. S.H. acknowledges the Israel Science Foundation, the Israel Ministry of Science and Technology (MOST) with the Italy Ministry of Foreign Affairs, MOST with the Japan Science and Technology Agency, the Bar-Ilan University Center for Research in Applied Cryptography and Cyber Security, and Defense Threat Reduction Agency Grants HDTRA-1-10-1-0014 for financial support. The Boston University Center for Polymer Studies is supported by National Science Foundation Grants PHY-1505000, CMMI-1125290, and CHE-1213217, and by Defense Threat Reduction Agency Grant HDTRA1-14-1-0017.

## Footnotes

↵

^{1}Y.H., S.J., Y.J., L.F., H.E.S., and S.H. contributed equally to this work.- ↵
^{2}To whom correspondence may be addressed. Email: huyanq{at}mail.sysu.edu.cn or hes{at}bu.edu.

Author contributions: Y.H., L.F., and S.H. designed research; Y.H., S.J., and Y.J. performed research; Y.H., S.J., H.E.S., and S.H. analyzed data; and Y.H., L.F., H.E.S., and S.H. wrote the paper.

Reviewers: M.B., Centre Commissariat à l’Energie Atomique; and Z.T., University of Notre Dame.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1710547115/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- Rust RT,
- Oliver RW

- ↵
- ↵
- Wang P, et al.

- ↵
- ↵
- Aral S,
- Walker D

- ↵
- ↵
- Ferguson R

- ↵
- Chen W,
- Wang Y,
- Yang S

- ↵
- Leskovec J, et al.

- ↵
- Newman ME

- ↵
- Freeman LC

- ↵
- ↵
- ↵
- ↵
- Barrat A,
- Barthelemy M,
- Vespignani A

- ↵
- Colizza V,
- Barrat A,
- Barthélemy M,
- Vespignani A

- ↵
- Kempe D,
- Kleinberg J,
- Tardos É

- ↵
- Newman MEJ

- ↵
- ↵
- Meloni S,
- Arenas A,
- Moreno Y

- ↵
- ↵
- ↵
- ↵
- Iribarren JL,
- Moro E

- ↵
- Grabowski A,
- Kruszewska N,
- Kosiński RA

- ↵
- ↵
- Newman MEJ,
- Strogatz SH,
- Watts DJ

- ↵
- Yuan X,
- Hu Y,
- Stanley HE,
- Havlin S

- ↵
- Borge-Holthoefer J,
- Rivero A,
- Yamir Moreno Y

- ↵
- ↵
- Bunde A,
- Havlin S

- ↵
- Cohen R,
- Shlomo Havlin S

- ↵
- Bakshy E,
- Hofman JM,
- Mason WA,
- Watts DJ

- ↵
- Borge-Holthoefer J,
- Moreno Y

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Physical Sciences

- Social Sciences
- Social Sciences