## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Scalable detection of statistically significant communities and hierarchies, using message passing for modularity

Edited by Giorgio Parisi, University of Rome, Rome, Italy, and approved October 31, 2014 (received for review May 28, 2014)

## Significance

Most work on community detection does not address the issue of statistical significance, and many algorithms are prone to overfitting. We address this using tools from statistical physics. Rather than trying to find the partition of a network that maximizes the modularity, our approach seeks the consensus of many high-modularity partitions. We do this with a scalable message-passing algorithm, derived by treating the modularity as a Hamiltonian and applying the cavity method. We show analytically that our algorithm succeeds all the way down to the detectability transition in the stochastic block model; it also performs well on real-world networks. It also provides a principled method for determining the number of groups or hierarchies of communities and subcommunities.

## Abstract

Modularity is a popular measure of community structure. However, maximizing the modularity can lead to many competing partitions, with almost the same modularity, that are poorly correlated with each other. It can also produce illusory ‘‘communities’’ in random graphs where none exist. We address this problem by using the modularity as a Hamiltonian at finite temperature and using an efficient belief propagation algorithm to obtain the consensus of many partitions with high modularity, rather than looking for a single partition that maximizes it. We show analytically and numerically that the proposed algorithm works all of the way down to the detectability transition in networks generated by the stochastic block model. It also performs well on real-world networks, revealing large communities in some networks where previous work has claimed no communities exist. Finally we show that by applying our algorithm recursively, subdividing communities until no statistically significant subcommunities can be found, we can detect hierarchical structure in real-world networks more efficiently than previous methods.

Community detection, or node clustering, is a key problem in network science, computer science, sociology, and biology. It aims to partition the nodes in a network into groups such that there are many edges connecting nodes within the same group and comparatively few edges connecting nodes in different groups.

Many methods have been proposed for this problem. These include spectral clustering, where we classify nodes according to the eigenvectors of a linear operator such as the adjacency matrix, the random walk matrix, the graph Laplacian, or other linear operators (1⇓–3); statistical inference, where we fit the network with a generative model such as the stochastic block model (4⇓⇓–7); and a wide variety of other methods, e.g., refs. 8⇓–10. See ref. 11 for a review.

We focus here on a popular measure of the quality of a partition, the modularity (e.g., refs. 8 and 12⇓–14). We think of a partition *q* groups as a function *i* belongs. The modularity of a partition *n* nodes and *m* edges is defined as*i*, and *δ* is the Kronecker delta function. The modularity is proportional to the number of edges connecting nodes in the same community minus the expected number of such edges if the graph were random conditioned on its degree distribution, that is, the expectation in a null model where *i* and *j* are connected with probability proportional to

However, maximizing over all possible partitions often gives a large modularity even in random graphs with no community structure (15⇓⇓–18). Thus, maximizing the modularity can lead to overfitting, where the “optimal” partition simply reflects random noise. Even in real-world networks, the modularity often exhibits a large amount of degeneracy, with multiple local optima that are poorly correlated with each other and are not robust to small perturbations (19).

Thus, we need to add some notion of statistical significance to our algorithms. One approach is hypothesis testing, comparing various measures of community structure to the distribution we would see in a null model such as Erdős–Rényi (ER) graphs (20⇓–22). However, even when communities really exist, the modularity of the true partition is often no higher than that of random graphs. In Fig. 1, we show partitions of two networks with the same size and degree distribution: an ER graph (*Left*) and a graph generated by the stochastic block model (*Right*), in the detectable regime where it is easy to find a partition correlated with the true one (5, 6). The true partition of the network in Fig. 1, *Right* has a smaller modularity than the partition found for the random graph in Fig. 1, *Left*. We can find a partition with higher modularity (and lower accuracy) in Fig. 1, *Right*, using, e.g., simulated annealing, but then the modularities we obtain for the two networks are similar. Thus, the usual approach of null distributions and *P* values for hypothesis testing does not appear to work.

We propose to solve this problem with the tools of statistical physics. As in ref. 16, we treat the modularity as the Hamiltonian of a spin system. We define the energy of a partition *β*, *i* belongs to group *t*, then *t* achieves the maximum. We call

We give an efficient belief propagation (BP) algorithm to approximate these marginals, which is derived from the cavity method of statistical physics. This algorithm is highly scalable; each iteration takes linear time on sparse networks if the number of groups is fixed, and it converges rapidly in most cases. It is optimal in the sense that for synthetic graphs generated by the stochastic block model, it works all of the way down to the detectability transition. It provides a principled way to choose the number of communities, unlike other algorithms that tend to overfit. Finally, by applying this algorithm recursively, subdividing communities until no statistically significant subcommunities exist, we can uncover hierarchical structure.

We validate our approach with experiments on real and synthetic networks. In particular, we find significant large communities in some large networks where previous work claimed there were none. We also compare our algorithm with several others, finding that it obtains more accurate results, both in terms of determining the number of communities and in terms of matching their ground-truth structure.

## Results

### Results on the Stochastic Block Model.

Also called the planted partition model, the stochastic block model (SBM) is a popular ensemble of networks with community structure. There are *q* groups of nodes, and each node *i* has a group label *p*, by connecting each pair of nodes *q* groups have equal size and where *p* has only two distinct entries, *ε* is small, the community structure is strong; when

For a given average degree

For larger numbers of groups, the situation is more complicated. For *ε*; that is, the thresholds for reconstruction and robust reconstruction become different. Our claim is that our algorithm succeeds down to the Kesten–Stigum bound, i.e., throughout the detectable regime for

In Fig. 2 we compare the behavior of our BP algorithm on ER graphs and a network generated by the SBM in the detectable regime. Both graphs have the same size and average degree *Left*) there are just two phases, separated by a transition at

In contrast, the SBM network in Fig. 2, *Right* has strong community structure. In addition to the paramagnetic and spin-glass phases, there is now a retrieval phase in a range of *β*, where BP finds a retrieval state describing statistically significant community structure. The retrieval modularity jumps sharply at *β* increases; for comparison, the modularity of the planted partition is

We can compute two of these transition points analytically by analyzing the linear stability of the factorized fixed point (*Methods*). Stability against random perturbations gives

In Fig. 3, *Left* we show the phase diagram of our algorithm on SBM networks, including the paramagnetic, retrieval, and spin-glass phases as a function of *ε*, with **4**]. For *Right* we show the accuracy of the retrieval partition

We emphasize that *β*; i.e., it is not on the Nishimori line (23, 31, 32). However, the optimal *β* depends on the parameters of the SBM (*SI Text*). Our claim is that setting *Right* we compare our algorithm with that of refs. 5 and 6, which learns the SBM parameters using an expectation-maximization (EM) algorithm. Our algorithm provides nearly the same overlap, without the need for the EM loop.

### Results on Real-World Networks and Choosing the Number of Groups.

We tested our algorithm on a number of real-world networks. As for networks generated by the SBM in the detectable regime, we find a retrieval phase between the paramagnetic and spin-glass phases (*SI Text*). Rather than attempting to learn the optimal parameters or temperature for these networks, we simply set **3**], where *c* is the average degree. Again, this value of *β* is not optimal, and varying *β* may improve the algorithm’s performance; however, setting

When the number of groups is not known, determining it is a classic model-selection problem. The maximum modularity typically grows with *q*. In contrast, the retrieval modularity stops growing when *q* exceeds the correct value, giving us a principled method of choosing *SI Text*). For those networks where

As shown in Table 1, our algorithm finds a retrieval state in all these networks, with high retrieval modularity and high overlap with the ground truth. For the Gnutella, Epinions, and web-Google networks, no ground truth is known; but in contrast with ref. 37, our algorithm finds significant large-scale communities.

Whereas most of these networks are assortative, one network in the table, the adjacency network of common adjectives and nouns in the novel *David Copperfield* (40) (see ref. 2), is disassortative, because nouns are more likely to be adjacent to adjectives than other nouns and vice versa. In this case, we found a retrieval state with negative modularity and high overlap with the ground truth, by setting *β* to

### Results on Hierarchical Clustering.

Many networks appear to have hierarchical structure with communities and subcommunities on many scales (2, 8, 24, 38, 39). We can look for such structures by working recursively: We determine the optimal number

For networks generated by the SBM, each subgraph is an ER graph. Our algorithm finds no retrieval state in the subgraphs, so it stops after one level of division. The same occurs in some small real-world networks, e.g., Zachary’s karate club. In some larger real-world networks, on the other hand, our algorithm repeatedly finds a retrieval state in the subgraphs, suggesting a deep hierarchical structure.

An example is the network of political blogs (34). Our algorithm first finds two large communities corresponding to liberals and conservatives and agreeing with the ground-truth labels on 95% of the nodes. However, as shown in Fig. 4, it splits these into subcommunities, eventually finding a hierarchy five levels deep with a total of 14 subgroups. We show the adjacency matrix with nodes ordered by this final partition in Fig. 4, *Right* and the hierarchical structure is clearly visible. The modularity of the second through fifth levels is 0.426, 0.331, 0.285, and 0.282, respectively. This decreasing modularity may explain why the algorithm did not immediately split the network all of the way down to the subcommunities.

A nested SBM was used to explore hierarchical structure in ref. 39, where the blog network was also reported to have hierarchical structure. Our results are slightly different, giving 14 rather than 17 subgroups, but the first three levels of subdivision are similar.

### Comparison with Other Algorithms.

In this section we compare the performance of our algorithm with two popular algorithms: Louvain (9) and OSLOM (21). In particular, OSLOM tries to focus on statistically significant communities.

Louvain gives partitions with similar modularity to that of our algorithm, but with a much larger number of groups, particularly on large networks. For example, on the Gnutella and Epinions network (37), our algorithm finds

We show results on synthetic networks in Fig. 5. In Fig. 5, *Left* we apply Louvain, OSLOM, and our algorithm to SBM networks with *Right* we show the number of groups that each algorithm infers for an ER graph with *n*. In *SI Text* we report on experiments on benchmark networks with heavy-tailed degree distributions (42), with similar results.

## Discussion

We have presented a physics-based method for finding statistically significant communities. Rather than using an explicit generative or graphical model, it uses a popular measure of community structure, namely the modularity. It does not attempt to maximize the modularity, which is both computationally difficult and prone to overfitting. Instead it estimates the marginals of the Gibbs distribution, using a scalable BP algorithm derived from the cavity method (next section), and defines the retrieval partition by assigning each node to its most likely community according to these marginals.

In essence, the algorithm looks for the consensus of many partitions with high modularity. When this consensus exists, it indicates statistically significant community structure, as opposed to random fluctuations. Moreover, by testing for the existence of this retrieval state, as opposed to a spin-glass state where the algorithm fluctuates between many unrelated local optima, we can determine the correct number of groups and decompose a network hierarchically.

We note that this algorithm is related to BP for the degree-corrected stochastic block model (DCSBM). Specifically, for a fixed *β*, the modularity is linearly related to the log-likelihood of the DCSBM with particular parameters (*SI Text*). However, our algorithm does not have to learn the parameters of the block model with an EM algorithm or perform model selection between the stochastic block model and its degree-corrected variant (43). To be clear, *β* is still a tunable parameter that can be optimized, but the heuristic value

In addition to the detectability transition in the SBM, another well-known barrier to community detection is the resolution limit (44) where communities become difficult to find when their size is *SI Text*, we give some evidence that our hierarchical clustering algorithm overcomes this barrier. Namely, for the classic example of a ring of cliques, at the second level our algorithm divides the graph precisely into these cliques.

Another recent proposal for determining the number of groups is to use the number of real eigenvalues of the nonbacktracking matrix, outside the bulk of the spectrum (3). For some networks, such as the political blogs, this gives a larger number than the

Our approach can be extended to generalizations of the modularity, where the graph is weighted or where a parameter *γ* represents the relative importance of the expected number of internal edges (16). Finally, it would be interesting to apply BP to other objective functions, such as normalized cut or conductance, devising Hamiltonians from them and considering the resulting Gibbs distributions.

Finally, we note that rather than running BP once and using the resulting marginals, we could use decimation (45) to fix the labels of the most biased nodes, run BP again to update the marginals, and so on. This would increase the running time of the algorithm, but it may improve its performance. Another approach would be reinforcement (45), where we add external fields that point toward the likely configuration. We leave this for future work.

## Methods

### Defining Statistical Significance.

As described above, an ER random graph has many partitions with high modularity. However, these partitions are nearly uncorrelated with each other. In the language of disordered materials, the landscape of partitions is glassy: Although the optimal one might be unique, there are many others whose modularity is almost as high, but have a large Hamming distance from the optimum and from each other. If we define a Gibbs distribution on the partitions, we encounter either a paramagnetic state where the marginals are uniform or a spin glass with replica symmetry breaking where we jump between local optima. In either case, focusing on any one of these optima is simply overfitting.

For networks such as Fig. 1, *Right* in contrast, there are many high-modularity partitions that are correlated with each other and with the ground truth. As a result, the landscape has a smooth valley surrounding the ground truth. At a suitable temperature, the Gibbs distribution is in a retrieval phase with both low energy (high modularity) and high entropy, giving it a lower free energy than that of the paramagnetic state, with its marginals biased toward the ground truth. When BP converges to a fixed point, it finds a (local) minimum of the Bethe free energy, approximating this lower free energy phase.

We propose the existence of this retrieval phase as a physics-based definition of statistical significance. When it exists, the retrieval partition defined by the maximum marginals is an optimal prediction of which nodes belong to which groups.

The idea of using the free energy to separate real community structure from random noise, and using the Gibbs marginals to define a partition, also appeared in refs. 5 and 6. However, that work is based on a specific generative model, namely the stochastic block model, and the energy is (minus) the log-likelihood of the observed network. In contrast, we avoid explicit generative models and focus directly on the modularity as a measure of community structure.

### The Cavity Method and Belief Propagation.

Our goal is to compute the marginal probability distribution that each node belongs to a given group and the free energy of the Gibbs distribution. We could do this using a Monte Carlo Markov chain algorithm. However, to obtain marginals we would need many independent samples, and to obtain the free energy we would need to sample at many different temperatures. Thus, the Monte Carlo Markov chain (MCMC) is prohibitively slow for our purposes.

Instead, for sparse networks, we can use belief propagation (46), known in statistical physics as the cavity method (47). BP makes a conditional independence assumption, which is exact only on trees; however, in the regimes we consider (the detectable regime of the stochastic block model and typical real-world graphs), its estimates of the marginals are quite accurate. It also provides an estimate of the free energy, called the Bethe free energy, which is a function of one- and two-point marginals.

BP works with ‘‘messages’’ *i* to node *k*, of the marginal probability that *i*’s interactions with nodes

Here *i*’s neighbors, and *t*, which we update after each BP iteration. We refer to *SI Text* for detailed derivations of the BP update equations and Bethe free energy.

For *q* groups and *m* edges, each iteration of [**5**] takes time *q* is fixed, this is linear in the number of edges and linear in the number of nodes when the network is sparse (i.e., when the average degree is constant). Moreover, these updates can be easily parallelized. Empirically, the number of iterations required to converge appears to depend very weakly on the network size, although in some cases it must grow at least logarithmically.

### The Factorized Solution and Local Stability.

Observe that the factorized solution, **5**]. If BP converges to this solution, we cannot label the nodes better than chance, and the retrieval modularity is zero. This is the paramagnetic state.

There are two other possibilities: BP fails to converge, or it converges to a nonfactorized fixed point, which we call the retrieval state. In the latter case, we can compute the marginals by

and define the retrieval partition

On the other hand, if BP does not converge, this means that neither the factorized solution nor any other fixed point is locally stable; the spin-glass susceptibility diverges, and replica symmetry is broken. In other words, the space of partitions breaks into an exponential number of clusters, and BP jumps from one to another. The retrieval partition obtained using the current marginals will change to a very different partition if we run BP a bit longer or if we perturb the initial BP messages slightly. In the spin-glass phase, we are free to define a retrieval modularity from the current marginals, but it fluctuates rapidly and does not represent a consensus of many partitions.

The linear stability of the factorized solution can be characterized by computing the derivatives of messages with respect to each other at the factorized fixed point. Using [**5**], we find that

Its largest eigenvalue (in magnitude) is

On locally tree-like graphs with Poisson degree distributions and average degree *c*, the factorized fixed point is then unstable with respect to random noise whenever *β* must exceed a critical **3**]. If the network has some other degree distribution but is otherwise random, [**3**] holds where *c* is the average excess degree, i.e., the expected number of additional neighbors of the endpoint of a random edge.

If there is no statistically significant community structure, then BP has just two phases, the paramagnetic one and the spin glass: For **3**] assumes that the network is random conditioned on its degree distribution; in principle *β* in the vicinity of

To estimate *T* be the **7**]. The matrix of derivatives of all *B* is the nonbacktracking matrix (3). The adaptive external field in the BP equations suppresses eigenvectors where every node is in the same community. As a result, the relevant eigenvalue is *λ* is the largest eigenvalue of *T*, and *μ* is the second-largest eigenvalue of *B*, and the factorized fixed point is unstable whenever

Combining this with [**8**] and setting **4**.

However, this assumes that the corresponding eigenvector of *B* is correlated with the community structure, so that perturbing BP away from the factorized fixed point will lead to the retrieval state. This is true as long as *μ* is outside the bulk of *B*’s eigenvalues, which are confined to a disk of radius

We note that the paramagnetic, retrieval, and spin-glass states were also studied in ref. 51, using a generalized Potts model and a heat bath MCMC algorithm. However, their Hamiltonian depends on a tunable cut-size parameter, rather than on a general measure of community structure such as the modularity. Moreover, it is difficult to obtain analytical results on phase transitions using MCMC algorithms, whereas the stability of BP fixed points is quite tractable.

### Defining the Spin-Glass Phase.

Although we have identified the spin-glass phase with the nonconvergence of belief propagation, the true phase diagram is potentially more complicated. The spin-glass phase is defined by the divergence of the spin-glass susceptibility. If this phase appears continuously, then in sparse problems this is equivalent to the sensitivity of the BP messages to noise, i.e., whether it converges to a stable fixed point. However, if the spin-glass phase appears discontinuously, it could be that BP converges even though the true susceptibility diverges (e.g., ref. 52).

We expect this to happen above the Nishimori line when the hard but detectable phase exists (5), when there is a retrieval state with lower free energy than the factorized fixed point but with an exponentially small basin of attraction, so that BP starting with random messages fails to converge to the true minimum of the free energy. Detecting this spin-glass phase would require us to go beyond the replica-symmetric BP equations used here to equations with one-step replica symmetry breaking (45). In the assortative case of the stochastic block model, the hard-but-detectable phase exists for

## Acknowledgments

We are grateful to Silvio Franz, Florent Krzakala, Mark Newman, Federico Ricci-Tersenghi, Christophe Schulke, and Lenka Zdeborová for helpful discussions and to Tiago de Paula Peixoto for drawing Fig. 4, *Left*, using his software at graph-tool.skewed.de/. This work was supported by the Air Force Office of Scientific Research and the Defense Advanced Research Planning Agency under Grant FA9550-12-1-0432.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: moore{at}santafe.edu.

Author contributions: P.Z. and C.M. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1409770111/-/DCSupplemental.

## References

- ↵
- ↵
- ↵.
- Krzakala F, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Rosvall M,
- Bergstrom CT

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Zdeborová L,
- Boettcher S

- ↵.
- Sulc P,
- Zdeborová L

- ↵
- ↵
- ↵
- ↵.
- Wilson JD,
- Wang S,
- Mucha PJ,
- Bhamidi S,
- Nobel AB

- ↵
- ↵
- ↵
- ↵.
- Mossel E,
- Neeman J,
- Sly A

*arXiv*:1202.1499 - ↵.
- Massoulie L

*arXiv*:1311.3085 - ↵.
- Mossel E,
- Neeman J,
- Sly A

*arXiv*:1311.4115 - ↵
- ↵
- ↵.
- Nishimori H

- ↵
- ↵.
- Zachary WW

- ↵.
- Adamic LA,
- Glance N

*Proceedings of the Third International Workshop on Link Discovery*(ACM, New York), pp 36–43 - ↵
- ↵.
- Krebs V

- ↵
- ↵.
- Sales-Pardo M,
- Guimerà R,
- Moreira AA,
- Amaral LAN

- ↵.
- Peixoto TP

- ↵.
- Dickens C

- ↵
- ↵
- ↵
- ↵.
- Fortunato S,
- Barthélemy M

- ↵.
- Mézard M,
- Montanari A

- ↵.
- Yedidia J,
- Freeman W,
- Weiss Y

- ↵.
- Mézard M,
- Parisi G

- ↵
- ↵
- ↵
- ↵
- ↵.
- Zdeborová L

## Citation Manager Formats

### More Articles of This Classification

### Physical Sciences

### Computer Sciences

### Related Content

- No related articles found.