New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
A nonparametric view of network models and Newman–Girvan and other modularities

Edited by Stephen E. Fienberg, Carnegie Mellon University, Pittsburgh, PA, and approved October 13, 2009 (received for review July 2, 2009)
Abstract
Prompted by the increasing interest in networks in many fields, we present an attempt at unifying points of view and analyses of these objects coming from the social sciences, statistics, probability and physics communities. We apply our approach to the Newman–Girvan modularity, widely used for “community” detection, among others. Our analysis is asymptotic but we show by simulation and application to real examples that the theory is a reasonable guide to practice.
The social sciences have investigated the structure of small networks since the 1970s, and have come up with elaborate modeling strategies, both deterministic, see Doreian et al. (1) for a view, and stochastic, see Airoldi et al. (2) for a view and recent work. During the same period, starting with the work of Erdös and Rényi (3), a rich literature has developed on the probabilistic properties of stochastic models for graphs. A major contribution to this work is Bollobás et al. (4). On the whole, the goals of the analyses of ref. 4, such as emergence of the giant component, are not aimed at the statistical goals of the social science literature we have cited.
Recently, there has been a surge of interest, particularly in the physics and computer science communities in the properties of networks of many kinds, including the Internet, mobile networks, the World Wide Web, citation networks, email networks, food webs, and social and biochemical networks. Identification of “community structure” has received particular attention: the vertices in networks are often found to cluster into small communities, where vertices within a community share the same densities of connecting with vertices in the their own community as well as different ones with other communities. The ability to detect such groups can be of significant practical importance. For instance, groups within the worldwide Web may correspond to sets of web pages on related topics; groups within mobile networks may correspond to sets of friends or colleagues; groups in computer networks may correspond to users that are sharing files with peertopeer traffic, or collections of compromised computers controlled by remote hackers, e.g. botnets (5). A recent algorithm proposed by Newman and Girvan (6), that maximizes a socalled “Newman–Girvan” modularity function, has received particular attention because of its success in many applications in social and biological networks (7).
Our first goal is, by starting with a model somewhat less general than that of ref. 4, to construct a nonparametric statistical framework, which we will then use in the analysis, both of modularities and parametric statistical models. Our analysis is asymptotic, letting the number of vertices go to ∞. We view, as usual, asymptotics as being appropriate insofar as they are a guide to what happens for finite n. Our models can, on the one hand, be viewed as special cases of those proposed by ref. 4, and on the other, as encompassing most of the parametric and semiparametric models discussed in Airoldi et al. (2) from a statistical point of view and in Chung and Lu (8) for a probabilistic one. An advantage of our framework is the possibility of analyzing the properties of the Newman–Girvan modularity, and the reasons for its success and occasional failures. Our approach suggests an alternative modularity which is, in principle, “failsafe” for rich enough models. Moreover, our point of view has the virtue of enabling us to think in terms of “strength of relations” between individuals not necessarily clustering them into communities beforehand.
We begin, using results of Aldous and Hoover (9), by introducing what we view as the analogues of arbitrary infinite population models on infinite unlabeled graphs which are “ergodic” and from which a subgraph with n vertices can be viewed as a piece. This development of Aldous and Hoover can be viewed as a generalization of deFinetti's famous characterization of exchangeable sequences as mixtures of i.i.d. ones. Thus, our approach can also be viewed as a first step in the generalization of the classical construction of complex statistical models out of i.i.d. ones using covariates, information about labels and relationships.
It turns out that natural classes of parametric models which approximate the nonparametric models we introduce are the “blockmodels” introduced by Holland, Laskey and Leinhardt ref. 10; see also refs. 2 and 11, which are generalizations of the Erdös–Rényi model. These can be described as follows.
In a possibly (at least conceptually) infinite population (of vertices) there are K unknown subcommunities. Unlabeled individuals (vertices) relate to each other through edges which for this paper we assume are undirected. This situation leads to the following set of probability models for undirected graphs or equivalently the corresponding adjacency matrices {A_{ij} : i,j ≥ 1}, where A_{ij} = 1 or 0 according as there is or is not an edge between i and j.

Individuals independently belong to community j with probability π_{j}, 1 ≤ j ≤ K, ∑_{j = 1}^{K} π_{j} = 1.

A symmetric K × K matrix {P_{kl} : 1 ≤ k,l ≤ K} of probabilities is given such that P_{ab} is the probability that a specific individual i relates to individual j given that i ∈ a,j ∈ b. The membership relations between individuals are established independently. Thus 1 − ∑_{1≤a,b≤K} π_{a} π_{b} P_{ab} is the probability that there is no edge between i and j.
The Erdös–Rényi model corresponds to K = 1.
We proceed to define Newman–Girvan modularity and an alternative statistically motivated modularity. We give necessary and sufficient conditions for consistency based on the parameters of the block model, properties of the modularities, and average degree of the graph. By consistency we mean that the modularities can identify the members of the block model communities perfectly. We also give examples of inconsistency when the conditions fail. We then study the validity of the asymptotics in a limited simulation and apply our approach to a classical small example, the Karate Club and a large set of Private Branch Exchange (PBX) data. We conclude with a discussion and some open problems.
Random Graph Models
Consider any probability distribution ℙ on an infinite undirected graph, or equivalently a probability distribution on the set of all matrices ∥A_{ij} : i,j ≥ 1∥ where A_{ij} = 1 or 0, A_{ij} = A_{ji} for all i,j pairs, and A_{ii} = 0 for all i, thus excluding self relation. If the graph is unlabeled, it is natural to restrict attention to ℙ such that ∥A_{σi σj}∥ ∼ ℙ for any permutation σ of {1,2,3,…}. Hoover (see ref. 9) has shown that all such probability distributions can be represented as, where α, {ξ_{i}} and {λ_{ij}} are i.i.d. U(0,1) variables and for all u,v,w,z. The variables ξ correspond to latent variables, λ being completely individual specific, ξ generating relations between individuals and α a mixture variable which is unidentifiable even for an infinite graph. Note that g is unidentifiable and the ξ and λ could be put on another scale, e.g. Gaussian. We note that, this point of departure was also recently proposed by Hoff (12) but was followed to a different end.
It is clear that the distributions representable as, where λ_{ij} = λ_{ji}, are the extreme points of this set and play the same role as sequences of i.i.d. variables play in de Finetti's theorem. Since given ξ_{i} and ξ_{j}, the λ_{ij} are i.i.d., these distributions are naturally parametrized by the function As Diaconis and Janson (13) point out h(.,.) does not uniquely determine ℙ but if h_{1} and h_{2} define the same ℙ, then there exists φ : [0,1] → [0.1] which is: measure preserving, i.e. such that φ(ξ_{1}) has a U(0,1) distribution; and h_{1}(u,v) = h_{2}(φ(u),φ(v)).
Given any h corresponding to ℙ, let It is well known (see section 10 of ref. 14) that there exists a measure preserving φ_{g} such that, g(φ_{g}(v)) is monotone non decreasing.
Define We claim that where F is the cdf of g_{CAN}(ξ_{i}), and h_{CAN} is unique up to sets of measure 0. To see this note that if h corresponds to ℙ and g(u) ≡ ∫_{0}^{1} h(u,v)dv is non decreasing, then since F is determined by ℙ only, g(u) = F^{−1}(u). But g(φ_{g}(u)) = g_{CAN}(u) and φ_{g}(u) = g^{−1}g_{CAN}(u) = u.
There is a reparametrization of h_{CAN} (we drop the CAN subscript in the future) which enables us to think of our model in terms more familiar to statisticians.
Let Then the conditional density of (ξ_{i},ξ_{j}) given that there is an edge between i and j is w(u,v) = ρ^{−1}h(u,v). This parametrization also permits us to decouple ρ ∝ E(Degree) of the graph from the inhomogeneity structure. It is natural finally to let ρ depend on n but w(·,·) to be fixed. If λ_{n} ≡ 𝔼(Degree) → ∞, we have what we may call the “dense graph” limit. If λ_{n} = Ω(1), we are in the case most studied in probability theory where, for instance, λ_{n} = 1 is the threshold at which the so called “giant component” appears. This is the situation Bollobas et al. focus on.
As we have noted, block models are of this type. Here we can think of the reparametrization as being ρ, π, ∥S_{ab}∥ = ∥ρ^{−1}P_{ab}∥, or ∥W_{ab}∥ ≡ ∥ρ^{−1}P_{ab}π_{a}π_{b}∥. The models studied by Chung and others (8) given by, also fall under our description. The mixture model of Newman and Leicht (15) where, given communities 1,…,K, is not of this type, since it is not invariant under permutations. It can be made invariant by summing over all permutations of {1,…,n}, but is then generally not ergodic. Such models can be developed from our framework by permitting covariates Z_{i} depending on vertex identity or Z_{ij} depending on edge identity. Newman and Leicht's example where the communities are WEB pages falls under this observation. From a statistical point of view, these models bear the same relation to our models as regression models do to single population models.
Block models or models where for known functions {a_{k}} can be used to approximate general h. The latent eigenvalue model of Hoff (12) is of this type, but with a_{k} which are extremely rough and unidentifiable since the a_{j}(ξ) are independent, and for which no unique choice exists. We can think of the canonical version of the block model as corresponding to a labeling 1,…,K of the communities in the order W_{1} ≤… ≤ W_{K} where W_{j} = ∑_{k} W_{jk}, which is proportional to the expected degree of a member of community j. The function h(·,·) then takes value P_{ab} on the (a,b) block of the product partition in which each axis is divided into consecutive intervals, of lengths π_{1},…,π_{K}. Each corresponding vertical slice exhibits the relation pattern for that community with the diagonal block identifying the members of the community. The nonparametric h(·,·) gives the same intuitive picture on an arbitrarily fine scale. We note that, as in nonparametric statistics, to estimate h or w, regularization is needed. That is, we need to consider K_{n} → ∞ at rates which depend on n and λ_{n} to obtain good estimates of h or w by using estimates of θ above or of block model parameters. We will discuss this further later.
Newman–Girvan and Likelihood Modularities
The task of determining K communities corresponds to finding a good assignment for the vertices e ≡ {e_{1},…,e_{n}} where e_{j} ∈{1,…,K}. There are K^{n} such assignments. Suppose that the distribution of A follows a K block model with parameters π = (π_{1},…,π_{K}) and P = ∥P_{ab}∥_{K×K}. The observed A is a consequence of a realization c = (c_{1},…,c_{n}) of n independent Multinomial(1,π) variables. Evidently we can measure the adequacy of an assignment through the matrix where R_{ab} = n^{−1} ∼_{i=1}^{i} I(c_{i} = b, e_{i} = a), the fraction of b members classified as a members if we use e. It is natural to ask for consistency of an assignment e, that is, e = c, i.e. where f_{a}(e) = n^{−1}n_{a}(e) and n_{a} (e) = ∑_{i=1}^{n} I(e_{i} = a). The Newman–Girvan modularity Q_{NG}(e,A) is defined as follows. Let {i : e_{i} = k} denote ecommunity k, i.e. as estimated by e. Define to be the block sum of A. Obviously, O_{kk} is twice the number of edges among nodes in the kth ecommunity and for k ≠ l, O_{kl} is the number of edges between nodes in the kth ecommunity and nodes in the lth ecommunity. Let D_{k}(e,A) = ∑_{l=1}^{K} O_{kl}(e,A). It is easy to verify that D_{k} is the sum of degrees for nodes in the kth ecommunity. Let L = ∑_{k=1}^{K} D_{k} be twice the number of edges among all nodes. Then the Newman–Girvan modularity is defined by The Newman–Girvan algorithm then searches for the membership assignment vector e that maximizes Q_{NG}. Notice that if edges are randomly generated uniformly among all pairs of nodes with given node degrees, then the number of edges between the kth ecommunity and lth ecommunity is expected to be L^{−1}D_{k}D_{l}. Therefore, the Newman–Girvan modularity measures the fraction of the edges on the graph that connect vertices of the same type (i.e. withincommunity edges) minus the expected value of the same quantity on a graph with the same community divisions but random connections between the vertices (6). Newman (16) contrasts and compares his modularity with spectral clustering, another common “community identification” method which we will also compare to the likelihood modularity below. We seek conditions under which the official NG assignment is consistent with probability tending to 1. Before doing so we consider alternative modularities.
For fixed e, the conditional, given {n_{k}(e)}_{1}^{K}, loglikelihood of A is
We write general modularities in the form,
where μ_{n} = E(L) = n(n − 1)ρ_{n} and the matrix O(e,A) is defined by its elements O_{ab}(e,A), f(e) =
We define asymptotic consistency of a sequence of assignments ĉ by as n → ∞.
We will assume that there exists a function F : M×ℝ^{+}×G → ℝ such that F_{n} is approximated by F evaluated at the conditional expectation given c of the argument of F_{n}. Suppose first F_{n} ≡ F. It is intuitively clear from [3] that if ĉ → c, then R(c,ĉ) → D(π). Then, since f(c) → π, the following condition is natural.
I. F(RSR^{T},1,R1) is uniquely maximized over R = {R : R ≥ 0, R^{T} 1 = π} by R = D(π), for all (π,S) in an open set Θ.
This means c with f(c) = π is the right assignment for the limiting problem. Note that since F is not concave in R, this is a strong condition.
For (π,S) to be identifiable uniquely, we clearly also need that:
II. S does not have two identical columns (Two communities cannot have identical probabilities of being related to other communities and within themselves) and π has all entries positive (Each community has some members).
We also need a few more technical conditions of a standard type.
III. a) F is Lipschitz in its arguments. b) The directional derivatives
Theorem 1. Suppose F,S and π satisfy I–III and ĉ is the maximizer of Q(e,A). Suppose
The proof is given in SI Appendix.
We note that Snijders and Nowicki (11) established a related result, exponential convergence to 0 of the misclassification probability for λ_{n} = Ω (n) using node degree Kmeans clustering for K = 2.
Let F(c,e) ≡ F(R(c,e)SR^{T}(c,e),f^{T}(c)Sf(c),f(e)). In the general case it suffices to show, for all δ > 0, some γ > 0, where and ∥e − c∥ = ∑_{i=1}^{n} 1(e_{i} ≠ c_{i}). We show in SI Appendix that Q_{NG} and Q_{LM} satisfy I and III and Eq. 5 for selected (π,S) for Q_{NG} and all (π,S) for Q_{LM}.
An immediate consequence of the theorem is:
Corollary 1: If the conditions of Theorem 1 hold and, then, with A · B denoting pointwise product. The limiting variances are what we would get for maximum likelihood estimates if ĉ = c, i.e. we knew the assignment to begin with. So consistent modularities lead to efficient estimates of the parameters.
This follows since with probability tending to 1, ĉ = c.
To estimate w(·,·) in the nonparametric case we need K → ∞, and w(·,·) and π(·) smooth. We approximate by W_{K} ∼ K^{−2}∥w(aK^{−1},bK^{−1})∥, π_{K}(a) ∼ K^{−1}π(aK^{−1}), where w(·,·), W_{K} are canonical and the modularity defining F_{K}, F_{K}(·,·,·) is of order K^{−2}. We have preliminary results in that direction but their formulation is complicated and we do not treat them further here.
Consistency of NG, LM
We show in SI Appendix using the appropriate F_{NG}, F_{LM} that the likelihood modularity is always consistent while the Newman–Girvan is not. This is perhaps not surprising since NG focuses on the diagonal of O. In fact, we would hope that NG is consistent under the submodel {(ρ, π W) : W_{aa} > ∑_{b≠a} W_{ab} for all a}, which corresponds to Newman and Girvan's motivation. We have shown this for K = 2 but it surprisingly fails for K > 2. Here is a counterexample. Let K = 3, π = (1/3,1/3,1/3)^{T} and As n → ∞, with true labeling, Q_{NG} approaches 0.033. However, the maximum Q_{NG}, about 0.038, is achieved by merging the first two communities. That is, two sparser communities are merged. This is consistent with an observation of Fortunato and Barthelemy (17).
If for the profile likelihood we maximize only over e such that Ŵ_{aa} (e) > ∑_{b≠a} Ŵ_{ab} (e) for all a, we obtain ĉ which is consistent under the submodel above, and in the Karate Club example performs like NG.
Computational Issues
Computation of optimal assignments using modularities is, in principle, NP hard. However, although the surface is multimodal, in the examples we have considered and generally when the signal is strong, optimization from several starting points using a label switching algorithm (19) works well.
Simulation
We generate random matrices A and maximize Q_{NG}, Q_{LM} to obtain node labels respectively, where Q_{LM} is maximized using a label switching algorithm. To make a fair comparison, the initial labeling for Q_{NG} and Q_{LM} is to randomly choose 50% of the nodes with correct labels and the other 50% with random labels. For spectral clustering, we adopt the algorithm of (18) by using the first K eigenvectors of D(d)^{−1/2} AD(d)^{−1/2}, where d = (d_{1},…,d_{n})^{T} and d_{i} is the degree of the ith node. We generate the P matrix randomly by forcing symmetry and then add a constant to diagonal entries such that I holds. The π is generated randomly from the simplex. To be precise, the values for Fig. 1 are π = (.203,.286,.511)^{T} and where n varies from 200 to 1,500 and b varies from 10 to 100. Obviously, Fig. 1 says that the likelihood method exhibits much less incorrect labeling than Newman–Girvan and spectral clustering. This is consistent with theoretical comparison.
Data Examples
We compare the LM and NG modularity algorithms below with applications to two real data sets. To deal with the issue of nonconvex optimization, we simply use many restarting points.
Zachary's “Karate Club” Network.
We first compare LM and NG with the famous “Karate Club” network of ref. 20, from the social science literature, which has become something of a standard test for community detection algorithms. The network shows the patterns of friendship between the members of a karate club at a US university in the 1970s. The example is of particular interest because shortly after the observation and construction of the network, the club in question split into two components separated by the dashed line as shown in Figs. 2 and 3 as a result of an internal dispute. Fig. 2 Left shows two communities identified by maximizing the likelihood modularity where the shapes of the vertices denote the membership of the corresponding individuals, and similarly the right panel shows communities identified by NG. Obviously,the NG communities match the two subdivisions identified by the split save for one misclassified individual. The LM communities are quite different, and obviously one community consists of five individuals with central importance that connect with many other nodes while the other community consists of the remaining individuals. Although not reflecting the split this corresponds to other plausible distinguishing characteristics of the individuals. However, if we force the constraint that withincommunity density is no less than the density of relationship to all other communities, the submodel we discussed, then we obtain two LM communities that match the split perfectly. The same partitions as ours with and without constraint have also been reported by Rosvall and Bergstrom using a data compression criterion (21), which is closely related to LM. We note that, as is usual in clustering, there is no ground truth, only features which can be validated ex post fact. It is interesting to note that, if instead of K = 2, we put K = 4, as in Fig. 3, it is evident for both modularities that merging the communities on either side of the eigenvector split, gives the “correct” Karate Club split. This suggests that the standard policy mentioned by Newman (16) of increasing the number of communities by splitting is not necessarily ideal since in this case the “misclassified” individual of Fig. 2 would never be “correctly” classified.
Private Branch Exchange.
Our second example is of a telephone communication network where connections are made among the internal telephones of a private business organization, a so called PBX. PBXs are differentiated from “key systems” in that users of key systems usually select their own outgoing lines, while PBXs select the outgoing line automatically. Our data contains 621 individuals. Fig. 4 Left shows the results of community detection by LM, where the adjacency matrix is plotted but the nodes are sorted according to the membership of the corresponding individuals identified by maximizing the likelihood modularity. Similarly, the right panel of Fig. 4 shows the communities identified by NG, where the maximum Newman–Girvan modularity is 0.4217. Note that the identified communities by LM have sizes 323, 81, 78, 97, 41, and 1, respectively. The communities are ordered simply by their average node degrees, essentially the order for h_{CAN}. Interestingly, the last LM community has only one node that communicates with almost everyone else, nodes in the second community only communicate with internal nodes, nodes in the fourth community and the sixth community, but not with others; Similarly, the third community only communicates with the fifth and sixth communities, and so on. In other words, communication between communities is sparse. However, the communities identified by NG are quite different with only the fifth community heavily overlapping with a community identified by LM. This difference appears to be caused by the nodes in the 5th and 6th LM communities which have many more betweencommunity connections than withincommunity connections while NG more or less maximizes withincommunity connections. We have verified that the group communicating with all others is a service group.
Discussion

As we noted, under our conditions the usual statistical goal of estimating the parameters π and P is trivial, since, once we have assigned individuals to the K communities consistently, the natural estimates, Ŵ and π̂, are not just consistent but efficient. However, in the more realistic case where λ_{n} = Ω(1), or even just λ_{n} = Ω (logn), this is no longer true. Elsewhere, we shall show that, indeed, estimation of parameters by maximum likelihood and Bayes classification of individuals (no longer perfect) is optimal.

A difficulty faced by all these methods, modularities or likelihoods, is that if K is large, searching over the space of classifications becomes prohibitively expensive. In subsequent work we intend to show that this difficulty may be partly overcome by using the method of moments to first estimate π and P, and then study the likelihood in a neighborhood of the estimated values.
Open Problems

A fundamental difficulty not considered in the literature is the choice of K. From our nonparametric point of view, this can equally well be seen as, how to balance bias and variance in the estimation of w(·,·). We would like to argue that, as in nonparametric statistics, estimating w(·,·) without prior prejudices on its structure is as important an exploratory step in this context as, using histograms in ordinary statistics.

The linking of this framework to covariates depending on vertice or edge identity is crucial, permitting relationship strength to be assessed as a function of vector variables.

The links of our approach to spectral graph clustering and more generally clustering on the basis of similarities seem intriguing.
Acknowledgments
We thank Tin K Ho for help in obtaining the PBX data and for helpful discussions. We also thank the referees, whose references and comments improved this article immeasurably.
Footnotes
 ^{1}To whom correspondence should be addressed. Email: bickel{at}stat.berkeley.edu

Author contributions: P.J.B. and A.C. performed research and analyzed data.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0907096106/DCSupplemental.
References
 ↵
 Doreian P,
 Batagelj V,
 Ferligoj A
 ↵
 Airoldi EM,
 Blei DM,
 Fienberg SE,
 Xing XP
 ↵
 Erdös P,
 Rényi A
 ↵
 ↵
 Zhao Y,
 et al.
 ↵
 ↵
 ↵
 Chung FRK,
 Lu L
 ↵
 Kallenberg O
 ↵
 ↵
 ↵
 Platt J,
 Koller D,
 Roweis S
 Hoff PD
 ↵
 Diaconis P,
 Janson S
 ↵
 Hardy G,
 Littlewood J,
 Polya G
 ↵
 Newman MEJ,
 Leicht EA
 ↵
 ↵
 Fortunato S,
 Barthélemy M
 ↵
 Ng AY,
 Jordan MI,
 Weiss Y
 ↵
 ↵
 Zachary W
 ↵
 Rosvall M,
 Bergstrom C
Citation Manager Formats
More Articles of This Classification
Physical Sciences
Related Content
 No related articles found.