New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
A likelihood approach to analysis of network data

Edited by David O. Siegmund, Stanford University, Stanford, CA, and approved March 31, 2006 (received for review January 4, 2006)
Abstract
Biological, sociological, and technological network data are often analyzed by using simple summary statistics, such as the observed degree distribution, and nonparametric bootstrap procedures to provide an adequate null distribution for testing hypotheses about the network. In this article we present a fulllikelihood approach that allows us to estimate parameters for general models of network growth that can be expressed in terms of recursion relations. To handle larger networks we have developed an importance sampling scheme that allows us to approximate the likelihood and draw inference about the network and how it has been generated, estimate the parameters in the model, and perform parametric bootstrap analysis of network data. We illustrate the power of this approach by estimating growth parameters for the Caenorhabditis elegans protein interaction network.
Complex biological, sociological, and technological networks vary in size, form, structure, and the mechanisms by which they grow. They are widely seen as convenient and coherent descriptions for the whole set of interactions in biological, social, or technological systems, and their empirical properties have attracted considerable attention. A range of statistical ensembles [in the sense of an “ensemble” in statistical physics (1)] of networks (or probability spaces over graphs) has been studied, notably ErdösRenyi and scalefree random graphs (2⇓–4). The former has been the canonical model in randomgraph theory but does not capture some important aspects of real networks. These often have a fixed number of nodes (e.g., the number of proteins in an organism is fixed) and much broader degree distributions than the Poisson distribution that characterizes the degree distribution of ErdösRenyi random graphs; i.e., some nodes have a very high degree (number of interactions), whereas most nodes have degree k = 1 and 2. A range of mechanistic models has been suggested where the network grows through the addition of nodes and the asymptotic shape of the degree distribution takes on the form of a power law (2).
Testing hypotheses about a network, its form, and structure and how it has evolved will be difficult, even if a plausible model for network growth can be found. Typically, the analysis of networks has therefore involved either the use of summary statistics, such as the degree distribution or the clustering coefficient, or, in the case of hypothesis tests, rewiring the network while keeping the degree of each node fixed. In the latter case each node has a number of “stumps” equal to its degree and the stumps are connected randomly to create a randomly rewired replicate, e.g., ref. 5. This procedure is well defined and meaningful, but it means that the replicates are uncorrelated (degree–degree correlations depend only on the degree sequence) and any potential coarse structure of the network (such as community structure) is ignored. Thus, although bootstrap methods can, in principle, be more informative than simple summary statistics and most structural analyses rely on them at least to some extent, it is important to keep in mind that the rewired instances of the network will often be systematically and qualitatively different from the true network. The answer to a hypothesis test might depend crucially on the part of the data that is kept fixed and the part that is changed by the bootstrapping procedure. Quite different answers might be obtained, e.g., if the skeleton of the network is fixed rather than just the (observed) degree distribution, although both approaches might appear reasonable in a given situation.
Alternatively, one might turn to likelihood methods. These methods require a probabilistic model reflecting the nature of the data and how the network has evolved. One popular broad class of mathematical models of networks and network evolution includes duplicationattachment (DA) models (4, 6, 7), where a set of parameters specifies the probabilities for including new nodes and edges. The network is considered the result of an evolutionary stochastic process such that the number of nodes has increased from a smaller number through a series of node adding events. New nodes can be (partial) copies of existing nodes and their links or completely new ones.^{¶} This class of models includes the Barabasi–Albert model (2) and the duplication model of Chung et. al (6) as special cases and can interpolate smoothly between them: both are nested inside the same DA model. Estimating their parameters (the probability of a duplication event and the probability of the duplicate node inheriting an edge from the original node, respectively) allows us, for example, to test the extent to which the assumptions of the Barabasi–Albert model are supported by the data.
Mathematical models of networks have been used among other things to explain evolutionary aspects of biological networks, growth and structure of sociological networks, and how certain features of networks seem to appear naturally and globally, such as fattailed degree distributions, e.g., ref. 2. However, to our knowledge, mathematical models have been used only indirectly in statistical analysis; for example, by comparing the observed degree distribution to a probability model for the degree distribution (which can be seen as a composite likelihood approach), e.g., ref. 9 and references therein. In principle these models allow for a deeper and fuller statistical analysis of network data, including estimation of the set of parameters that provides the best fit to the model and hypothesis testing, and subsequently interpretation of parameters in relation to the mechanisms underlying the generation of the data (for example, the underlying biological causes and processes). Hypothesis testing is here naturally performed by using the parametric bootstrap: the null distribution is obtained by simulation of networks under the model with the estimated parameters.
Here we present a method that in principle allows us to calculate the likelihood of the full network under a given mathematical model, thereby using the full network data and all of the information embedded in the data about the network structure. It has been argued (10) that the mechanistic process underlying network evolution is less important than the ability of an ensemble to qualitatively describe the data. It is, for example, well known, that protein interaction networks do not grow according to the simple preferential attachment model that describes simple scalefree models. For example, ≈120 million years before present, a whole genome duplication occurred in the lineage ancestral to Saccharomyces cerevisiae and other yeast species belonging to the sensu stricto clade and such exceedingly rare events, even though they are of fundamental biological importance, cannot be easily incorporated into a probability model of the sort used in other contexts. But even if the model is (necessarily and indeed desired) oversimplified the likelihood approach allows us to gain additional insights. For DA models, for example, we can infer a maximumlikelihood estimate for the duplication probability. This estimate, in turn, can be interpreted as an “effective” node duplication probability,^{‖} which, in the case of biological networks, can be compared with estimates of the extent of historical duplication activity obtained from sequence analysis.
A General DA Model
By a graph we always mean an undirected graph without multiple edges (or links) and selfloops. A randomly growing graph is a Markov chain {G_{t}}_{t=t0}^{∞} such that G_{t} is a graph of size t and the edges and nodes of G_{t} are subsets of the edges and nodes of G_{t}_{+1}. This implies that one node is added at the time.
We suppose that when G_{t}_{+1} is created, new edges are added between the new node v_{new} and the old nodes that were already present in G_{t}. One kind of stochastic rule that does this is duplication. An old node v_{old} is picked, and the nodes linked to v_{new} are chosen among the nodes linked to v_{old}. If, in addition, it is also possible to let v_{new} attach to v_{old}, then we speak of DA. For short, we say that v_{new} is copied from v_{old}. See Fig. 1.
Consider a graph G_{t} and let δ(G_{t}, v) denote the graph with v deleted, i.e., v and all links to it are removed. A node v in G_{t} is said to be removable if G_{t} can be created by copying a node in δ(G_{t}, v). If G_{t} contains removable nodes, it is said to be reducible, otherwise G_{t} is irreducible.
A surprising fact, which we prove in Supporting Text, which is published as supporting information on the PNAS web site, is that if removable nodes are deleted repeatedly from G_{t} until the graph is irreducible, we always end up with graphs that are isomorphic to each other irrespectively of the order the nodes were removed. This fact will be important when computing the likelihood.
We say that a randomly growing graph is a DA model if

G_{t0} is irreducible.

P(G_{t+1}G_{t}) > 0, if, and only if, G_{t}_{+1} can be obtained by copying one node in G_{t}.
In the next section we consider for simplicity a specific DA model, although the theory applies generally to all DA models (including the Barabasi–Albert model mentioned above).
The Likelihood
Assume the model is described by a possibly vectorvalued parameter θ and let G_{t} be an observation from the model. We are interested in finding the maximumlikelihood estimate of θ. To that end we calculate the likelihood as a function of θ. Let ℛ(G_{t}) be the set of removable nodes in G_{t}, then where The factor 1/t is the probability that v is the last added node.
Thus, in principle, it is possible to compute the likelihood recursively. As soon as we arrive at an irreducible graph, it is isomorphic to the true starting graph and has the same likelihood as the true starting graph. If we could end up with nonisomorphic graphs (potentially with different number of nodes), then a nontrivial distribution must have been required for the initial graph.
In the remainder of the section, we consider the following DA model. Let The model is defined by the following two rules.

Choose v_{old} in G_{t} uniformly. Create a link between v_{new} and any node that links to v_{old} with probability p. Link v_{old} to v_{new} with probability q.

Choose v_{old} in G_{t} uniformly. Create a link between v_{old} and v_{new} with probability r.
In every step, we follow rule 1 with probability π, and rule 2 with probability 1 − π. We call this model duplication with random attachment.
To compute the likelihood, we make a list of all nodes in ℛ(G_{t}). To every node v in the list there is a nonempty set 𝒞(G_{t}, v) of nodes w in G_{t} such that v could have been copied from w. Let e(w, v) = 1 if there is an edge between w and v and let e(w, v) = 0 otherwise. Further, let d(v) be the degree of node v. Then where and I is the indicator function. Even though this, in principle, provides the means to compute the likelihood, the method is computationally too intensive even for moderately sized networks; the number of recursive calls that has to be made becomes astronomical in a short time. For a fully connected graph (or a graph with no connections at all) the number of recursive calls is easily seen to be ⌊(e − 1)t!⌋, where e is Euler's number, e ≈ 2.71, and ⌊x⌋ denotes the largest integer smaller than or equal to x. Keeping a list of already calculated likelihood values reduces the number of calls to t 2^{t}^{−1} − t + 1, (see Supporting Text), but this number is still extremely large even for small values of t. In terms of computational time it is worth mentioning that each lookup in the list can be done in at most log_{2}(2^{t} − 1) ≈ t operations, because there are at most 2^{t} − 1 entries (the number of nonempty subsets of t objects) in the list. Fig. 2 provides some examples. For the investigated parameter values the number of calls appears to increase subexponentially, whereas the number for a complete graph is superexponential. Still the numbers become very large; e.g., for t = 20 the number of calls are of the order of 10^{6} compared with 10^{7} for a complete graph.
Importance Sampling (IS)
IS is an efficient simulation (variance reduction) technique that in many cases provides ways to approach quantities for which exact or numerical results are otherwise difficult to obtain, e.g., refs. 12 and 13 for a comprehensive treatment. The IS scheme to be implemented here for computing the likelihood is inspired by the recursion relation (1) that allows the likelihood to be written as an expectation over a Markov chain (see below). Our approach makes direct use of the recursive form of the likelihood function. Similar schemes has been proposed in different contexts; see, for example, refs. 14⇓–16 for proposed schemes for inference on the socalled coalescent (17). We apply IS to random graph models, but the flexibility of IS [or sequential IS (13)] schemes makes them powerful tools for the statistical analysis of (biological) network models.
We rewrite Eq. 1 in the form: where for t > t_{0} and Using lemma 1 in ref. 14 (the lemma relates generally to Markov chains with a stopping rule), the likelihood in Eq. 2 can be written as an expectation where
The usability of Eq. 3 rests on the fact that all irreducible graphs derived from G_{t} are isomorphic and thus have the same likelihood. In Eq. 3, the expectation is with respect to the probabilities s = t_{0} + 1, …, t, that define a Markov chain on graphs. This Markov chain is not the same as the one defined by the DA model. However, it motivates the following simulation scheme:

Let G_{t}^{(i)} = G_{t}.

For s = t − 1, …, t_{0} + 1, choose v_{i} with probability proportional to ω(θ_{0}, G_{s}^{(i)}, v_{i}) and let G_{s−1}^{(i)} = δ(G_{s}^{(i)}, v_{i}).

Let l_{θ0}^{(i)}(θ) = ∏_{s=t0}^{t} S(θ_{0}, θ, G_{s}^{(i)}, v_{i}).

Repeat steps 1–3 N times and approximate L(θ, G_{t}) by
Each of the N draws is called a path. The value θ_{0} is the socalled driving value (12), which can be chosen either arbitrarily or in some other conditioned way, e.g., by using summary statistics. Fig. 3 provides an example for t = 10. There are at least two things that are worth pointing out. First, the overall form of the simulated likelihood curve is similar to the true likelihood curve even for small N and thus the relative likelihood L(θ; G_{t})/L(θ_{1}; G_{t}) (for fixed θ_{1}) might be estimated accurately even for small N (depending on t); this observation is also made by other authors (14, 18). Second, the driving value influences the accuracy of the simulated likelihood curve. A driving value close to the true value is likely to provide faster convergence to the true likelihood curve than a driving value far from the true value. However, for all θ_{0} and θ the convergence is of order , because Eq. 4 is an unbiased estimator of the likelihood. Hence, the rate of convergence depends solely on the standard deviation of the terms l_{θ0}^{(i)}(θ) in Eq. 4.
We calculated the average run time for one path for different network sizes. Graphs with different numbers of nodes were generated with parameter θ_{1} = (1,0.66,0.33,0) or θ_{2} = (1,0.33,0.33,0) and a path was drawn 15 times for each parameter value by using driving value, θ_{0} = θ_{i}, i = 1, 2. The observed run time was approximately polynomial with an estimated degree of 2.34 and 2.84, respectively; see Supporting Text for a description of the algorithm and Fig. 6, which is published as supporting information on the PNAS web site, for a plot of run times. For θ_{1} the average number of links per node increases with network size, whereas for θ_{2} it stabilizes and is lower than the average for θ_{1}. Apparently, in these cases it has the opposite effect on the run times.
Application
We applied the IS method to protein interaction data from Caenorhabditis elegans (19). The largest connected component was selected (2,368 nodes; as described in ref. 20), and the data were analyzed by fixing three of four parameters for the sake of demonstration: π = 1, q = 0.33, and r = 0; θ_{0} = (1,0.66,0.33,0) was used as driving value; and p was varied between 0 and 1. Fig. 4 shows simulated likelihood curves. The maximumlikelihood estimate of p is ≈0.28. In other words, when assuming the model and the other parameters are correct only 28% of all links survive when a node is copied.
Two things transpire from the likelihood curves: The first is that the variance of the contribution from one path l_{θ0}^{(i)}(θ) (see Eq. 4) is small compared with the loglikelihood L(θ, G_{t}) itself. We take this observation as evidence that even with a small number of paths the importance sampled likelihood is a good approximation to the true likelihood. Possibly it is a large sample (or network) size effect. The second thing is that the confidence intervals (CI) of the maximumlikelihood estimator of p appear to be wide. The CIs are likely to be even wider if all four parameters are estimated. After removing all removable nodes there are still 735 nodes left, which is a considerable amount. Both degree sequences could well be explained by a power law with degree ≈2 (see Fig. 5).
Discussion
While the inference of e.g., coexpression networks and the statistical inference and analysis of graphical models and Bayesian networks have received great attention from the statistics community (see e.g., refs. 21⇓–23), the analysis of experimentally derived molecular networks has largely relied on ad hoc descriptions and summary statistics. This situation contrasts with the social sciences, where networks are much smaller and inferential procedures have been used with some considerable success (24). Thus there is a rich amount of literature on the statistical inference of networks and, perhaps to a lesser extent, also on the analysis of smaller (social) network structures.
In this article we have attempted to calculate the full likelihood of a network, G_{t}, assuming a mathematical model of network growth that was inspired by basic evolutionary processes.** For the class of DA models we consider here, we can show that it is possible to calculate the likelihood without specifying a distribution on the initial graph, simply because it turns out that the any two paths consistent with G_{t} initiate with isomorphic graphs. One can also choose to think of the likelihood of G_{t} as conditional on the initial graph isomorphism.
We further showed that the likelihood, in principle, could be calculated by using a recursion, but also that this approach is computationally too demanding to be practical for even moderate networks. As an alternative, we suggested adopting an IS approach that samples paths consistent with 𝒢_{t} and that the likelihood can be computed from such paths. In our implementation this approach also runs into time constraints but only for graphs exceeding at least 2,500 nodes.
Our work leaves room for improvements and also raises some questions. The application raises the question of whether the DA models are adequate models for describing the C. elegans network. In previous work (9) we have shown that power laws describe the degree distribution of the C. elegans data statistically better than other types of distributions (derived from normal, exponential, and other standard distributions) (see also Fig. 5). However, the initial (irreducible) graph comprises almost onethird of the nodes in the entire network (735/2,368 = 31%), implying that there are loops and cycles in the network that are not consistent with how DA models build up graphs (see also Lemma 6 in Supporting Text).
This observation leads us to our second point: development of more realistic models. We have stuck with the class of DA models because they are widely used and discussed in the literature. Generalization to directed DA models should be straightforward, but also models that evolve by other mechanisms than duplication and attachment should be possible to handle in a similar way to that described here. Very general models allowing for insertion and deletion of edges at any time are straightforwardly handled by the theory because the graph eventually is reduced to a single node. However, this simplicity is at the cost of computational complexity because the number of paths from G_{t} to G_{t0} (containing a single node) is now even larger and may quickly become unmanageable. More importantly, such a model is biologically implausible. Nature is not likely to remove functions and interactions without having reasonable substitutes for them. Models that allow for moderate deletions of nodes and/or edges are biologically much more realistic. For example, one could allow a link between two nodes, v and w, to be removed if a copy v′ of v exists that also has a link to w. Such features are biologically tractable but potentially require more bookkeeping for calculating the likelihood.
Finally, it would be natural to engage in Markov chain Monte Carlo and Gibbs sampling methods to improve the speed and perhaps accuracy of the computations, but also to try other IS schemes. As discussed in ref. 18, a recursion like ref. 1 opens more possibilities than the one presented here. These, including the one presented here, fall under the general principle of sequential IS in which one builds up the sampling distribution sequentially, see e.g., ref. 13 for general discussion and examples. Sequential IS might be particularly useful for random graphs because one can envisage the graph as being built up step by step. However, the shapes of the simulated likelihood curves in Fig. 4 also raise the question of whether, in large networks, individual paths provide a good estimate of the likelihood.
Acknowledgments
We thank Sylvia Richardson for helpful discussions. C.W. and M.P.H.S. are supported by the Carlsberg Foundation and the Royal Society. C.W. is supported by the Danish Cancer Society. M.P.H.S. is supported by a Wellcome Trust Fellowship and a European Molecular Biology Organization Young Investigator Award.
Footnotes
 ↵^{‡}To whom correspondence should be addressed. Email: wiuf{at}birc.au.dk

Author contributions: C.W. designed research; C.W., M.B., O.H., and M.P.H.S. performed research; C.W. and M.B. analyzed data; and C.W. and M.P.H.S. wrote the paper.

↵^{¶} In contrast, graphical models like Bayesian networks and chain graphs (8) consider a fixed graph that determines the probabilistic dependencies in the data; here the graph/network is the data, i.e. the graph is stochastic.

↵^{‖} This interpretation is similar in spirit to subsuming details of the demographic history of a population into the effective population size N_{e} in population genetics, see e.g. ref. 11.

↵** Duplication is a straightforward biological process. For the attachment process there is, apart from horizontal gene transfer in bacteria, no easy biological equivalent; rather it is convenient to subsume all nonduplication processes into this mechanism.

Conflict of interest statement: No conflicts declared.

This paper was submitted directly (Track II) to the PNAS office.
Abbreviations
 DA,
 duplication attachment;
 IS,
 importance sampling.
 Received January 4, 2006.
 © 2006 by The National Academy of Sciences of the USA
References
 ↵
 Thompson C. J.
 ↵
 Barabási A. L.,
 Albert R.
 ↵
 Bollobás B.
 ↵
 Dorogovtsev S. N.,
 Mendes J. F. F.
 ↵
 Milo R.,
 ShenOrr S.,
 Itzkovitz S.,
 Kashtan N.,
 Chklovskij D.,
 Alon U.
 ↵
 ↵
 Kumar S. R.,
 Raghavan P.,
 Rajagopalan S.,
 Sivakumar D.,
 Tomkins A.,
 Upfal E.
 ↵
 Lauritzen S. L.
 ↵
 Stumpf M. P. H.,
 Ingram P. J.,
 Nouvel I.,
 Wiuf C.
 ↵
 ↵
 Ewens W. J.
 ↵
 Ripley B.
 ↵
 Liu J. S.
 ↵
 ↵
 ↵
 ↵
 Kingman J. F. C.
 ↵
 ↵
 Li S.,
 Armstrong C. M.,
 Bertin N.,
 Ge H.,
 Milstein S.,
 Boxem M.,
 Vidalain P.O.,
 Han J.D. J.,
 Chesneau A.,
 Hao T.,
 et al.
 ↵
 ↵
 ↵
 Schafer J.,
 Strimmer K.
 ↵
 Pournara I.,
 Wernisch L.
 ↵
 Snijders T. A. B.