New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Estimating the size of the human interactome

Edited by Burton H. Singer, Princeton University, Princeton, NJ, and approved February 19, 2008 (received for review August 27, 2007)
Abstract
After the completion of the human and other genome projects it emerged that the number of genes in organisms as diverse as fruit flies, nematodes, and humans does not reflect our perception of their relative complexity. Here, we provide reliable evidence that the size of protein interaction networks in different organisms appears to correlate much better with their apparent biological complexity. We develop a stable and powerful, yet simple, statistical procedure to estimate the size of the whole network from subnet data. This approach is then applied to a range of eukaryotic organisms for which extensive protein interaction data have been collected and we estimate the number of interactions in humans to be ≈650,000. We find that the human interaction network is one order of magnitude bigger than the Drosophila melanogaster interactome and ≈3 times bigger than in Caenorhabditis elegans.
One of the perhaps most surprising results of the genomesequencing projects was that the number of genes is much lower than had been expected and is, in fact, surprisingly similar for very different organisms (1, 2). For example, the nematode Caenorhabditis elegans appears to have a similar number of genes as humans, whereas rice and maize appear to have even more genes than humans. It was then quickly suggested that the biological complexity of organisms is not reflected merely by the number of genes but by the number of physiologically relevant interactions (1, 3). In addition to alternative splice variants (4), posttranslational processes (5), and other (e.g., genetic) factors influencing gene expression (6, 7), the structure of interactome is one of the crucial factors underlying the complexity of biological organisms. Here, we focus on the wealth of available protein interaction data and demonstrate that it is possible to arrive at a reliable statistical estimate for the size of these interaction networks. This approach is then used to assess the complexity of protein interaction networks in different organisms from present incomplete and noisy protein interaction datasets.
There are now fairly extensive protein interaction network (PIN) datasets in a number of species, including humans (8, 9). These have been generated by a variety of experimental techniques (as well as some in silico inferences). Although these techniques and the resulting data are (i) notoriously prone to false positives and negatives (10, 11), and (ii) result in highly idealized and averaged network structures (12), such interaction datasets are increasingly turning into useful tools for the analysis of the functional (e.g., ref. 13) and evolutionary properties (14) of biological systems. In particular, in Saccharomyces cerevisiae we are beginning to have a fairly complete description of the protein interaction network that is accessible with current experimental technologies; the recent highquality literaturecurated dataset of Reguly et al. (15) provides us with a dataset that should be almost completely free from false positives. For most other organisms, however, interaction data are still far from complete and it has recently been shown that subnetworks, in general, have qualitatively different properties from the true network (16–18). Although the importance of networksampling properties had only been realized relatively recently, this aspect of most systems biology data are increasingly being recognized (11, 19) as important.
There are, however, some properties of the true network that can be inferred even from subnet data, and here we show that the total network size is one property for which this is the case. Present proteininteraction datasets enable us to estimate the size of the interactomes in different species by using graph theoretical invariants. This is particularly interesting for species where more than one experimental dataset is available. Below we first describe a robust and very general estimator of network size from partial network data that overcomes this problem. We then apply it to available PIN data in a range of eukaryotic organisms. In supporting information (SI) Text we demonstrate the power of this approach by using extensive simulation studies.
Estimating Interactome Size
Here, we develop an approach for estimating the size of a network from incomplete data. We will show below (and by using extensive simulations in SI Text) that for a given species estimates from different independent datasets—generated by different methods such as yeasttwohybrid and TAP tagging—yield estimates for the interactome size that are in excellent agreement.
We are concerned with a true network, 𝒩, which has N_{𝒩} nodes and M_{𝒩} edges. The sets of nodes and edges are given by 𝒱_{𝒩} and ℰ_{𝒩}, respectively; these define the graph representation of the true network: We pick a subset of nodes 𝒱_{𝒮} ⊆ 𝒱_{𝒩} and study properties of the subgraph G_{𝒮} induced by the nodes in 𝒱_{𝒮} where the set of edges observed in the 𝒮 is a subset of the total set of edges, ℰ_{𝒮} ⊆ ℰ_{𝒩}. Our aim is to predict the number of interactions in the true network G_{𝒩} based on the available data in the subnet, G_{𝒮}.
We assume that the network, G_{𝒩}, is generated according to some (unknown) model characterized by a parameter (vector) θ, and subsequently the observed network, G_{𝒮} is sampled from it. Then where it is assumed that the sampling is independent of the networkgenerating model. The parameter p refers to a general sampling process, and not only independent node sampling. Furthermore, we assume the order N_{𝒩} of the network is known and allow nodes to be annotated with information not related to the wiring of the network (e.g., GO terms or protein family classes). Consequently, the sum is over networks, G_{𝒩}, with N_{𝒩} (labeled) nodes only. For convenience, we take labeling information to be included in N_{𝒩} and N_{𝒮} (the order of G_{𝒮}).
If sampling only depends on the nodes in the network and not on their connections, then P_{p}(G_{𝒮}G_{𝒩}) splits into a product of two terms, where Q_{p}(N_{𝒮}) is a term denoting the probability of sampling the nodes in the observed PIN and q(G_{𝒮}, G_{𝒩}) denotes how many ways this can be done given the (labeled) nodes in G_{𝒩}—by assumption, labeling of nodes is the same in all possible G_{𝒩}s. For example, if the nodes are unlabeled and have degree zero, then q(G_{𝒮}, G_{𝒩}) = (_{N𝒮}^{N𝒩}). If all nodes have degree one, a similar factor can be derived based on the number of degree one (d_{1}) and degree zero (d_{0}) nodes that are observed in the PIN: q(G_{𝒮}, G_{𝒩}) is the number of ways one can choose d_{0} and d_{1} out of the N_{𝒩}/2 pairs of connected nodes in G_{𝒩}. If all nodes are labeled then q(G_{𝒮}, G_{𝒩}) = 1, because one can only select the nodes in the PIN in one way.
It follows that Q_{p}(N_{𝒩}) is sufficient for inference on p (the remaining part of the likelihood does not depend on p). In the case of independent node sampling, each node with probability p, we have Q_{p}(N_{𝒮}) = p^{N𝒩} (1 − p)^{N𝒩 − N𝒮} and the maximum likelihood estimate of p is which is unbiased and consistent.
From the likelihood Eq. 4 it follows that where G*_{𝒩} is a specific network (to distinguish it from the sum over all networks in the denominator). Note that this conditional probability does not depend on p and that, in principle, we can only gain knowledge about the interactome if something is assumed about the networkgenerating model. Note also that this is a general restriction that is not related to independent node sampling alone.
A reasonable estimate of the edge probability in G_{𝒩} is where M_{𝒮} is the number of edges in the PIN. It leads to the following estimate of the interactome size: The estimate is unbiased and consistent provided the networkgenerating mechanism ensures some form of uniformity, as is the case for random graphs (Figs. S1 and S2). For example, if G_{𝒩} has a star topology with one node of degree N_{𝒩} − 1 and the remaining of degree 1, then M_{𝒮} = 0 with probability 1 − p and M_{𝒮} = (N_{𝒩} − 1)/p̂ with probability p; hence, M̂_{𝒩} is not consistent. We will demonstrate below that the assumption of independent sampling of nodes is not too restrictive and should apply to many, in particular, highthroughput, experimental studies.
So far we have assumed that the number of ORFs, N_{𝒩} in an organism is known from genome surveys. Total genome size is, however, still not precisely known in most organisms. Uncertainty in the N_{𝒩} is, however, easily incorporated. Assume that the value N_{𝒩} is associated with an error or uncertainty ε (i.e., if the genome contains N_{0} proteincoding genes of which N_{𝒩} are known, then ε = (N_{0} − N_{𝒩})/N_{0}. Then let N_{𝒩} := N_{0} (1 ± ε) and for ε ≲ 0.1 we have Replacing p̂ in Eq. 8 with ^{~}p yields the errorcorrected estimate for the true network size Thus, an uncertainty of ε in the number of nodes in the true network results in an uncertainty of 2ε for the number of edges in the true network.
To assess the variability of the estimator we can construct approximate bootstrap confidence intervals (CI) (20). The number of edges is given by in terms of the degree sequence. Now let d = {d_{1}, d_{2}, …, d_{𝒩𝒮}} be the set of degrees of all of the nodes in the graph 𝒢_{𝒮} describing the subnet. Then we generate bootstrap replicates, d*, by sampling the degrees of the nodes in the sample with replacement N_{𝒮}. For each bootstrap replicate, d*, we obtain an estimate M*_{𝒮} (which may be a noninteger because of the factor 1/2 in Eq. 11; this does not affect the estimator). Creating a sufficiently large number of bootstrap replicates, d*, thus allows us to calculate the bootstrap CIs; these have very good coverage properties, as shown in Figs. S3 and S4.
The derivation of Eq. 8 does not depend on any restrictive assumptions (see SI Text) but is a generic property of random graphs and their subnets. Crucially Eq. 8 is valid irrespective of the degree sequence or other summary statistics of the networks^{††}; confidence intervals (CI) and their coverage properties (20) may, however, depend on the degree sequence or network structure. Because there is no sufficient statistic for general networks (17) [i.e., a summary statistic that would include all information about the likelihood (21) of a network] it is also not possible to improve on these estimators by, for example., including the numbers of observed triangles or the clustering coefficient. The only limitation is the assumption of independent sampling. This is, however, also implicit in all previous attempts at estimating interactome sizes (22–24). Below we show how nonrandom sampling schemes can be described and how falsepositive and falsenegative rates of PIN data affect our estimate.
Other NodeSampling Schemes
The above approach can be generalized for datasets that are ascertained in certain ways and can thus also deal with experimental bias.
Independent but Nonuniform Sampling.
We assume independent sampling of nodes. Let node i have a probability p_{i} for being included in the subnet. We allow p_{i} ≠ p_{j} and only assume that the p_{i} values are drawn independently from the same probability distribution, where α is a parameter (potentially vector valued). The properties of F are not of importance. It follows that p̂ is unbiased and also consistent, because for large networks. Now consider an edge e_{ij} (i ≠ j); then the probability of observing this edge in the subnet is and Likewise p̂^{2} is consistent (25), hence also unbiased for large networks.
Dependent Sampling.
Here, we assume as above that p_{i} is drawn from some probability distribution, p_{i} ∼ F_{i}(α) that might, however, depend on information related to node i, for example, the degree or functional classification of i; that is, F_{i}(α) = F(α; D_{i}), where D_{i} denotes this information. Although measures for expression abundance may be such a factor, this appears not to be the case for the datasets considered here (Fig. S5). Hence, we might take D_{i} as an additional parameter in the function F.
In addition, we assume the network is uncorrelated with respect to this information, that is, P(D_{i}, D_{j}) = P(D_{i})P(D_{j}); and, given the probabilities p_{i}, we assume nodes are drawn independently of each other. This assumption is justified for all networks in which the degree–degree correlation of interacting nodes is determined by the degree distribution. This is approximately the case for the networks considered here^{‡‡}. It follows that that is, p̂ is unbiased. Note that which in turn leads to and consequently consistency. Likewise, it follows that E(π_{ij}) = E(p_{i}p_{j}) = 〈p〉^{2}, and that the edge sampling probability consistently is estimated by p̂^{2}.
Effects of Uncertain Data on Estimated Interactome Sizes
So far we have assumed that the interaction data are correct. This is not the case for protein interaction data (10–12, 26–28). Here, we show that it is possible to include noisy data and that the estimates given in Table 1(see also Fig. 2) are not likely to change severely (e.g., by an order of magnitude) for realistic rates of false positives and false negatives. We note that the sampling theory developed in the previous sections needs modification to take false positives and false negatives into account; for example, the sum in Eq. 4 should be over all possible networks and not just those containing the observed PIN data.
Let the number of true interactions in a network with N nodes be denoted by M; if the data collection process is not perfect, then (assuming independence) the number of reported interactions, M̃ will generally be different from M. Now let M_{TP}, M_{FN}, M_{FP}, and M_{TN} denote the truepositive, falsenegative, falsepositive, and truenegative results, respectively. We trivially have and The rates for true positives and false negatives are defined by Thus, for a given number of reported edges/interactions and estimates of the truepositive and falsenegative rates, μ̂ and ρ̂, we obtain an estimate for the true number of interactions Thus, for a fixed network (or subnet) the falsepositive and falsenegative rates affect the estimates of the true number of interactions in a simple linear manner (see Fig. S6).
Results
We use Eq. 8 to estimate interactome sizes in humans and three other eukaryotic organisms: S. cerevisiae (29–32), C. elegans (33), and D. melanogaster (34–36). But we begin with an illustration of the power of this simple estimator by applying it to S. cerevisiae PIN data; here, we have treated the presently available PIN data as a proxy for a complete “interaction network” whose size we are trying to predict. In Fig. 1A we show the distributions of estimates obtained from 1,000 randomly chosen subnets covering 20%, 40%, 60%, and 80% of the available PIN data [taken from the Database of Interacting Proteins (DIP) (37)]. In Fig. 1B we show the coverage properties of the bootstrap 95% CIs for sampling the same sampling fractions. Together with the simulation studies discussed in SI Text, the results in Fig. 1 suggest that the estimator M̂_{𝒩} provides an accurate and reliable way of estimating interactome sizes from present data. Interactome size estimates and their CIs for experimental PIN datasets are shown in Table 1 and Fig. 2 for the organisms considered here. The DIP datasets (always shown in green) are mainly based on highthroughput studies, supplemented by interactions collected from the literature; as such, they generally cannot be treated as independent from the other datasets. For humans, however, there is negligible overlap between the DIP databases and the two recent highthroughput surveys and we can treat the three estimators as approximately independent.
Based on the results in Table 1 and Fig. 2, we would therefore expect—given present experimental methods and ignoring multiple splice variants—the human interactome to contain ≈650,000 protein interactions. Thus, it is approximately an order of magnitude larger than the estimated D. melanogaster interactome, and a factor of 3 more complex than the estimated C. elegans interactome; this contrasts with relative genome sizes of ≈1.8 and ≈1.2, respectively. The results for the S. cerevisiae PIN suggest that it will ultimately contain ≈25,000–35,000 interactions (see also Table 1); this agrees well with previous estimates (22, 23). It also agrees well with estimates obtained from the recent data generated by Reguly et al. (15): for the pure literaturecurated set we obtained 37,000 interactions; for the complete network data we obtained an estimate of ≈35,000 interactions in the yeast PIN. These two datasets were, however, collected from the literature and the sampling process is thus much harder, perhaps even impossible, to model accurately.
By using Eq. 24 the impacts of falsepositive and falsenegative rates are easily assessed (see also ref. 38). We find that the linear effect of the error rates on the estimated number of true interactions results in a comparatively modest effect. The estimates of the truepositive rates in PIN datasets range from 35% (33) to 84% (34); there are fewer estimates for the falsenegative rate that are on the order of 20–40% (10) obtained for different S. cerevisiae datasets. It appears that, for realistic rates of true positive and false positive, the estimate of the human interactome size remains very similar compared to the simple estimate obtained in this article of ≈650,000 protein–protein interactions. Similar curves can be drawn for the other species, too, and in each case we obtain comparable values for most combinations of realistic error rates. Thus, we believe that error rates exert a comparatively moderate effect on the estimator (Eq. 8).
Overall, it therefore appears that estimates obtained from Eq. 8 should be accurate to within less than an order of magnitude even under the very worst circumstances. A much more realistic estimate, however, can be obtained from comparing the different and essentially independent estimates for S. cerevisiae. These findings suggest that an accuracy of approximately a factor of 2 is more realistic. Reassuringly, these results are confirmed when applying a recent multimodel inference procedure (39) that deals with incomplete network data.
Discussion
We have shown that it is possible to estimate the size of interactomes reliably from present partial interaction data. Our estimator is powerful and robust, relying on assumptions that appear to be met by typical systematic highthroughput studies. Unlike the previous approach of Hart et al. (24), who implicitly assume that interactions do not occur between surveyed proteins and those not yet surveyed, our estimate deals with missing data in a coherent and statistically meaningful manner; the route taken by Grigoriev (23) can be understood as a special case of the present approach when two or more datasets are available. Moreover, noise and different sampling/ascertainment strategies are straightforwardly included in the analysis (38, 40). We have illustrated the power of this approach by using simulated sampling processes in S. cerevisiae and have found that the estimator, Eq. 8, and the bootstrap confidence intervals have very good coverage properties. We have then applied this inferential framework to published datasets in four eukaryotic organisms. We found that the predicted interactome sizes differ quite considerably between these species. For example, the human interactome appears to be an order of magnitude larger than the D. melanogaster interactome. Unfortunately, for maize and rice, which have comparable or even larger number of genes to humans, only tiny PIN datasets are available and we cannot obtain useful estimates for their respective interactome sizes. If conventional assumptions about the different complexity of organisms are indeed correct, and if interactome size does reflect organismic complexity (1–3, 41), then we would expect these organisms to have smaller interactomes than humans. The increase of interactome size with number of proteins/ORFs should thus not be uniform or even monotonic. We note that the estimate of ≈ 650,000 interactions means that the human PIN will still be relatively sparse: this corresponds to only ≈0.2% of all possible pairwise interactions being present; for most other species, however, the network is even sparser.
There are a number of other factors that may contribute to an explanation of the increase in phenotypic bauplan complexity between species: the diversity of the transcriptome (42) and proteindomain architecture (43) have all been implicated in the literature. Here, we have demonstrated that interactome sizes are consistent with biological intuition about the complexity of eukaryotic organisms. We note that our estimator is very flexible and reflects the quality of present data: we predict the number of interactions that are detectable given present experimental technology. For example, we have not considered (physiologically probably very important) transient or conditionspecific interactions. Should more sensitive and reliable experimental methodologies or better estimates of experimental error rates become available in the future, then Eq. 8 can, of course, be used to predict an updated number of protein–protein interactions for an organism. Our formalism is also readily extended to directed network data (such as generegulation networks).
As a final note, we want to stress that the estimates necessarily reflect experimental technology. Thus, the estimates in Table 1 refer only to the types of interactions that are detectable given present experimental methods and protocols. The estimator for the size of the true network, however, will remain universally correct for suitable datasets and for all types of networks. We will thus be able to use it in the future and apply it to other network datasets as well.
Acknowledgments
This work was supported by the Wellcome Trust (M.P.H.S., E.d.S., and T.T.), the Royal Society and the Carlsberg Foundation (M.P.H.S. and C.W.), and an EMBO Young Investigator fellowship (to M.P.H.S.). C.W. is supported by the Danish Research Council.
Footnotes
 ^{§}To whom correspondence may be addressed. Email: m.stumpf{at}imperial.ac.uk or wiuf{at}birc.au.dk

Author contributions: M.P.H.S., M.L., and C.W. designed research; M.P.H.S., T.T., E.d.S., M.L., and C.W. performed research; M.P.H.S., T.T., E.d.S., R.S., H.J.A., and C.W. analyzed data; and M.P.H.S., M.L., and C.W. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

See Commentary on page 6795.

This article contains supporting information online at www.pnas.org/cgi/content/full/0708078105/DCSupplemental.

↵†† Eq. 8 is a general result for general (random) graphs; it is equally true for all ensembles of random graphs such as Erdös–Rényi and scalefree random graphs. In SI Text we further illustrate the simple quadratic relationship by using simulations.

↵‡‡ The degree–degree distribution is not significantly different from the product degree distribution (by using the Kolmogorov–Smirnov test); that is, P(k, l) ≈ P(k)P(l) for the datasets considered here.
 © 2008 by The National Academy of Sciences of the USA
References
 ↵
 ↵
 Venter J,
 et al.
 ↵
 Copley R
 ↵
 Tian B,
 Pan Z,
 Lee JY
 ↵
 Henikoff S
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Deeds EJ,
 Ashenberg O,
 Shaknovich EI
 ↵
 de Silva E,
 Stumpf M
 ↵
 Rives AW,
 Galitski T
 ↵
 ↵
 ↵
 Stumpf M,
 Wiuf C,
 May R
 ↵
 Stumpf M,
 Wiuf C
 ↵
 Wiuf C,
 Stumpf M
 ↵
 ↵
 Efron B,
 Tibshirani R
 ↵
 Cox D,
 Hinkley D
 ↵
 Hazbun T,
 Fields S
 ↵
 Grigoriev A
 ↵
 ↵
 Silvey S
 ↵
 ↵
 ↵
 ↵
 ↵
 Ito T,
 et al.
 ↵
 ↵
 ↵
 Li S,
 et al.
 ↵
 Giot L,
 et al.
 ↵
 ↵
 Formstecher E,
 et al.
 ↵
 Duan X,
 Xenarios I,
 Eisenberg D
 ↵
 ↵
 Stumpf M,
 Thorne T
 ↵
 ↵
 ↵
 Carninci P,
 et al.
 ↵
 Chothia C,
 Gough J,
 Vogel C,
 Teichmann S
Citation Manager Formats
More Articles of This Classification
Physical Sciences
Applied Mathematics
Biological Sciences
Related Content
Cited by...
 Identification of two distinct peptidebinding pockets in the SH3 domain of human mixedlineage kinase 3
 Determining the factors driving selective effects of new nonsynonymous mutations
 Interactome disassembly during apoptosis occurs independent of caspase cleavage
 Global Membrane Protein Interactome Analysis using In vivo Crosslinking and Mass Spectrometrybased Protein Correlation Profiling
 A DoubleBarrel Liquid ChromatographyTandem Mass Spectrometry (LCMS/MS) System to Quantify 96 Interactomes per Day
 Uncovering diseasedisease relationships through the incomplete interactome
 2P2IHUNTER: a tool for filtering orthosteric proteinprotein interaction modulators via a dedicated support vector machine
 The Role of Structural Disorder in the Rewiring of Protein Interactions through Evolution
 Largescale De Novo Prediction of Physical ProteinProtein Association
 Toward an understanding of the protein interaction network of the human liver
 Incomplete and noisy network data as a percolation process
 Missing and spurious interactions and the reconstruction of complex networks
 Proteomewide Prediction of Signal Flow Direction in Protein Interaction Networks Based on Interacting Domains
 Discovery and Scoring of Protein Interaction Subnetworks Discriminative of Late Stage Human Colon Cancer
 Challenges and Rewards of Interaction Proteomics
 A truer measure of our ignorance