Freeenergy distribution of binary protein–protein binding suggests crossspecies interactome differences
See allHide authors and affiliations

Communicated by Ernest M. Henley, University of Washington, Seattle, WA, May 25, 2006 (received for review April 19, 2006)
Abstract
Major advances in largescale yeast twohybrid screening have provided a global view of binary protein–protein interactions across species as dissimilar as human, yeast, and bacteria. Remarkably, these analyses have revealed that all species studied have a degree distribution of protein–protein binding that is approximately scalefree (varies as a power law) even though their evolutionary divergence times differ by billions of years. The universal power law shows only the surface of the rich information harbored by these highthroughput data. We develop a detailed mathematical model of the protein–protein interaction network based on association free energy, the biochemical quantity that determines protein–protein interaction strength. This model reproduces the degree distribution of all of the largescale yeast twohybrid data sets available and allows us to extract the distribution of free energy, the likelihood that a pair of proteins of a given species will bind. We find that acrossspecies interactomes have significant differences that reflect the strengths of the protein–protein interaction. Our results identify a global evolutionary shift: more evolved organisms have weaker binary protein–protein binding. This result is consistent with the evolution of increased protein unfoldedness and challenges the dogma that only specific protein–protein interactions can be biologically functional.
Gaining a global view of protein–protein interaction (PPI) networks gives a new perspective to the understanding of all biological organisms (1–3). Advances in yeast twohybrid (Y2H) interaction have provided highthroughput readouts that generate maps of PPI networks in several organisms including man (4–11). These largescale interactomes revealed an approximately scalefreelike topology that is shared by each studied species. This means that in all organisms most proteins have one or two partners, but a few (so called hubs) have many partners. Thus the probability p(k) that a protein interacts with k others follows an approximate powerlaw distribution: p(k) ∝ 1/k ^{γ}. This conserved crossspecies PPI property is not surprising because networks with powerlaw distributions are ubiquitous appearing in systems as diverse as the internet, the citation index, and societies (12–14).
The discovery of a common topology of diverse systems whose functions are so strikingly different initiated the search for universal models to explain scalefree networks (13, 15). The concept of preferential attachment, in which in growing networks new vertices link preferentially to older nodes that are already highly connected (13), is very popular. Analysis of species separated by billions of years of evolution showed that this mechanism could also be involved in evolutionarily expanding protein–protein networks (16). Another newer idea of intrinsic fitness, in which two nodes are connected when the link is mutually beneficial, was proposed to explain scalefree networks (15). This class of models shows that intrinsic fitness is an essential property that underlies the powerlaw distribution and also allows the prediction and measurement of other properties. In particular, an exponential distribution of the fitness leads to a power law degree distribution of a network.
Protein–protein binding is determined by free energy of association as well as the concentrations of participating molecules (17). The biochemical manifestation of intrinsic fitness for protein–protein binding is that each protein has an inherent propensity for association. This idea opposes the view that PPIs are determined solely by a “lockandkey” mechanism involving complementarity. This article uses the properties of PPI networks to quantitatively explore the unorthodox view of protein interaction promiscuity.
The Y2H method reports binary results for protein–protein binding under a controlled setting (18). We assume that a Y2H measurement is an efficient way to measure a binary PPI, just as one can do for a pair of proteins in a test tube. Thus, the association reaction of two proteins, say A and B, is determined by the free energy difference ΔG ^{0} (19) between the state A + B and the AB final state (Fig. 1). The largescale Y2H data sets report the presence of an AB complex.
We discuss this in more detail. In Y2H screens, two fusion proteins are generated: one protein is constructed to have a DNAbinding domain attached to its N terminus, and its potential binding partner is fused to a transcriptional activation domain (18). Binding of the two proteins will form an intact and functional transcriptional activator. This newly formed transcriptional activator complex will then transcribe a reporter gene whose protein product can be assayed. Thus the presence of the reporter gene product generated is a measure of the association between two proteins. The probability of two proteins forming a complex is determined by their association constant, K _{a}, which is in turn related to the free energy measured in unit of RT.
In the largescale Y2H screens concentrations of all expressed hybrid proteins are expected to be approximately the same. Hence, the factor of protein concentrations, which surely plays a role in vivo, is negated in the binary interactions measured in the Y2H system. This is in contrast to PPI measurements using mass spectroscopy that depend on the native protein concentrations in cells (20). In this report, we focus on the data derived from Y2H screens and hence on the strength of protein–protein binding alone. The physical origin of PPI strength can be complex: hydrogen bonding, van der Waals, hydrophobic, and electrostatic interactions all play a role. However, a thermodynamic model can be developed irrespective of the nature of the freeenergy difference, ΔG ^{0}. In this study we developed a quantitative model using both exact simulation technique and semianalytic approximation to test whether the current largescale Y2H binding data sets obtained for multiple species can be interpreted in terms of a distribution of ΔG ^{0} of an organism, and furthermore whether distributions of free energy of interactions differ across species.
Results and Discussion
The overall strategy used to derive an organism’s freeenergy distribution of binary PPI is illustrated in Fig. 1 and detailed in Methods.
Largescale Y2H screens of protein interactions have been completed for several organisms including the bacteria Helicobacter pylori (8), the malaria parasite Plasmodium falciparum (6), yeast Saccharomyces cerevisiae (5, 11), worm Caenorhabditis elegans (7), fruit fly Drosophila melanogaster (4), and human (9, 10). For all of the organisms examined the distribution of Y2H PPIs approximately follows a power law (4–6, 8–11) form. We sidestep the issue of whether the PPI topology is exactly scalefree. Instead we apply our model, which uses free energy as the basis of the of PPI in a thermodynamic approach to understanding PPIs, to describe the Y2H data available for the different species. In this approach the powerlaw behavior of networks is derived from an exponential distribution giving the probability of variations in the free energy contributed by a protein. Our starting point is the PPI and the additivity principle (21). Simulation and semianalytic strategies were used in a complementary fashion (see Methods).
The computersimulated fit and semianalytically derived curves superimposed on the Y2H data for each organism are shown in Fig. 2. Modeling the data obtained from the largescale Y2H screens (4–6, 8–11) using our approach yielded two parameters, λ and μ, for each species (Table 1). The highthroughput Y2H maps represent a partial sample of the interactomes. Questions have been raised about the accuracy of inferring a complete PPI topology from only a partial sample (22). In our model, the nearly identical λ and μ parameters derived for the two independent human highthroughput Y2H screens not only provide independent validation of those largescale data sets but also support the current model. The analytically derived curves for the different species data sets reveal that the degree distribution resembles but does not strictly follow a power law. Importantly, unlike the prior analysis (10), the current model reveals species differences in the parameters that control the degree distribution. The value of λ ranges from 0.64 to 1.53 and is closely related to the slope of the curves representing p(k) (Fig. 3 A). The value of λ reflects the tightness of the fluctuations of the freeenergy difference, with a smaller value indicating increased fluctuation of the ability to interact. There is no obvious correlation between λ and divergence times among these organisms. The parameter μ is closely related to the height of the curves representing p(k) (Fig. 3 B).
Parameter μ is a measure of the average association freeenergy difference and therefore can be regarded as an indicator of the average strength of all of the binary PPIs of an organism. The value of μ differs across species and unlike λ is positively correlated with divergence times (Fig. 4). This means that μ is lowest for H. pylori and progressively increases with increased evolutionary time. In other words, the average strength of binary PPIs is strongest in the least complex organism.
Recently, Deeds et al. (23) proposed a physical model for PPIs based on number of exposed hydrophobic residues that similarly recapitulates powerlaw distribution in yeast. Their model is a specific example of the “intrinsic fitness model” advanced by Caldarelli et al. (15), with the fitness number of each node having Gaussian distribution and the probability of interaction p(g, g′) taken as a step function. Our model and their model each obtain degree distributions that are approximately scalefree. Whereas their approach based on the Gaussian distribution only holds for a small range of parameters, our model based on exponential distribution yields a nearly scalefree distribution for almost any set of parameters λ and μ. We have not been able to use their model to reproduce the measured degree distributions of all of the species (data not shown).
Our statistical model, as well as those of others (23), underestimates the number of proteins with single partners. This is seen in Fig. 2 by comparing the empirical and predicted values of p(k) at k = 1. There are many classical biochemistry “lockandkey” PPIs that involve highly specific pairs of proteins. Such interactions are not included in our statistical model that assumes additivity of free energy. Hence, the experimental observation of p(1) is always much greater than the predicted value. In fact, the traditional paradigm for structurally based protein recognition always emphasizes the complementarity between interactive pairs. This has led to the important concept of specificity in biochemistry. The intrinsic fitnessbased model illustrates that “nonspecific” interactions where one protein can have multiple partners play an important role in the largescale PPI. And more importantly, our result shows that such interactions can be biologically functional.
We can quantify the strong and weak interactions by computing the probability h(Δg) for the association energy of a pair of proteins, ΔG ^{0}/RT, to take the specific value Δg. A straightforward computation yields the result The distribution h(Δg) peaks when Δg = μ + (1/λ). The mean value of the distribution is μ. The parameters λ and μ for each species were used to generate the different energy distributions, h(Δg), shown in Fig. 5.
The crossspecies comparison of the freeenergy distributions shown in Fig. 5 reveals a progressive lefttoright shift of freeenergy distribution with evolutionary time. This shifts toward weaker interactions mirrors changes in μ (Fig. 4). There was a disproportionably larger difference between human and fly/worm freeenergy distribution with respect to divergence time than the differences between fly/worm and the unicellular organisms. The Plasmodium protein complexes network has diverged from those of yeast, fly, and worm (24). Yet, we find that the freeenergy distribution for this malaria pathogen is similar to these other early organisms (Fig. 5).
What could be the evolutionary changes needed to account for the weaker interactions that seem to typify the human interactome compared with those of the lower organisms? Comparative genomic analysis reveals dramatic differences in the human proteome compared with lower metazoans such as the fly or the nematode (25).
Disordered protein regions are common, particularly in regulatory factors (26). These domains can bind a diversity of protein partners (26, 27). It has recently been recognized that there is an increased trend toward protein unfoldedness from lower to highly complexed organisms (28). The unstructured protein domains are often modified posttranslationally. These unfolded domains also permit multilateral binding and complex interactions required by highly evolved organisms. This evolutionary change expands a protein’s repertoire of partners and is a way for factors to assume new functions. The interactions involving disordered proteins are intrinsically weak. The ability to more readily dissociate a complex is considered an important attribute because it allows PPIs to be regulated by covalent modification and by other molecules.
Comparative genomic analysis also reveals other significant proteome changes that evolved in more complex organisms (25). For example, Src homology 2 (SH2) and Src homology 3 (SH3) bearing proteins are some of the most frequently represented families of factors in man, but their frequency in earlier species is orders of magnitude lower (25). The SH3 and SH2 interaction typically exhibit lower affinities and are also highly regulated.
These evolutionary changes are embodied in the hnRNP K. This protein contains three structured RNAbinding KH domains that are well conserved in fly, nematode, and yeast. KH domains are also found in bacteria (27). Mammalian K protein contains a large disordered KI region that contains several SH2 and SH3binding sites that are absent in fly and worm. The K interaction region mediates association with many protein partners, interactions that are highly regulated by phosphorylation (29–31). In vitro, many K protein binary interactions are weak (29–31). Yet, within cells K protein is a component of many and functionally diverse complexes (27, 32). These observations may reflect multilateral molecular cooperativity of binding that is amiable to regulation by intra and extracellular signals. K protein ability to interact in a regulated fashion with a diversity of other factors and nucleic acids explains its involvement in multiple processes that compose gene expression (27, 33). There are many mammalian proteins that exhibit similar properties (26).
Sequencing of many genomes revealed that the number of protein coding genes is surprisingly similar for organisms as disparate as human, fly, and even yeast (25). Yet, the differences in the complexity of these organisms are immense. The largescale Y2H screens across species provide opportunities to gain global views of the interactomes and their evolutionary trends. Our freeenergy model of PPIs identifies a global evolutionary tendency toward weaker binary protein–protein Y2H interactions. The result of this analysis is consistent with the notion that in highcomplexity organisms disordered regions assumed a greater role (28). These weaker interactions became more important for human PPI networks than for the networks of lower organisms, as viewed by Y2H screens. The evolution of weaker interaction, which is more easily modulated, provides new insight how cellular complexity could have evolved while maintaining genomic simplicity.
Undoubtedly, the future will bring many more largescale Y2H studies. The model developed here should be useful for following interactome changes that evolved between more closely related organisms, and also for studying differences between the freeenergy distributions of diverse tissues. In this regard it would be particularly interesting to compare the Y2H global view of the brain interactome to that of less complex organs. Comparing the freeenergy distribution of PPIs in normal and malignant tissues could also be very fruitful.
Methods
There are thousands of proteins in a typical PPI network and millions of possible binary interactions. Therefore, a statistical treatment of PPI networks, based on the concept of free energy of association—ΔG ^{0} = −RT ln K _{a}, where K _{a} is the association constant—is used to derive the degree distribution of Y2H binary protein interactions for an organism.
Derivation of the Model.
Our basic ideas and assumptions were as follows. There is a mean association free energy among all of the protein pairs in an organism: 〈ΔG ^{0}〉. For a particular pair of proteins, say A and B, their association free energy deviates from the 〈ΔG ^{0}〉, and the deviation is contributed by both proteins A and B additively: where the g _{A} and g _{B} represent the fluctuations of the values of the freeenergy difference, measured in RT units, due to the respective contributions of protein A and B. We assume additivity following the theoretical work (23, 34) and molecular studies (35, 36). For general discussion of additivity principles in biochemistry, see ref. 21. The empirical support for additivity assumption is also provided by the scalefree nature of Y2H PPI networks (4–11). The scalefree phenomenon suggests that if AB has strong interaction, then AC is likely to have strong association. Conversely, if XY is a weak complex, then XZ is more likely to be weak. The physical basis for this intrinsic property of proteins to interact with others remains to be better defined. However, there is a correlation between the number of interactions by a given protein and the fraction of hydrophobic amino acids on its surface (23), suggesting one potential mechanism. Conventionally, one would assume that g _{A} and g _{B} are Gaussiandistributed with zero mean. But it has been shown that the Gaussian distribution is inconsistent with networks exhibiting powerlaw topology (15). Rather, Caldarelli et al. (15) have shown that the robust powerlaw topology essentially dictates the g _{A} and g _{B} to be exponentially distributed; thus it is asymmetric. The exponential distribution leading to power law behavior is also seen in the kinetic study of protein folding (37, 38). Therefore, we take the mathematical expression for the distribution of both g _{A} and g _{B} to be where C is a normalization factor whose value can be determined to be λ/e. The distribution of Eq. 3 has a mean of zero and standard deviation of 1/λ. In summary, the Y2H PPI power law topology (4–11) suggests using an exponential distribution (15) to define an organism’s freeenergy distribution of binary PPIs; the actual numerical value of the power seems to be related to the fluctuations of the PPI.
We now ask, for a given protein A, what is the probability of it being associated with a protein B? This is a standard question of bimolecular association, and the probability is given by where K _{a,AB} is the association constant between A and B, and [B] is the concentration of molecule B. We assume that in all of the Y2H experiments, the concentrations of the expressed hybrid proteins are essentially the same. Then Eq. 4 can be simplified into where g _{A} and g _{B} are exponentially distributed according to Eq. 3 , and the parameter μ = (〈ΔG ^{0}〉/RT) − ln [B] contains information on the average binding strengths of all of the binary PPI of a given organism. Its value is expected to be different for different species. Since we assume that all of the Y2H measurements essentially have the same [B], the lower the mean association energy, the greater the association constant, and the smaller the value of μ. The [B] in Eq. 4 , the concentration of all of the protein B with its binding site for A being free and independent of other binding sites, is a function of the concentrations of the other proteins that compete for the A binding site of B. We do not take this effect into consideration in the present model.
In this article, we shall denote (e ^{gA+gB−μ})/(1+ e ^{gA+gB−μ}) = p(g _{A}, g _{B}). It is graphically convenient to use a simpler notation in which the gvalues of the proteinpair A, B are denoted by g and g′. Then for two interacting proteins with gvalues g and g′, the interaction probability p(g, g′) of Eq. 5 is determined solely by the quantity g + g′ − μ, with a positive value indicating a significant interaction probability.
Eq. 5 gives the probability of a protein A forming a complex with another protein A′ leading to functional transcriptional activator. The relation between this probability and eventual expression of reporter gene is the complex transcription activation process. This is the least understood step of the Y2H measurement. There are essentially two types of transcriptional activation responses: graded and allornone (39–41). The latter leads to a step function as used in ref. 23. Reporter gene systems have revealed that at a singlecell level expression is either maximal or not expressed at all, but the probabilities of expression are a function of the amount of transcriptional activators. This leads to graded responses in a cell population. We have adopted such a graded response in our simulations.
Computer Simulation.
The use of random sampling techniques is appropriate for any system that can be described statistically, so we simulate the protein interaction network using the parameters λ, μ, and N. The technique is to generate an N × N matrix with each element representing a chosen pair of proteins. A matrix element is 1 if the pair interacts or 0 if not. The first step in calculating the matrix element is to assign each protein a “gvalue” according to the exponential probability distribution ρ(g) (Eq. 3 ). The second step is to calculate the probability of interaction p(g, g′) from Eq. 5 . Increasing the value of μ decreases the value of p(g, g′) and therefore decreases the interaction probability, thus μ can be taken as an indicator of the strength of the interaction. Then a number q (0 ≤ q ≤ 1) is generated randomly from a uniform distribution. If p ≥ q, we say that there is an interaction between the two proteins, and the matrix element corresponding to these two proteins is set to be unity. Otherwise, this element is set to zero. This procedure is repeated for all of the (N(N − 1))/2 pairs of proteins in the network. The sum of the number of ones in each row of the matrix represents the number of partners (or degree) of the chosen protein. Tabulating the degree of each protein allows us to determine p(k) the probability that a protein has k partners. Carrying out this procedure five times is sufficient to achieve a stable degree distribution.
Semianalytic Approach.
We developed an average probability approximation to the exact formulation of ref. 34. One of the consequences of the model represented by Eq. 5 is that the probability distribution of interaction free energy between protein A and all of the other proteins is in fact different for different proteins. However, if we neglect this difference, and are only interested in the average distribution of interaction free energy between two proteins, a simple expression for p(k), the degree distribution, can be derived. This is done by approximating the p(g, g′) by its average.
Using the distribution given in Eq. 6 , we built a network by assuming that a single protein with given a gvalue binds another protein with a probability p̄(g). For a number of proteins, N (including itself), the probability of having k actually bind is given by a binomial distribution. Eqs. 6 and 7 give essentially the same degree distribution as the simulation procedure discussed above. Furthermore, if one replaces the probability of interaction of Eq. 5 by a step function, this model reduces to the intrinsic fitness models of refs. 15 and 34. In that case the approximate Eqs. 6 and 7 give the same results as an exact treatment.
Y2H Data Fitting.
The parameters λ, μ were varied so as to minimize the χ^{2} parameter defined to minimize the difference between the logarithms of the theory p(k) of Eq. 7 and the experimentally measured p ^{exp}(k).
The sum is over those values of k for which p(k) ≠ 0. The parameters from these fits were used in the simulation.
Acknowledgments
We thank Lynn Amon, Oleg Denisenko, Jay Hesselberth, Stan Fields, Tom Milac, and Ram Samudrala for their input on this article. This work was supported by National Institutes of Health Grants GM45134 and DK45978 (to K.B.).
Footnotes
 ^{§}To whom correspondence should be addressed at: University of Washington, Box 358050, Seattle, WA 98109. Email: karolb{at}u.washington.edu

Author contributions: Y.Y.S., G.A.M., H.Q., and K.B. designed research, performed research, analyzed data, and wrote the paper.

Conflict of interest statement: No conflicts declared.
 Abbreviations:
 PPI,
 protein–protein interaction;
 Y2H,
 yeast twohybrid.
 © 2006 by The National Academy of Sciences of the USA
References

↵
 Cusick M. E. ,
 Klitgord N. ,
 Vidal M. ,
 Hill D. E.

↵
 Deane C. M. ,
 Salwinski L. ,
 Xenarios I. ,
 Eisenberg D.

↵
 Salwinski L. ,
 Miller C. S. ,
 Smith A. J. ,
 Pettit F. K. ,
 Bowie J. U. ,
 Eisenberg D.

↵
 Giot L. ,
 Bader J. S. ,
 Brouwer C. ,
 Chaudhuri A. ,
 Kuang B. ,
 Li Y. ,
 Hao Y. L. ,
 Ooi C. E. ,
 Godwin B. ,
 Vitols E. ,
 et al.

↵
 Ito T. ,
 Chiba T. ,
 Ozawa R. ,
 Yoshida M. ,
 Hattori M. ,
 Sakaki Y.
 ↵

↵
 Li S. ,
 Armstrong C. M. ,
 Bertin N. ,
 Ge H. ,
 Milstein S. ,
 Boxem M. ,
 Vidalain P. O. ,
 Han J. D. ,
 Chesneau A. ,
 Hao T. ,
 et al.
 ↵
 ↵
 ↵
 ↵
 ↵

↵
 Barabasi A. L. ,
 Albert R.
 ↵
 ↵
 ↵
 ↵
 ↵

↵
 Nelson D. L. ,
 Cox M. M.
 ↵

↵
 Dill K. A.
 ↵

↵
 Deeds E. J. ,
 Ashenberg O. ,
 Shakhnovich E. I.
 ↵
 ↵
 ↵
 ↵

↵
 Pandey N. ,
 Ganapathi M. ,
 Kumar K. ,
 Dasgupta D. ,
 Das Sutar S. K. ,
 Dash D.

↵
 Ostrowski J. ,
 Schullery D. S. ,
 Denisenko O. N. ,
 Higaki Y. ,
 Watts J. ,
 Aebersold R. ,
 Stempka L. ,
 Gschwendt M. ,
 Bomsztyk K.

↵
 Schullery D. S. ,
 Ostrowski J. ,
 Denisenko O. N. ,
 Stempka L. ,
 Shnyreva M. ,
 Suzuki H. ,
 Gschwendt M. ,
 Bomsztyk K.

↵
 Van Seuningen I. ,
 Ostrowski J. ,
 Bustelo X. ,
 Sleath P. ,
 Bomsztyk K.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵