New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Assessing experimentally derived interactions in a small world

Edited by Lawrence A. Shepp, Rutgers, The State University of New Jersey–New Brunswick, Piscataway, NJ, and approved February 10, 2003 (received for review September 27, 2002)
Abstract
Experimentally determined networks are susceptible to errors, yet important inferences can still be drawn from them. Many real networks have also been shown to have the smallworld network properties of cohesive neighborhoods and short average distances between vertices. Although much analysis has been done on smallworld networks, smallworld properties have not previously been used to improve our understanding of individual edges in experimentally derived graphs. Here we focus on a smallworld network derived from highthroughput (and errorprone) protein–protein interaction experiments. We exploit the neighborhood cohesiveness property of smallworld networks to assess confidence for individual protein–protein interactions. By ascertaining how well each protein–protein interaction (edge) fits the pattern of a smallworld network, we stratify even those edges with identical experimental evidence. This result promises to improve the quality of inference from protein–protein interaction networks in particular and smallworld networks in general.
Until recently, network modeling often assumed the topology was either a random graph or a regular lattice. In recent years it has been shown that neither of these options capture the properties of most real networks well, and new graph models have been proposed. One such model, the smallworld networks of Watts and Strogatz (1), has inspired a plethora of new research directions (2). A smallworld network is one in which the length of the shortest path between any pair of vertices tends to be small (short characteristic path length), but also with densely connected local neighborhoods (high clustering coefficient) (1). The former property is held by random graphs but not regular lattices, whereas the latter property is held by regular lattices but not random graphs, placing smallworld networks between these two extremes. Real networks shown to be smallworld networks include the Internet (3), scientific collaboration networks (4), neural connections in Caenorhabditis elegans (1), the English lexicon (5), metabolic networks (6), and protein–protein interaction networks (7, 8). It is not surprising that so many real networks are small world, because Watts and Strogatz (1) showed that as you interpolate from either extreme (random or lattice), the graph quickly becomes small world, so that real graphs that are neither completely ordered nor completely random will often be smallworld graphs. Properties of smallworld networks have been investigated (9–12), various models for their origin have been formulated (13–16), and methods for separating cohesive regions have been devised (17, 18). Other network properties studied in recent years that are also common in real networks include scalefree (or powerlaw) degree distribution (3, 19–22), community structure (18), and hierarchy (23). Although much analysis has been done on these new network topologies, topological network properties have not previously been used to assess reliability of individual edges in an experimentally derived network.
Biological networks such as protein–protein interaction networks, metabolic networks, and gene regulatory networks are experimentally derived with substantial falsepositive and falsenegative errors (24, 25). Here we consider in detail a network of protein–protein interactions derived from highthroughput, errorprone yeast twohybrid (Y2H) studies (26, 27). These data can be depicted graphically with vertices representing proteins and edges (connecting lines) representing interactions between them. Estimates of error rates in Y2H studies range from 50% to 80% (24, 25, 27, 28). Although protein interaction networks have been used to predict protein function (26, 27, 29–32), experimental errors necessarily affect the quality of such inference.
Others have assessed edges in protein–protein interaction networks by using homology (25), gene expression (33), and network topology (considering the number of neighbors with only one neighbor) (34). Each of these methods used threshold values for their assessment, essentially classifying edges as either high or low confidence. We want to estimate the probability that each edge in the network represents a real interaction, or “true edge.” In addition to direct evidence about the veracity of each edge, we want to see whether a property of the overall topology of the true graph can be exploited to locally assess edges in the experimental graph. The neighborhood cohesiveness of a network is an average of a local measure, so it is a natural first choice of a network property to use. We incorporate a measure of neighborhood cohesiveness around the edge as an indication of how well the edge fits the expected topology of protein–protein interaction networks.
Our strategy is as follows. First, we define four variants of a mutual clustering coefficient, C_{vw}, to measure the neighborhood cohesiveness around an edge in a graph. Second, we show that for our network of protein–protein interactions, true edges (interactions) have distinctly higher C_{vw} than falsepositive edges, as expected if true edges form a smallworld network whereas falsepositive edges form a more random network. Third, we examine the degree to which the neighborhood cohesiveness of each edge is consistent with a smallworld network and show that this measure provides a way to score individual edges according to their likelihood of being true. Fourth, we rank protein interactions according to each variant of C_{vw} and determine the best definition of C_{vw} for protein–protein interaction networks. Fifth, we provide a probabilistic framework for integrating diverse types of evidence to better assess confidence for observed highthroughput interactions. Finally, we predict interactions for which we have no experimental evidence and show that upon further investigation a remarkable number of these predictions have already been noted in the biomedical literature.
Mutual Clustering Coefficient
Watts and Strogatz (1) defined a clustering coefficient to give a global measure of the cohesiveness or “cliquishness” of a graph. Smallworld networks have high clustering coefficients (1). The cliquishness of the neighborhood around an individual edge should therefore be an indication of how well this edge fits the pattern of a smallworld network. Quantifying this seems straightforward. Indeed, the clustering coefficient of a graph was originally defined as the average of a local measurement: The clustering coefficient of a graph defined by Watts and Strogatz used the average value of C_{v}, a measure of the neighborhood cohesiveness around each vertex v. However, we want to quantify neighborhood cohesiveness around individual edges rather than vertices.
The high clustering coefficients of smallworld networks indicate that neighbors of a given vertex are more likely to have edges between them than would be expected in a random graph. Such edges between neighbors of a vertex form triangles cornered at that vertex. This preponderance of triangles in a smallworld network means that an edge is likely to be a side of more triangles than would be expected in a random graph. Therefore, for an edge vw between vertices v and w, a neighbor of vertex v is more likely to have an edge to w if the edge is from a smallworld graph (illustrated in Fig. 1) than if it is from a random graph. Such “mutual neighbors” of the two endpoints serve to corroborate the edge.
We want to define a mutual clustering coefficient C_{vw} for a pair of vertices v and w to give a measure of such corroboration. We want this measure to be independent of the existence of an edge between v and w for a leaveoneout approach, so that direct experimental evidence about an interaction between two proteins does not influence our assessment of the neighborhood of the two proteins. We apply this measure not only to edges (vertex pairs with supporting evidence) but to any pair of vertices so we can form hypotheses about missing edges. In defining such a measure, we want to count the number of triangles in which this pair of vertices might be included, but the proper normalization factor (to account for the number of neighbors of the two proteins) is uncertain. Optimal normalization may depend on other aspects of the network topology, such as whether the smallworld network also exhibits a scalefree topology, i.e., has a distribution of degree (number of edges to each vertex) that follows a power law (19, 21).
We consider four alternative definitions of C_{vw}. In each of these definitions, N(x) represents the neighborhood of a vertex x, and Total represents the total number of proteins in the organism. Given fixed neighborhood sizes N(v) and N(w), these coefficients all increase with increasing overlap between the neighborhoods. For two vertices v and w, we define these measures of mutual clustering coefficient: The first three definitions all have as their numerator the number of triangles that contain the edge, although each definition uses a different normalization factor. The Jaccard index (35) is a natural and common graphtheoretic measure that has been used for hierarchical clustering (36), but it is inappropriate if one of the two endpoints of the edge we are considering has a large neighborhood. For example, if the two endpoints of an edge share 10 common neighbors, we might desire more significance if one endpoint has only these 10 neighbors and the other endpoint has 200 neighbors than if each endpoint has 105 neighbors (the union of the two neighborhoods is of size 200 in either case). Such situations can be expected in scalefree networks such as that of protein–protein interactions (21, 22). The meet/min coefficient removes this bias at the expense of discarding information about the larger neighborhood size. This measure is similar to the topological overlap defined by Ravasz et al. (23). The principal difference is that our measure is independent of any evidence of an edge between the two nodes measured (see below). The geometric coefficient is a compromise between the Jaccard and meet/min coefficients and is similar to the measure of signature overlap used by Ihmels et al. (37).
The cumulative hypergeometric distribution is frequently used to measure cluster enrichment (38) and significance of cooccurrence (39). The summation in the hypergeometric coefficient can be interpreted as a p value, the probability of obtaining a number of mutual neighbors between vertices v and w at or above the observed number by chance, under the null hypothesis that the neighborhoods are independent, and given both the neighborhood sizes of the two vertices and the total number of proteins in the organism. The hypergeometric coefficient is then defined to be the negative log of this p value.
To avoid zero denominators in the first three coefficients, we included the edge vw in computation of C_{vw}, regardless of direct experimental evidence for that edge. For the hypergeometric coefficient, we excluded the edge vw. This makes C_{vw} (for all definitions) independent of the direct experimental evidence for the edge we are assessing. To expedite computation of the hypergeometric coefficient, a numerical approximation of the Gamma function was used to calculate factorials (40) so that computation time was not an issue.
Protein–Protein Interaction Data
We derived our protein–protein interaction network from highthroughput, errorprone Y2H studies (26, 27) obtained from CuraGen's PathCalling Yeast Interaction Database (ref. 26; http://portal.curagen.com). Such interactions between proteins are known to form a smallworld network (7, 8). We chose to focus on data from Y2H studies because they are particularly errorprone and thus stand to maximally benefit from a better assessment of individual interactions. For validation, we used the more reliable conventional evidence (e.g., coimmunoprecipitation) also obtained from the PathCalling database. A total of 6,000 known and hypothetical Saccharomyces cerevisiae proteins were used, of which 1,658 have some type of evidence of interactions with another protein.
To validate predictions of interacting pairs of proteins for which there is no evidence in the PathCalling database, we used Incyte Genomics' Yeast Proteome Database (ref. 41; www.incyte.com/proteome), which is a more comprehensive, biomedical literaturederived database.
Correlation Between C_{vw} and Validity of Interactions
To determine whether there is a correlation between each definition of C_{vw} and true edges, we examined the Y2H network described above. As shown in Fig. 2, for each of these definitions, the average C_{vw} of interactions found by highconfidence conventional studies was an order of magnitude higher than the average C_{vw} of interactions found only in Y2H studies, which in turn was several orders of magnitude higher than the C_{vw} for pairs of proteins with no evidence of interaction. This finding implies that biologically relevant edges in our Y2H network have a distinct distribution of neighborhood cohesiveness that might be useful in separating truepositive from falsepositive edges. This general pattern was observed for each specification of C_{vw}, so that high neighborhood cohesiveness appears to be independent of our choice of measures and a property of the true protein interaction network. The standard errors shown in Fig. 2 suggest that these differences are significant enough to justify testing the value of C_{vw} in adjusting our confidence in the veracity of individual edges.
Ranking Individual Interactions by C_{vw}
The four definitions of mutual clustering coefficient C_{vw} were used to rank protein–protein interactions observed in Y2H studies. Fig. 3 describes these results. The fraction of interactions validated by highconfidence conventional evidence is shown for groups of protein pairs ranked in decreasing order by C_{vw}.
If C_{vw} were unrelated to confidence, the expected fraction would be constant across all bins with height near zero (<10^{−4}). The fraction of validated interactions dominates (is higher than) this line of expectation and is nearly monotonically decreasing for all specifications of C_{vw}, providing evidence that all specifications of C_{vw} contain information about the validity of edges in the graph. The hypergeometric curve generally dominates the others, indicating that for a given rank a larger fraction is validated. This finding justified our choice to proceed by using only the hypergeometric specification for C_{vw}.
Integrating C_{vw} with Direct Experimental Evidence
Frequently, there is other evidence that bears on confidence in a particular edge. Such integration of diverse evidence can be accomplished by using a Bayesian probabilistic framework. In this case we consider two types of evidence: observed C_{vw} and the presence of supporting Y2H evidence. Although we are considering two types of evidence, the framework we use can be used for any number of evidence types.
We want to compute the probability of an interaction being true given the experimental (Y2H) evidence and local network topology (C_{vw}). Because we do not know precisely which interactions are true, we use the existence of evidence from highconfidence conventional experiments as an indication of the truth of an interaction. Rather than compute the probability that two proteins truly interact, we instead estimate the probability that there is highconfidence evidence that the two proteins interact, having concealed the known status of conventional evidence for that pair of proteins. This probability should be correlated with the actual veracity of the protein–protein interaction, although it is likely an underestimate given the sparsity of highconfidence evidence currently available. For simplicity, we call this a posterior probability score of confidence in the interaction of two proteins.
We can compute this score (P^{+}) by using Bayes' rule and the naïve assumption of independence between evidence types as follows: or, by using Bayes' rule to introduce the term P(T_{vw} = 1Y_{vw}) and some algebraic manipulation, Here T_{vw} is an indicator variable for the truth of the interaction according to highconfidence evidence; Y_{vw} is an indicator variable for the existence of experimental Y2H data; and C_{vw} represents the hypergeometric coefficient. The assumption of independence between evidence types was used to limit overfitting the data.
To estimate P(C_{vw}T_{vw}), the likelihood of observing a particular value of C_{vw} for true or false edges, we “bin” edges by C_{vw}, initially binning C_{vw} in equal intervals and then merging bins containing <10 protein pairs. We then count the proportion of known true edges in each bin, using the existence of highconfidence conventional evidence as an indicator for true edges. P(T_{vw}Y_{vw}), the prior probability of the truth of an interaction given the Y2H evidence Y_{vw}, is estimated by the proportion of all pairs of proteins with that value of Y_{vw} that have been observed in an independent (higher reliability) conventional experiment. The resulting estimate of P(T_{vw} = 1Y_{vw}) is almost certainly an underestimate because many true interactions have not had confirmation by conventional methods.
We used a leaveoneout approach to compute P^{+} for each pair of proteins, wherein the likelihoods and priors were obtained by using all edges except the edge of interest. An indication of our method's ability to discriminate true interactions from false ones is shown in the receiver operating characteristic (ROC) curve in Fig. 4. Fig. 4 graphically depicts tradeoffs between truepositive rates and falsepositive rates for different possible thresholds in P^{+}. A curve for a random classifier would be expected to extend from the origin with a slope of 1. Our method substantially dominates this expectation. For example, to achieve a falsepositive rate (1specificity) of 10^{−5}, we have a sensitivity of 0.117, far above the expected sensitivity of 10^{−5}. It is also worth noting that the set of edges exceeding the threshold indicated in Fig. 4 by point d is essentially that set of edges supported by Y2H data. The ROC curve illustrates the ability of our method to stratify interactions, i.e., to allow a researcher to trade lower sensitivity for a lower falsepositive rate or vice versa.
Predicting Protein–Protein Interactions Without Direct Experimental Evidence
The current experimentally derived protein interaction network is incomplete; that is, many true interactions have not yet been seen in any study (24, 42). Although our original motivation was to assess the quality of protein interactions found in highthroughput studies so that future inferences (such as protein function prediction) could be improved, we would also like to predict protein interactions for which we currently have no evidence. Pairs of proteins with high P^{+} score and no direct supporting evidence represent predicted interactions.
We view these predicted interactions as a further test of the validity of our method. Many real edges (interactions) were not in our training set because not all true edges have been found (or even tested) and because our training set is based on summaries of the literature that are unlikely to be complete. These edges have essentially been deleted from our training graph of protein–protein interactions, and we would like to learn the lower limit of how well our method can recover these deleted edges by more exhaustively searching the literature for our top predictions.
In Table 1 we show the pairs of proteins with P^{+} > 0.25 for which we had no evidence of interaction (Y2H or conventional). We examined these interactions further by using the Yeast Proteome Database. Taken together, there are four known physical interactions included in our 13 predictions (p < 10^{−7}). This p value was calculated by using a cumulative hypergeometric distribution with the expected probability of success according to a (conservatively high) estimate of 16 for the global average number of physical interactions per protein (42). The fraction of predictions verified by the literature is likely to be an underestimate of the fraction of predictions that are true, because not all interactions have been tested. Interestingly, four known genetic interactions were observed among the 13 predictions. This result is surprising, given the scarcity of known genetic interactions in yeast (<1,200 such interactions are known among >10^{7} potential interactions) and suggests the possibility of predicting one interaction type from another. Furthermore, according to the Saccharomyces Genome Database (ref. 43; http://genomewww.stanford.edu/Saccharomyces/), in eight of the nine predicted interactions that did not involve a hypothetical ORF, the two proteins share a common molecular function or are known to be involved in the same biological pathway (all except Lsm1p–Lsm8p).
Conclusion
We have described an approach that exploits the local topology of smallworld networks to assess confidence in networks derived from data containing errors. Such an approach can be used to improve predictions of protein function that have previously relied on an “allornothing” view of interactions (26, 27, 29–32) or to determine the proteins most likely to be members of particular protein complexes (S. Asthana and F.P.R., unpublished work). This approach is applicable to other smallworld networks that are defined experimentally or by heuristic measures, e.g., measures of topic similarity between documents that are used in the field of information retrieval (44). The Bayesian framework we use here for combining local topology measures with Y2H evidence will allow integration of diverse measures of confidence, such as other informative measures of local topology (e.g., “interaction generality”) and other sources of interaction evidence (31, 32, 34).
The uncertain nature of experimentally or heuristically derived networks necessarily impacts networkderived inference. We expect that the measures of confidence in network edges described here will improve inference for protein interaction networks in particular and smallworld networks in general.
Acknowledgments
We are grateful to S. Asthana, G. Berriz, D. Miller, and P. D'haeseleer for computational advice and helpful comments. We thank S. Asthana, J. Kleinberg, S. Komili, and S. Wong for critical reading of the manuscript. We gratefully acknowledge the helpful remarks of E. Harlow and the anonymous referees. This research was supported in part by an institutional grant from the Howard Hughes Medical Institute Biomedical Research Support Program for Medical Schools.
Footnotes

↵* To whom correspondence should be addressed. Email: froth{at}hms.harvard.edu.

This paper was submitted directly (Track II) to the PNAS office.
Abbreviations

Y2H, yeast twohybrid
 Received September 27, 2002.
 Copyright © 2003, The National Academy of Sciences
References
 ↵
 ↵
 ↵
 ↵
 Newman M. E.
 ↵
 Sigman M.
 ↵
 ↵
 Wagner A.
 ↵
 ↵

 Watts D. J.
 ↵
 Amaral L. A.
 ↵
 Snel B.
 ↵
 Girvan M.
 ↵
 Barabasi A. L.
 ↵
 ↵
 Wuchty S.
 ↵
 Ravasz E.
 ↵
 ↵
 Deane C. M.
 ↵
 ↵
 Ito T.
 ↵
 Mrowka R.
 ↵

 Boulton S. J.
 ↵
 ↵
 ↵
 ↵
 Saito R.
 ↵
 ↵
 ↵
 ↵
 ↵
 Sudarsanam P.
 ↵
 Press W. H.
 ↵
 Costanzo M. C.
 ↵
 ↵
 Cherry J. M.
 ↵
Citation Manager Formats
Sign up for Article Alerts
Jump to section
 Article
 Abstract
 Mutual Clustering Coefficient
 Protein–Protein Interaction Data
 Correlation Between C_{vw} and Validity of Interactions
 Ranking Individual Interactions by C_{vw}
 Integrating C_{vw} with Direct Experimental Evidence
 Predicting Protein–Protein Interactions Without Direct Experimental Evidence
 Conclusion
 Acknowledgments
 Footnotes
 Abbreviations
 References
 Figures & SI
 Info & Metrics