New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
A data integration methodology for systems biology

Contributed by Leroy Hood, October 5, 2005
Abstract
Different experimental technologies measure different aspects of a system and to differing depth and breadth. Highthroughput assays have inherently high falsepositive and falsenegative rates. Moreover, each technology includes systematic biases of a different nature. These differences make network reconstruction from multiple data sets difficult and errorprone. Additionally, because of the rapid rate of progress in biotechnology, there is usually no curated exemplar data set from which one might estimate data integration parameters. To address these concerns, we have developed data integration methods that can handle multiple data sets differing in statistical power, type, size, and network coverage without requiring a curated training data set. Our methodology is general in purpose and may be applied to integrate data from any existing and future technologies. Here we outline our methods and then demonstrate their performance by applying them to simulated data sets. The results show that these methods select truepositive data elements much more accurately than classical approaches. In an accompanying companion paper, we demonstrate the applicability of our approach to biological data. We have integrated our methodology into a free open source software package named pointillist.
Systems biology (1, 2) aims to understand cellular behavior in terms of the spatiotemporal interactions among cellular components, such as genes, proteins, metabolites, and organelles. In systems biology, one typically perturbs a system and, with highthroughput measurements to identify all pertinent elements and their interactions, integrates them into a biological network to understand the system's behavior. As such, systems biology is predicated on the integration of experimental data from an ever increasing number of technologies, such as gene expression arrays, proteomics, and chromatin immunoprecipitation on chip assays (3). Integration achieves one of the most important imperatives of systems biology, namely it reduces the dimensionality of global data to deliver useful information about the system of interest.
A major challenge in systems biology is that technologies that globally interrogate biological systems have inherently high falsepositive and falsenegative rates (4); thus, each data type alone has a limited utility. The integration of data from different sources provides an effective means to deal with this issue by reinforcing bona fide observations and reducing false negatives. Moreover, because different experimental technologies provide different insights into a system, the integration of multiple data types offers the greatest information about a particular cellular process. For example, gene perturbation experiments (e.g., knockouts or RNA interference) reveal relationships between genes that may imply direct physical interactions or indirect logical interactions. In contrast, chromatin immunoprecipitation chip data can reveal direct protein–DNA interactions or cofactor associations with bound transcription factors (3). Combined together, these technologies can provide a much more detailed view of a transcriptional regulatory network than either alone (5).
There are a number of confounding problems that make data integration nontrivial. First, the types of data to be integrated range from discrete (e.g., a protein molecule may be localized to one or more organelles) to continuous (e.g., mRNA expression level). Second, each technology used has a different degree of reliability and different amounts of the various types of error. Even when considering multiple data sets generated by a common method, simply taking the intersection of these data sets does not remove random errors completely (6). Third, each data set includes its own systematic biases (4, 7). For example, labelingbased mass spectrometry approaches (e.g., isotopecoded affinity tag) tend to favor identification of highly abundant proteins. Smallscale experiments tend to provide strong evidence for a small portion of a network but say little about what may have been missed. Finally, in addition to data generated by highthroughput technologies, there are other attractive sources of data, such as smallscale experiments, curated databases, and computational predictions. To fully realize the potential of systems biology, it is imperative to draw from all of these sources of data. However, curated databases often favor widely studied proteins and genes. Curated databases may also merge data from different strains/cell types or from various experimental conditions, and they may contain considerable data on one part of a system while omitting other parts. Computational predictions that extrapolate from earlier experimental data [e.g., prediction of protein–protein interactions from known interactions of homologous proteins (8)] run the risk of perpetuating any systematic bias in the source data (in addition to falsepositive and falsenegative errors). De novo predictions are even more error prone.
There is therefore a pressing need for more effective methods of integrating data. These methods should accommodate various sources of binary, categorical, and continuous valued data acquired from highthroughput experiments, smallscale experiments, databases, and computational predictions; and should be suitable for dealing with missing data, high error rates, and systematic biases in each data set. In addition, there are few fully verified data sets available for training. Integration methods not requiring a training set have so far been limited to particular classes of data where specific assumptions hold true (9).
To address these concerns, we have developed a data integration methodology that can handle multiple data sets differing in type, size, and network coverage and does not require a training data set. This methodology uses an optimization algorithm to minimize the numbers of false positives and false negatives, and it makes no assumptions about the number of data sets integrated; rather, it is for general purposes and may be applied to integrate data from any existing and future technologies. We have integrated our methodology into a freely available software package named pointillist.
In this paper, we describe our methodology and its statistical foundations using simple illustrative example data sets. Also in this issue of PNAS (10), we demonstrate the utility and efficacy of pointillist by applying it to the integration of 18 data sets to arrive at a network model of the galactose utilization network in yeast. The resulting network recapitulates the known biology of galactose utilization and provides new insights and predictions, some of which we verified experimentally.
Methods
Simulated Data Sets. Although we developed and tested our methodologies by using many sets of real experimental data, for clarity we base this paper on the simplest set of data that would be sufficient to illustrate the pertinent characteristics of the various methods presented. See the companion paper (10) for a demonstration of the applicability of our methods to a wide variety of experimental data. We generated simulated data mimicking real highthroughput data as follows. First, true differences (e.g., fold changes of gene expression levels) for the data elements affected by a perturbation (e.g., disease) were drawn with random signs from a noncentral distribution (gamma distribution with a ≥ 1.25 and b ≥ 1.25). We used two parameters (φ and η) to define the proportion of affected elements (φ) and the fraction of affected elements with negative differences (η), respectively. True differences for nonaffected elements (1 – φ) were set to zero (Figs. 5–11, which are published as supporting information on the PNAS web site, illustrate characteristic features of the data and our methods). Second, normal random noise (r_{i} e _{i} ) was added to the true differences ( t _{i} ) of each data set ( T _{i} ): T _{i} = t _{i} + r_{i} e _{i} , where r_{i} is a noise level inversely related to the statistical power of the technology producing each data set, i, and e _{i} is standard normal. The affected data elements in a data set with a small noise level produce high absolute T _{i} values, resulting in low P values in the following test. Finally, a twotailed test was performed on T _{i} to produce P values for each data set (P _{i} ): P _{i} = 2N _{cdf}(–T _{i} /r_{i} ), where the division using the scaling factor r_{i} is the scale of the noise. The same two data sets were used for Figs. 1, 2, 3 (r_{i} = 0.666 for data set 1 and 0.334 for data set 2, and a = b = 2). For the more noisy, threedimensional data used in Fig. 4B , a and b values were randomly selected in the range 1.25–2.0 to generate 3 × 10 = 30 data sets (for the example shown in Fig. 4A , a = 1.5 and b = 1.5). The noise level (r_{i} ) was selected from a uniform distribution and multiplied by a scaling factor of three for half the data and four for the rest.
Weighted Integration Methods. Several integration methods (11), such as Fisher's χ^{2} (12) and Stouffer's Z (11), have been widely used in statistical metaanalysis to combine P values from k data sets. In this study, we used “weighted” versions of the following integration statistics to maximize the overall statistical power of the weighted sum of nonlinearly transformed P values in each method: where N ^{–1}(·) is the inverse of a standard normal cumulative density function (cdf). The weight (w_{i} ) for the P values of each data set represents a relative measure of statistical power compared with the other data sets. Fig. 6 shows the definitions of falsepositive and falsenegative error rates and statistical power by using two distributions for nonaffected (background) and affected elements, when Fisher's weighted F was used. Also, Figs. 7 and 8 show the effect of random noise on the distribution of P values of the background and affected elements, respectively.
Determination of Weights. We determined the weights by maximizing the overall statistical power (observed in the data sets; Fig. 6) of the weighted integration statistics for a given significance level (0 ≤ α ≤ 1). This maximization was implemented by using enhanced simulated annealing (ESA) (see ref. 13). (i) Guess an initial weight vector. (ii) For the weight vector, determine by using Monte Carlo simulations a cdf of background data elements (those satisfying the null hypothesis H_{0}) for a given integration statistic. The rejection method (14) was used to generate random numbers from a central distribution with k = 1 (i.e., χ^{2} distribution with two degrees of freedom for Fisher's method, tdistribution with nine degrees of freedom for MG method, or standard normal distribution for LS method). (iii) Find an overall statistic value (c) corresponding to cdf(H_{0}) = 1 – α. (iv) Define a decision boundary for F _{w} = c by generating a grid matrix P _{–} _{j} = {P _{1}, P _{2},..., P_{j} _{–1}, P_{j} _{+1},..., P_{k} } using GaussLegendre quadrature (15) and then by computing P _{j} values for the grid points: (v) Count selected data elements by using this boundary as the objective function to be maximized. (vi) Try another weight vector and reject or accept it by using the Metropolis algorithm until the ESA stopping criteria (13) are met. Finally, the overall P value for each data element was calculated using cdf(H_{0}). The selected elements should have overall P values less than the value of α.
Determination of Significance Threshold (α_{T}) for Combined P Value. For a given weight vector, we compute the ratios of cumulative observed densities to cumulative expected densities by using the real data distributions in the range of α between 0 and a: where D ^{cdf}(·) is the cdf of the observed data distribution, and c(a) corresponds to the integration statistic S _{w} (e.g., F _{w}) value when α = a. Note c(α = 0) is infinite. As a decreases, the decision boundary moves toward the origin in Pvalue space. When there are truepositive elements, this results in a large cumulative density ratio because the expected density is small but the observed density is large. As a increases, however, the ratio decreases because the observed density slowly increases relative to the expected density. The plot of the ratio versus a shows that the curve can be split into two regions, one rich in true positives and the other in background data elements (Fig. 11). α_{T} was determined as the x axis value of the point nearest to the origin.
Mixture Distribution Model. For a given weight vector, we define the probability density function (PDF) of a mixture distribution model M as PDF(M) = (1 – θ)PDF(H_{0}) + θPDF(H_{1}) (16, 17). We directly estimated an H_{1} distribution by fitting to the distribution of the real data elements as follows: (i) guess a θ (0 ≤ θ ≤ 1); (ii) define a PDF(H_{0}) by using Monte Carlo simulation; (iii) define a PDF(H_{1}) of integration statistic values (S _{w}) as a distribution of affected data elements by guessing two parameters (a and b) for a gamma distribution; (iv) compute the PDF(M) by using the equation above; (v) calculate the objective function as the squared sum of the difference between cdf(M) and the cdf of the real data ∥D ^{cdf}(S _{w}) – M ^{cdf}(S _{w})∥^{2} _{2}; (vi) repeat steps ii–v with a different set of gamma distribution parameters (a and b) and θ and then recompute the objective function; (vii) accept or reject these new trials by using the Metropolis algorithm until the ESA stopping criteria are met.
Nonparametric (NP) Decision Boundary. For a given α_{T} (see below), the initial set of elements is selected with the union method: C = ∪{ P _{i} < α_{T}}, i = {1,2,..., k}. A weight for each data set was then computed as the nonoverlapping area between the P value distributions of C and the set of random background elements (B): where the instances of S_{i} are the transformed P values for data set i [e.g., when Fisher's method is used, S_{i} =–2ln( P _{i} )]. To quantify the overall statistical power for the two sets C and B, we compute as an objective function, f, the nonoverlapping area between distributions of weighted integration statistic values (S _{w}) of C and B: Note that f is related to statistical power of C, i.e., 1 –cdf(S ^{max} _{w}, C), where S ^{max} _{w} is the S _{w} value at which the cdf difference is at maximum. Also, the final set of selected elements depends on α_{T}. The initial α_{T} value was determined in a manner similar to the mixture distribution method above, except that we approximated the PDF of H_{1} by using kernel density estimation (18).
Results
Example Data Sets. For the purposes of illustration, here we use two artificially generated example data sets. In this way, we know the data characteristics apriori and can illustrate and evaluate each step in our methodology unambiguously. The simulated data are designed to be as simple as possible while mimicking data from highthroughput technologies. In a companion paper (10), we demonstrate the application of the methodology to real biological data.
Fig. 1A presents two example data sets generated as described in Methods. Each of our data sets (assays for genes or proteins or their interactions) consists of two types of elements: true positives (the elements affected by an experimental perturbation, which we wish to detect) and true negatives (background elements not affected by the perturbation). As is common for highthroughput technologies, each data element is presented as a P value (the probability of observing a value when the corresponding data element is not affected by the experimental perturbation). Thus, data elements with lower P values are more likely to be true positives. In Fig. 1A , the true negatives are distributed uniformly (shown as gray dots). When plotting multiple data sets together, P values for true positives would ideally be expected to be distributed near the origin. In practice, because of the different noise characteristics of each measurement technology, P values for the same data element may be low in one measurement (along one axis) and not in another. This results in the distribution of true positives near the axes. Note that the truepositive elements in Fig. 1A are asymmetrically distributed. Data set two (measured along the y axis) was generated with a lower level of noise (thereby containing more statistical power) and therefore has more truepositive data elements with low P values. Finally, note that, although the true positives are highly correlated (as expected), the true negatives are not correlated. For real highthroughput data, true positives will comprise a small proportion of the total data. Thus data sets from different technologies will be largely uncorrelated (19). For the example presented in the companion paper, only 69 genes of 6,000 genes were selected as potential positive elements; 16,985 protein–protein interactions were selected out of 6,000^{2}; and 8,555 protein–DNA interactions were selected out of 135 × 6,000^{2}. The correlation coefficients between the above data sets were all <0.3. To make it easier to understand and compare the methodologies presented, the same data are used for the analyses presented in Figs. 1, 2, 3.
Classical Approaches to Data Integration. Fig. 1A shows the decision boundaries for three commonly used approaches to data integration: intersection, union (20), and unweighted Fisher's method (12). As described above, truepositive data elements tend to be distributed near the axes and especially near the origin. Thus, data integration methods typically divide the total P value space into two regions by determining a decision boundary. Data elements in the region near the axes and/or origin, which is rich in true positives, are selected as significant. Data elements in the complementary region (away from the origin and axes) are considered true negatives. In the intersection and union methods, significant elements are identified independently in each data set using a cutoff P value (e.g., 0.05). For intersection, only the significant elements common to all data sets are selected. This method tends to miss many truepositive data elements near the P_{1} axis and away from the origin (see Fig. 1A ). The union method improves this situation by selecting elements significant in any one data set but includes many false positives near the P _{2} axis and may still miss many elements with low geometricmean P values. Note that the union and intersection methods treat each data set independently. In contrast, Fisher's method selects significant data elements by using a more sophisticated measure that results in a hyperbolic (curved) decision boundary, allowing selection of data elements near both axes and within a larger region near the origin.
Two additional integration methods widely used in metaanalysis to combine P values from multiple data sets are MG T and LS Z (see Methods for formulae). Fig. 1B illustrates the differences between these methods and Fisher's. The three methods involve different transformations of the P values (see Methods), resulting in different shapes of the decision boundary. As can be seen in Fig. 1B , the LS and MG methods favor data elements with low P values in all data sets (points near the origin). In contrast, Fisher's method permits the selection of elements as long as there is at least one very small P value. Thus, Fisher's method is most effective when P values for truepositive data elements in different data sets tend not to agree (many data elements scattered near the axes), and the MG or LS methods when data sets tend to agree (many data elements scattered near the origin). Importantly, all of these methods treat both data sets equally, producing a symmetric decision boundary, even though our data are highly asymmetric. The methods we present below overcome this limitation by producing asymmetric decision boundaries (pointillist provides all three methods. The user should select the appropriate method for the given data characteristics). For the remainder of this paper, for simplicity, we will use only Fisher's method without loss of generality.
Weighted Integration of P Values. To allow for various levels of statistical power (reliability) in different data sets, we use a weighted method for combining P values. In this approach, P values from different highthroughput assays are transformed, scaled [by a weight representing the reliability (statistical power) of the assay] and summed to generate a combined P value (see Methods). The combined P values are then tested by using Monte Carlo generated H_{0} distributions as described in Methods. Data elements whose combined P values are below a threshold are accepted as components of the system of interest. All other data elements are rejected. The selection threshold corresponds to a significance level (α) reflecting the proportion of area below the decision boundary. Infinitely many combinations of weights can satisfy a given α (Fig. 9). For a given α, we select weights that maximize the number of elements chosen (Fig. 10, elements between the decision boundary and axes), which maximizes the number of true positives selected because the majority of data elements near the axes are true positives (for illustrative examples, see Fig. 1C ). We used ESA to search for such optimal weights (see Methods). Fig. 1C compares the unweighted (green line) and weighted (blue line) Fisher's methods for our example data. In this example, P values from data set 2 are twice as reliable as those in data set 1, thus skewing the P _{2} values correspondingly closer to 0 (see Methods). As shown, a weighted integration method produces an asymmetric decision boundary and can therefore capture more of the true positives.
Selecting a JointSignificance Threshold (α). As shown in Figs. 7 and 8, the choice of the αthreshold is constrained by the opposing needs to minimize false positives and false negatives. A large α value can produce a high falsepositive rate, whereas a small α value can lead to a high falsenegative rate. To estimate a suitable threshold α value (α_{T}), we note that truepositive data elements cluster near the axes whereas the remaining data elements tend to be uniformly distributed. Therefore, one way to select α_{T} would be to find the value of α for which the resulting decision boundary best separates uniformly distributed data elements from the nonuniform distribution of data elements near the axes. For a given α, the cumulative density of the data elements measures the fraction of data elements selected. The cumulative expected density of the background (uniformly distributed) data elements is α. Thus to compute α_{T}, we calculate the ratios of cumulative observed densities to cumulative expected densities as α is increased from zero. α_{T} is the xaxis value of the point nearest to the origin (see Methods and Fig. 11). Because the data integration weights are calculated for a given α, it is necessary to recalculate the weights for α_{T}. For these recalculated weights, another α_{T} is determined. This procedure is repeated until α_{T} converges. The black line in Fig. 1C shows the decision boundary arrived at using this method for selecting α_{T}. Note how this boundary is much more stringent, resulting in fewer false positives.
Improved Threshold Significance Level Selection by Using Mixture Distribution Models. The above approach to estimating α_{T} uses only the distribution of the putative background data elements. As such, this approach focuses on minimizing the false positive rate only. By explicitly dividing the data elements into two populations (a putative truepositive set, H_{1,} and a putative background set, H_{0}), we can estimate falsepositive and falsenegative rates in each set. This approach allows optimization of the membership of H_{0} and H_{1} to minimize the falsepositive and falsenegative rates, as follows. (i) For an initial value of α (e.g., 0.05), determine the weight vector as described in Weighed Integration of P Values. (ii) Assume that the full set of observed data elements (green histogram in Fig. 2A ) divides into two sets H_{1} and H_{0}. (iii) The PDF of observed data elements can be approximated by a “mixture model” (MM), M: PDF(M) = (1 – θ)PDF(H_{0}) + θPDF(H_{1}), where 0 ≤ θ ≤ 1 is the proportion of H_{1} in M (red curve in Fig. 2A ). (iv) For a given set of weights, the PDF of H_{0} (dashed blue curve in Fig. 2A ) can be evaluated numerically by Monte Carlo sampling of a uniform distribution (or the corresponding χ^{2}, T, or Z distributions, see Methods for details). (v) Guess an initial value for θ. (vi) Estimate the PDF of H_{1} (black curve in Fig. 2A ) as a gamma distribution that best satisfies the relationship in step 2 above using ESA (see Methods). (vii) Determine α_{T} as the righthand tail of H_{0} measured from the point where H_{1} and H_{0} PDFs meet [marked as c(α_{T}) in Fig. 2A ]. (viii) Repeat steps 1–7 (repeat step 1 by using α = α_{T}) until α_{T} estimates converge. Finally, the significant elements are selected by using the decision boundary given by α_{T}.
Fig. 2B shows the decision boundaries arrived at through the above procedure (black curve, α_{T} = 0.015) and for α = 0.05 for comparison. Note that MMbased estimation of α_{T} reduces the number of false positives considerably while incurring relatively few additional false negatives (see Table 1, which is published as supporting information on the PNAS web site). The explicit MM also has fewer false negatives than the implicit method presented in the previous section (see Fig. 4 for a quantitative comparison using more demanding data that confirm that the MM method provides better overall performance).
Using a NP Decision Boundary. The above methods are all based on smooth parametric decision boundaries derived from assigning a reliability parameter (the weight) to each type of data. In practice, there are many sources of error for a given measurement error, in turn affecting the resulting P values in an irregular manner. In such cases, smooth parametric decision boundaries may be unable to accommodate these irregularities. To allow estimation of nonsmoothdecision boundaries, we have developed a heuristic NP method, as follows (see Methods for details).
First, we construct a firstpass set of candidate truepositive data elements C whose P values are less than a threshold α_{T} (see Methods) in at least one data set. In Fig. 3A , these are the data elements between the axes and the dashed blue lines. The remaining elements are grouped into the complement data set C^{c} . We also generate a set of random background elements (B, see Methods). Second, we compute a weight for C and B proportional to the nonoverlapping area between the two PDFs of transformed P values of the elements in C and B (see Methods). Third, we compute the combined P value statistic (S _{w}). Fourth, we quantify the statistical power of the current memberships of C and B as the nonoverlapping area between the corresponding S _{w} PDFs of C and B. Fifth, if the nonoverlapping area (f) is less than a userspecified threshold (e.g., f_{c} = 0.99), we remove putative falsepositive elements with high S _{w} values from C and add them to C^{c} . The green crosses in Fig. 3A indicate data elements retained in C after this step. Sixth, steps 2 through 5 are repeated by using the new C and C^{c} sets until f reaches f_{c} . Finally, once the iteration process is terminated, potential false negatives are identified as elements in C^{c} whose S _{w} values exceed that of the point where the S _{w} PDFs of C and C^{c} meet. These elements are then added to the final set of candidate truepositive data elements C. Fig.3B shows the final set of selected data elements (green crosses). For comparison, the dashed blue curve shows the decision boundary calculated using the MM (see previous section).
The above procedure is essentially the same as the parametric method described in the preceding section. The main difference is that, rather than moving a decision boundary (as in previous methods), here we move individual data elements between C and C^{c} . The final decision boundary is irregular because elements in a different part of the P value space are removed as the weights change in each iteration. To speed up the optimization process and avoid potential local minima in the search process, we have limited our optimization process to pruning. To be sure that the final set C includes as many of the truepositive data elements as possible, we initialize C by using a value for α_{T} that assures the inclusion of most true data elements at the expense of many false positives. These false positives are then removed iteratively. An adaptive rule is used to determine the fraction of C removed in step 5: C_{f} = min[(f_{c} – f)/f_{c} ,0.01]. This rule removes 1% of C in the beginning but decreases the proportion of elements moved as f approaches f_{c} .
Comparison of the Performance of Different Integration Methods. The example data used in the previous figures were intentionally simplistic for ease of understanding. To highlight differences in the methods when applied to more complex data sets, here we compare the performance of the methods discussed using 10 examples of integration of three data sets. The individual data sets were generated by using the same scheme as described earlier, but with approximately three to four times more noise and less distinct populations of true positives (reduced a and b parameters for the gamma distributions). See Methods for details. Fig. 4A shows the distribution of the truepositive data elements for an example case.
Fig. 4B summarizes the performance of eight different data integration methods on the above 10 examples. For each run of each method, the falsepositive rate is plotted along the x axis and the truepositive rate is plotted along the y axis. The 10 examples are identified by numbers 1–10. Each method is identified by a different marker shape and color (see legend). Instances of the letter M mark the mean performance of the correspondingly colored method. In a manner similar to receiver operating characteristic graphs (21), ideal performance is given by X = 0, Y = 1. The Euclidean distance from this point represents the degree of performance degradation from the ideal. The dashed contour lines in Fig. 4B represent loci of equal performance as defined by the above measure.
The NP and MM methods outperform all other methods on a casebycase basis and based on average performance. The performance of the union method depends crucially on the choice of significance level [compare U1 (α = 0.05) with U2 (α = 0.15), which happens to be close to the α of the MM and NP methods]. The intersection method I1 (for which α = 0.05) performs worse than all other methods. I2 performance (α = 0.05^{3}) is not indicated because its truepositive rate was significantly <0.5. As expected, the thresholded Fisher's method, in which we minimize the number of false positives automatically, produced the smallest falsepositive error rate at the expense of lowering the truepositive rate. However, as shown in Table 1, for less noisy data, the falsepositive reduction method can be very effective. Finally, comparison of the weighted and unweighted Fisher's methods highlights the performance advantage of the weighted method for data sets with unequal amounts of noise, whereas the contour gradient between the weighted Fisher's method and the MM and NP methods highlights the importance of optimizing α_{T}. Overall, MM and NP perform best for minimization of falsepositive and falsenegative rates, followed by the thresholded Fisher's method, which mainly reduces the falsepositive rate.
Handling Missing Values. Missing data points can arise from systematic biases in highthroughput technologies, e.g., favoring high abundance species. Missing values also occur when we integrate global data sets with curated database information. For methods that combine independent, uniformly distributed P values into a uniformly distributed P value, missing values may be handled by using the available P values to compute each combined P value. The simplest way of handling these missing values would be to assign a fixed P value in place of the missing data (e.g., P = 0.5 to indicate an equal chance of being a true positive or a true negative, or P = 1.0 to discourage elements with missing values from being selected). If one or more data sets include many missing values, the above approach can distort the calculated weights. To avoid such a scenario, we exclude data elements with missing values from the calculation of weights, thereby avoiding the above distortion problem. However, we include them in the data selection process so that data elements with missing values can still be selected.
Conclusions
We have presented a generalized framework for data integration in systems biology. The applicability of our methodology to different types and sizes of data and to different numbers of data sets is demonstrated by application to five different types of data integration in a companion paper (10). Although we focused here on presenting our methodology from the perspective of maximizing statistical power, it can also be applied to scenarios for which the different types of data being integrated have systematic differences between them, for example, combining mRNA and protein abundance measurements or in vivo and in vitro measurements. Examples of this type of integration are given in the companion paper (10). Data integration can never rule out inclusion of some false positives or loss of some true positives. Our methodology provides a framework for optimizing the tradeoff between these opposing demands. We have implemented all of the methods presented in a free open source software package (pointillist) that allows users to select the integration method most appropriate to their needs.
Our methodology provides a simple and efficient means for combining multiple sets of noisy data to produce probabilistic models. The final outcome of our data integration procedure is a network model in which nodes represent biomolecular species (e.g., genes or proteins) and edges represent interactions (e.g., transcriptional regulation). Our methodology associates a P value with each node and edge in the network model. These P values indicate the degree of confidence in a node or edge being a true component of the system of interest (compared with background/control).
Acknowledgments
We thank Frederick Roth for insightful and critical analysis of an earlier version of our data integration methodology. A.F.S. holds the Grant I. Butterbaugh Professorship at the University of Washington.
Footnotes

↵ ‡ To whom correspondence may be addressed: Email: hbolouri{at}systemsbiology.org or lhood{at}systemsbiology.org.

↵ † Present address: Pfizer Global Research and Development, Safety Sciences, Eastern Point Road, Groton, CT 06340.

Conflict of interest statement: No conflicts declared.

Abbreviations: LS, Liptak–Stouffer's; MG, Mudholkar–George's; NP, nonparametric; MM, mixture model; cdf, cumulative density function; PDF, probability density function; ESA, enhanced simulated annealing.
 Copyright © 2005, The National Academy of Sciences
References

↵
Hood, L., Heath, J. R., Phelps, M. E. & Lin, B. (2004) Science 306 , 640–643. pmid:15499008
 ↵

↵
Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., BarJoseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., et al. (2002) Science 298 , 799–804. pmid:12399584
 ↵
 ↵
 ↵

↵
Mrowka, R., Patzak, A. & Herzel, H. (2001) Genome Res. 11 , 1971–1973. pmid:11731485

↵
Deane, C. M., Salwinski, L., Xenarios, I. & Eisenberg, D. (2002) Mol. Cell Proteomics 1 , 349–356. pmid:12118076

↵
Gilchrist, M. A., Salter, L. A. & Wagner, A. (2004) Bioinformatics 20 , 689–700. pmid:15033876

↵
Hwang, D., Smith, J. J., Leslie, D. M., Weston, A. D., Rust, A. G., Ramsey, S., de Atauri, P., Siegel, A. F., Bolouri, H., Aitchison, J. D. & Hood, L. (2005) Proc. Natl. Acad. Sci. USA 102 , 17302–17307. pmid:16301536

↵
Hedges, L. & Olkin, I. (1985) Stat. Method MetaAnalysis (Academic, San Diego).
 ↵

↵
Siarry, P., Berthiau, G., Durdin, F. & Haussy, J. (1997) ACM Trans. Math. Software 23 , 209–228.
 ↵

↵
Hildebrand, F. (1956) Introduction to Numerical Analysis (McGraw–Hill, New York).
 ↵
 ↵

↵
Bowman, A. W. & Azzalini, A. (1997) Applied Smoothing Techniques for Data Analysis (Oxford Univ. Press, Oxford).

↵
Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N. J., Chung, S., Emili, A., Snyder, M., Greenblatt, J. F. & Gerstein, M. (2003) Science 302 , 449–453. pmid:14564010

↵
Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R. & Hood, L. (2001) Science 292 , 929–934. pmid:11340206

↵
Swets, J. A. (1988) Science 240 , 1285–1293. pmid:3287615
Citation Manager Formats
Sign up for Article Alerts
Jump to section
You May Also be Interested in
More Articles of This Classification
Biological Sciences
Applied Biological Sciences
Related Content
 No related articles found.
Cited by...
 Reverse enGENEering of regulatory networks from Big Data: a guide for a biologist
 Transcription Factor NFAT5 Promotes Migration and Invasion of Rheumatoid Synoviocytes via Coagulation Factor III and CCL2
 Quantitative Proteomics Reveals {beta}2 Integrinmediated Cytoskeletal Rearrangement in Vascular Endothelial Growth Factor (VEGF)induced Retinal Vascular Hyperpermeability
 GREM1 Is a Key Regulator of Synoviocyte Hyperplasia and Invasiveness
 DNA Methylation Regulates the Differential Expression of CX3CR1 on Human IL7R{alpha}low and IL7R{alpha}high Effector Memory CD8+ T Cells with Distinct Migratory Capacities to the Fractalkine
 RNA helicase HEL1 promotes longevity by specifically activating DAF16/FOXO transcription factor signaling in Caenorhabditis elegans
 ADPribosylation Factor 6 (ARF6) Bidirectionally Regulates Dendritic Spine Formation Depending on Neuronal Maturation and Activity
 Placental Growth Factor1 and 2 Induce Hyperplasia and Invasiveness of Primary Rheumatoid Synoviocytes
 RANK and cMetmediated signal network promotes prostate cancer metastatic colonization
 The Arabidopsis NAC Transcription Factor ANAC096 Cooperates with bZIPType Transcription Factors in Dehydration and Osmotic Stress Responses
 A Systems Approach for Decoding Mitochondrial Retrograde Signaling Pathways
 A novel pathogenic role of the ER chaperone GRP78/BiP in rheumatoid arthritis
 Direct Transfer of {alpha}Synuclein from Neuron to Astroglia Causes Inflammatory Responses in Synucleinopathies
 A systems biology approach to understanding atherosclerosis
 Gene Expression Profiling and the Use of GenomeScale In Silico Models of Escherichia coli for Analysis: Providing Context for Content
 A systems approach to prion disease
 Global Characterization of CellSpecific Gene Expression through FluorescenceActivated Sorting of Nuclei
 Bacterial Postgenomics: the Promise and Peril of Systems Biology{triangledown}
 Communication between levels of transcriptional control improves robustness and adaptivity
 A data integration methodology for systems biology: Experimental verification