## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Physically grounded approach for estimating gene expression from microarray data

Edited* by H. Eugene Stanley, Boston University, Boston, MA, and approved June 28, 2010 (received for review January 22, 2010)

## Abstract

High-throughput technologies, including gene-expression microarrays, hold great promise for the systems-level study of biological processes. Yet, challenges remain in comparing microarray data from different sources and extracting information about low-abundance transcripts. We demonstrate that these difficulties arise from limitations in the modeling of the data. We propose a physically motivated approach for estimating gene-expression levels from microarray data, an approach neglected in the microarray literature. We separately model the noises specific to sample amplification, hybridization, and fluorescence detection, combining these into a parsimonious description of the variability sources in a microarray experiment. We find that our model produces estimates of gene expression that are reproducible and unbiased. While the details of our model are specific to gene-expression microarrays, we argue that the physically grounded modeling approach we pursue is broadly applicable to other molecular biology technologies.

One thousand manuscripts are published each year involving microarray technology.† In spite of the 15-year history of the field, those manuscripts still describe a wide variety of data analysis methods, many of them poorly specified. Indeed, criticisms of the validity and reproducibility of microarray experiments have dogged the technology since its inception. There are two possible explanations for these shortcomings: (*i*) inherent limitations of the microarray technology that constrain its utility or (*ii*) modeling strategies that are not appropriate. The former is potentially a fundamental problem that can be overcome only with technological advances. This hypothesis has led to candid speculation that emerging sequencing technologies will quickly replace microarrays as the de facto genome-wide expression analysis technique (1, 2).

An alternative view is that current shortcomings result from gaps in our understanding of how to model the data generated in microarray experiments. In order to pursue this point, let us consider the motivation for the “standard” model (3). The fluorescence intensity *F*_{i} (Fig. 1*A*) detected at a spot *i* is surmised to be the sum of a background term and a term related to the expression level *E*_{i} we want to estimate,[1]Oddly, the standard model assumes that *B*_{i} can be directly determined from the fluorescence intensity measured in the nonfeature region surrounding the spot.^{‡} The dependence on *E*_{i} is assumed to be distorted by multiplicative noise (3). These assumptions yield [2]where *ν*^{sp} is normally distributed with zero mean, and *A*_{i} is a parameter capturing the effects of hybridization efficiency and dye-specific and experiment-specific factors.

Because of the difficulty in estimating systematic effects affecting the value of *A*, microarray experiments are frequently performed with an internal control, the goal being to determine change of expression *R*_{i} between two conditions, 1 and 2, instead of the expression level for each condition: [3]where , and are the best estimates of *R*_{i} and *E*_{i}. Because, according to Eq. **2**, *F*_{i} and *B*_{i} can be directly measured, the crux of the traditional approach is to estimate .

In dye-swap experiments, for which the two conditions are identical, one can develop a number of reasonable expectations for *p*(*R*_{i}) and *p*(*R*_{i}|*E*_{i}). Assuming no correlations in the values of , one expects the average value of *R*_{i} to be zero. Moreover, assuming that *A*_{i} is nonnegligible, one expects the standard deviation of *R*_{i} to decrease with increasing *E*_{i}. Unfortunately, neither of these expectations is typically obeyed by the data (Fig. 1 *B*, *C*, *D*).

As a result, the field has failed to converge on a single, robust model. Instead, publications reporting microarray data include a bevy of variations of this standard model. *In many cases, these models were “rescued” to achieve the aforementioned expected properties by the use of idiosyncratic nonlinear corrections.* Exemplifying this are the data reanalyzed in this manuscript—the authors of the studies considered have used different normalization techniques (4, 5).

Here, we argue that background fluorescence intensity cannot be correctly estimated by . Nonspecific hybridization is the dominant factor determining *B*_{i}. In order to correctly estimate *B*_{i}, we propose a dramatically distinct approach to determining gene-expression levels from microarray data. Instead of attempting to surmise a functional expression for *F*_{i}, we model each of the processes that constitute a microarray experiment. Remarkably, by propagating the fluctuations one expects in each stage of the protocol, we arrive at a *concise expression* relating *E*_{i} to measured quantities in the experiment.

We find that our model is able to capture the properties of microarray data for thousands of experiments. Moreover, our model yields *reproducible* estimates of changes in expression level.

## The Physically Grounded Approach

The protocol for two-color cDNA microarray experiments is now essentially standard (6). The measurement component has three main stages, each with its own characteristics. Sample preparation consists of mRNA extraction, purification, amplification, and labeling. Hybridization is the process by which differently labeled targets bind surface-associated probes. Detection is the excitation and scanning of surface-associated fluorophores. In the following, we describe and model each of these stages.

Consider a biological sample consisting of *E*_{i} copies of transcript *i*, with *i* = 1,…,*N*_{tr}. The quantity of RNA derived from a biological sample is typically insufficient for efficient quantification by current experimental methods. Thus, sample amplification is necessary. One of two methods is typically employed to amplify the original messenger RNA: (*i*) expression in a T7 viral vector or (*ii*) polymerase chain reaction. Amplification by T7 vector expression is currently the preferred method because it results in smaller variability for high expression levels (7); thus, we consider it here (*SI Text*).

cDNA vectors are prepared from sample mRNAs by incorporating the T7 polymerase promoter into reverse transcriptase primers. Approximately one vector arises from each mRNA. We assume that transcription of these vectors to RNA is kinetically limited by the rate *R*_{b} of binding of T7 polymerase to transcription start sites (8). In our model, we disregard sequence or length dependent effects on transcription rate (7, 9).

In a well-mixed solution, transcripts of gene *i* are produced at a characteristic rate, *E*_{i}*R*_{b}. Under experimental conditions, the number of transcripts present after running the process for a time *t* is described by a Poisson process with parameter *E*_{i}*R*_{b}*t*. We expect the amplification gain to be very high, that is, *R*_{b}*t*≫1. In this limit, the Poisson distribution of number of transcripts arising from this process converges to a Gaussian distribution. This implies that the number *n*_{i} of copies of cDNA for gene *i* available for hybridization is a Gaussian variate with mean and variance equal to *E*_{i}*R*_{b}*t*.

Consider now competitive hybridization in a solution that is well-mixed and let *p*_{ii} be the probability of specific hybridization of target *i* to feature *i*. *p*_{ii} may depend on the sequence of gene *i* and on experimental conditions such as temperature and buffer concentration, but typically probe sequences are selected so that *p*_{ii} is approximately constant. Thus, we assume that *p*_{ii} = *p*_{sp} for all *i* and let its fluctuations be incorporated into the noise. We suggest that the number of specifically hybridized probes in the feature follows a binomial distribution with parameter *p*_{sp}. If *n*_{i}*p*_{sp}≫1, then the central limit theorem holds, and is a Gaussian variate with mean *n*_{i}*p*_{sp}, [4]where and are Gaussian variates with zero mean.

Similarly, let *p*_{ji} be the nonspecific hybridization efficiency for gene *j* to probe *i*. The number of hybridized probes *j* in feature *i* will then be [5]where is again a Gaussian variate with mean zero. Note that *p*_{ji} ≪ *p*_{ii} for all *j* ≠ *i*. The total contribution of nonspecific hybridization from all targets to the observed signal will then be [6]Estimating *p*_{ji} directly for all pairs of transcripts is not feasible in practice. In order to proceed, we thus use a mean-field approximation. Specifically, we assume that no single gene is responsible for a significant fraction of all mRNA targets. We further assume that *p*_{ji} is not dependent strongly on *j* or *i*; that is *p*_{ji} ≈ *p*^{nsp}. Under these assumptions, Lyapunov’s central limit theorem applies, yielding [7]where *U*^{′} is the characteristic contribution of nonspecific hybridization and is a Gaussian variate with zero mean.

The fluorescence generated by the excitation of the spots on the chip will be amplified in the scanning process. Amplification using a photomultiplier is characterized by a dye-specific gain *G* that is a function, in principle, of dye incorporation rate, dye properties, laser power, and detector characteristics, yielding a detected fluorescence [8]where , the noise associated with stage *k* of amplification, is normally distributed with mean zero. We assume that the gain is constant and does not depend on the intensity of the signal, or on any other spot property; that is, *G*_{i} = *G*. We also assume that there is no specific interaction between a dye molecule and either target or probe sequence, an assumption that we find fails for some probes (*SI Text*).

Because the variability for each term in Eq. **8** is the product of several independent Gaussian variables, the terms will converge to a log-normal distribution. We can therefore write Eq. **8** as [9]where *A* ≡ *R*_{b}*tp*_{sp}*G*, *U* ≡ *U*^{′}*G*, and the noise terms and are normally distributed with mean zero and standard deviations *σ*_{sp} and *σ*_{nsp}.^{§} Our physically grounded model thus has four parameters that relate *E*_{i} to *F*_{i}: *A*, *U*, *σ*_{sp}, and *σ*_{nsp}.

Although formally similar, Eq. **9** differs significantly from the standard statistical model typically assumed for observed feature intensity, Eq. **2**. Here, *B*_{i}, the nonfeature intensity local to spot *i*, is measured from the data. This is in contrast to our interpretation of additive noise in a microarray experiment, which is dominated by nonspecific hybridization. Because background fluorescence and nonspecific hybridization cannot be decoupled, we explicitly model the latter.

## Model Validation

We next compare the predictions of Eqs. **2** and **9** for the distribution of observed feature intensities *p*(*F*). To this end, we must surmise a functional form for the distribution of expression levels *p*_{E}(*E*). We expect *p*_{E}(*E*) to be strictly decreasing; most genes have very low expression levels, whereas a few genes have high expression levels. Following recent reports (10–12), we assume that *p*(*E*) exhibits a power law decay, such that [10]

We derive *p*(*F*) for both models and obtain maximum likelihood estimates of the model parameters—including *α*—by the method of steepest descent (*SI Text*). We find that our model predicts the empirical distributions extraordinarily well, whereas the statistical model does not (Figs. 2 *A* and *B*). Note that because Eq. **2** includes two observed quantities (*F*_{i} and *B*_{i}), the distribution in Fig. 2*B* is expressed as a function of (*F*_{i} - *B*_{i}). For the *Arabidopsis thaliana* chips we considered, our results suggest that *p*_{E}(*E*) follows a power law decay with *α* ≈ 1.7, consistent with previous reports (12).

To summarize the abilities of the standard statistical model and the physically grounded model to reproduce the distributions of observed fluorescences, we fit parameters for both models to 894 Agilent gene-expression chips from the compendium of arrays in the National Center for Biotechnology Information Gene Expression Omnibus (GEO) for which raw data has been deposited. For each of these chips, we computed the error, *e*_{m} of the fit to the model, [11]where *p*_{e} and *p*_{m} are the empirical and model-derived probability density functions, respectively. *For 91% of the chips, our model results in a better description of the distribution of fluorescence intensities* (Fig. 3*A*).

### Intensity-Dependent Dye Bias.

Next, we consider a metric of relative expression change; see Eq. **3**. In the special case that for all *i*—as would occur in a dye-swap experiment—one expects the distribution of *R*_{i} to be symmetric about its mean, zero. In this special case, because there is no expression change, we denote the observed *R*_{i} = *R*^{0} to indicate the absence of an underlying signal. We utilize data from experiments employing a dye-swap design to investigate the presence of bias in the estimation of *E*_{i}. These experiments employ a technical replicate of each sample, alternating the labeling scheme on the second replicate. This procedure yields a pair of realizations and that arise from a single set of expression levels.

A common assumption in the literature is that the gain of a dye-detector system can depend on the signal intensity in a profound way, giving rise to intensity-dependent dye bias (Fig. 1*B*). Because no theoretical expression exists to describe the dependence of this bias on *F*_{i}, investigators use nonlinear regression techniques such as the scatterplot smoothing algorithm lowess to correct affected data (Fig. 1*C* and *SI Text*) (13, 14).

We propose that the prevalence of this bias is due in large part to an incorrect adjustment for the additive noise in the experiment, which, after data are logarithmically transformed, manifests itself nonlinearly. We investigated the existence of intensity-dependent dye bias in estimates from our model and found that *R*^{0} estimates using our model show only a very weak dependence on *E* (Fig. 1*D*).

To more completely address the extent of dye bias in estimates generated from these models, we quantified its presence in the 894 microarrays described above. While these chips are not typically performed in dye-swap arrangement, the extreme heterogeneity of the sample origins motivates the assumption that *R*^{0} is typically zero centered and does not depend on *E*.

We define the extent of bias of a model, *b*_{m}, [12]where is the 100-point moving average of for model *m*. *We found that this bias is greater for the estimates using the standard statistical model in 78.5% of the chips we considered* (Fig. 3*B*).

Not surprisingly, the lowess-corrected statistical model decreases the dependence of on *E*, compared to the standard statistical model. However, our model yields estimates of that are no more biased than those observed with the lowess-corrected model (Fig. 3*C*).

### Reproducibility.

We next assess intra- and interlab experimental reproducibility. We consider microarray experiments performed at three labs with identical reference samples from two different sources (Universal Human Reference; Human Brain Reference) (4). Each lab performed five replicates of four protocols, including either competitive hybridization of the same sample or of different samples. We measured the correlation between the estimated expression level changes across protocols, replicates, and labs for our model, the standard statistical model, and the statistical model used in ref. 4. For identical samples, we expect correlations across microarray experiments to be close to zero. We find an average correlation of 0.10 using our model’s estimates. For distinct samples we expect a correlation close to 1 and find a mean correlation of 0.79 for our model’s estimates (Fig. 4).

## Model Implications

Eq. **2** and similar models supply no direct means of determining the significance of gene-expression changes and therefore often rely on arbitrary thresholds. The intrinsic difficulty of quantitatively establishing significance of microarray results is highlighted in Fig. 1*D*—the scale of fluctuations varies nonlinearly with expression level. Our model enables us to quantify in a natural way the probability of rejecting the null hypothesis that a gene’s expression level is unchanged between samples. To identify genes that are expressed differentially between two samples, we consider a null model for *R*_{i}, which assumes that the expression in the two samples is identical. Specifically, given two fluorescence levels and , the probability density that the corresponding expression levels and are identical is [13]We denote the expression ratios estimated from the physically grounded and statistical models as and , respectively. One can determine the expected distribution of *R*^{0} values for any expression level if a gene’s expression level has not changed (*SI Text*). This distribution will depend on the parameter estimates for the model, but Fig. 5 shows that the confidence intervals for lower expression levels are larger than the corresponding intervals for more highly expressed genes. This is expected, because nonspecific hybridization constitutes a larger fraction of *S*_{i} (and consequently of *F*_{i}) for low *E*_{i}, resulting in *R*_{i} estimates with larger uncertainty.

## Enrichment of Experimental Power

To further illustrate the limitations of the current approaches and the significance of our approach, we next applied our methods to a dataset reported in ref. 5. In this study, three cohorts of animals were fed different diets: standard lab diet (SD), high-calorie, high-fat diet (HC), and the high-fat diet supplemented with the small molecule resveratrol (HCR).

Resveratrol has been shown to extend life span in several model organisms (15, 16). Baur et al. (5) suggested the existence of a molecular basis for the phenotypic similarity they observe between SD and HCR mice. As such, they performed microarray experiments to test the hypothesis that HCR animals are transcriptionally similar to SD animals.

RNA from the livers of animals from each feeding protocol were hybridized against a pool of RNA from the SD mice. From these chips we estimated (again, filtering for poor-quality spots and prevalent sequence-dependent dye bias, *SI Text*) using the statistical model, our physically grounded model, a lowess-corrected statistical model, and the z-score normalization used by Baur et al. (5, 17) (see *SI Text*). We computed the correlation between for each pair of chips for each model to understand the degree of specificity and sensitivity that each imparts and to test the hypothesis that HCR animals are transcriptionally similar to SD animals, whereas HC animals are distinct from both.

If there is a robust difference between expression changes between two samples, one expects the expression change to be consistent—a high correlation—across replicates. If there is no difference, one expects weak correlation (Fig. 6*A*). The similarity between the estimates derived from the statistical model for the HCR animals are statistically indistinguishable from the similarity between the control animals (Fig. 6*B*). The authors of ref. 5 had to use a higher-level pathway analysis (18) to distinguish between the HCR and HC mice. In contrast, estimates derived from our model strongly support the hypothesis that HCR animals are transcriptionally similar to SD animals (Fig. 6*B*). Our analysis indicates that the difference is clear at the level of correlation of individual gene-expression changes, an effect that is unobservable using other models. *This leads us to conclude that our model imparts greater statistical power.*

Having established the statistical legitimacy of our model, we investigated what practical implications it has for identifying consistent sets of up- and down-regulated transcripts. For our model and the three others, we determined the 100 genes most likely to be up- and down-regulated for each chip. For the HC and HCR chips, we aggregated genes that were represented in at least three of four sets (Fig. 6*C*). Given the hypothesis that HCR animals are similar to control animals, we expect that there is much less consistency between the HCR sets (i.e., small circles). Also, if the HCR and HC animals are distinct, they should have very few genes common across conditions (i.e., small overlap of the circles). This is the behavior we observe in the sets derived from our model—there is very little consistency between the HCR sets, but the HC sets are robust—but *not* for the other models.

## Discussion

We demonstrate here that a physically grounded approach successfully models the outcome of gene-expression microarray experiments. Whereas linear models assuming normally distributed error terms may be appropriate models for many experiments in biology, they fail in many high-throughput applications due to the multiplicative nature of propagating fluctuations. For these experiments, consideration of the physical processes responsible for the outcome is essential. Our model, although constructed with two-color expression microarrays in mind, is generalizable to other systems. As chip-based assays and other high-throughput technologies continue to evolve, it will become increasingly important to establish physically grounded models for the resulting data. Although the specifics of a particular protocol may vary, a physically grounded model can be derived to understand any procedure that is composed of serial, fundamentally understood stages. For these experiments, statistical models are often the first approach because physically grounded models may be perceived as difficult to develop. In many cases, the benefits of the physically grounded modeling approach are appreciable and may outweigh increased developmental difficulty.

We have found that this approach produces a model for microarray data that reproduces macroscopic properties of the chip and results in estimates of expression changes that are nearly free of intensity-dependent dye bias, an artifact that has been traditionally rectified using ad hoc approaches. As a result, the estimates we obtain of the expression levels are systematically reproducible within and across laboratories. In addition, our model allows us to assign confidence to expression changes, even in experiments devoid of technical and biological replicates.

Our study provides yet another cautionary tale of the ad hoc adjustment of models of complex data. The standard statistical model of microarray data has many laudable features: It is simple, it has easily understood parameters, and it is readily testable. Indeed, the model’s inability to capture even basic properties of the data (Figs. 1, 2, and 3) would strongly suggest the need to reject it. Surprisingly, instead of rejecting the model, the course followed by the field has been its “rescuing” with uncontrolled and unjustified corrections. Our study shows that these corrections are unnecessary.

## Acknowledgments

The authors thank J. Wang, J. Widom, W. Kibbe, and members of the Amaral and Morimoto groups for helpful discussion and friendly review of the manuscript. This work was supported by a National Institutes of Health P50 grant; the Keck Foundation (L.A.N.A.); and the National Institute for General Medical Science, the National Institute for Aging, the Rice Institute for Biomedical Research, and the Huntington’s Disease Society of America (R.I.M.).

## Footnotes

^{1}To whom correspondence should be addressed. E-mail: amaral{at}northwestern.edu.Author contributions: P.D.M., R.I.M., and L.A.N.A. designed research; P.D.M. performed research; P.D.M. and L.A.N.A. analyzed data; and P.D.M., R.I.M., and L.A.N.A. wrote the paper.

The authors declare no conflict of interest.

*This Direct Submission article had a prearranged editor.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1000938107/-/DCSupplemental.

↵

^{†}Medical Subject Heading (MeSH) term “microarray analysis/methods.”↵

^{‡}Due to differences in surface chemistry of feature and nonfeature regions, one cannot reasonably expect that*B*_{i}is representative of the background fluorescence in the feature region.↵

^{§}For now, we assume that there are no features with quality problems (see*SI Text*).

Freely available online through the PNAS open access option.

## References

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵(2008) Two-Color Microarray-Based Gene Expression (Quick Amp Labeling) Protocol, Agilent Technologies, Technical Report G4140-90050 v.5.7.
- ↵
- ↵
- ↵
- ↵
- ↵
- Hoyle D,
- Rattray M,
- Jupp R,
- Brass A

- ↵
- Ueda H,
- et al.

- ↵
- Bittner ML,
- Chen Y,
- Dorsel AN,
- Dougherty ER

- Yang YH,
- Dudoit S,
- Luu P,
- Speed TP

- ↵
- ↵
- ↵
- ↵
- Cheadle C,
- Vawter MP,
- Freed WJ,
- Becker KG

- ↵
- ↵
- Fauteux F,
- Chain F,
- Belzile F,
- Menzies J,
- Belanger R

- ↵
- Reiner A,
- Yekutieli D,
- Benjamini Y

- ↵
- ↵

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology

- Physical Sciences
- Applied Physical Sciences