Previous Article |
Table of Contents
| Next Article
Departments of Statistics and Human Genetics, University of
California, Los Angeles, CA 90095
Edited by Peter J. Bickel, University of California, Berkeley,
CA, and approved October 30, 2000 (received for review August 21, 2000)
Recent advances in cDNA and oligonucleotide DNA arrays have made it
possible to measure the abundance of mRNA transcripts for many genes
simultaneously. The analysis of such experiments is nontrivial because
of large data size and many levels of variation introduced at different
stages of the experiments. The analysis is further complicated by the
large differences that may exist among different probes used to
interrogate the same gene. However, an attractive feature of
high-density oligonucleotide arrays such as those produced by
photolithography and inkjet technology is the standardization of chip
manufacturing and hybridization process. As a result, probe-specific
biases, although significant, are highly reproducible and predictable,
and their adverse effect can be reduced by proper modeling and analysis
methods. Here, we propose a statistical model for the probe-level data,
and develop model-based estimates for gene expression indexes. We also
present model-based methods for identifying and handling
cross-hybridizing probes and contaminating array regions. Applications
of these results will be presented elsewhere.
Oligonucleotide expression
array technology (1) has recently been adopted in many areas of
biomedical research. As reviewed in ref. 2, 14 to 20 probe pairs are
used to interrogate each gene, each probe pair has a Perfect Match (PM)
and Mismatch (MM) signal, and the average of the PM-MM differences for
all probe pairs in a probe set (called "average difference") is
used as an expression index for the target gene. Researchers rely on
the average differences as the starting point for "high-level
analysis" such as SOM analysis (3) or two way clustering (4).
Besides the original publications by Affymetrix scientists (1, 5), there have been very few studies on important "low-level"
analysis issues such as feature extraction, normalization, and
computation of expression indexes (6).
One of the most critical issues is the way probe-specific effects
are handled. We have found that even after making use of the control
information provide by the MM intensity, the information on expression
level provided by the different probes for the same gene are still
highly variable. We use a set of 21 HuGeneFL arrays to illustrate our
discussion. This data set is typical, in terms of quality and sample
size, of a data set from a single-laboratory experiment. We have
applied the methodology to many sets of arrays from different
laboratories and obtained similar results. Each of these 21 arrays
contains more than 250,000 features and 7,129 probe sets. Figs.
1 and 2
show data for one probe set in the first six arrays. This probe set
(no. 6,457) will be called probe set A hereafter. There are
considerable differences in the expression levels of this gene in the
samples being interrogated, as the between-array variation in PM-MM
differences is substantial. More noteworthy is the dramatic variation
among the PM-MM differences of the 20 probes that interrogate the
transcript level. ANOVA of the PM-MM differences of this probe set in
these 21 arrays shows that the variation due to probe effects is larger
than the variation due to arrays. Specifically, mean squares due to
probes and arrays are 38,751,018 and 17,347,098, respectively. This is a general phenomenon: for the majority of the 7,129 probe sets, the rms
due to probes is five times or more than that due to arrays. Thus, it
is clear that proper treatment of probe effects is an essential
component of any approach to the analysis of such expression array
data. Below, we introduce a statistical model for the probe-level data
to account for probe-specific effects in the computation of expression
indexes.
Statistics
Model-based analysis of oligonucleotide arrays: Expression index
computation and outlier detection
![]()
Abstract
Top
Abstract
Introduction
Results
Conclusions
References
![]()
Introduction
Top
Abstract
Introduction
Results
Conclusions
References

View larger version (54K):
[in a new window]
Fig. 1.
Black curves are the PM and MM data of gene A in the first six arrays.
Light curves are the fitted values to model 1. Probe pairs
are labeled 1 to 20 on the horizontal axis.

View larger version (54K):
[in a new window]
Fig. 2.
Black curves are the PM-MM difference data of gene A in the first six
arrays. Light curves are the fitted values to model 2.
In addition, human inspection and manual masking of image artifacts is currently very time consuming and represents a limiting factor in large-scale expression profiling projects. We show that the goodness of fit to our model can be used to construct diagnostics for cross-hybridizing probes, contaminated array regions, and other image artifacts. We use the diagnostics to develop automated procedures for detecting and handling of all these artifacts. This method makes it possible to process and analyze a large number of arrays in a speedy manner.
Statistical Model.
Suppose that a number (I > 1) of samples have been profiled in an
experiment. Then, for any given gene, our task is to estimate the
abundance level of its transcript in each of the samples. The
expression-level estimates are constructed from the 2 × I × 20 (assuming a probe set has 20 probe pairs) intensity values for the
PM and MM probes corresponding to this gene. The estimation procedure
is based on a model of how the probe intensity values respond to
changes of the expression levels of the gene. Let us denote by
i an expression index for the gene in the ith sample. We assume that the intensity value of a probe
will increase linearly as
i
increases, but that the rate of increase will be different for
different probes. It is also assumed that within the same probe pair,
the PM intensity will increase at a higher rate than the MM intensity.
We then have the following simple model:
|
|
[ 1 ] |
j is the baseline response
of the jth probe pair due to nonspecific hybridization,
j is the rate of increase of the
MM response of the jth probe pair,
j is the additional rate of
increase in the corresponding PM response, and
is a
generic symbol for a random error. The rates of increase are assumed to
be nonnegative.
We fit model 1 to the 2 × 21 × 20 data matrix
for probe set A and Fig. 1 shows the observed and fitted PM and MM
intensities for the first six arrays. The model fits the data well. The
residual sum of squares is only 1.03% of the sum of squares of the
original PM and MM intensities. Thus, this model is able to capture the main relations between the observed intensities for different arrays
and probes.
The model for individual probe responses implies an even simpler model
for the PM-MM differences:
|
s
to be J (the number of probes):
|
[ 2 ] |
s and
s, regarding the other set
as known. For comparison, we also perform least square fitting by using
the more standard additive model:
|
|
Conditional Mean and Standard Error.
Suppose for gene A, the
s have been learned from a large
number of arrays, we can then treat them as known constants and analyze
the mean and variance of the expression index estimate. For a single
array, model 2 becomes:
|
[ 3 ] |
s, the linear least square estimate for
is
|
|
|
s as fixed, we can
calculate standard errors of
s. These standard errors will play an
important role in outlier detection and probe selection. We note that
the above calculation is conditional in the sense that the
s are
regarded as known constants. This is valid if we have a large number of
arrays to estimate them accurately, otherwise, the uncertainty in
the estimation of these probe-specific parameters must be taken into
account in the standard error computation.
Probe selection and Automatic Outlier and Artifact Detection.
Conceptually we can extend expression 2 to model the
response of the probe set to all genes in the sample:
|
[ 4 ] |
i(k) is the
expression level of the kth gene in the ith
array,
j(k) is the sensitivity
of the jth probe to the kth gene, and n is the
total number of different human genes (we do not consider complications
such as alternative splicing here). Ideally, we want a probe set to be
specific: if a probe set is intended to interrogate gene k, then only
the
j(k)s should be nonzero
(thus sensitive) and all other
j(k)s should be 0 (thus
specific). In this case the observed yij are
specific signals coming from the target gene and model 4 is
reduced to model 2, and the expression indexes
i(k) can be correctly
estimated. We note that model 4 is formally a special case
of the factor analysis model that is widely used in the social sciences
(7).
Although Affymetrix has developed prediction rules to guide the
selection of probe sequences with high specificity and sensitivity (1),
inevitably there remains some probes hybridizing to one or more
nontarget genes. We expect most cross-hybridizing genes to have
expression patterns (in a large set of samples) different from that of
the target gene, and different probes in a probe set to cross-hybridize
to different nontarget genes. For a nontarget interfering gene
k', the sensitivity indexes
j(k') are expected to be small
except for one or two probes in the probe set. The mixed response of a
probe set to target and nontarget genes suggests that probe selection
(in the analysis stage) may enhance the specificity in estimating the
expression levels of the target gene
i(k).
In the standard analysis (5), the mean and standard deviation of the
PM-MM differences of a probe set in one array are computed after
excluding the maximum and the minimum. If a difference deviates by more
than 3 SD from the mean, a probe pair is marked as an outlier in
this array and discarded in calculating average differences of both the
baseline and the experiment arrays. One drawback to this approach is
that a probe with a large response might well be the most informative
but may be consistently discarded. Furthermore, if we want to compare
many arrays at the same time, this method tends to exclude too many probes.
We exploit our model to detect and handle cross-hybridizing probes,
image contamination, and outliers from other causes. For a particular
probe set, its 20
values constitute its "probe response
pattern," and the model hypothesizes that the 20 differences in an
array should follow this pattern and are scaled by the target gene's
expression index (
) in this array. The (conditional) standard error
attached to a fitted
is a good measure of how the 20 differences in
the corresponding array conform to the probe response pattern. For
example, in Fig. 4B, array 4 is identified as an "outlier array" because the estimated
4 has a large standard error. Close examination of Fig. 4A reveals that the probe
responses in array 4 deviates from the consistent patterns shown in the
other arrays. This could be due to various reasons including image
artifacts (Fig. 5). The probe responses
in this array will distort the fitting of the probe response pattern.
To guard against this, we exclude outlier arrays (identified by large
standard errors) and use the remaining arrays to estimate the probe
response pattern. For an outlier array, we still compute its expression
index conditional on the estimated probe response pattern, with the
attached large standard error indicating poor reliability of this
expression index.
|
|
and
are symmetric. Therefore,
we can use the conditional standard errors of the estimated
js to identify problematic probes. In Fig.
6, probe 17 (indicated by arrow in
several arrays) has peculiar behavior that is inconsistent with the
rise and fall of other probes. This inconsistency is probably due to
the cross-hybridization of this probe to nontarget genes. Fig.
6B shows that this nonspecific probe can be identified by
the large standard error associated with
17.
Finally, we must also consider a "single outlier" which might be
an image spike in one array affecting just one PM-MM difference. Such
a single outlier (say dij in the data matrix)
may affect estimates of both
i and
j and we can identify it by the large
residual for this data point. Once identified, single outliers are
regarded as "missing data" in the model fitting.
|
value. If it contributes more than 80% to the
sum of squares of the
js in the probe set, we
classify a probe as "high-leverage" and exclude it during the
fitting of the model. A similar procedure is used to identify high
leverage arrays.
|
s (arrays) with large standard error
(more than three times as large as the median standard error of all
s) or dominating magnitude (
in 2 is more than 80% of
the sum of squares of all
s), and mark these arrays as array
outliers. Next, with these array outliers excluded, we work on a data
table with fewer rows (discarding outlier arrays), and fit the model
again. This time we inspect the standard errors and magnitudes of
s,
in the hope of excluding probe outliers. If a
has a negative value,
we also regard it as probe outlier and exclude the corresponding probe.
In effect, the data table shrinks in columns and we fit the model again
to this new data table. Note that, although some arrays and probes may
not be used in fitting the model, we still can regress the data in one
array (excluding probes not used in model fitting) against the
estimated probe pattern (
s) to get an estimate of the expression
levels (
s) of the excluded arrays. After probe outliers are
excluded, we evaluate all arrays for outliers again and compare them to the set of array outliers in the previous round to see if there is any
change. This procedure is repeated until the set of probe and array
outliers does not change anymore. (In some cases, they may cycle in a
small number of slightly different sets.) Along the iteration we will
also identify some single data point outliers with large residuals and
mark them as missing data when fitting the model. In general, 5-10
iterations will lead to a converged set of outliers.
| |
Results |
|---|
|
|
|---|
We apply this model-based analysis to all 7,129 probe sets of the
HuGeneFL arrays. Fig. 8 shows excluded
array and single outliers for two arrays. As we have seen from Fig. 5,
image contamination can be handled automatically by reasonably marking
array and single outliers and excluding them from model fitting. Such
contamination would lead to incorrect expression, and fold change
calculation if left unattended in the data. There are arrays with a
large number of array and single outliers (Fig. 8B).
Presumably, such arrays underwent severe sample contamination which
destroyed the probe pattern of many probe sets and introduced many
single outliers. Again the model automatically excludes the array from
model fitting to avoid the influence of these bad arrays on good arrays
and attaches large standard errors to the expression indexes of
contaminated probe sets. Fig. 9 shows a
case where the output by the GENECHIP software
(Affymetrix, Santa Clara, CA) is presumably incorrect, most probably
because of a misaligned corner. This is detected by an unusually large
number of array and single outliers near the corner. We also
intentionally include a murine MU11KSUBA array with the 21 human arrays
to assess its effect on the analysis. This has little effect on the
outlier image of the human arrays, but more than two-thirds of
"probe sets" on the murine array are detected as outliers (Fig.
10). For the remaining "probe
sets," their signals on the murine array are mostly close to zero.
These are not detected as outliers as they can fit the human probe
patterns (
s) by taking low
values, but they will not bias the
estimation of the probe response patterns of the human arrays.
|
|
|
Fig. 11A shows that
for 60.2% of the probe sets, we use more than half of the probes to
fit the model. Fig. 11B shows that the explained energy is
high (R2 greater than 80%) for 63.3% of the
probe sets. To investigate the reason for the low probe usage and poor
R2 for some probe sets, we examine the
relationship between probe usage, explained energy, and the presence
percentage (percentage of arrays where a probe set is called
"present" by GENECHIP; Fig. 11C). Fig.
12A shows that high
presence percentage usually leads to high probe usage. Fig.
12B demonstrates that when a gene is present in many
arrays, the explained energy of the corresponding probe set tends to be
high. Clearly, when a gene is absent in most arrays, the variations in
the observed data are mostly due to the noise term and one should not
expect the model to explain a large fraction of this variation. In this
case, there is not much information in the data to determine the
s,
but the
s are still correctly estimated to be close to zero.
|
|
| |
Conclusions |
|---|
|
|
|---|
We have proposed a statistical model for oligonucleotide expression array data at the probe level. Based on this model, we are able to address several important analysis issues that are difficult to handle by using current approaches. These include accounting for individual probe-specific effects, and automatic detection and handling of outliers and image artifacts. Computer programs implementing these methods are available on request for nonprofit research. In a follow-up paper, we will discuss the computation of standard errors (SE) for expression indexes and confidence intervals (CI) for fold changes, how the availability of the SE and CI values impact downstream analysis, and metaanalysis of pooled data from different experiments.
| |
Acknowledgements |
|---|
We thank Eric Schadt for discussion; Sven de Vos for providing data; and Gary Churchill, Ker-Chau Li, David Lockhart, and Terry Speed for comments on the first draft. This work is supported in part by National Institutes of Health Grant 1R01HG02341-01 and National Science Foundation Grant DBI-9904701.
| |
Abbreviations |
|---|
PM, perfect match; MM, mismatch.
| |
Footnotes |
|---|
* To whom reprint requests should be addressed at: Department of Biostatistics, Harvard University, 655 Huntington Avenue, Boston, MA 02115. E-mail: wwong{at}hsph.harvard.edu.
This paper was submitted directly (Track II) to the PNAS office.
Article published online before print: Proc. Natl. Acad. Sci. USA, 10.1073/pnas.011404098.
Article and publication date are at www.pnas.org/cgi/doi/10.1073/pnas.011404098
| |
References |
|---|
|
|
|---|
| 1. | Lockhart, D. , Dong, H. , Byrne, M. , Follettie, M. , Gallo, M. , Chee, M. , Mittmann, M. , Wang, C. , Kobayashi, M. , Horton, H. , et al. (1996) Nat. Biotechnol. 14, 1675-1680. |
| 2. | Lipshutz, R. J. , Fodor, S. , Gingeras, T. & Lockhart, D. (1999) Nat. Genet., supplement 21, 20-24. |
| 3. | Tamayo, P. , Slonim, D. , Mesirov, J. , Zhu, Q. , Kitareewan, S. , Dmitrovsky, E. , Lander, E. S. & Golub, T. R. (1999) Proc. Natl. Acad. Sci. USA 96, 2907-2912. |
| 4. | Alon, U. , Barkai, N. , Notterman, D. A. , Gish, K. , Ybarra, S. , Mack, D. & Levine, A. J. (1999) Proc. Natl. Acad. Sci. USA 96, 6745-6750. |
| 5. | Wodicka, L. , Dong, H. , Mittmann, M. , Ho, M. & Lockhart, D. (1997) Nat. Biotechnol. 15, 1359-1367. |
| 6. | Schadt, E. , Li, C. , Su, C. & Wong, W. H. (2001) J. Cell. Biochem. 80, 192-202. |
| 7. | Press, J. (1972) Applied Multivariate Analysis (Holt, Rinehart and Winston, New York). |
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
N. Ono, S. Suzuki, C. Furusawa, T. Agata, A. Kashiwagi, H. Shimizu, and T. Yomo An improved physico-chemical model of hybridization on high-density oligonucleotide microarrays Bioinformatics, May 15, 2008; 24(10): 1278 - 1285. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-H. Ding, Y. Xie, S. Park, G. Xiao, and M. D. Story Enhanced identification and biological validation of differential gene expression via Illumina whole-genome expression arrays through the use of the model-based background correction methodology Nucleic Acids Res., May 1, 2008; (2008) gkn234v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Knox and J. C. Baker Genomic evolution of the placenta using co-option and duplication and divergence Genome Res., May 1, 2008; 18(5): 695 - 705. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. S. Nair, M. L. Bigelow, Y. W. Asmann, L. S. Chow, J. M. Coenen-Schimke, K. A. Klaus, Z.-K. Guo, R. Sreekumar, and B. A. Irving Asian Indians Have Enhanced Skeletal Muscle Mitochondrial Capacity to Produce ATP in Association With Severe Insulin Resistance Diabetes, May 1, 2008; 57(5): 1166 - 1175. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Ramirez-Valle, S. Braunstein, J. Zavadil, S. C. Formenti, and R. J. Schneider eIF4GI links nutrient sensing by mTOR to cell proliferation and inhibition of autophagy J. Cell Biol., April 16, 2008; 181(2): 293 - 307. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Medina, S. K. Zaidi, C.-G. Liu, J. L. Stein, A. J. vanWijnen, C. M. Croce, and G. S. Stein MicroRNAs 221 and 222 Bypass Quiescence and Compromise Cell Survival Cancer Res., April 15, 2008; 68(8): 2773 - 2780. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Zhang, A. Lenk, M. X. Andersson, T. Gjetting, C. Pedersen, M. E. Nielsen, M.-A. Newman, B.-H. Hou, S. C. Somerville, and H. Thordal-Christensen A Lesion-Mimic Syntaxin Double Mutant in Arabidopsis Reveals Novel Complexity of Pathogen Defense Signaling Mol Plant, April 15, 2008; (2008) ssn011v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Solomon, J.-S. Kim, S. Jenkins, H. Ressom, M. Huang, N. Coppa, L. Mabanta, D. Bigner, H. Yan, W. Jean, et al. Identification of p18INK4c as a Tumor Suppressor Gene in Glioblastoma Multiforme Cancer Res., April 15, 2008; 68(8): 2564 - 2569. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Xu and X. Cui Robustified MANOVA with applications in detecting differentially expressed genes from oligonucleotide arrays Bioinformatics, April 15, 2008; 24(8): 1056 - 1062. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Gautschi, C. G. Tepper, P. R. Purnell, Y. Izumiya, C. P. Evans, T. P. Green, P. Y. Desprez, P. N. Lara, D. R. Gandara, P. C. Mack, et al. Regulation of Id1 Expression by Src: Implications for Targeting of the Bone Morphogenetic Protein Pathway in Cancer Cancer Res., April 1, 2008; 68(7): 2250 - 2258. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Koltai and C. Weingarten-Baror Specificity of DNA microarray hybridization: characterization, effectors and approaches for data correction Nucleic Acids Res., April 1, 2008; 36(7): 2395 - 2405. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Pinto, S. Thanaviratananich, M. G. Hayes, R. M. Naclerio, and C. Ober A Genome-Wide Screen for Hyposmia Susceptibility Loci Chem Senses, April 1, 2008; 33(4): 319 - 329. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-I. Fernandez, B. Regnault, C. Mulet, M. Tanguy, P. Jay, P. J. Sansonetti, and T. Pedron Maturation of Paneth Cells Induces the Refractory State of Newborn Mice to Shigella Infection J. Immunol., April 1, 2008; 180(7): 4924 - 4930. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. B. Silkworth, E. A. Carlson, C. McCulloch, K. Illouz, S. Goodwin, and T. R. Sutter Toxicogenomic Analysis of Gender, Chemical, and Dose Effects in Livers of TCDD- or Aroclor 1254-Exposed Rats Using a Multifactor Linear Model Toxicol. Sci., April 1, 2008; 102(2): 291 - 309. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. T. Tuomisto, H. Lumivuori, E. Kansanen, S.-K. Hakkinen, M. P. Turunen, J. V. van Thienen, A. J. Horrevoets, A.-L. Levonen, and S. Yla-Herttuala Simvastatin has an anti-inflammatory effect on macrophages via upregulation of an atheroprotective transcription factor, Kruppel-like factor 2 Cardiovasc Res, April 1, 2008; 78(1): 175 - 184. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Giannakis, S. L. Chen, S. M. Karam, L. Engstrand, and J. I. Gordon Helicobacter pylori evolution during progression from chronic atrophic gastritis to gastric cancer and its impact on gastric stem cells PNAS, March 18, 2008; 105(11): 4358 - 4363. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Bengtsson, R. Irizarry, B. Carvalho, and T. P. Speed Estimation and assessment of raw copy numbers at the single locus level Bioinformatics, March 15, 2008; 24(6): 759 - 767. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Roccaro, X. Leleu, A. Sacco, A.-S. Moreau, E. Hatjiharissi, X. Jia, L. Xu, B. Ciccarelli, C. J. Patterson, H. T. Ngo, et al. Resveratrol Exerts Antiproliferative Activity and Induces Apoptosis in Waldenstrom's Macroglobulinemia Clin. Cancer Res., March 15, 2008; 14(6): 1849 - 1858. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. A. Rao, S. Saraswathy, G. S. Wu, G. S. Katselis, E. F. Wawrousek, and S. Bhat Elevated Retina-Specific Expression of the Small Heat Shock Protein, {alpha}A-crystallin, Is Associated with Photoreceptor Protection in Experimental Uveitis Invest. Ophthalmol. Vis. Sci., March 1, 2008; 49(3): 1161 - 1171. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Handfield, H.V. Baker, and R.J. Lamont Beyond Good and Evil in the Oral Cavity: Insights into Host-Microbe Relationships Derived from Transcriptional Profiling of Gingival Cells J. Dent. Res., March 1, 2008; 87(3): 203 - 223. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. L. Shattuck, J. K. Miller, K. L. Carraway III, and C. Sweeney Met Receptor Contributes to Trastuzumab Resistance of Her2-Overexpressing Breast Cancer Cells Cancer Res., March 1, 2008; 68(5): 1471 - 1477. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Marselli, J. Thorne, Y.-B. Ahn, A. Omer, D. C. Sgroi, T. Libermann, H. H. Otu, A. Sharma, S. Bonner-Weir, and G. C. Weir Gene Expression of Purified {beta}-Cell Tissue Obtained from Human Pancreas with Laser Ca |