Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / APPLIED BIOLOGICAL SCIENCES
Accurately quantifying low-abundant targets amid similar sequences by revealing hidden correlations in oligonucleotide microarray data



*Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139;
Biomedical Engineering Department, Northwestern University, Evanston, IL 60208; and
Department of Biology, Temple University, Philadelphia, PA 19122
Edited by J. Craig Venter, The J. Craig Venter Institute, Rockville, MD, and approved June 13, 2006 (received for review February 22, 2006)
| Abstract |
|---|
|
|
|---|
microbial ecology | optimization algorithm | cross-hybridization | free energy | rRNA
In all microarray analysis, accuracy of specific signal detection is compromised by: (i) nonspecific interactions among probes and noncomplementary targets (cross-hybridization; refs. 9 and 10), (ii) stochastic variability in the measurements (noise; ref. 11), (iii) background level of the system (5), and (iv) unequal signal intensity among different complementary probe–target pairs for the same target abundance (unequal specific response; refs. 12 and 13). In single-organism expression analysis, some of these spurious contributions have been addressed by optimization of probe and array design (14–17) and by hybridization signal analysis using several probabilistic and free-energy-based models (6, 13, 18–20). These are primarily based on differential hybridization responses of short oligonucleotides (25-mer) differing by a single central mismatch. In this system, cross-hybridization is modeled as part of the noise/background, because low sequence identity among transcripts causes nonspecific interactions to be low. In addition, multiple probe pairs per target allow signal averaging and outlier rejection for improved signal quantitation. However, no current models fully meet the challenges presented by multiorganism assemblages, where large numbers of nontarget sequences with high similarity to the target can be present. In this situation, the specific target may be at such low abundance that the sum of even small levels of cross-hybridization in addition to noise and background may completely mask the specific signal. Moreover, no analytical tools are available, which can easily be transposed among different technological platforms.
We reasoned the above challenge would best be addressed by explicitly determining and simultaneously accounting for all spurious contributions, which obscure the underlying physical relationship between microarray signal intensity and target abundance. This led us to design an optimization algorithm based on a system of nonlinear equations, which express raw hybridization signal intensity through the true signal of a specific target and spurious signal contributions (Eq. 1; Materials and Methods). By iteratively solving these equations using best-linear-unbiased-estimation (BLUE) theory (Eq. 2; Materials and Methods), the algorithm determines the most likely values for all spurious contributions and the actual target abundances. These are initially inferred assuming a linear relationship between signal intensity and target abundance; however, because the signal saturates for higher target abundances as described by the Langmuir function (Eq. 3; Materials and Methods; see refs. 21–23), this relationship can be used to accurately estimate target abundances over the entire range. Furthermore, the algorithm depends on a priori knowledge of the cross-hybridization probability between any probe–target pair, yet experimental measurement is intractable for the number of probe–target interactions possible on typical microarrays. We therefore developed an analytical predictor (Eq. 4; Materials and Methods) to estimate the probability of cross-hybridization between any probe–target pair differing by an arbitrary number and location of mismatches. The estimate is based on the average free energy of binding (
G
) of realizations of different probe–target interactions calculated from their sequence identity (Eq. 5; Materials and Methods). This probability is then used by the algorithm to calculate the cross-hybridization signal observed for each probe as a function of abundances of targets. We focused on 50-mer oligonucleotides to test the ability of the methodology to determine and compensate cross-hybridization between sequences differing by a range in number and location of mismatches. We then tested the method through a series of experiments aimed at differentiating ribosomal RNA (rRNA) of closely related bacteria of the genus Vibrio in both artificially assembled and natural coastal communities. This is a particular challenge, because (i) rRNAs are difficult to differentiate due to their high conservation (24, 25), and (ii) thousands of different rRNA genes have been shown to coexist within coastal communities, of which specific Vibrio taxa typically represent <0.1% (2, 26, 27).
| Results and Discussion |
|---|
|
|
|---|
G
(rather than
G) for the prediction of probability of cross-hybridization among oligonucleotides of essentially any length. Although the model performance should be validated for oligonucleotides of differing length, we predict it should be feasible, because
G has already been shown to correlate well with the hybridization intensity of both long (50- to 70-mer; refs. 30, 31, and 33) and short (18- to 25-mer; refs. 12, 20, and 34) oligonucleotides. Thus, the method should be easily transposable to other microarray platforms.
|
|
The relationship between abundances returned by the algorithm,
alg, and true abundances, follows the Langmuir function (Eq. 3), where three regimes can be identified: linear regime for low-target abundance (0.25–20 ng), nonlinear regime (20–60 ng), and plateau range for high-target abundances when hybridization signal intensity saturates. Thus, an estimation of target abundance from microarray data,
est, can be obtained from
alg by solving the Langmuir equation. The accuracy of
est in estimating true target abundances depends on the regime it falls in. In the linear regime, the average relative error is 7.8% of the true abundance where the lowest RNA abundance of 0.25 ng (0.05% of total RNA) was detected with an accuracy of ±0.086 ng and relative error of 34.4%; in the nonlinear regime, the average relative error is 28.5%.
In a second set of experiments, the accuracy of identification of specific targets was further tested under near natural conditions by spiking known amounts of V. cholerae RNA at realistic concentrations into coastal water samples (2–150 ng of V. cholerae total RNA contributing 0.1–1% of total RNA). This tests the high probability of cross-hybridization, because at the same site, hundreds of bacterial rRNA variants including multiple Vibrio taxa other than V. cholerae have been detected (2, 27). In all cases, V. cholerae was correctly identified and followed the same relationship observed for the artificial communities (Fig. 3). Thus, the methodology overall returned quantitative results over at least a 250-fold range from 0.25 to 60 ng of RNA, above which the signal saturates; however, the small error of 0.086 ng for the lowest measurement suggests the detection limit can potentially be lower. This dynamic range correlates well with the 500-fold dynamic range measured for single organism expression studies using 60- and 70-mer microarrays (33, 35) and is >10-fold higher than described for environmental studies using both oligonucleotide and cDNA microarrays (29, 36, 37).
|
The last set of experiments was designed to test the ability of this methodology to differentiate closely related organisms within a real microbial community by evaluating the presence of several Vibrio (16S rRNA >95% sequence identity) for which the array contained probes. A total of seven taxa were identified in coastal seawater samples taken in June and August (Fig. 4). RNA of these taxa ranged from 1.80 to 60.07 pg/ml seawater representing between 0.005% and 0.15% of total community RNA with generally increasing representation during warmer water conditions. These observations are consistent with the relative abundance of the same Vibrio taxa measured by quantitative PCR (QPCR) at this and analogous sites (26, 27); we sought to further validate the quantification for Vibrio splendidus using reverse transcriptase QPCR (RT-QPCR). The estimates showed good agreement between microarray and RT-QPCR data with 10.12 ± 1.73 and 5.74 ± 0.08 and 3.28 ± 0.59 and 0.96 ± 0.12 copies of 16S rRNA (x106) per milliliter of seawater for June and August, respectively. These results therefore suggest that the new method extends both the sensitivity and accuracy of identification of low-abundant targets; previously reported detection limits were between 1% and 5% of the total RNA or DNA within communities (31, 32, 36, 38, 39). A recent publication by Brown and colleagues (8) describes an optimized experimental protocol, which includes probes designed to reduce cross-hybridization to a minimum; this enabled detection of individual bacteria species at <0.1% of total bacteria. Thus, we suggest that a combination of such optimized technical protocols with the analytical methodology presented here may even further improve sensitivity and accuracy of microarrays.
|
| Materials and Methods |
|---|
|
|
|---|
|
|
where for each array experiment, Yjp is the mean logarithm of observed hybridization signal intensity for a given target species j, probe region p, in the presence of background
, and noise
.
is the target abundance in the original sample, matrix
jk describes cross-hybridization between probe j and target k for each probe region p, and bjj is the signal intensity of a perfect match probe–target pair j normalized to the relative abundance of target j, and which can vary between different probes. Error term
is outside the logarithmic relationship to account for higher levels of noise intrinsically associated with higher signal intensities. The task of the inversion algorithm is to determine the abundance
of a specific target gene from the logarithm of hybridization signal intensity Y for all its complementary probes, jp, by accounting for all four major spurious contributions of cross-hybridization (
), noise (
), background (
), and unequal specific response (b). In the absence of cross-hybridization, the matrix
is diagonal, and the estimation of target abundance
is relatively straightforward. In the presence of cross-hybridization, however,
jk is a nondiagonal matrix. In this case, coefficients
are calculated by using Eq. 4, and the algorithm iteratively determines the optimal values for bjjp,
jp, and
k which justify the observed hybridization intensities.
This optimization is accomplished through iterative recursive linearization using best-linear-unbiased-estimation (BLUE) theory (43). The iterative process can be described in matrix notations. If an nth iterative solution of Eq. 1 is available, the (n + 1)th iteration is found as follows:
|
|
where X is the vector of all unknowns (
jp, bjjp,
k)T,
![]() |
with J the total number of species, P the total number of probe regions, index m varying from 1 to 2 JP +J,
jpm a unitary jp x jp matrix (i.e.,
jpm = 1 for jp = m, 0 otherwise), and
and 
are the prior covariance matrices of X(n) and
, respectively. To start the iterative process, the initial values of the unknowns b0,
0, and
0 were chosen as follows: (i) the signal intensity of a perfect-match probe-pair b0jj = constant, which in our experimental setup (ArrayWoRx scanner, Applied Precision CCD with counts from 1 to 60,000) was experimentally determined to be 30 ± 0.7 counts per nanogram of target [we chose b0jj = constant because the unequal specific response among different probe–target pairs (bjj) varied on average 20% among each other, as determined by array experiments containing 139 probe–target pairs where five bacteria were individually spiked]; (ii) the background
0jp = mean observed signal intensity in case of zero true abundances, which is approximated as the measured intensity of 100 negative control spots in each microarray experiment; (iii) the rRNA abundance
![]() |
and (iv) the prior covariance matrix
is diagonal with the variances of b,
, and
as diagonal elements. The optimal initial variances for b and
were determined in computational experiments with synthetic data sets, which contained intensity data simulating the presence of different bacterial rRNA present at low levels in real experimental conditions of noise and background. The criterion to determine convergence of the algorithm was the relative difference between consecutive iterations <0.001%. Typically, 50 iterations were sufficient to reach convergence.
The relationship between RNA abundance of a particular target determined by the algorithm,
alg, and its true abundance,
true, is essentially linear for low-target abundances and approaches a plateau in the high limit of
true. This overall dependence follows the Langmuir function, which has been shown to accurately describe microarray signals (21–23):
|
|
with
C = 20 ng. In this type of relationship, three regimes of dependence can be identified.
C, or critical abundance, provides the approximate upper bound for the regime within which
alg depends linearly on
true. For
true, between 20 and 60 ng (from
C to
3
C),
alg depends nonlinearly on
true, and Eq. 3 has to be solved to obtain
true from
alg. For even higher abundances (>3
C = 60 ng), the plateau regime is reached and
alg does not increase with
true.
Analytical Cross-Hybridization Predictor.
To estimate the probability of cross-hybridization between any given probe–target pair, we calculated the probability of dissociation of a probe j and target k pair having a given binding free-energy
Gjk. The relationship between binding free energy and cross-hybridization has been proposed (30, 31, 33, 44). Here we build upon these results. The probability of a target bound to a probe and, thus, the probability of cross-hybridization
jk can be estimated using the law of mass action
|
|
where C0,jk and C1,jk are the abundances of the bound and free targets after the reaction, respectively, as follows:
|
|
where coefficient b measures the probability of a perfect match target–probe pair remaining hybridized after the reaction and coefficient h is the ratio of energy of binding and energy of breaking the target–probe pair. Both coefficients b and h depend on the conditions of the reaction (e.g., temperature and salt concentration). Coefficients b = 0.75 and h = 8.47 were obtained by fitting Eq. 4 to experimental data consisting of 1,233 cross-hybridization values (for an experimental posthybridization washing temperature of 60°C; Fig. 1).
The
Gjk/
Gjj ratios can be calculated by using either mfold software (www.bioinfo.rpi.edu/applications/mfold) or the analytical approximation described by Eq. 5. To model free binding energy, we used the following set of assumptions: (i)
Gjk decreases with the number of base pair bonds; (ii) the presence of a mismatch in the sequence weakens the neighboring bonds; and (iii) in the case of a number of consecutive mismatches (e.g., "loop" formation), a target may bind the probe out of the sequence within the loop. Furthermore, we reasoned that a more accurate calculation of cross-hybridization requires
Gjk/
Gjj averaged over realizations of different probe–target pair interactions,
Gjk
/
Gjj, because most nonspecific binding interactions are characterized by a distribution of binding affinities due to multiple effects including conformational changes and variation in the point of attachment in the sequence (45, 46). Thus, the estimation of
Gjk
/
Gjj is reduced to the calculation of the expectation of the number of loops formed for a given sequence identity, and the ratio of the binding energies becomes:
|
|
where sequence identity is conventionally defined as
|
|
where N is the number of base pairs, pn and tn stand for probe and target base pairs in position n, and
(pn, tn) = 1 if pn and tn are complimentary or 0 otherwise (www.ebi.ac.uk/clustalw). For T < 70°C, coefficient
|
|
and coefficient
|
|
For T = 37°C,
= 0.83 and sc = 0.5, whereas for T = 60°C,
= 1.26 and sc = 0.6. As shown in Fig. 1b,
Gjk/
Gjj ratios obtained by mfold are scattered around
Gjk
/
Gjj, calculated by using Eq. 5.
Detailed experimental procedures are available in Supporting Text, which is published as supporting information on the PNAS web site.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: rRNA, ribosomal RNA; QPCR, quantitative PCR
¶To whom correspondence should be addressed. E-mail: mpolz{at}mit.edu
Freely available online through the PNAS open access option.
Present address: Department of Microbiology and Molecular Genetics, Harvard Medical School, Boston, MA 02115. ![]()
Author contributions: L.A.M., V.B., and D.V. designed research; L.A.M., A.D., C.S., J.R.T., S.P.P., and C.L. performed research; E.L. contributed new reagents/analytic tools; L.A.M., V.B., D.V., and M.F.P. analyzed data; and L.A.M., V.B., and M.F.P. wrote the paper.
Conflict of interest statement: No conflicts declared.
This paper was submitted directly (Track II) to the PNAS office.
Data deposition: The probe sequences reported in this paper have been deposited in the National Center for Biotechnology Information (NCBI) Probe Database (PUIDs 6103259–6103399).
© 2006 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||