New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
A segmental nearest neighbor normalization and gene identification method gives superior results for DNAarray analysis

Communicated by Lonnie O'Neal Ingram, University of Florida, Gainesville, FL (received for review October 3, 2002)
Abstract
An intuitive normalization and gene identification method is proposed. After segmentation of the entire expression range into intensity intervals, the mean and standard deviation of the logarithm of expression ratios are calculated for each interval using the nearest neighbor genes. Genes with high differential expression are excluded from these calculations. For glass arrays, normalization is performed for each interval by using the mean of the logarithm of expression ratios in the interval. For nylon/plastic membranes, the average of the means of the logarithm of ratios across the intervals of higher intensities is used for normalization. Compared with other normalization methods, this method delivered the smallest normalization errors for 42 nylon/plastic arrays used to analyze cultured T cells and 22 Clostridium acetobutylicum glass arrays. For identifying differentially expressed genes, upper and lower boundaries are constructed for each interval by using the standard deviation of the expression ratio logarithms. When a C. acetobutylicum pSOL1 megaplasmiddeficient strain M5 was used, this method identified more “downregulated” pSOL1 genes with fewer misidentifications in a comparative array analysis of M5 versus the parent strain. A comparison of quantitative RTPCR results with different gene identification methods indicates that the proposed method is superior to other methods.
The DNA array technology is emerging as a powerful tool for largescale gene expression analysis (1). Like any other measuring method, this technology contains inherent measurement errors. Because of variations in labeling, hybridization, spotting, or surface characteristics, expression intensities resulting from the same amount of mRNA can differ from experiment to experiment, and thus, a normalization method must be used to compare expression intensities obtained from two dyes or membranes. Normalization of array data involves two steps: (i) selection of genes to be used for normalization, and (ii) application of a mathematical operator or metric to normalize the data. Gene selection can include the entire gene set (global, G), housekeeping genes, rankinvariant genes, or genes with “constant expression.” Mathematical operators or metrics include expressionintensity mean (EM), expressionintensity median, expressionratio mean (ERM), expressionratio median (ERMD), mean of logarithm of expression ratio (LERM), median of logarithm of expression ratio (LERMD), expressionratio probability density (ERPD), and linear or nonlinear regression. Any combination of one of these mathematical operators or metrics with a set of genes can be used. However, there are only a few normalization methods reported in the literature. Global normalization using the mean or median of expression intensities and normalization based on housekeeping genes are the most widely used methods (2–4). Other methods include an iterative method using ERPD based on housekeeping genes (5) and a LERMD method using only genes with constant expression levels (6, 7). Because of different labeling and detection efficiencies for various fluorescent dyes, glass arrays require nonlinear normalization protocols rather than a constant normalization factor (8, 9).
After normalization, data must be processed to identify differentially expressed genes before clustering (2). A frequently used method for selection of differentially expressed genes is based on setting an intensity threshold and a minimal fold change (typically 2 to 3fold) to discard genes with low expression levels and insignificant fold differences, respectively (10). At high expression levels, the noisetosignalintensity ratio is small. A small change at very high expression levels might therefore be significant. On the other hand, at moderate expression levels, 2 or 3fold changes may not be significant due to relatively higher noisetosignal ratios. By using “calibration” experiments (4, 8) (identical samples labeled with both Cy3 and Cy5 dyes and hybridized on the same glass slide or labeled with radioactivity and hybridized on two separate membranes), Tien et al. (4) performed a segmental calculation of standard deviation (SD) or maximum/minimum fold changes, and developed masks for nondifferentially expressed genes. Statistical approaches have also been developed to identify differentially expressed genes. When a ratio probability density is used, upper and lower boundaries (masks) for differentially expressed genes can be derived at a given confidence level (5). Because such masks are intensitydependent, Newton et al. (11) used a Gamma–Gamma Bernoulli model to develop the ratio probability density, and presented intensitydependent geneidentification masks at various posterior odds based on hierarchical modeling. Such masks are wider at both low and high expression intensities. Using a Bayesian approach, Tseng et al. (8) developed a hierarchical Gaussian model for identification of differentially expressed genes. Another approach is application of an ANOVA model (12). However, no studies have compared the efficiencies of different normalization and geneidentification methods and their effects on transcriptional analysis.
We describe a superior array normalization and geneidentification method by segmenting the entire intensity range into a number of intensity intervals and determining the mean and SD of the logarithms of expression ratios (LERs) for each interval with the nearest neighbor nondifferentially expressed genes. Normalization is based on the mean of the LERs in each interval or in the intervals of higher expression intensities. For gene identification, upper and lower masks are constructed for each interval by using the interval SD of the LERs.
Materials and Methods
Nylon and Plastic Arrays.
Thirty nylon arrays (Human Atlas 1.2, CLONTECH) containing 1,191 singly spotted genes, and 12 plastic arrays (Atlas Plastic Human 8K, CLONTECH), containing over 8,300 doubly spotted genes, were used to analyze samples of primary human T cells cultured at 5% or 20% oxygen tension (pO_{2}) in serumfree medium (13). Geneexpression analysis was performed pairwise by using samples of 20% vs. 5% pO_{2} cultures initiated with the same donor cells. In addition, 6 nylon and 2 plastic calibration arrays were used to obtain a constant multiplier used for gene identification masks. RNA isolation, reverse transcription (RT), labeling (^{32}P/^{33}P), and hybridization were performed per manufacturer's instructions. Nylon or plastic arrays were exposed to GP/LE phosphor screens (Molecular Dynamics) for 24 or 72 h, respectively, and scanned by using a Storm PhosphorImager (Molecular Dynamics). imagequant (Molecular Dynamics) or ATLASIMAGE 2.01 (CLONTECH) was used for quantitation of signal intensities.
Glass Arrays.
cDNA arrays with triplicate spots representing 1,019 ORFs, approximately onefourth of the Clostridium acetobutylicum ATCC 824 genome, were printed by using the TIGR protocol (14). The list of genes on the arrays and all primary array data used in this paper are available on our web site (http://www.chemeng.northwestern.edu/faculty/papou.html). Three C. acetobutylicum strains were grown anaerobically in bioreactors (15): WT (ATCC and 824), 824(pSOS95del) (plasmidcontrol strain), and 824(pGroE1) (overexpresses GroESL operon genes using the thiolase promoter). Static flask cultures (400 ml) of the WT and M5 (lacks the pSOL1 megaplasmid; ref. 16) strains were also analyzed. Three sets of array experiments were performed, including eight slides of WT vs. 824(pSOS95del), six slides of WT vs. M5, and eight slides of 824(pSOS95del) vs. 824(pGroE1). Half of these slides were done with opposite labeling and were used for construction of a geneidentification mask based on ANOVA. In addition, three calibration glass arrays were used for setup of geneidentification masks. RNA was isolated by using the Trizol reagent (Invitrogen) with additional lysozyme treatment (15). Purified RNA was used to synthesize labeled cDNA in a hexamerprimed RT reaction (Roche Molecular Biochemicals) in the presence of Cy3dUTP or Cy5dUTP (Amersham Pharmacia) using Moloney murine leukemia virus reverse transcriptase (Promega). Slides were hybridized with probes overnight in a Corning hybridization chamber at 42°C, and were scanned with a GSI Lumonics ScanArray analyzer. Spot intensities were quantitated with QuantArray Microarray Analysis (Perkin–Elmer).
Quantitative RTPCR (QRTPCR).
For QRTPCR, all RT reactions were performed by using the RETROscript kit (Ambion, Austin, TX). Each PCR contained 0.5 μl of RT reaction product, 1× SYBR Green PCR Master Mix (Applied Biosystems), and primers (Table 3, which is published as supporting information on the PNAS web site, www.pnas.org). All PCRs were run on the ABI PRISM 7700 Sequence Detection System with the following program: 10 min at 95°C, 35–40 cycles of 15 sec at 95°C and 1 min at the annealing temperature (Table 3). Three or six QRTPCR replicates were performed for each gene in each T cell or C. acetobutylicum sample, respectively. The resulting average fold changes from these replicates were compared with a fixed value of one to determine the statistical significance (P < 0.05) of the fold changes using the singlesample t test (17).
Normalization and Gene Identification Method
Variations in several factors and processes of array analysis result in two types of errors, random and system errors, and affect the measured gene expression intensity. Random errors result from scanning errors and spottospot variations (on the same array) in the amount of deposited cDNA and arraysurface properties. These errors can generally be defined as noise with a mean of 0 across all spots in an array. System errors can be defined as those resulting from variations in array quality (array surface and printing, and amount of DNA spotted) as well as in sample preparation between two membranes or dyes (array storage, RNA amount used, reverse transcription and labeling, hybridization, and washing).
Let x* and y* be the true intensities of a nondifferentially expressed gene on two membranes or of two dyes. True intensities should be free of random errors. Thus, the difference between x* and y* is caused by system errors only. Practically, one will never know these true values. What is known are measurements that contain random errors (noise). If no such random errors were present, normalization would be accomplished by using the ratio of true intensities directly, namely λ(x, y) = x*/y*.
Assume that in the neighborhood of (x*, y*) there are K nondifferentially expressed genes with true expression intensities (x^{*}_{i}, y^{*}_{i} ), i = 1, … , K, where the neighborhood is defined by the space spanned by the K nondifferentially expressed genes closest to (x*, y*). The measured intensities, (x_{i}, y_{i}), after prefiltering (as described in the supporting text, which is published on the PNAS web site) for these K genes contain random errors (ɛ_{x,i}, ɛ_{y,i}). Thus, we can write that x_{i} = x^{*}_{i} + e_{x,i} and y_{i} = y^{*}_{i} + e_{y,i}. To treat upregulated and downregulated genes equally, natural LER is used. The logarithm of the normalization factor of two membranes or dyes can be estimated using the K nearest neighbors of (x*, y*) (18): 1 If the number K is large enough, one can rewrite this equation as 2 where E{⋅} represents the mean of the argument. When this normalization factor is used, the normalized expression ȳ in the second membrane or dye to the expression in the first membrane or dye can be written as log ȳ = log y + log λ(x, y) or log ȳ = λ(x, y)y.
Random Errors of Two Different Membranes or Dyes.
The first term on the right side of Eq. 2 aims to eliminate system errors, while the second term describes the effects of the noisetosignal ratios on normalization. The normalization objective is to capture and eliminate system errors in the presence of random errors (noise). In membrane array experiments, samples are hybridized on two separate arrays. Random errors in two different arrays should be independent of one another, and thus, ɛ_{x} and ɛ_{y} may have opposite signs, which may lead to a large variation in y/x = (y* + ɛ_{y})/(x* + ɛ_{x}), especially at low intensities. A typical distribution of LER against the logarithmic mean intensity in a nylon/plastic array is shown in Fig. 1 Ia and IIa. Expression ratios have a wide spread at lower intensities and gradually converge with increasing intensities. For random errors in two membranes, one can write 3 In glass arrays, samples are labeled with two different dyes, and in contrast to membrane arrays, are hybridized on the same slide; thus, both samples compete for binding with the cDNA of each array spot. Therefore, spot variations, which result in random errors, affect the two dyes simultaneously. Because random errors contain measurement errors in addition to spot variations, they are expected to be both related because of same spot variations and random because different measurement errors. As a result, variations in expression ratios are nearly uniformly distributed across the whole intensity range (Fig. 1IIIa), and one may write that y/x = (1 + ɛ_{0})y*/x*, where ɛ_{0} is a random signal with a mean of zero. The noisetosignal relationship between the two channels can then be written as 4
Normalization of Calibration Arrays.
For membrane arrays at higher intensities, ɛ_{x} ≪ x* and ɛ_{y} ≪ y*. Considering Eqs. 2 and 3, the normalization factor between two plastic or nylon membranes can then be obtained by using genes with higher expression intensities: 5 For glass arrays, assuming that ɛ_{0} ≪ 1, one can derive the following expression from Eqs. 2 and 4, 6
Normalization of Two Different Samples.
In Eq. 5 or 6, the genes used for normalization are only those that are nondifferentially expressed. In an array experiment, typically only a fraction of genes have altered expression when comparing two samples. Nevertheless, highly differentially expressed genes (outliers) may considerably affect the mean of the LERs, and therefore should be removed before normalization. One can use the increase or decrease in SD of a group of n + 1 data points to identify outliers. When equidistantly distributed, one can write 7 where Δd is the distance between two data points. Therefore, the SD increases approximately linearly with increasing number of data points. As a point with a larger distance to the neighboring point is added, the SD increases at a greater rate. To eliminate genes with strong differential expression, we first calculate the LER mean and SD for a set of genes, and then remove the gene with the LER the farthest from the LER mean from the set of genes. This calculation and removal procedure is repeated until only a few genes are left (e.g., five). By doing so, we obtain the SD profile as a function of the number of data points (genes) (Fig. 2). Data with a uniform or normal distribution (without tails) result in a relatively constant increase in SD as the number of data points increases, whereas data points in the tails of a normal distribution and strongly expressed genes cause a sharp increase in the SD. These highly expressed genes are excluded, and the LER mean derived from the remaining nearest neighbor genes is used for normalization.
Discretization of Intensities and Normalization Based on a MovingGene Neighborhood.
As shown, the validity of Eq. 5 is restricted to higher expression intensities. Thus, we can only use nondifferentially expressed genes at higher expression intensities to normalize two different membrane arrays. The expression ratio to intensity relationship between two dyes in a glass array is slidedependent and nonlinear (Fig. 1IIIa), and such nonlinearity cannot be corrected by a global normalization factor. Thus, we propose a piecewise normalization to account for this nonlinear relationship. To distinguish the higher from the lower expression in the case of membrane arrays or to obtain a piecewise normalization for glass arrays, the whole range of the natural logarithmic mean intensity was discretized into M equidistant intervals. K nondifferentially expressed genes around the middle of each interval were then used to determine the LER mean and its SD in the interval. The M and K values were determined by minimizing the total normalization errors, as discussed below. The normalization factor between two membrane arrays was evaluated based on the average of the LER means over the intervals of higher intensities (e.g., >5× background intensity). For glass arrays, the normalization of two dyes was performed by using the LER mean in each interval. This piecewise normalization factor across all intensity intervals was used to model the slidedependent nonlinear relationship between the expression ratios and intensities in glass arrays.
Masks for Identifying Differentially Expressed Genes.
After normalization, the normalized LERs should be distributed around zero. Because the SD may vary from interval to interval and from array to array, the normalized LERs of various intensity intervals in separate arrays are distributed differently. Therefore, masks for identification of differentially expressed genes obtained from one array experiment cannot be applied to other arrays (4). The distribution of the LERs can be standardized by dividing the LER by the corresponding SD. Similarity can be found for various intervals of different arrays (Figs. 5 and 6, which are published as supporting information on the PNAS web site). The standardized probability densities are normallike, but with heavier tails (Fig. 7, which is published as supporting information on the PNAS web site) and the confidence level cannot be accurately determined based on normal distributions and t statistics (12). Instead, the percentile method was used to estimate the confidence level. Because of the similarity of the standardized probability densities in various intervals and samples, one can apply the results from one interval to any other interval or array. Using the percentile method, the SD_{j} of LERs in an interval j along with a constant multiplier can be used to set up the upper and lower mask borders, ϑ = γSD_{j} and ϑ = −γSD_{j}, for that interval. The constant multiplier γ is the number of SDs required to ensure, with a certain confidence level (e.g., 95%), that genes with expression ratios falling between the upper and lower mask borders are not differentially expressed. In other words, we can state with 95% confidence that genes lying outside of the mask are differentially expressed. Because the standardized probability densities in different intervals are similar, γ is independent of the interval index j. To determine γ, a number of calibration array experiments and the percentile method were used. After normalization, any deviations of the LERs of the calibration arrays from zero are caused by noise. Ideally, γ should be chosen such that all genes are located inside the masks. However, because of significant noise in the data, an extremely large γ was needed to produce a mask that included all genes. For example, γ = 4.3, 5.1, and 3.9 were necessary for three pairs of nylon calibration arrays, one pair of plastic calibration arrays, and three calibration glass arrays, respectively. Hence, γ was chosen such that a maximum of 5% of all genes lie outside the mask for all calibration experiments. For nylon, plastic, and glass arrays, γ = 2.1, 2.3, and 2.0, respectively. More stringent masks can be constructed by increasing the value of γ. As in the case of normalization, the determination of the SD was performed after excluding highly differentially expressed genes. It should be noted that an improvement in the similarity of standardized LER probability densities will make masks more robust and accurate, and such an improvement can be achieved by a better standardization or discretization protocol.
Results and Discussion
Normalization of Array Data.
Fifteen pairs of nylon arrays and six pairs of plastic arrays hybridized with T cell samples (20% vs. 5% pO_{2}) were used for normalization (Fig. 8, which is published as supporting information on the PNAS web site). Normalization of glass array data were performed with 22 glass arrays cohybridized with various pairs of C. acetobutylicum samples. Three representative examples are depicted in Fig. 1. After normalization, the LERs were distributed around zero for all three array types (Figs. 1 and 5a).
If a normalization method is perfect, the ratio of the normalized expression ȳ to the expression x should be close (because of random errors) to 1. To compare different normalization methods, the following criterion is proposed based on the normalization error: 8 where n is the total number of genes, and p is the number of membrane pairs or arrays. The smaller J_{norm_error}, the better the normalization. Although most J_{norm_error} values range between 0 and 1, poor normalization methods will result in values >1 (see below).
The proposed normalization method is referred to as the segmental nearest neighbor LER mean method (SNNLERM). The search for the optimal number K of nearestneighbor genes and the optimal number M of intervals was performed by normalizing the array data with varying M and K values. If K is small, an interval is underrepresented by the K nearest neighbor genes and the normalization will not be very accurate (Fig. 9a, which is published as supporting information on the PNAS web site). At K = 0, the resulting normalization corresponds to the case of no normalization. If K is large and approaches the total number of genes n, the quality of SNNLERM is approximately that of the global LER mean (GLERM) normalization. A small M will result in a normalization method neglecting the ratiointensity nonlinear relationship (as in global normalization). As M increases, the normalization quality first improves, but eventually plateaus as M increases further (Fig. 9b). The optimal M and K values for nylon, plastic, or glass arrays, which were calculated by minimizing J_{norm_error}, were 20, 45, 25 and 250, 300, 200, respectively.
For all three forms of DNA arrays, SNNLERM produces the lowest normalization error (Fig. 3). For membrane arrays, the proposed method is only slightly better in terms of the total normalization error, when compared with other median and LER methods (Fig. 3 a and b). For pairwise (between the corresponding 5% and 20% pO_{2} T cell samples) normalization, the difference between normalization and no normalization was small, because the samples were processed at the same batch, and thus system errors were small. Often, different membrane arrays are compared with a single array (e.g., array hybridized with a global RNA pool; ref. 19). In this case, system errors are much greater (e.g., because of various batches of sample processing), and hence, compared with the no normalization case, the normalization error is reduced considerably (>45%) when using the proposed method (Fig. 3b). For glass arrays, normalization based on SNNLERM was far better than normalization based on any global method. The improvement in normalization using SNNLERM was also considerable (>11%) compared with other methods such as rankinvariant nonlinear regression (Fig. 3c). We also examined the SNNLER median method and found it to give results similar to those obtained using SNNLERM (Fig. 3). Among various normalization methods, we demonstrated that the proposed method delivers superior results.
Identification of Differentially Expressed Genes.
After SNNLERM normalization, 15 pairs of nylon arrays hybridized with samples taken from four sets of T cell cultures (20% vs. 5% pO_{2}) and nine C. acetobutylicum glass arrays were used to compare different geneidentification methods. The proposed method of using SNN genes and LER SD (SNNLERSD) delivered dynamic upper and lower mask boundaries, in contrast to the rigid minimal fold change with an intensity threshold (MF&T) and the expression ratio probability density (ERPD) methods (Fig. 4). Unlike the mask contours (CHM) of Newton et al. (11), SNNLERSD masks for membrane arrays are broader at lower intensities and narrower at higher intensities, allowing identification of differentially expressed genes of both higherfold changes at lower intensities and lowerfold changes at higher intensities (Fig. 4a). Successful identification includes properly identifying (i) differentially expressed genes and (ii) nondifferentially expressed genes, while avoiding misidentification of (iii) nondifferentially expressed genes as differentially expressed, (iv) differentially expressed genes as nondifferentially expressed, and (v) significantly upregulated genes as downregulated or vice versa. Let us assume that n_{di} and n_{nd} genes are known to be differentially and nondifferentially expressed, respectively. The total identification error J_{iden_error} may be defined as 9 where n_{C} to n_{E} are the numbers of genes in the above defined categories iii–v, respectively. J_{iden_error} values range from 0 to 1, and ideally should be equal to 0.
A robust method for assessing the absolute and relative accuracy of the proposed geneidentification method can be based on the geneexpression analysis of the megaplasmid (pSOL1) deficient C. acetobutylicum strain M5 relative to WT (Table 1). Strain M5 is isogenic to WT but lacking the pSOL1 plasmid. Up to 178 genes, which are expressed with a broad range of levels in WT, therefore, have null expression in M5 and can be used for method validation. All 178 ORFs of pSOL1 are included on the C. acetobutylicum glass slides. Thus, these 178 pSOL1 genes are expected to be classified as “downregulated” (if expressed in WT samples) or nondifferentially expressed (if not expressed in WT samples), when M5 samples are cohybridized with WT. Traditional methods such as MF&T and ERPD identified a number of “downregulated” pSOL1 genes, but at the same time misidentified a few pSOL1 genes as “upregulated”. Several misidentifications were made when using CHM, and even more when using ANOVA. SNNLERSD properly identified the largest number of “downregulated” pSOL1 genes and at the same time had no misidentifications of “upregulated” pSOL1 genes. The strategy of using the C. acetobutylicum pSOL1 genes to validate geneidentification methods is unique in the DNAarray literature.
We also used T cell samples hybridized on nylon arrays to validate different geneidentification methods. QRTPCR results for 10 genes from 15 pairs of 20% vs. 5% pO_{2} cultures (2 failed, leaving 148 RTPCR measurements) were used to determine n_{di} and n_{nd} (Table 2). Although fold changes obtained from methods such as Northern analysis and QRTPCR are often used to validate array data (3, 7, 20), there is no established way for comparing fold changes obtained from such methods to array fold changes. To determine the differential status of genes based on QRTPCR results, we first selected genes with their fold changes of replicate QRTPCR results statistically significantly >1. Among those genes, genes with fold changes >1.5 were considered as differentially expressed. We chose the 1.5 value as a conservative benchmark to test the power of the various geneidentification methods. Based on this criterion, we obtained that n_{di} = 34 and n_{nd} = 114. For various identification methods, different sets of n_{C} to n_{E} were determined, and the corresponding total identification errors were calculated. SNNLERSD delivered the smallest identification error. With decreasing fold change and threshold in MF&T and increasing posterior odds in CHM, these two methods may be able to correctly identify more genes are differentially or nondifferentially expressed genes but at the cost of increasing misidentifications. EPRD, SNNLERSD, and CHM (with posterior odds of 100:5) were similar in classifying differentially expressed genes. In identifying nondifferentially expressed genes, SNNLERSD was, however, superior to EPRD and CHM by making far fewer misidentifications. Similar conclusions are obtained when a higherfold change, such as 2, is used for selecting genes as differentially expressed by QRTPCR, although the number of genes that meet this criterion is smaller, that thus, the ability to discriminate among the various geneidentification method is reduced. In summary, for plastic/membrane arrays, though our proposed normalization method performed only slightly better than other normalization methods (Fig. 3 a and b), the proposed overall (normalization and gene identification) method produces significant improvements (Table 2).
To further test the accuracy of SNNLERSD, in C. acetobutylicum 824(pSOS95del)824(pGroE1) glass array experiments, three genes identified by SNNLERSD but not by other methods as differentially expressed and one gene identified by other methods (ERPD; 95% confidence level; MF&T with the minimal fold of 2.5 and threshold intensity of 500; CHM with posterior odds at 100:10) but not by SNNLERSD (95% confidence level) as differentially expressed were chosen for QRTPCR verification (two such genes are identified in Fig. 4b). Three genes identified by SNNLERSD as differentially expressed were assayed as differentially expressed by QRTPCR (average LERs of these three genes: 2, 1.4, 1.1). The array expression ratios of these genes were relative small (average LERs: 0.77, 0.83, 0.71), and only SNNLERSD was able to identify them correctly. The QRTPCR LER for the gene identified by SNNLERSD as nondifferentially expressed was −0.26. In contrast to other methods, SNNLERSD was able to discard this gene from being classified as differentially expressed (Fig. 4b).
Although QRTPCR results demonstrate that the proposed geneidentification method is superior to other methods, it should be noted that neither QRTPCR nor DNAarray analysis is error free. The discrepancy in the number of differentially expressed genes identified by QRTPCR and DNAarray analysis may be due to the fact that array fold changes are smaller compared with Northern analysis or QRTPCR (3, 7). Thus, a large number of genes assayed by QRTPCR as differentially expressed with a fold change of 1.5 are judged as nondifferentially expressed by array analysis (Table 2). As a result, array analysis is conservative in predicting differentially expressed genes compared with QRTPCR. At the same time, when using SNNLERSD, array analysis produces few false positives.
Acknowledgments
We thank Dr. N. Jafari of the Center for Genetic Medicine, Northwestern University, and L. T. Huang and J. Beamish for assistance. We acknowledge the use of Keck Biophysics and Center for Genetic Medicine facilities at Northwestern University. This work was supported by National Science Foundation Grant BES9905669, National Institutes of Health Grant R01GM65476, and a Whitaker Foundation Graduate Fellowship (to H.H.).
Footnotes
Abbreviations
 CHM,
 contour based on hierarchical modeling;
 ERM,
 expressionratio mean;
 ERMD,
 expressionratio median;
 ERPD,
 expressionratio probability density;
 LER,
 logarithm of expression ratio;
 LERM,
 mean of logarithm of expression ratio;
 MF&T,
 minimal fold change with an intensity threshold;
 QRTPCR,
 quantitative RTPCR;
 SNN,
 segmental nearest neighbor
 Received October 3, 2002.
 Accepted December 3, 2002.
 Copyright © 2003, The National Academy of Sciences
References
 ↵
 Schena M,
 Shalon D,
 Heller R,
 Chai A,
 Brown P O,
 Davis R W
 ↵
 Eisen M B,
 Spellman P T,
 Brown P O,
 Botstein D
 ↵
 Iyer V R,
 Eisen M B,
 Ross D T,
 Schuler G,
 Moore T,
 Lee G C F,
 Trent J M,
 Staudt L M,
 Hudson H Jr,
 Boguski M S,
 et al.
 ↵
 Tsien C L,
 Libermann T A,
 Gu X,
 Kohane I S
 ↵
 ↵
 Beissbarth T,
 Fellenberg K,
 Brors B,
 ArribasPart R,
 Boer J M,
 Hauser N C,
 Scheideler M,
 Hoheisel J D,
 Schuetz G,
 Poustka A,
 et al.
 ↵
 Steidl U,
 Kronenwett R,
 Rohr U P,
 Fenk R,
 Kliszewski S,
 Maercker C,
 Neubert P,
 Aivado M,
 Koch J,
 Modlich O,
 et al.
 ↵
 Tseng G C,
 Oh M K,
 Rohlin L,
 Liao J C,
 Wong W H
 ↵
 ↵
 ↵
 Newton M A,
 Kendziorski C M,
 Richmond C S,
 Blattner F R,
 Tsui K W
 ↵
 ↵
 ↵
 ↵
 Harris L M,
 Welker N E,
 Papoutsakis E T
 ↵
 Cornillot E,
 Nair R V,
 Papoutsakis E T,
 Soucaille P
 ↵
 ↵
 Cover D,
 Hart P
 ↵
 Garber M E,
 Troyanskaya O G,
 Schluens K,
 Petersen S,
 Thaesler Z,
 PacynaGengelbach M,
 Rijn M,
 Rosen G D,
 Perou C M,
 Whyte R I,
 et al.
 ↵