On the utility of pooling biological samples in microarray experiments
- *Department of Biostatistics and Medical Informatics and §McArdle Laboratory for Cancer Research, University of Wisconsin, Madison, WI 53703; and ‡Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205
-
Communicated by Grace Wahba, University of Wisconsin, Madison, WI, January 25, 2005 (received for review July 28, 2004)
-
Fig. 6.
Design accuracy. (A) Lists of fixed size. Solid lines give the average performer across 100 subsets; dashed lines give the worst case performer. (B) Lists with fixed FDR. Each vertical tick on the FDR plot marks 100 genes identified at the specified level of FDR (see Comparison of Designs for more details on the construction of each figure). Virtually identical results were obtained if CEL files were processed by using RMA within pool group (see Fig. 12).
-
Fig. 3.
Distorted gene. Expression values are shown for individuals, pools, and technical replicates. The + (x) indicates the mathematical average of the raw (log) data; the m indicates the median of the values. The numbers refer to arrays (control condition). Of importance here are arrays 3 and 10, where expression values for this gene differ from the majority. The effects of arrays 3 and 10 are attenuated by the values they are pooled with (11 and 2, respectively, for the pools of two).
-
Fig. 4.
Effects of distortion within and between conditions. (A) The mean difference between the pools of two and the corresponding averages across individuals (control condition) as a function of standard deviation (SD) estimated within the control condition (all genes are shown). The units are log base-2 expression. The percentiles of SD are shown (bottom) along with the percentage of genes (top) having values in the pools of two that are larger than the corresponding average across individuals. Genes with values in the pools that are higher (lower) than the corresponding averages are shown in blue (purple). For the 25% of the genes with largest SD, >80% have values larger in the pools of two. Similar results were found by using estimates of either technical or biological SD. The treatment condition and pools of three give similar results. (B) The difference between the log fold change (FC) values (control/treatment) calculated from the pools of two and the individuals for the genes shown in A plotted as a function of the difference in SD calculated across conditions. B is unitless because we are considering the difference in log fold change. Distortion affects both control and treatment and largely cancels out when FCs are considered resulting in similar FC values in the individuals and pools.
-
Fig. 5.
DE inferences without biological replication. Expression values from three genes are shown. Technical replicates for genes 1 and 2 are shown in columns C1 and C3. By considering these technical replicates only, the first two genes might be considered DE by some measures (because the averages in each group are quite different); when biological replicates are considered for these two genes (columns C2 and C4), it is obvious that the difference in means is caused by three outliers (first gene) and underestimation of the biological variance (second gene). DE calls for gene 3 would be the same, whether considering biological or technical replicates; +, x, and m are defined in Fig. 3.
Footnotes
- Copyright © 2005, The National Academy of Sciences











