New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Metagene projection for crossplatform, crossspecies characterization of global transcriptional states

Communicated by Edward M. Scolnick, The Broad Institute, Cambridge, MA, February 6, 2007 (received for review December 7, 2006)
Abstract
The high dimensionality of global transcription profiles, the expression level of 20,000 genes in a much small number of samples, presents challenges that affect the sensitivity and general applicability of analysis results. In principle, it would be better to describe the data in terms of a small number of metagenes, positive linear combinations of genes, which could reduce noise while still capturing the invariant biological features of the data. Here, we describe how to accomplish such a reduction in dimension by a metagene projection methodology, which can greatly reduce the number of features used to characterize microarray data. We show, in applications to the analysis of leukemia and lung cancer data sets, how this approach can help assess and interpret similarities and differences between independent data sets, enable crossplatform and crossspecies analysis, improve clustering and class prediction, and provide a computational means to detect and remove sample contamination.
A major challenge in the analysis of global transcription profiles is the high level of noise and the lack of reproducibility across data sets, which results from fitting models to small numbers of samples in a highdimensional space (i.e., thousands of genes). Ideally we would prefer to reduce the data to a small number of metagenes that better capture the essential behavior of the samples.
There are many advantages to such a metagene approach. By capturing the major, invariant biological features and reducing noise, metagenes provide descriptions of data sets that allow them to be more easily combined and compared. This is especially important when we are considering crossplatform or crossspecies data. Ultimately, this can result in more sensitive clustering and classification. In addition, interpretation of the metagenes, which characterize a subtype or subset of samples, can give us insight into underlying mechanisms and processes of a disease.
Here, we describe a general methodology, metagene projection, that creates a lowdimensional representation of a training (model) data set using nonnegative metagene factors into which an independently obtained new (test) set of samples or data can be projected and analyzed. The metagene factors are a small number of gene combinations that distinguish expression patterns of subclasses in a data set. We obtain the factors by the application of nonnegative matrix factorization (NMF) (1, 2) used to extract facial features from images. We showed (3) how NMF can extract metagenes that provide stable, robust clustering of expression data. Moreover, by using gene set enrichment analysis (GSEA) to annotate the metagene factors themselves, we can gain insight into the underlying biology of both the training and test data sets.
Importantly, we illustrate the utility of metagene projection by its application to leukemia and lung cancer data sets. We show how the projection of new data sets into the space of metagene factors reduces noise and emphasizes relevant biological correlations and thus (i) enables crossplatform analysis by removing technological noise from data, (ii) enables crossspecies analysis and the assessment of disease models, (iii) improves the accuracy of classification and prediction methods in the mapping of diseases types, and (iv) detects contamination in tumor samples.
Results
Overview of Method.
We consider a gene expression data set consisting of a collection of N _{M} model samples, which we use to characterize a domain of biological (transcriptional) states of interest. The model data are represented as an n _{M} × N _{M} matrix, M, whose rows contain the expression levels of the n _{M} genes in the N _{M} samples.
Using NMF, we find a small number, k, of metagenes, positive linear combinations of the N _{M} genes, which can be used to distinguish the transcription profiles of the subtypes contained in the model data set. Mathematically, this corresponds to finding an approximate factoring, M ≈ W _{M} × H _{M}, where both factors have only positive entries. W _{M} is an n _{M} × k matrix that defines the metagene decomposition model and whose columns specify how much each of the n _{M} genes contributes to each of the k metagenes. H _{M} is a k × N _{M} matrix whose entries represent the expression levels of the k metagenes for each of the N _{M} samples. This model selection is done in an unsupervised fashion by using either a knowledgebased or datadriven model selection approach. One can set k equal to the number of known phenotypes in the model set. Alternatively, optimal values of k can be determined based on projection stability by using consensus clustering techniques as described (3).
From the factoring of M, we are able to construct a mapping that allows us to project a data set into the space of the metagenes derived above. Mathematically, this can be accomplished by using the Moore–Penrose generalized pseudoinverse (4) of W _{M}, so that, Ĥ_{M} = (W_{M} )^{−1} × M, where Ĥ_{M} ≈ H_{M} . For simplicity in notation we refer to the projected matrix as H _{M}. After elimination of outlier samples and model refinement, we can apply the final resulting pseudoinverse to a new individual sample or entire data set and analyze that data in the context of the metagenes, which characterized the model data.
We summarize the three major steps in the metagene projection method below (Fig. 1). More detail can be found in Methods. The software is freely available from The Broad Institute web site as both Rcode and a module in the GenePattern software package.
Step 1. Metagene Factor Extraction and Refinement of the Model Data Set.
We start with standard data preprocessing: thresholding and eliminating genes that do not vary sufficiently across the model set and rank normalizing to minimize platform idiosyncrasies. We apply NMF to factor the resulting expression matrix and derive the Moore–Penrose pseudoinverse of W _{M}. Next, we project the model data set into metagene space and, by using a support vector machine (SVM) (5) classification step, trim outliers from the model set (model data set refinement). Finally, we refactor the expression matrix M of the refined model set, M ≈ W _{M}× H _{M} , and define a refined pseudoinverse or projection map. We use this refinement of the projection map in the analysis of new test data sets.
Step 2. Metagene Factor Projection of the Test Data Set.
We threshold the expression values as in step 1 and then match the genes in each test set to the corresponding genes in the model set. We then rank normalize the test samples to yield the corresponding columns in the test data expression matrix, T. Finally, we apply the pseudoinverse (W _{M})^{−1} to both M and T to obtain H _{M} and H_{T} , their projections into metagene space.
Step 3. Analysis of Model and Test Data Set Projection Results.
In our experience, the use of metagenes, instead of genes, as features for analysis, increases the signaltonoise ratio and yields more robust, accurate results. Now that both the model and test data are represented in the lower dimensional metagene space, there are a variety of analyses we can apply. These include the following:
Visualization.
Model and test samples can be characterized and compared by using heat maps of the H matrices.
Clustering model and test projections.
The projection can provide a sample's class assignment by identifying the metagene with maximum expression. Alternatively, we can cluster the columns of H _{M} and H_{T} .
Classification of test samples.
We can use the projected data to build a multiclass predictor and assess any data set of test samples. Below, we use a oneversusall SVM classifier (6, 7) to predict phenotypes by using the k metagenes as the input features. This method provides a predicted class and a predictive confidence by using a modified Brier score (see Methods for details).
GSEAbased metagene interpretation.
To gain biological insight into the different metagene factors, we use a variation of our GSEA methodology (8). Using the expression profile of a metagene, i.e., the corresponding row of the H _{M} matrix, as a template, we sort the genes according to the correlation of their expression profile from the M matrix with the metagene template. We can then evaluate the “enrichment” of gene sets representing a pathway or other biological process at the top of that ranked list by using GSEA. For each metagene, one obtains a list of “enriched” gene sets and their statistical significance [see supporting information (SI) Text ].
Examples.
Here, we describe three applications of the metagene projection method to highlight its utility in three crossplatform analyses, to validate disease models, to improve classification of crossplatform data sets, to assess the similarities and differences of subtypes across data sets, and to detect contamination. We start with a simple example. We then describe two more innovative results.
CrossPlatform Clustering of Leukemia Data.
We analyzed two leukemia data sets from different microarray platforms to test the method and demonstrate its power to enable crossplatform classification and to improve sensitivity in clustering. Often clustering of crossplatform data reveals the platform or originating lab as the strongest differentiating signal in the data. Importantly, we establish that the method was able to cluster the crossplatform data correctly and that these results are because of the metagene representation rather than the rank normalization step.
We considered two data sets containing samples representing three leukemia subclasses: B and T cell acute lymphoblastic leukemia (ALLB, ALLT) and acute myeloid leukemia (AML). The model data set consisted of 30 samples (10 ALLB, 10 ALLT, 10 AML) (from refs. 9 and 10). The test data set contained the 38 samples (19 ALLB, 8 ALLT, 11 AML) from ref. 11. The two data sets came from different laboratories and were acquired on different microarray technologies, Affymetrix U133 for the model set and Affymetrix HU6800 for the test set (Affymetrix, Santa Clara, CA).
We applied the metagene projection methodology as described above. In particular, we noted that the model data set is very consistent, and no model refinement was necessary. Because the number of subtypes was known, we used k = 3 metagene factors. Fig. 2 shows the resulting heat maps for the projected model and test sets. Clearly, the metagenes are associated with the biological phenotypes (F1 ≈ ALLB, F2 ≈ ALLT, F3 ≈ AML) in both.
Postprojection clustering of the model samples demonstrates reduction of noise and greater emphasis of the biologically invariant signal in the data. The clusters corresponding to each phenotype have higher intracluster correlation and greater intercluster distance than obtained with the original data (SI Fig. 6). More importantly, clustering of the merged set of projected model and test samples produces very clear results with the major three clusters consisting of each leukemia subtype independent of the data set of origin (Fig. 3 A and SI Fig. 7A ).
We next sought to confirm that this consistency of subtype clusters across the data sets was due to the metagene projection and not just the result of preprocessing and rank normalization. To this end, we performed two additional clusterings: one merging the model and test samples after rank normalization and clustering in the space of all filtered genes without using metagene projection (Fig. 3 B and SI Fig. 7B ) and another clustering the merged and rank normalized data in the space of the top500 marker genes of each of the three subtypes in the model set, 1,500 genes in total (SI Fig. 8). This last procedure is often used for crossplatform analysis. In both alternative clusterings, not using metagene projection, the samples first split according to their data set of origin before the biological subclassification appears.
Leukemias: Improving CrossPlatform Classification and Interpretation of Subtypes.
We sought to ascertain whether metagene projection would be an effective procedure for unsupervised feature extraction (12) and dimension reduction to enable more robust and accurate classifiers. To this end, we considered 10 subclasses of leukemia (5 subtypes of ALL and 5 subtypes of AML) as represented in a model set of 170 samples from refs. 9 and 10. The test set consisted of 297 samples (13–20), obtained from eight independent published data sets. The model set samples were all acquired on the same platform in the same laboratory, whereas the test set came from multiple labs and three different microarray platforms (see SI Table 1).
We set the number of metagene factors to the number of known phenotypes in the model set, k = 10. Metagene projection, followed by model refinement, resulted in elimination of eight outlier samples from the model set [2 of 21 AML t (8, 21); 4 of 23 AML MLL; 2 of 14 AML inv (16)] (for more detail see Methods). Fig. 4 shows the metagene expression matrices for both the model and test data sets after projection. Strikingly, we found that each leukemia subtype was characterized by essentially one metagene.
Next, we sought to determine whether we could build a classifier using the metagene projections that would accurately predict the subtype of the crossplatform samples in the test set. We noted that the datadriven model selection technique described in our previous work (3) indicated that k = 13 was the best choice (SI Fig 9). Thus, we evaluated SVM classifiers using both the 10 and 13metagene models and compared them with SVM and Knearest neighbor (KNN) classifiers using all genes in common between the model and test data sets. SI Fig. 10 shows the comparative performance of the 10 and 13metagene SVMs with the allgene classifiers.
Our metagenebased classifier outperformed the classifiers based on allgenes or markers selected in allgene space. The 13metagene classifier attained the “best” performance, with a correct call accuracy of 88% and fewer errors than the 10metagene model. The 10metagene, allgene SVM, and KNN classifiers' correct call accuracies were 86%, 82%, and 72% respectively. We note that the SVM classifier using all common genes made fewer “confident” calls but made correspondingly fewer errors. We used 0.3 as the confidence threshold for all of the SVM multiclass predictors. Increasing this threshold will reduce both the number of correct calls and the number of errors. (SI Tables 2 and 3 contain details).
Closer examination of the confusion matrices for the 10 and 13metagene classifiers revealed that two thirds of the errors resulted from placing ALLBCR, AMLt (8, 21), AMLM7, and AMLMLL samples into the AMLinv16 class. We believe this results from shared metagene signals, which can be seen in the heat map in Fig. 4 B. A GSEA analysis of the metagene factors, described below, uncovered a biological interpretation for some of the errors. This also led us to explore the extent to which cross talk between the AML and ALL data in the model might be affecting our ability to predict the classes in the test set. Interestingly, we found that building 10metagene, fiveclass classifiers for just the ALL [respectively AML] subtypes improved accuracy substantially to 97% (130 samples) with 1.5% no calls (2 samples) and 1.5% errors (2 samples) [92% (150 samples) with 3% no calls (4 samples) and 5% errors (9 samples)]. The allgene SVM and 9NN predictors also improved accuracy, but the metagenebased classifier continues to make more correct calls and fewer nocalls (SI Fig. 10).
These are remarkably good multiclass, crossplatform classification results. It was difficult to make direct comparisons with other approaches in the literature, because the specific data sets or data preparation were not always available. However, the metagenebased approach appears to outperform other leukemia crossplatform classification approaches: 93–96% accuracy on ALL subtypes and 68–78% on AML subtypes (21); ≈40% accuracy on AML subtypes (22).
Finally, we applied GSEA analysis to help interpret the metagenes characterizing the leukemia subtypes. Interestingly, many of the results agreed with the current understanding of these subclasses, and others posed new hypotheses. We present them as an illustration of the power of the metagene projection method to provide biological insights. The top two gene sets enriched in F4 (i.e., high in ALL T Cell) are (i) a set of E2F1 targets known to be activated in T Cell ALL (23) and (ii) a set of genes downregulated by ET743 treatment, which is known to induce apoptosis in acute T cell leukemia Jurkat cells (24). Metagene F9, high in AMLMLL, shows enrichment for chromosome band 11q13, which is known to be frequently coamplified with MLL in AML patients (25).
F6 is highly expressed in t (8, 21) and also upregulated in inv (16) subclasses of AML. The mechanism of leukemogenesis in AML in both these subtypes is disruption of the core binding factor (CBF) transcriptional complex, comprised of the RUNX1 and CBFB proteins. In t (8, 21), RUNX1 is fused to the CBFA2T1 gene, and inv (16) causes a CBFBMYH11 fusion gene. Both fusion genes disrupt the CBF complex, which is required for normal hematopoietic differentiation. Patients with t (8, 21) and inv (16) also have similar clinical features: both subclasses are associated with a relatively good prognosis and particular benefit from consolidation chemotherapy with highdose cytarabine. F6 therefore identifies patients harboring distinct cytogenetic abnormalities with a common molecular mechanism and clinical phenotype. Intriguingly, F9 also shows strong correlation with both AML.MLL and AMLinv (16). This leads us to speculate some common program of these two AML subtypes.
In this example, we have shown that metagene projection is an effective approach to building multiclass classification models across different platforms and sources of data that are accurate, robust, and interpretable.
Lung Cancer: CrossPlatform Comparison, Contamination Detection, and Interpretation of Cell Line Models.
We next investigated whether metagene projection would enable us to evaluate consistency in a collection of crossplatform data sets, validate cell lines as good models for different tumor types, and, importantly, provide a method to computationally extract some of the expression signal of normal tissue contamination from tumor samples.
For our model set, we used a subset of data set A from ref. 26, BOS, consisting of 30 lung adenocarcinomas, 20 squamous tumors, and 17 normal lung samples. Our test set derived from seven independent data sets (refs. 27–32 and one unpublished set, see SI Table 4). Note that these data sets were acquired on four different microarray platforms by six different laboratories.
We first built a fourmetagene model from the BOS model set as described above. Although the model set included three major subtypes, the datadriven NMF model selection procedure indicated that four factors was the smallest optimal solution greater than the number of known phenotypes (SI Fig. 11). After SVM model refinement, one outlier adenocarcinoma sample was removed from the model set, and the metagene factors recalculated. Fig. 5A contains the H _{M} matrix of metagene expression levels. From the H _{M} matrix, we can see that metagenes F2, F3, and F4 characterize the adenocarcinoma, squamous, and normal samples respectively, whereas the F1 metagene picks up an additional signature in a subset of the adenocarcinoma and squamous samples. Next, we projected all of the test data sets into metagene space (H _{T} in Fig. 5 A) and found an unexpected result. The normal test samples NLSTA continued to be characterized by F4. However, although the adenocarcinoma and squamous samples still showed F2 and F3 metagene signatures, respectively, they also showed significant expression in the F4 “normal” metagene. This led us to speculate that these samples might have varying degrees of contamination by stroma or normal tissue, which we might be able to extract computationally.
To remove the normal signature, we set the F4 metagene factor coefficient in the H _{M} matrix to zero and multiplied it by the original W _{M} to yield a matrix M̃ that reproduces the original data but without the contribution of F4. We then excluded the normal tissue samples from the model data set because they only had residual values, factored the resulting data matrix to extract the three remaining metagene factors, and projected all of the samples as was done before. The resulting expression profiles of the metagenes in the model and test sets are seen in Fig. 5 B. Eliminating the contribution of the F4 metagene, we find the dominant signatures in the adenocarcinoma and squamous samples are F2 and F3, respectively, as in the model set, and F1 retains its role as the signature of the cell lines. Thus, we were able to numerically “modulate” a specific metagene to computationally reduce contamination in the tumor samples.
The most striking feature of the metagene projection of the test samples is that the adenocarcinoma and squamous cell lines do not project with the corresponding tumor classes. This has been reported in the literature (27). Using the GSEA approach we described above, we can gain some biological insight into the metagene, F1, which characterizes the cell lines. SI Table 5 shows the top20 gene sets enriched in F1.
Metagene F1 is enriched in gene sets associated with rapamycin response (mTOR activation), protein production (genes downregulated by amino acid starvation), lack of differentiation, the mitochondria, oxidative phosphorylation, and BRCA1 signaling. We have observed some of these gene sets before as part of a group of gene sets enriched in pooroutcome lung adenocarcinoma patients in three different data sets (8). This leads us to speculate that F1 represents transcriptional programs associated with hyperactivation of AKT/mTOR, an associated mTORmediated increase of protein production and high proliferation, and a lack of differentiation.
In this example, we have shown the power of the metagene projection to define a common space of transcriptional variation in which we can analyze and assess multiple data sets across different technology platforms and laboratories. Despite the diversity of platforms, sample sources, and different experimental conditions, most test samples project with their biological counterparts. Moreover, we have shown that metagene projection provides a method for computationally reducing sample contamination, which enables more coherent projection of tumor samples. Finally, the combination of metagene projection and GSEA analysis allows us to gain insights into more robust, invariant biological features of different phenotypes and tumor subtypes.
Discussion
Traditional approaches to microarray analysis focus on identifying marker genes, which are correlated with a phenotype of interest, and on using them to build classifiers for samples whose phenotype may be unknown or to gain some insight into the underlying biology of a cellular state. These strategies often fail when classifiers are applied to data from other laboratories or derived on different technology platforms or when used to try to assess the validity of a disease model.
Lowerdimensional projections and decompositions of DNA microarray data, such as principal component analysis, singular value decomposition, and NMF, have been used to analyze transcriptional states (3, 33–37). Primarily, these approaches were applied in the context of a single data set for clustering or visualization.
We introduced a metagene projection method to assess the validity of a Snf5 knockout mouse as a murine model for Snf5deficient human rhabdoid tumors (38), and found that the murine Snf5 model samples were closely related to the human rhabdoid samples (from both model and test sets) and distinct from the controls. The model and test sets were obtained on different microarray platforms in addition to being crossspecies. This approach combined our previous work, using NMF to identify a small number of gene combinations (metagenes) whose profiles best represent the most distinguishing features of the expression patterns of the subclasses in a data set, with our previously published gene expression data set derived from a collection of human pediatric brain tumors (rhabdoid, medulloblastoma, glioma, and normal cerebellum) (33). A corresponding projection map, the MoorePenrose generalized pseudoinverse of one of the factor matrices, allowed us to analyze new data in the context of the space of metagenes arising from the original data set.
This article presents a refinement of that method, which is more sensitive, robust, and broadly applicable to crossplatform and crossspecies analysis and classification (see SI Text ). In addition, we have shown how the projection can be used to highlight the biologically invariant aspects and commonalities of the subclasses, assess the similarities and differences between suitable chosen sets of model and test samples, and, surprisingly, to computationally remove contaminating signals from tumor data.
The method, as presented here, has a number of advantages over other approaches. Metagene projection, together with NMF, reduces dimensionality and summarizes the salient features of a data set with coherent patterns shared by multiple genes and samples. In contrast to approaches using principal component analysis or singular value decomposition, it yields a sparser representation of the original model data set optimized for the number of factors specified. NMF factors are nonnegative and more localized and therefore easier to interpret and analyze. We note here that Alter and Golub (39) applied the pseudoinverse to genomic data by using the singular value decomposition.
There is complementary work of Huang (40) and Bild (41), which is conceptually similar to ours in the sense of combining dimensionality reduction and classification models, but has distinct objectives. Their main goal is to provide an exquisitely specific predictor of pathway activation, which has been experimentally characterized by the overexpression of a single gene. In contrast, our goal is to model global transcriptional states, rather than specific pathways, and to use them to describe an entire range of biological behavior, e.g., different morphologies, lineages, etc. Thus, the specific methodologies and techniques we use are also quite different.
Classifiers built in metagene, rather than allgene, space are more robust, reproducible, and generalizable across platforms and laboratories because the projection can reduce noise and technologybased variation more than simple normalization. In particular, we found this approach to be very sensitive in the complex, crossplatform, multiclass setting of the leukemia data sets. Others have studied crossplatform classification in lung cancer (42, 43). However, they use the test data explicitly to choose similarly correlated genes as features, rather than relying solely on the model set.
Most importantly, metagene models built on previously acquired or published data sets enable the use of prior knowledge to help characterize and analyze new data. This is seen in our work validating a mouse model for human rhabdoid tumors (38). We also used this approach to analyze samples from malariainfected patients using signatures derived from publicly available yeast data (P.T., D.S., J.P.M., unpublished work). Thus, we see that this metagene projection method not only decreases noise by reducing the dimensionality of microarray data, but can also provide a powerful knowledgebased approach to the crossplatform, crossspecies analysis of microarray data.
Methods
Data Set Preprocessing and Normalization.
For Affy Hu6800 and U133 microarrays, we threshold at 20 and 100,000 units. Gene filtering excludes genes with <5fold and 500 units of maximum difference for the first leukemia example, 8fold/800 for the second leukemia example, and 3fold/300 for the lung. We rank the genes according to their expression levels and replace the value by 10,000 × (rank(gene) − 1)/(number of genes − 1).
Metagene Factor Extraction.
We use NMF with 2,000 iterations and stopping criterion as described (3).
Metagene Model Selection.
We select k based either on the known number of phenotypes or by using the values determined by projection stability described (3). Optimal solutions are peaks in the cophenetic coefficient as a function of k.
Data Set Refinement.
We train a SVM on H _{M} to predict each class, and we remove samples that are errors (known phenotypes) or no calls (discovered classes). In our experience, the number of outliers is quite small compared with the size of the classes if the number of metagenes is chosen as described above.
Calculating the Pseudoinverse of W_{M}.
We use “ginv” from R's MASS package.
Metagene Projection of Model and Test Set Samples.
To project the model set, we use the pseudoinverse of W _{M}. For each data set in the test set, we match the genes to the corresponding rows of W _{M} (i.e., genes in the model set). We calculate the pseudoinverse for that set of rows and apply it to obtain the corresponding columns of H _{T} for that specific data set. This procedure adapts the projection to the particular test data set and, by tolerating unmatched genes between model and test set, supports the projection of data sets from different platforms. If too many unmatched genes result in weak amplitudes in H _{T}, we rescale the columns of H _{T} so the sum of the squares of their rowentries is equal to one. This postnormalization is optional.
Clustering.
We use “hclust” (complete linkage) from R's STATS package.
Classification and Prediction Confidence.
We use the “svm” function from R's e1071 package (one vs. all, radial function kernel). The predicted class is the one with the highest probability, and a predictive confidence 1 ≥ C _{p} ≥ 0 is computed by using a modification of the Brier skill score (44): where P _{1} > P _{2} > …> P _{k}. is the sorted list of k output probabilities for a given sample. C _{p} < 0.3 is a no call. The KNN classifier in the leukemia example used 50 marker genes and nine nearest neighbors. For the SVM using all genes we use a “linear” kernel.
Acknowledgments
We thank J. P. Brunet, T. Golub, E. Lander, and M. Meyerson for helpful conversations and for reviewing this manuscript.
Footnotes
 ^{§}To whom correspondence should be addressed. Email: mesirov{at}broad.mit.edu

Author contributions: P.T. and J.P.M. designed research; P.T., D.S., B.L.E., C.W.M.R., and J.P.M. performed research; M.A.G. contributed data; P.T., D.S., and J.P.M. analyzed data; and P.T., B.L.E., and J.P.M. wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0701068104/DC1.
 Abbreviations:
 GSEA,
 gene set enrichment analysis;
 NMF,
 nonnegative matrix factorization;
 SVM,
 support vector machine.

Freely available online through the PNAS open access option.
 © 2007 by The National Academy of Sciences of the USA
References
 ↵

↵
 Lee DD ,
 Seung HS

↵
 Brunet JP ,
 Tamayo P ,
 Golub TR ,
 Mesirov JP

↵
 BenIsrael A ,
 Greville TNE

↵
 Cristianini N ,
 ShaweTaylor J

↵
 Ramaswamy S ,
 Tamayo P ,
 Rifkin R ,
 Mukherjee S ,
 Yeang CH ,
 Angelo M ,
 Ladd C ,
 Reich M ,
 Latulippe E ,
 Mesirov JP ,
 et al.
 ↵

↵
 Subramanian A ,
 Tamayo P ,
 Mootha VK ,
 Mukherjee S ,
 Ebert BL ,
 Gillette MA ,
 Paulovich A ,
 Pomeroy SL ,
 Golub TR ,
 Lander ES ,
 Mesirov JP

↵
 Ross ME ,
 Zhou X ,
 Song G ,
 Shurtleff SA ,
 Girtman K ,
 Williams WK ,
 Liu HC ,
 Mahfouz R ,
 Raimondi SC ,
 Lenny N ,
 et al.

↵
 Ross ME ,
 Mahfouz R ,
 Onciu M ,
 Liu HC ,
 Zhou X ,
 Song G ,
 Shurtleff SA ,
 Pounds S ,
 Cheng C ,
 Ma J ,
 et al.

↵
 Golub TR ,
 Slonim DK ,
 Tamayo P ,
 Huard C ,
 Gaasenbeek M ,
 Mesirov JP ,
 Coller H ,
 Loh ML ,
 Downing JR ,
 Caligiuri MA ,
 et al.

↵
 Guyon IM ,
 Gunn SR ,
 Nikravesh M ,
 Zadeh L
 ↵
 ↵

↵
 Chiaretti S ,
 Li X ,
 Gentleman R ,
 Vitale A ,
 Vignetti M ,
 Mandelli F ,
 Ritz J ,
 Foa R
 ↵
 ↵
 ↵

↵
 Bourquin JP ,
 Subramanian A ,
 Langebrake C ,
 Reinhardt D ,
 Bernard O ,
 Ballerini P ,
 Baruchel A ,
 Cave H ,
 Dastugue N ,
 Hasle H ,
 et al.

↵
 Fine BM ,
 Stanulla M ,
 Schrappe M ,
 Ho M ,
 Viehmann S ,
 Harbott J ,
 Boxer LM

↵
 Nilsson B ,
 Andersson A ,
 Johansson M ,
 Fioretos T
 ↵

↵
 Lemasson I ,
 Thebault S ,
 Sardet C ,
 Devaux C ,
 Mesnard JM

↵
 Gajate C ,
 An F ,
 Mollinedo F
 ↵

↵
 Bhattacharjee A ,
 Richards WG ,
 Staunton J ,
 Li C ,
 Monti S ,
 Vasa P ,
 Ladd C ,
 Beheshti J ,
 Bueno R ,
 Gillette M ,
 et al.

↵
 Virtanen C ,
 Ishikawa Y ,
 Honjoh D ,
 Kimura M ,
 Shimane M ,
 Miyoshi T ,
 Nomura H ,
 Jones MH

↵
 Staunton JE ,
 Slonim DK ,
 Coller HA ,
 Tamayo P ,
 Angelo MJ ,
 Park J ,
 Scherf U ,
 Lee JK ,
 Reinhold WO ,
 Weinstein JN ,
 et al.
 ↵
 ↵

↵
 Garber ME ,
 Troyanskaya OG ,
 Schluens K ,
 Petersen S ,
 Thaesler Z ,
 PacynaGengelbach M ,
 van de Rijn M ,
 Rosen GD ,
 Perou CM ,
 Whyte RI ,
 et al.
 ↵
 ↵

↵
 Kim PM ,
 Tidor B

↵
 Alter O ,
 Brown PO ,
 Botstein D

↵
 Moloshok TD ,
 Klevecz RR ,
 Grant JD ,
 Manion FJ ,
 Speier WFT ,
 Ochs MF
 ↵

↵
 Isakoff MS ,
 Sansam CG ,
 Tamayo P ,
 Subramanian A ,
 Evans JA ,
 Fillmore CM ,
 Wang X ,
 Biegel JA ,
 Pomeroy SL ,
 Mesirov JP ,
 Roberts CW

↵
 Alter O ,
 Golub GH
 ↵
 ↵

↵
 Parmigiani G ,
 GarrettMayer ES ,
 Anbazhagan R ,
 Gabrielson E

↵
 Hayes DN ,
 Monti S ,
 Parmigiani G ,
 Gilks CB ,
 Naoki K ,
 Bhattacharjee A ,
 Socinski MA ,
 Perou C ,
 Meyerson M
 ↵