Previous Article |
Table of Contents
| Next Article
Statistics
Robust singular value decomposition analysis of microarray data



*National Institute of Statistical Sciences, P.O. Box 14006, Research Triangle Park, NC 27709-4006;
School of Statistics, University of Minnesota, 313 Ford Hall, 224 Church Street NE, Minneapolis, MN 55455; and
GlaxoSmithKline, Research Triangle Park, NC 27709
Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved July 2, 2003 (received for review May 22, 2003)
| Abstract |
|---|
|
|
|---|
Microarray data present a number of challenges to statistical modeling. The size of the typical array (up to thousands of columns and perhaps hundreds of rows) defies easy graphical analyses. There may be severe distributional difficulties such as non-normal distributions, outliers (unusual data values), and numerous missing values. Common objectives are finding "patterns" in the data, in particular
One attractive way of clustering is a by-product of "ordination." Ordination involves finding suitable permutations of the rows (and perhaps of the columns) that lead to a steady progression going down the rows (and perhaps across the columns). A clustering is given by placing vertical (and perhaps horizontal) dividing lines in the array to break it up into rectangular blocks within which the values are homogeneous.
Conversely, not all clustering methods are hierarchical, but if we cluster the rows and elect to do so with any hierarchical clustering method, the dendrogram produces an ordering of the rows, although, since the layout of the dendrogram is not unique, neither is the ordering produced. Thus good ordination methods lead to good clustering whereas hierarchical clustering gives (nonunique) row ordinations.
| Methods |
|---|
|
|
|---|
![]() |
where ri is a parameter corresponding to the ith biological sample, cj corresponds to the jth gene, and eij is a "residual." This representation solves the ordination problem in that the rows can be ordered by their ri values and the columns by their cj values. Ordering the rows by r and the columns by c permutes the original data array to one in which we have high and low values in the corners and medium values in the middle, leading to an informative display.
Subsequently, grouping together those rows whose ri are similar will give clusters of biological samples. Grouping the columns with similar cj will give clusters of genes. If the residuals are small so that the ricj captures all of the important structure of the data matrix, then the ordination and subsequent clustering using the r or c values is essentially unique.
Standard practice is to remove "uninteresting structure" such as a grand mean, or even the row or column means from X before attempting the approximation. This is more of an implementation detail than a central aspect of the method.
The conventional method of getting this bilinear approximation is from the singular value decomposition (SVD) of X from Healy (1). It is well known that the leading term of the SVD provides the bilinear form with the best least-squares approximation to X. The SVD is often found by performing a principal component analysis on either X'X or XX'.
The conventional SVD, however, has some serious deficiencies. First, being a least-squares method, it is highly susceptible to outlier values in the array X. Such outliers are an accepted fact of life when dealing with microarray data, where a sprinkling of entries are found to be very large or small. Second, finding the SVD through a principal component analysis of X'X requires that all elements of X be observed. This goes counter to a second reality of microarray data, which is that missing values are a routine feature of the experimental data. Standard approaches to missing data are to "impute" values for the missing cells or eliminate whole rows or columns of the data matrix that are felt to be too incomplete. These methods are obviously at best unattractive necessities.
Alternating Least Squares. There is a standard remedy for the "missing information" deficiency, the Gabriel-Zamir alternating least-squares algorithm (2). This begins with a tentative estimate of the column factors cj, which are used to provide a matching scaling for the rows. Regarding
![]() |
as a regression of the ith row of X on the column factors identifies rj as the coefficient of a no-intercept regression. Fitting this regression row by row by using all nonempty cells then leads to an estimate of the row factors ri. Then switching roles, we take the row factors ri as given and use regression of all nonempty cells in exactly the same way to calculate fresh estimates of the column factors cj. This approach uses all of the observed data and does not require imputation of missing data. If the data set is complete, this alternating least-squares algorithm gives the first term of the conventional SVD.
Alternating Robust Fitting (ARF). The ALS method is effective in solving the missing information problem. But it does nothing about the sensitivity to outliers. Solving the outlier issue, however, can be done by a simple change in the regression method that lies at the heart of the Gabriel-Zamir algorithm. Instead of using ordinary least squares to carry out the alternating regressions, we can use any outlier-resistant regression method such as L1 (D.M.H., L.L., and S.S.Y., www.niss.org, technical report 122), weighted L1 [refer to Croux et al. (3) for details], least trimmed squares [refer to Ukkelberg and Borgen (4) for details], or an M-estimation method. In this article, we use the least trimmed squares method. The resulting algorithm then is to take the model
![]() |
Use any convenient initial values for the column factors cj (or optionally for the row factors) and then apply a robust no-intercept linear regression algorithm to alternately use the cj to refine the estimates of the ri and the ri to refine the estimates of the cj.
Using any robust regression criterion, each of these alternating regressions will lead to a reduction in the regression criterion, so the algorithm will converge.
Properties of the ARF Fit. Broad properties of the ARF bilinear fit follow at once. The method handles missing information routinely, without requiring a separate "fill-in" step. And it is impervious to a minority of outlier cells. Outliers will, of course, create a problem for the ARF, as with almost any conceivable method, if they constitute the majority of the elements of any row or column.
Clustering of the Rows and Columns. As already noted, sorting the rows by their ri values creates a natural ordination. This can be turned into a clustering of k groups of biological samples by finding "breakpoints" b(0) = 0 < b(1) < b(2) < b(k - 1) < b(k) = n and allocating to cluster h those genes that, in the reordering, have index b(h - 1) < i < = b(h). The breakpoints need to be chosen so that the biological samples within each cluster have ri values as similar as possible. This can be made operational by the criterion that the pooled sum of squared deviations of the ri broken down into the k clusters should be a minimum. Exact algorithms for finding breakpoints to attain this minimum are given by Venter and Steel (5) and Hawkins (6). Similarly, applying the optimal segmentation algorithm to the column factors cj clusters the genes into any specified number of clusters such that the genes within clusters have cj values as similar as possible.
| Relationship to Other Clustering Approaches |
|---|
|
|
|---|
![]() |
we see that the squared Euclidean distance between any two rows i and k can be written
![]() |
Now if the bilinear term ricj has captured all of the "structure" in the sample-gene association, and all that is left is statistically independent measurement noise (not necessarily small) as is the case when the SVD is used to get the bilinear approximation. Consider the three terms comprising
.
Theory therefore implies that a well-fitting bilinear approximation to the matrix X will give a better picture of the biological sample differences through its row factors ri than can be found directly by using the Euclidean distances between rows. A similar conclusion applies to using the correlation between pairs of row profiles.
There is yet another consideration favoring use of the ARF for clustering. If the matrix contains outliers, these outliers contaminate the Euclidean distances (and the correlation coefficients) between rows. However, it is a consequence of the robust fits used in the ARF that the outliers do not contaminate the ri or cj substantially.
| GeneLogic Data |
|---|
|
|
|---|
Fig. 1 shows the image of the unordered log transformed data matrix of set A after removing the row and column means, where the rows correspond to the cell lines and the columns correspond to the genes. In this image, we cannot see any clear patterns. The log-transformed and row and column means subtracted data are available at www.samsi.info/200304/dmml/web-internal/bio/data/data_rsvd.xls.
|
Our goal is to cluster similar genes together and similar cell lines together simultaneously. Graphically, we hope to form blocks of reds and greens.
We used the ARF to fit a bilinear approximation to the log-transformed expression matrix after removing the row and column means. Using the resulting bilinear approximation to order the rows and the columns of X leads to the display of Fig. 2A. Note that visually the ordination has been highly successful in rearranging the matrix so as to give blocks of high and low values in the corners and in-between values in the remainder of the array.
|
Next, we applied the segmentation algorithm to the row factors ri to segment the biological samples, and to the column factors cj to segment the genes. Tables 1 and 2 show the results of fitting various numbers of clusters. In an exploratory statistical analysis such as this, we do not need a rigorous answer to the question of the number of genuine clusters, but guidance comes from the variance explained by breaking the factors into two groups, three groups, four groups, etc. In Tables 1 and 2, the major explained variability is attained once five clusters are formed, and so we will use this as our working solution. Numbering the columns (samples) in their ri order, the optimal division into five segments breaks samples 1-18, 19-27, 28-63, 64-204, and 205-504. Similarly, based on Table 2, the optimal five-segment division of the genes is (using their cj ordering) genes 1-6, 7-33, 34-73, 74-204, and 205-224.
|
|
Looking at the names of the ordered samples (rows) shows clear separation of liver and kidney samples from the other seven tissues; samples 1-18 are mostly normal liver samples, samples, 19-27 are mostly malignant liver samples, and samples 28-63 are mostly kidney samples. Turning to the ordering of genes (columns), genes 1-6 (the first six genes in Fig. 2 A) in gene lists for set A are enriched for genes that are involved in steroid hormone metabolism (UGT1A, UGT2B, and HSD). This is biologically consistent since the liver and kidney are the two organs predominantly involved in metabolism. Two of the genes in this group (UGT1A and UGT2B) have duplicate probes on the microarrays, and the SVD algorithm correctly clusters them together. This finding is to be expected since the duplicate probes, coding for the same gene, should display very similar expression profiles.
This data set happened to contain no missing values, so it was possible to analyze it with the conventional squared-norm singular value decomposition and to carry out a clustering paralleling by using the robust SVD (rSVD). The results, however, were far inferior. The conventional SVD did a bad job in identifying the genes relating to androgen/estrogen metabolism.
We tested the validity of our rSVD results by comparing the predicted liver and kidney enrichment of genes with independently generate data on the same genes from an unrelated, public-domain gene expression database (Gene Express Atlas, maintained by the Genome Institute of Novartis Research Foundation, ref. 7). This database contains gene expression profiles from 91 human and mouse samples and is generated on an earlier version of Affymetrix microarrays (compared with that used in the GeneLogic database). Table 3 summarizes the results obtained for the queried genes. In all cases, the gene expression is significantly higher in liver and kidney samples compared with the median gene expression across all tissues. Thus the patterns of gene-sample clusters identified by rSVD are substantiated in an independent data set (see Figs. 4-6, which are published as supporting information on the PNAS web site).
|
Additionally, evidence from the literature helps explain some of the reasons the specific genes are found to be enriched in the liver samples. For example, mutations in the UGT1A gene are associated with a variety of liver-specific diseases such as Crigler-Najaar syndrome, Gilbert syndrome, familial transient neonatal hyperbilirubinemia, and cholelithiasis (8-10). Another gene found to be up-regulated in this cluster is hydroxysteroid dehydrogenase (HSD11). The gene product of HSD11 catalyzes the conversion of cortisol to cortisone and is an important regulator of glucocorticoid metabolism. The gene for hydroxysteroid (11 beta) B1 isoform was first isolated from rat liver (11). Intense immunoreactivity to HSD11B1 has been observed around the hepatic central vein, confirming that the protein expression is also high in liver (12). Studies with knockout mice demonstrate that HSD11B1 deficiency produces an improved metabolic profile characterized by increased lipid catabolism, increased hepatic insulin sensitivity, and reduced intracellular glucocorticoid concentrations (13). Another up-regulated gene, aldolase B, is a tetrameric glycolytic enzyme that catalyzes the reversible conversion of fructose-1,6-bisphosphate to glyceraldehyde 3-phosphate and dihydroxyacetone phosphate. Aldolases A, B, and C are distinct proteins exhibiting developmentally regulated expression of the different isozymes. Aldolase B expression is observed only in adult human liver, kidney, and intestine. Significantly, our segmentation algorithm detected the right isoform of aldolase (aldolase B) for enriched liver expression. Deficiency in ALDOB function is related to hereditary fructose intolerance. The preferential expression of ALDOB in liver has been used in 31P magnetic resonance imaging studies to follow the metabolism of fructose in the liver of patients with this disorder (14).
Similarly, we then examine results from set B. Set B shows a clear separation of white blood cells from other tissue types as shown in the ordered samples based on the first rSVD component (see Fig. 7, which is published as supporting information on the PNAS web site). The corresponding gene segmentation shows a subset of genes (records 211-224; see Tables 5 and 6, which are published as supporting information on the PNAS web site) that show preferential expression in white blood cells. This list is enriched for genes that are involved in apoptosis (programmed cell death).
We then investigate the tissue expression of the apoptosis-specific genes in the Gene Express Atlas database. As shown in Table 4, the majority of the genes identified by the rSVD algorithm indeed show very high levels of expression in white blood cells, again providing excellent validation of our results.
|
The enrichment of proapoptotic genes in white blood cells is consistent with the biology of white blood cells. Apoptosis plays an essential role in immune system homeostasis. The vertebrate immune system uses apoptosis to control cell number, delete lymphocytes with inoperative or autoreactive receptors from its repertoire, and reverse clonal expansion at the end of an immune response. The programmed removal of lymphocytes in response to cellular stress or injury or genetic errors serves to preserve genomic integrity and constitutes an important mechanism of tumor surveillance. Given the crucial role of apoptosis in such a diverse array of physiologic functions, aberrations of this process underlie a host of immune disorders. Genetic aberrations that render cells incapable of executing their suicide program promote tumorigenesis and underlie the observed resistance of lymphoid cancers to genotoxic anticancer agents (15). In addition to immune homeostasis, T and B cell lymphocytes and neutrophils undergo programmed cell death under a wide variety of physiological and drug-induced conditions such as immunosuppression by polycyclic aromatic hydrocarbons, perturbations of redox states, tissue repair, treatment of Crohn's disease, immune evasion in renal cell carcinoma, and tumor necrosis factor
-induced effects in aging, to name a few (16-21).
| Finding Additional Structure |
|---|
|
|
|---|
Doing so does indeed uncover additional biologically meaningful structure, as shown in Fig. 2B for set A. The segmentation algorithm suggests six segments for the cell lines: 1-76, 77-195, 196-333, 334-391, 392-462, and 463-504. The cell lines of the first segment are mostly prostate samples, and the last two segments are mostly colon samples. Corresponding to this cell line segmentation, we divide the genes into five segments, genes 1-11, 12-65, 66-176, 177-219, and 220-224 (see Tables 7 and 8, which are published as supporting information on the PNAS web site).
As we can see, the two components for set A give two substantially different orderings of genes and samples and represent different aspects of the gene expression data. This richness of interpretation cannot be achieved by one single ordering of genes and samples.
We could continue this process. Subtracting this second bilinear term and repeating the ARF on the residuals gives a third component and could be repeated for a fourth and so on. The number of components that we should study depends on the number of significant eigenvalues. For set A, the scree plot of the eigenvalues suggests that two components (see Fig. 3A) capture all of the structure of the array, so we stop at two components.
|
| Finding Outliers, Filling In Estimates of Missing Values, and Smoothing |
|---|
|
|
|---|
Enriching notation slightly, write rim and cjm for the row and column factors given in the mth bilinear pair fitted. Then for any missing cell ij, we could predict the missing value by
mrimcjm. Another possible use of the ARF fits is to replace the entire matrix by the rank-m approximation given by using this missing value fill-in for all cells for purposes of other statistical analyses or displays. The potential attraction of this approach is that it would largely remove the impact of outlier cells as well as avoiding gaps in the matrix.
| Discussion |
|---|
|
|
|---|
The microarray example illustrates the usefulness of this method, where the cell lines of the same origin are grouped together, and some genes found are confirmed by previous literature. The outlier detection points out some possible outliers. They may be experimental mistakes or specific gene actions that deserve further study.
This microarray example was chosen to verify that the techniques worked on a reasonably well-understood data set. In addition to this example, these methods have been used successfully on other public-domain and proprietary data sets.
| Acknowledgements |
|---|
| Footnotes |
|---|
Abbreviations: SVD, singular value decomposition; rSVD, robust SVD; ARF, alternating robust fitting.
To whom correspondence should be sent at present address: Aventis Pharmaceuticals, Bridgewater, NJ 08807. E-mail: li.liu{at}aventis.com.
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
L. Li, L. Ying, M. Naesens, W. Xiao, T. Sigdel, S. Hsieh, J. Martin, R. Chen, K. Liu, M. Mindrinos, et al. Interference of globin genes with biomarker discovery for allograft rejection in peripheral blood samples Physiol Genomics, January 17, 2008; 32(2): 190 - 197. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Fogel, S. S. Young, D. M. Hawkins, and N. Ledirac Inferential, robust non-negative matrix factorization analysis of microarray data Bioinformatics, January 1, 2007; 23(1): 44 - 49. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. W. Carter, S. Rupp, G. R. Fink, and T. Galitski Disentangling information flow in the Ras-cAMP signaling network Genome Res., April 1, 2006; 16(4): 520 - 526. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. A. Irizarry, Z. Wu, and H. A. Jaffee Comparison of Affymetrix GeneChip expression measures Bioinformatics, April 1, 2006; 22(7): 789 - 794. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Tuikkala, L. Elo, O. S. Nevalainen, and T. Aittokallio Improving missing value estimation in microarray data with gene ontology Bioinformatics, March 1, 2006; 22(5): 566 - 572. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||