New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
Robust singular value decomposition analysis of microarray data
-
Edited by Peter J. Bickel, University of California, Berkeley, CA, (received for review May 22, 2003)

Abstract
In microarray data there are a number of biological samples, each assessed for the level of gene expression for a typically large number of genes. There is a need to examine these data with statistical techniques to help discern possible patterns in the data. Our technique applies a combination of mathematical and statistical methods to progressively take the data set apart so that different aspects can be examined for both general patterns and very specific effects. Unfortunately, these data tables are often corrupted with extreme values (outliers), missing values, and non-normal distributions that preclude standard analysis. We develop a robust analysis method to address these problems. The benefits of this robust analysis will be both the understanding of large-scale shifts in gene effects and the isolation of particular sample-by-gene effects that might be either unusual interactions or the result of experimental flaws. Our method requires a single pass and does not resort to complex ”cleaning” or imputation of the data table before analysis. We illustrate the method with a commercial data set.
Biologists are using DNA microarrays to monitor the level of gene expression of biological samples. Thousands of genes are typically monitored on a few to tens of samples. In the near future, it is expected that there will be data sets of hundreds of samples. Patterns of gene expression can be used to determine coregulated genes, suggest biomarkers of specific disease, and propose targets for drug intervention.
Microarray data present a number of challenges to statistical modeling. The size of the typical array (up to thousands of columns and perhaps hundreds of rows) defies easy graphical analyses. There may be severe distributional difficulties such as non-normal distributions, outliers (unusual data values), and numerous missing values. Common objectives are finding ”patterns” in the data, in particular
-
clustering the biological samples (rows) into groups with similar expression profiles;
-
clustering the genes (columns) into groups where the level of gene expression is similar in the samples.
One attractive way of clustering is a by-product of ”ordination.” Ordination involves finding suitable permutations of the rows (and perhaps of the columns) that lead to a steady progression going down the rows (and perhaps across the columns). A clustering is given by placing vertical (and perhaps horizontal) dividing lines in the array to break it up into rectangular blocks within which the values are homogeneous.
Conversely, not all clustering methods are hierarchical, but if we cluster the rows and elect to do so with any hierarchical clustering method, the dendrogram produces an ordering of the rows, although, since the layout of the dendrogram is not unique, neither is the ordering produced. Thus good ordination methods lead to good clustering whereas hierarchical clustering gives (nonunique) row ordinations.
Methods
The classical method of ordination is through the singular value decomposition. Write the expression data as an n by p array X with rows representing the n biological samples and columns representing the p genes. Approximate X with a bilinear form where ri is a parameter corresponding to the ith biological sample, cj corresponds to the jth gene, and eij is a ”residual.” This representation solves the ordination problem in that the rows can be ordered by their ri values and the columns by their cj values. Ordering the rows by r and the columns by c permutes the original data array to one in which we have high and low values in the corners and medium values in the middle, leading to an informative display.
Subsequently, grouping together those rows whose ri are similar will give clusters of biological samples. Grouping the columns with similar cj will give clusters of genes. If the residuals are small so that the ricj captures all of the important structure of the data matrix, then the ordination and subsequent clustering using the r or c values is essentially unique.
Standard practice is to remove ”uninteresting structure” such as a grand mean, or even the row or column means from X before attempting the approximation. This is more of an implementation detail than a central aspect of the method.
The conventional method of getting this bilinear approximation is from the singular value decomposition (SVD) of X from Healy (1). It is well known that the leading term of the SVD provides the bilinear form with the best least-squares approximation to X. The SVD is often found by performing a principal component analysis on either X′X or XX′.
The conventional SVD, however, has some serious deficiencies. First, being a least-squares method, it is highly susceptible to outlier values in the array X. Such outliers are an accepted fact of life when dealing with microarray data, where a sprinkling of entries are found to be very large or small. Second, finding the SVD through a principal component analysis of X′X requires that all elements of X be observed. This goes counter to a second reality of microarray data, which is that missing values are a routine feature of the experimental data. Standard approaches to missing data are to ”impute” values for the missing cells or eliminate whole rows or columns of the data matrix that are felt to be too incomplete. These methods are obviously at best unattractive necessities.
Alternating Least Squares. There is a standard remedy for the ”missing information” deficiency, the Gabriel-Zamir alternating least-squares algorithm (2). This begins with a tentative estimate of the column factors cj, which are used to provide a matching scaling for the rows. Regarding as a regression of the ith row of X on the column factors identifies rj as the coefficient of a no-intercept regression. Fitting this regression row by row by using all nonempty cells then leads to an estimate of the row factors ri. Then switching roles, we take the row factors ri as given and use regression of all nonempty cells in exactly the same way to calculate fresh estimates of the column factors cj. This approach uses all of the observed data and does not require imputation of missing data. If the data set is complete, this alternating least-squares algorithm gives the first term of the conventional SVD.
Alternating Robust Fitting (ARF). The ALS method is effective in solving the missing information problem. But it does nothing about the sensitivity to outliers. Solving the outlier issue, however, can be done by a simple change in the regression method that lies at the heart of the Gabriel-Zamir algorithm. Instead of using ordinary least squares to carry out the alternating regressions, we can use any outlier-resistant regression method such as L1 (D.M.H., L.L., and S.S.Y., www.niss.org, technical report 122), weighted L1 [refer to Croux et al. (3) for details], least trimmed squares [refer to Ukkelberg and Borgen (4) for details], or an M-estimation method. In this article, we use the least trimmed squares method. The resulting algorithm then is to take the model Use any convenient initial values for the column factors cj (or optionally for the row factors) and then apply a robust no-intercept linear regression algorithm to alternately use the cj to refine the estimates of the ri and the ri to refine the estimates of the cj.
Using any robust regression criterion, each of these alternating regressions will lead to a reduction in the regression criterion, so the algorithm will converge.
Properties of the ARF Fit. Broad properties of the ARF bilinear fit follow at once. The method handles missing information routinely, without requiring a separate ”fill-in” step. And it is impervious to a minority of outlier cells. Outliers will, of course, create a problem for the ARF, as with almost any conceivable method, if they constitute the majority of the elements of any row or column.
Clustering of the Rows and Columns. As already noted, sorting the rows by their ri values creates a natural ordination. This can be turned into a clustering of k groups of biological samples by finding ”breakpoints” b(0) = 0 < b(1) < b(2) < b(k - 1) < b(k) = n and allocating to cluster h those genes that, in the reordering, have index b(h - 1) < i < = b(h). The breakpoints need to be chosen so that the biological samples within each cluster have ri values as similar as possible. This can be made operational by the criterion that the pooled sum of squared deviations of the ri broken down into the k clusters should be a minimum. Exact algorithms for finding breakpoints to attain this minimum are given by Venter and Steel (5) and Hawkins (6). Similarly, applying the optimal segmentation algorithm to the column factors cj clusters the genes into any specified number of clusters such that the genes within clusters have cj values as similar as possible.
Relationship to Other Clustering Approaches
A common approach to clustering genes has been through (dis)similarity indices between the rows of X, for example, the Euclidean distance between rows as a dissimilarity measure or their correlation as a similarity measure. These measures can then be used in any convenient dissimilarity-based cluster method such as average linkage. If we look at this approach through the bilinear approximation we see that the squared Euclidean distance between any two rows i and k can be written
Now if the bilinear term ricj has captured all of the ”structure” in the sample-gene association, and all that is left is statistically independent measurement noise (not necessarily small) as is the case when the SVD is used to get the bilinear approximation. Consider the three terms comprising
.
-
The last term is zero or near zero because of statistical independence.
-
The center term is made up simply of measurement noise. It cannot contribute usefully to the clustering, but in fact will have the effect of degrading the clustering.
-
The first term is the inter-row distance that our proposal uses for its clustering. It may be thought of as a way of filtering measurement noise out of the computation of inter-row distance.
Theory therefore implies that a well-fitting bilinear approximation to the matrix X will give a better picture of the biological sample differences through its row factors ri than can be found directly by using the Euclidean distances between rows. A similar conclusion applies to using the correlation between pairs of row profiles.
There is yet another consideration favoring use of the ARF for clustering. If the matrix contains outliers, these outliers contaminate the Euclidean distances (and the correlation coefficients) between rows. However, it is a consequence of the robust fits used in the ARF that the outliers do not contaminate the ri or cj substantially.
GeneLogic Data
We will illustrate the bilinear fit by using the ARF and the subsequent segmentation of the genes and biological specimens by using the following real data sets. The data set is a subset of gene expression profiles of human tissues from the GeneExpress database, commercially available from Genelogic (Gaithersburg, MD). Gene expression data are generated on oligonucleotide microarrays from Affymetrix (Santa Clara, CA). Two sample data sets are created for analysis, one containing both normal and malignant tissues (504 samples, covering eight tissue types: adipose tissue, breast, colon, kidney, liver, lung, ovary, and prostate, hereafter called set A) and the other containing only normal tissues (822 samples covering all of the above tissues plus white blood cells, hereafter called set B). A set of 224 genes are selected for both data sets based on their roles in metabolic and signaling pathways. The goal of the study is to determine whether SVD analysis can correctly cluster the genes and samples based on their biological function (drug metabolizing genes predominantly expressed in liver samples, for example).
Fig. 1 shows the image of the unordered log transformed data matrix of set A after removing the row and column means, where the rows correspond to the cell lines and the columns correspond to the genes. In this image, we cannot see any clear patterns. The log-transformed and row and column means subtracted data are available at www.samsi.info/200304/dmml/web-internal/bio/data/data_rsvd.xls.
(A) The unordered gene expression data matrix of set A. The rows correspond to the cell lines, and the columns correspond to the genes. In this image, we cannot see any clear patterns. (B) Outliers identified in the image of the residuals. The outliers are yellow (higher than expected) or blue (lower than expected). To be able to view the figure clearly, we selected only 60 columns (genes) to illustrate here.
Our goal is to cluster similar genes together and similar cell lines together simultaneously. Graphically, we hope to form blocks of reds and greens.
We used the ARF to fit a bilinear approximation to the log-transformed expression matrix after removing the row and column means. Using the resulting bilinear approximation to order the rows and the columns of X leads to the display of Fig. 2A. Note that visually the ordination has been highly successful in rearranging the matrix so as to give blocks of high and low values in the corners and in-between values in the remainder of the array.
(A) The first rSVD component from set A. Looking at the names of the ordered samples (rows) shows clear separation of liver and kidney samples from the other seven tissues. (B) The second rSVD component from set A. This shows a separation of prostate and colon from other tissues. Here, liver(n) represents normal liver samples, liver(m) represents malignant liver samples, colon(n) represents normal colon samples, and colon(m) represents malignant colon samples.
Next, we applied the segmentation algorithm to the row factors ri to segment the biological samples, and to the column factors cj to segment the genes. Tables 1 and 2 show the results of fitting various numbers of clusters. In an exploratory statistical analysis such as this, we do not need a rigorous answer to the question of the number of genuine clusters, but guidance comes from the variance explained by breaking the factors into two groups, three groups, four groups, etc. In Tables 1 and 2, the major explained variability is attained once five clusters are formed, and so we will use this as our working solution. Numbering the columns (samples) in their ri order, the optimal division into five segments breaks samples 1-18, 19-27, 28-63, 64-204, and 205-504. Similarly, based on Table 2, the optimal five-segment division of the genes is (using their cj ordering) genes 1-6, 7-33, 34-73, 74-204, and 205-224.
Looking at the names of the ordered samples (rows) shows clear separation of liver and kidney samples from the other seven tissues; samples 1-18 are mostly normal liver samples, samples, 19-27 are mostly malignant liver samples, and samples 28-63 are mostly kidney samples. Turning to the ordering of genes (columns), genes 1-6 (the first six genes in Fig. 2 A) in gene lists for set A are enriched for genes that are involved in steroid hormone metabolism (UGT1A, UGT2B, and HSD). This is biologically consistent since the liver and kidney are the two organs predominantly involved in metabolism. Two of the genes in this group (UGT1A and UGT2B) have duplicate probes on the microarrays, and the SVD algorithm correctly clusters them together. This finding is to be expected since the duplicate probes, coding for the same gene, should display very similar expression profiles.
This data set happened to contain no missing values, so it was possible to analyze it with the conventional squared-norm singular value decomposition and to carry out a clustering paralleling by using the robust SVD (rSVD). The results, however, were far inferior. The conventional SVD did a bad job in identifying the genes relating to androgen/estrogen metabolism.
We tested the validity of our rSVD results by comparing the predicted liver and kidney enrichment of genes with independently generate data on the same genes from an unrelated, public-domain gene expression database (Gene Express Atlas, maintained by the Genome Institute of Novartis Research Foundation, ref. 7). This database contains gene expression profiles from 91 human and mouse samples and is generated on an earlier version of Affymetrix microarrays (compared with that used in the GeneLogic database). Table 3 summarizes the results obtained for the queried genes. In all cases, the gene expression is significantly higher in liver and kidney samples compared with the median gene expression across all tissues. Thus the patterns of gene-sample clusters identified by rSVD are substantiated in an independent data set (see Figs. 4-6, which are published as supporting information on the PNAS web site).
Additionally, evidence from the literature helps explain some of the reasons the specific genes are found to be enriched in the liver samples. For example, mutations in the UGT1A gene are associated with a variety of liver-specific diseases such as Crigler-Najaar syndrome, Gilbert syndrome, familial transient neonatal hyperbilirubinemia, and cholelithiasis (8-10). Another gene found to be up-regulated in this cluster is hydroxysteroid dehydrogenase (HSD11). The gene product of HSD11 catalyzes the conversion of cortisol to cortisone and is an important regulator of glucocorticoid metabolism. The gene for hydroxysteroid (11 beta) B1 isoform was first isolated from rat liver (11). Intense immunoreactivity to HSD11B1 has been observed around the hepatic central vein, confirming that the protein expression is also high in liver (12). Studies with knockout mice demonstrate that HSD11B1 deficiency produces an improved metabolic profile characterized by increased lipid catabolism, increased hepatic insulin sensitivity, and reduced intracellular glucocorticoid concentrations (13). Another up-regulated gene, aldolase B, is a tetrameric glycolytic enzyme that catalyzes the reversible conversion of fructose-1,6-bisphosphate to glyceraldehyde 3-phosphate and dihydroxyacetone phosphate. Aldolases A, B, and C are distinct proteins exhibiting developmentally regulated expression of the different isozymes. Aldolase B expression is observed only in adult human liver, kidney, and intestine. Significantly, our segmentation algorithm detected the right isoform of aldolase (aldolase B) for enriched liver expression. Deficiency in ALDOB function is related to hereditary fructose intolerance. The preferential expression of ALDOB in liver has been used in 31P magnetic resonance imaging studies to follow the metabolism of fructose in the liver of patients with this disorder (14).
Similarly, we then examine results from set B. Set B shows a clear separation of white blood cells from other tissue types as shown in the ordered samples based on the first rSVD component (see Fig. 7, which is published as supporting information on the PNAS web site). The corresponding gene segmentation shows a subset of genes (records 211-224; see Tables 5 and 6, which are published as supporting information on the PNAS web site) that show preferential expression in white blood cells. This list is enriched for genes that are involved in apoptosis (programmed cell death).
We then investigate the tissue expression of the apoptosis-specific genes in the Gene Express Atlas database. As shown in Table 4, the majority of the genes identified by the rSVD algorithm indeed show very high levels of expression in white blood cells, again providing excellent validation of our results.
The enrichment of proapoptotic genes in white blood cells is consistent with the biology of white blood cells. Apoptosis plays an essential role in immune system homeostasis. The vertebrate immune system uses apoptosis to control cell number, delete lymphocytes with inoperative or autoreactive receptors from its repertoire, and reverse clonal expansion at the end of an immune response. The programmed removal of lymphocytes in response to cellular stress or injury or genetic errors serves to preserve genomic integrity and constitutes an important mechanism of tumor surveillance. Given the crucial role of apoptosis in such a diverse array of physiologic functions, aberrations of this process underlie a host of immune disorders. Genetic aberrations that render cells incapable of executing their suicide program promote tumorigenesis and underlie the observed resistance of lymphoid cancers to genotoxic anticancer agents (15). In addition to immune homeostasis, T and B cell lymphocytes and neutrophils undergo programmed cell death under a wide variety of physiological and drug-induced conditions such as immunosuppression by polycyclic aromatic hydrocarbons, perturbations of redox states, tissue repair, treatment of Crohn's disease, immune evasion in renal cell carcinoma, and tumor necrosis factor α-induced effects in aging, to name a few (16-21).
Finding Additional Structure
The bilinear fit produced by the ARF does not necessarily exhaust all structure in the matrix X. As with the conventional, nonrobust least-squares SVD, we can remove the first bilinear fit from X to get the initial residual matrix (xij - ricj) and apply the ARF to this matrix to get a second pair of matching row and column factors, which may be segmented, just as were the leading pair.
Doing so does indeed uncover additional biologically meaningful structure, as shown in Fig. 2B for set A. The segmentation algorithm suggests six segments for the cell lines: 1-76, 77-195, 196-333, 334-391, 392-462, and 463-504. The cell lines of the first segment are mostly prostate samples, and the last two segments are mostly colon samples. Corresponding to this cell line segmentation, we divide the genes into five segments, genes 1-11, 12-65, 66-176, 177-219, and 220-224 (see Tables 7 and 8, which are published as supporting information on the PNAS web site).
As we can see, the two components for set A give two substantially different orderings of genes and samples and represent different aspects of the gene expression data. This richness of interpretation cannot be achieved by one single ordering of genes and samples.
We could continue this process. Subtracting this second bilinear term and repeating the ARF on the residuals gives a third component and could be repeated for a fourth and so on. The number of components that we should study depends on the number of significant eigenvalues. For set A, the scree plot of the eigenvalues suggests that two components (see Fig. 3A) capture all of the structure of the array, so we stop at two components.
(A) The plot of the eigenvalues for set A. This plot suggests that we keep the first two components. (B) The quantile-quantile plot of the residuals for set A. As shown, the residuals follow a heavy tailed distribution.
Finding Outliers, Filling In Estimates of Missing Values, and Smoothing
A strength of our method is that it does not require complete information and is not affected by a minority of outliers. Outliers can be identified automatically by looking at the final residuals after removal of the structural components. A simple outlier model might be that most of the residuals follow a normal distribution, but that some small number are ”wild.” A probability plot shows that, rather than this simple two-category model, the residuals follow a heavy-tailed distribution (Fig. 3B). If we so want, we can flag those readings that seem particularly anomalous. For example, in normal data 1.5 times the median absolute deviation (MAD) of the residuals gives a robust estimate of the standard deviation. Thus residuals more than six times MAD (a cutoff equivalent to four standard deviations) should be extremely rare. In the actual data, however, some 2% of residuals are beyond six times MAD. This is a red-flag warning against the use of nonrobust methods (22).
Enriching notation slightly, write rim and cjm for the row and column factors given in the mth bilinear pair fitted. Then for any missing cell ij, we could predict the missing value by Σmrimcjm. Another possible use of the ARF fits is to replace the entire matrix by the rank-m approximation given by using this missing value fill-in for all cells for purposes of other statistical analyses or displays. The potential attraction of this approach is that it would largely remove the impact of outlier cells as well as avoiding gaps in the matrix.
Discussion
We have proposed an analysis based on a variant of the SVD that is largely impervious to outliers and missing information. This can be used for ordination and display of the microarray and also for segmentation.
The microarray example illustrates the usefulness of this method, where the cell lines of the same origin are grouped together, and some genes found are confirmed by previous literature. The outlier detection points out some possible outliers. They may be experimental mistakes or specific gene actions that deserve further study.
This microarray example was chosen to verify that the techniques worked on a reasonably well-understood data set. In addition to this example, these methods have been used successfully on other public-domain and proprietary data sets.
Acknowledgments
We thank Alan Karr and Jerry Sacks for helpful discussions.
Footnotes
-
↵† To whom correspondence should be sent at present address: Aventis Pharmaceuticals, Bridgewater, NJ 08807. E-mail: li.liu{at}aventis.com.
-
This paper was submitted directly (Track II) to the PNAS office.
-
Abbreviations: SVD, singular value decomposition; rSVD, robust SVD; ARF, alternating robust fitting.
- Received May 22, 2003.
- Accepted July 2, 2003.
- Copyright © 2003, The National Academy of Sciences
References
- ↵
Healy, M. J. R. (1986) Matrices for Statisticians (Clarendon, Oxford), pp. 64-66.
- ↵
- ↵
Croux, C., Filzmoser, P., Pison, G. & Rousseeum, P. J. (2002) Stat. Comput. 13, 23-36.
- ↵
- ↵
- ↵
Hawkins, D. M. (2000) Comp. Stat. Data Anal. 37, 323-341.
- ↵
Su, A. I., Cooke, M. P., Ching, K. A., Hakak, Y., Walker, J. R., Wiltshire, T., Orth, A. P., Vega, R. G., Sapinoso, L. M., Moqrich, A., et al. (2002) Proc. Natl. Acad. Sci. USA 99, 4465-4470.pmid:11904358
- ↵
- ↵
- ↵
Agarwal, A. K., Rogerson, F. M., Mune, T. & White, P. C. (1989) J. Biol. Chem. 264, 18939-18943.pmid:2808402
- ↵
- ↵
Morton, N. M., Holmes, M. C., Fievet, C., Staels, B., Tailleux, A., Mullins, J. J. & Seckl, J. R. (2001) J. Biol. Chem. 276, 41293-41300.pmid:11546766
- ↵
Oberhaensli, R. D., Rajagopalan, B., Taylor, D. J., Radda, G. K., Collins, J. E., Leonard, J. V., Schwarz, H. & Herschkowitz, N. (1987) Lancet II, 931-934.
- ↵
- ↵
Novosad, J., Fiala, Z., Borska, L. & Krejsek, J. (2002) Acta Med. 45, 123-128.
- ↵
- ↵
Huber, P. J. (1981) Robust Statistics (Wiley, New York).