# Singular value decomposition for genome-wide expression data processing and modeling

## Abstract

We describe the use of singular value decomposition in transforming genome-wide expression data from genes × arrays space to reduced diagonalized “eigengenes” × “eigenarrays” space, where the eigengenes (or eigenarrays) are unique orthonormal superpositions of the genes (or arrays). Normalizing the data by filtering out the eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Sorting the data according to the eigengenes and eigenarrays gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. After normalization and sorting, the significant eigengenes and eigenarrays can be associated with observed genome-wide effects of regulators, or with measured samples, in which these regulators are overactive or underactive, respectively.

### Sign up for PNAS alerts.

Get alerts for new articles, or get an alert when an article is cited.

DNA microarray technology (1, 2) and genome sequencing have advanced to the point that it is now possible to monitor gene expression levels on a genomic scale (3). These new data promise to enhance fundamental understanding of life on the molecular level, from regulation of gene expression and gene function to cellular mechanisms, and may prove useful in medical diagnosis, treatment, and drug design. Analysis of these new data requires mathematical tools that are adaptable to the large quantities of data, while reducing the complexity of the data to make them comprehensible. Analysis so far has been limited to identification of genes and arrays with similar expression patterns by using clustering methods (4–9).

We describe the use of singular value decomposition (SVD) (10) in analyzing genome-wide expression data. SVD is also known as Karhunen–Loève expansion in pattern recognition (11) and as principal-component analysis in statistics (12). SVD is a linear transformation of the expression data from the genes × arrays space to the reduced “eigengenes” × “eigenarrays” space. In this space the data are diagonalized, such that each eigengene is expressed only in the corresponding eigenarray, with the corresponding “eigenexpression” level indicating their relative significance. The eigengenes and eigenarrays are unique, and therefore also data-driven, orthonormal superpositions of the genes and arrays, respectively.

We show that several significant eigengenes and the corresponding eigenarrays capture most of the expression information. Normalizing the data by filtering out the eigengenes (and the corresponding eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Such normalization may improve any further analysis of the expression data. Sorting the data according to the correlations of the genes (and arrays) with eigengenes (and eigenarrays) gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. These groups of genes (or arrays) are not defined by overall similarity in expression, but only by similarity in the expression of any chosen subset of eigengenes (or eigenarrays). Upon comparing two or more similar experiments, with a regulator being overactive or underactive in one but normally expressed in the others, the expression pattern of one of the significant eigengenes may be correlated with the expression patterns of this regulator and its targets. This eigengene, therefore, can be associated with the observed genome-wide effect of the regulator. The expression pattern of the corresponding eigenarray is correlated with the expression patterns observed in samples in which the regulator is overactive or underactive. This eigenarray, therefore, can be associated with these samples.

We conclude that SVD provides a useful mathematical framework for processing and modeling genome-wide expression data, in which both the mathematical variables and operations may be assigned biological meaning.

## Mathematical Framework: Singular Value Decomposition

The relative expression levels of

*N*genes of a model organism, which may constitute almost the entire genome of this organism, in a single sample, are probed simultaneously by a single microarray. A series of*M*arrays, which are almost identical physically, probe the genome-wide expression levels in*M*different samples—i.e., under*M*different experimental conditions. Let the matrix ê, of size*N*-genes ×*M*-arrays, tabulate the full expression data. Each element of ê satisfies 〈*n*|ê|*m*〉 ≡*e*_{nm}for all 1 ≤*n*≤*N*and 1 ≤*m*≤*M*, where*e*_{nm}is the relative expression level of the*n*th gene in the*m*th sample as measured by the*m*th array.§ The vector in the*n*th row of the matrix ê, 〈*g*_{n}| ≡ 〈*n*|ê, lists the relative expression of the*n*th gene across the different samples which correspond to the different arrays. The vector in the*m*th column of the matrix ê, |*a*_{m}〉 ≡ ê|*m*〉, lists the genome-wide relative expression measured by the*m*th array.SVD (10) is then linear transformation of the expression data from the In this space the data are represented by the diagonal nonnegative matrix ê, of size indicates the relative significance of the measures the complexity of the data from the distribution of the overall expression between the different eigengenes (and eigenarrays), where

*N*-genes ×*M*-arrays space to the reduced*L*-“eigenarrays” ×*L*-“eigengenes” space, where*L*= min{*M*,*N*} (see Fig. 7 in supplemental material at www.pnas.org),\[ \begin{equation*}\hat {e}=\hat {u}\hat {{\varepsilon}}\hat {v}^{T}.\end{equation*}\]

[1]

*L*-eigengenes ×*L*-eigenarrays, which satisfies 〈*k*|ɛ̂|*l*〉 ≡ ɛ_{l}δ_{kl}≥ 0 for all 1 ≤*k*,*l*≤*L*, such that the*l*th eigengene is expressed only in the corresponding*l*th eigenarray, with the corresponding “eigenexpression” level ɛ_{l}. Therefore, the expression of each eigengene (or eigenarray) is decoupled from that of all other eigengenes (or eigenarrays). The “fraction of eigenexpression,”\[ \begin{equation*}p_{l}={\varepsilon}^{2}_{l}/{ \,\substack{ ^{L} \\ {\sum} \\ _{k=1} }\, }\hspace{.167em}{\varepsilon}^{2}_{k},\end{equation*}\]

[2]

*l*th eigengene and eigenarray in terms of the fraction of the overall expression that they capture. Assume also that the eigenexpression levels are arranged in decreasing order of significance, such that ɛ_{1}≥ ɛ_{2}≥ … ≥ ɛ_{L}≥ 0. “Shannon entropy” of a dataset,\[ \begin{equation*}0{\leq}d=\frac{{\mathrm{-}}1}{{\mathrm{log}}(L)}\hspace{.167em}{ \,\substack{ ^{L} \\ {\sum} \\ _{k=1} }\, }p_{k}{\mathrm{log}}(p_{k}){\leq}1,\end{equation*}\]

[3]

*d*= 0 corresponds to an ordered and redundant dataset in which all expression is captured by a single eigengene (and eigenarray), and*d*= 1 corresponds to a disordered and random dataset where all eigengenes (and eigenarrays) are equally expressed.The transformation matrices û and v̂where Î is the identity matrix. Therefore, the expression of each eigengene (or eigenarray) is not only decoupled but also decorrelated from that of all other eigengenes (or eigenarrays). The eigengenes and eigenarrays are unique, except in degenerate subspaces, defined by subsets of equal eigenexpression levels, and except for a phase factor of ±1, such that each eigengene (or eigenarray) captures both parallel and antiparallel gene (or array) expression patterns. Therefore, SVD is data-driven, except in degenerate subspaces.

^{T}define the*N*-genes ×*L*-eigenarrays and the*L*-eigengenes ×*M*-arrays basis sets, respectively. The vector in the*l*th row of the matrix v̂^{T}, 〈γ_{l}| ≡ 〈*l*|v̂^{T}, lists the expression of the*l*th eigengene across the different arrays. The vector in the*l*th column of the matrix û, |α_{l}〉 ≡ û|*l*〉, lists the genome-wide expression in the*l*th eigenarray. The eigengenes and eigenarrays are orthonormal superpositions of the genes and arrays, such that the transformation matrices û and v̂ are both orthogonal\[ \begin{equation*}\hat {u}^{T}\hat {u}=\hat {v}^{T}\hat {v}=\hat {I},\end{equation*}\]

[4]

### SVD Calculation.

According to Eqs. 1 and 4, the

*M*-arrays ×*M*-arrays symmetric correlation matrix â = ê^{T}ê = v̂ɛ̂^{2}v̂^{T}is represented in the*L*-eigengenes ×*L*-eigengenes space by the diagonal matrix ɛ̂^{2}. The*N*-genes ×*N*-genes correlation matrix ĝ = êê^{T}= ûɛ̂^{2}û^{T}is represented in the*L*-eigenarrays ×*L*-eigenarrays space also by ɛ̂^{2}, where for*L*= min{*M*,*N*} =*M*, ĝ has a null subspace of at least*N*−*M*null eigenvalues. We, therefore, calculate the SVD of a dataset ê, with*M*≪*N*, by diagonalizing â, and then projecting the resulting v̂ and ɛ̂ onto ê to obtain û = êv̂ɛ̂^{−1}.### Pattern Inference.

The decorrelation of the eigengenes (and eigenarrays) suggests the possibility that some of the eigengenes (and the corresponding eigenarrays) represent independent regulatory programs or processes (and corresponding cellular states). We infer that an eigengene |γ

_{l}〉 represents a regulatory program or process from its expression pattern across all arrays, when this pattern is biologically interpretable. This inference may be supported by a corresponding coherent biological theme reflected in the functions of the genes, whose expression patterns correlate or anticorrelate with the pattern of this eigengene. With this we assume that the corresponding eigenarray |α_{l}〉 (which lists the amplitude of this eigengene pattern in the expression of each gene |*g*_{n}〉 relative to all other genes 〈*n*|α_{l}〉 = 〈*g*_{n}|γ_{l}〉/ɛ_{l}) represents the cellular state which corresponds to this process. We infer that the eigenarray |α_{l}〉 represents a cellular state from the arrays whose expression patterns correlate or anticorrelate with the pattern of this eigenarray. Upon sorting of the genes, this inference may be supported by the expression pattern of this eigenarray across all genes, when this pattern is biologically interpretable.### Data Normalization.

The decoupling of the eigengenes and eigenarrays allows filtering the data without eliminating genes or arrays from the dataset. We filter any of the eigengenes |γ

_{l}〉 (and the corresponding eigenarray |α_{l}〉) ê → ê − ɛ_{l}|α_{l}〉 〈γ_{l}|, by substituting zero for the eigenexpression level ɛ_{l}= 0 in the diagonal matrix ɛ̂ and reconstructing the data according to Eq. 1. We normalize the data by filtering out those eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts.### Degenerate Subspace Rotation.

The uniqueness of the eigengenes and eigenarrays does not hold in a degenerate subspace, defined by equal eigenexpression levels. We approximate significant similar eigenexpression levels ɛ

_{l}≈ ɛ_{l+1}≈ … ≈ ɛ_{m}with ɛ_{l}= … = ɛ_{m}= \( \begin{equation*}\sqrt{{{\Sigma}}_{{k=1}}^{{m}}{\hspace{.167em}{\varepsilon}}_{{k}}^{{2}}{/(m\hspace{.167em}-\hspace{.167em}l\hspace{.167em}+\hspace{.167em}1)}}\end{equation*}\). Therefore, Eqs.**1–4**remain valid upon rotation of the corresponding eigengenes {(|γ_{l}〉, … , |γ_{m}〉) → R̂(|γ_{l}〉, … , |γ_{m}〉)}, and eigenarrays {(|α_{l}〉, … , |α_{m}〉) → R̂(|α_{l}〉, … , |α_{m}〉)}, for all orthogonal R̂, R̂^{T}R̂ = Î. We choose a unique rotation R̂ by subjecting the rotated eigengenes to*m*−*l*constraints, such that these constrained eigengenes may be advantageous in interpreting and presenting the expression data.### Data Sorting.

Inferring that eigengenes (and eigenarrays) represent independent processes (and cellular states) allows sorting the data by similarity in the expression of any chosen subset of these eigengenes (and eigenarrays), rather than by overall similarity in expression. Given two eigengenes |γ

_{k}〉 and |γ_{l}〉 (or eigenarrays |α_{k}〉 and |α_{l}〉), we plot the correlation of |γ_{k}〉 with each gene |*g*_{n}〉, 〈γ_{k}|*g*_{n}〉/〈*g*_{n}|*g*_{n}〉 (or |α_{k}〉 with each array |*a*_{m}〉) along the*y*-axis, vs. that of |γ_{l}〉 (or |α_{l}〉) along the*x*-axis. In this plot, the distance of each gene (or array) from the origin is its amplitude of expression in the subspace spanned by |γ_{k}〉 and |γ_{l}〉 (or |α_{k}〉 and |α_{l}〉), relative to its overall expression*r*_{n}≡ 〈*g*_{n}|*g*_{n}〉^{−1}\( \begin{equation*}\sqrt{{{\vert}{\langle}{\gamma}_{k}{\vert}g_{n}{\rangle}{\vert}^{2}\hspace{.167em}+\hspace{.167em}{\vert}{\langle}{\gamma}_{l}{\vert}g_{n}{\rangle}{\vert}^{2}}}\end{equation*}\) (or*r*_{m}≡ 〈*a*_{m}|*a*_{m}〉^{−1}\( \begin{equation*}\sqrt{{{\vert}{\langle}{\alpha}_{k}{\vert}a_{m}{\rangle}{\vert}^{2}\;+\;{\vert}{\langle}{\alpha}_{l}{\vert}a_{m}{\rangle}{\vert}^{2}}}\end{equation*}\)). The angular distance of each gene (or array) from the*x*-axis is its phase in the transition from the expression pattern |γ_{l}〉 to |γ_{k}〉 and back to |γ_{l}〉 (or |α_{l}〉 to |α_{k}〉 and back to |α_{l}〉) tan φ_{n}≡ 〈γ_{k}|*g*_{n}〉/〈γ_{l}|*g*_{n}〉, (or tan φ_{m}≡ 〈α_{k}|*a*_{n}〉/〈α_{l}|*a*_{m}〉). We sort the genes (or arrays) according to φ_{n}(or φ_{m}).## Biological Data Analysis: Elutriation-Synchronized Cell Cycle

Spellman

*et al.*(3) monitored genome-wide mRNA levels, for 6,108 ORFs of the budding yeast*Saccharomyces cerevisiae*simultaneously, over approximately one cell cycle period,*T*≈ 390 min, in a yeast culture synchronized by elutriation, relative to a reference mRNA from an asynchronous yeast culture, at 30-min intervals. The elutriation dataset we analyze (see supplemental data and Mathematica notebook at www.pnas.org and at http://genome-www.stanford.edu/SVD/) tabulates the measured ratios of gene expression levels for the*N*= 5,981 genes, 784 of which were classified by Spellman*et al.*as cell cycle regulated, with no missing data in the*M*= 14 arrays.### Pattern Inference.

Consider the 14 eigengenes of the elutriation dataset. The first and most significant eigengene |γ

_{1}〉, which describes time invariant relative expression during the cell cycle (Fig. 8*a*at www.pnas.org), captures more than 90% of the overall relative expression in this experiment (Fig. 8*b*). The entropy of the dataset, therefore, is low*d*= 0.14 ≪ 1. This suggests that the underlying processes are manifested by weak perturbations of a steady state of expression. This also suggests that time-invariant additive constants due to uncontrolled experimental variables may be superimposed on the data. We infer that |γ_{1}〉 represents experimental additive constants superimposed on a steady gene expression state, and assume that |α_{1}〉 represents the corresponding steady cellular state. The second, third, and fourth eigengenes, which show oscillations during the cell cycle (Fig. 8*c*), capture about 3%, 1%, and 0.5% of the overall relative expression, respectively. The time variation of*|γ*fits a normalized sine function of period_{3}〉*T,*\( \begin{equation*}\sqrt{{\mathit{2/T}}}\end{equation*}\) sin(2π*t*/*T*). We infer that |γ_{3}〉 represents expression oscillation, which is consistent with gene expression oscillations during a cell cycle. The time variations of the second and fourth eigengenes fit a cosine function of period*T*with \( \begin{equation*}\sqrt{1/2}\end{equation*}\) the amplitude of a normalized cosine with this period, \( \begin{equation*}\sqrt{1/{\mathit{T}}}\end{equation*}\) cos 2π*t*/*T*. However, while |γ_{2}〉 shows decreasing expression on transition from*t*= 0 to 30 min, |γ_{4}〉 shows increasing expression. We infer that |γ_{2}〉 and |γ_{4}〉 represent initial transient increase and decrease in expression in response to the elutriation, respectively, superimposed on expression oscillation during the cell cycle.### Data Normalization.

We filter out the first eigengene and eigenarray of the elutriation dataset, ê → ê

_{C}= ê − ɛ_{1}|α_{1}〉 〈γ_{1}|, removing the steady state of expression. Each of the elements of the dataset ê_{C}, 〈*n*|ê_{C}|*m*〉 ≡*e*_{C,nm}, is the difference of the measured expression of the*n*th gene in the*m*th array from the steady-state levels of expression for these gene and array as calculated by SVD. Therefore,*e*_{C,nm}^{2}is the variance in the measured expression of the*n*th gene in the*m*th array. Let ê_{LV}tabulate the natural logarithm of the variances in elutriation expression, such that each element of ê_{LV}satisfies 〈*n*|ê_{LV}|*m*〉 ≡ log(*e*_{C,nm}^{2}) for all 1 ≤*n*≤*N*and 1 ≤*m*≤*M*, and consider the eigengenes of ê_{LV}(Fig. 9*a*in supplemental material at www.pnas.org). The first eigengene |γ_{1}〉_{LV}, which captures more than 80% of the overall information in this dataset (Fig. 9*b*), describes a weak initial transient increase superimposed on a time-invariant scale of expression variance. The initial transient increase in the scale of expression variance may be a response to the elutriation. The time-invariant scale of expression variance suggests that a steady scale of experimental as well as biological uncertainty is associated with the expression data. This also suggests that time-invariant multiplicative constants due to uncontrolled experimental variables may be superimposed on the data. We filter out |γ_{1}〉_{LV}, removing the steady scale of expression variance, ê_{LV}→ ê_{CLV}= ê_{LV}− ɛ_{1,LV}|α_{1}〉_{LV LV}〈γ_{1}|.The normalized elutriation dataset ê

_{N}, where each of its elements satisfies 〈*n*|ê_{N}|*m*〉 ≡ sign(*e*_{C,nm})\( \begin{equation*}\sqrt{exp{(e_{CLV,nm})}}\end{equation*}\), tabulates for each gene and array expression patterns that are approximately centered at the steady-state expression level (i.e., of approximately zero arithmetic means), with variances which are approximately normalized by the steady scale of expression variance (i.e., of approximately unit geometric means). The first and second eigengenes, |γ_{1}〉_{N}and |γ_{2}〉_{N}, of ê_{N}(Fig. 1*a*), which are of similar significance, capture together more than 40% of the overall normalized expression (Fig. 1*b*). The time variations of |γ_{1}〉_{N}and |γ_{2}〉_{N}fit normalized sine and cosine functions of period*T*and initial phase θ ≈ 2π/13, \( \begin{equation*}\sqrt{{2/T}}\end{equation*}\) sin(2π*t*/*T*− θ) and \( \begin{equation*}\sqrt{2/{\mathit{T}}}\end{equation*}\) cos(2π*t*/*T*− θ), respectively (Fig. 1*c*). We infer that |γ_{1}〉_{N}and |γ_{2}〉_{N}represent cell cycle expression oscillations, and assume that the corresponding eigenarrays |α_{1}〉_{N}and |α_{2}〉_{N}represent the corresponding cell cycle cellular states. Upon sorting of the genes (and arrays) according to |γ_{1}〉_{N}and |γ_{2}〉_{N}(and |α_{1}〉_{N}and |α_{2}〉_{N}), the initial phase θ ≈ 2π/13 can be interpreted as a delay of 30 min between the start of the experiment and that of the cell cycle stage G_{1}. The decay to zero in the time variation of |γ_{2}〉_{N}at*t*= 360 and 390 min can be interpreted as dephasing in time of the initially synchronized yeast culture.Figure 1

### Data Sorting.

Consider the normalized expression of the 14 elutriation arrays {|

*a*_{m}〉} in the subspace spanned by |α_{1}〉_{N}and |α_{2}〉_{N}, which is assumed to approximately represent all cell cycle cellular states (Fig. 2*a*). All arrays have at least 25% of their normalized expression in this subspace, with their distances from the origin satisfying 0.5 ≤*r*_{m}< 1, except for the eleventh array |*a*_{11}〉. This suggests that |α_{1}〉_{N}and |α_{2}〉_{N}are sufficient to approximate the elutriation array expression. The sorting of the arrays according to their phases {φ_{m}}, which describes the transition from the expression pattern |α_{2}〉_{N}to |α_{1}〉_{N}and back to |α_{2}〉_{N}, gives an array order which is similar to that of the cell cycle time points measured by the arrays, an order that describes the progress of the cell cycle expression from the M/G_{1}stage through G_{1}, S, S/G_{2}, and G_{2}/M and back to M/G_{1}.Figure 2

Because |α

_{1}〉_{N}is correlated with the arrays |*a*_{4}〉, |*a*_{5}〉, |*a*_{6}〉, and |*a*_{7}〉 and is anticorrelated with |*a*_{13}〉 and |*a*_{14}〉, we associate |α_{1}〉_{N}with the cell cycle cellular state of transition from G_{1}to S, and −|α_{1}〉_{N}with the transition from G_{2}/M to M/G_{1}. Similarly, |α_{2}〉_{N}is correlated with |*a*_{2}〉 and |*a*_{3}〉, and therefore we associate |α_{2}〉_{N}with the transition from M/G_{1}to G_{1}. Also, |α_{2}〉_{N}is anticorrelated with |*a*_{8}〉 and |*a*_{10}〉, and therefore we associate −|α_{2}〉_{N}with the transition from S to S/G_{2}. With these associations the phase of |*a*_{1}〉, φ_{1}= −θ ≈ −2π/13, corresponds to the 30-min delay between the start of the experiment and that of the cell cycle stage G_{1}, which is also present in the inferred cell cycle expression oscillations |γ_{1}〉_{N}and |γ_{2}〉_{N}.Consider also the expression of the 5,981 genes {|

*g*_{n}〉} in the subspace spanned by |γ_{1}〉_{N}and |γ_{2}〉_{N}, which is inferred to approximately represent all cell cycle expression oscillations (Fig. 10 in supplemental material at www.pnas.org). One may expect that genes that have almost all of their normalized expression in this subspace with*r*_{n}≈ 1 are cell cycle regulated, and that genes that have almost no expression in this subspace with*r*_{n}≈ 0, are not regulated by the cell cycle at all. Indeed, of the 784 genes that were classified by Spellman*et al.*(3) as cell cycle regulated, 641 have more than 25% of their normalized expression in this subspace (Fig. 2*b*). We sort all 5,981 genes according to their phases {φ_{n}}, to describe the transition from the expression pattern |γ_{2}〉_{N}to that of |γ_{1}〉_{N}and back to |γ_{2}〉_{N}, starting at φ_{1}≈ −2π/13. One may expect this to order the genes according to the stages in the cell cycle in which their expression patterns peak. However, for the 784 cell cycle regulated genes this sorting gives a classification of the genes into the five cell cycle stages, which is somewhat different than the classification by Spellman*et al.*This may be due to the poor quality of the elutriation expression data, as synchronization by elutriation was not very effective in this experiment. For the α factor-synchronized cell cycle expression there is much better agreement between the two classifications (Fig. 5*b*).With all 5,981 genes sorted, the gene variations of |α

_{1}〉_{N}and |α_{2}〉_{N}fit normalized sine and cosine functions of period*Z*≡*N*− 1 = 5,980 and initial phase θ ≈ 2π/13, −\( \begin{equation*}\sqrt{{2/Z}}\end{equation*}\) sin(2π*z*/*Z*− θ) and \( \begin{equation*}\sqrt{2/{\mathit{Z}}}\end{equation*}\) cos(2π*z*/*Z*− θ), respectively, where*z*≡*n*− 1 (Fig. 3*b*and*c*). The sorted and normalized elutriation expression fit approximately a traveling wave of expression, varying sinusoidally across both genes and arrays, such that the expression of the*n*th gene in the*m*th array satisfies 〈*n*|ê_{N}|*m*〉 ∝ −2 cos[2π(*t*/*T*−*z*/*Z*)]/\( \begin{equation*}\sqrt{{ZT}}\end{equation*}\) (Fig. 3*a*).Figure 3

## Biological Data Analysis: α Factor-Synchronized Cell Cycle and *CLB2* and *CLN3* Overactivations

Spellman

*et al.*(3) also monitored genome-wide mRNA levels, for 6,108 yeast ORFs simultaneously, over approximately two cell cycle periods, in a yeast culture synchronized by α factor, relative to a reference mRNA from an asynchronous yeast culture, at 7-min intervals for 119 min. They also measured, in two independent experiments, mRNA levels of yeast strain cultures with overactivated*CLB2*, which encodes a G_{2}/M cyclin, both at*t*= 40 min relative to their levels at the start of overactivation at*t*= 0. Two additional independent experiments measured mRNA levels of strain cultures with overactivated*CLN3*, which encodes a G_{1}/S cyclin, at*t*= 30 and 40 min relative to their levels at the start of overactivation at*t*= 0. The dataset for the α factor,*CLB2*, and*CLN3*experiments we analyze (see supplemental data and Mathematica notebook at www.pnas.org) tabulates the ratios of gene expression levels for the*N*= 4,579 genes, 638 of which were classified by Spellman*et al.*as cell cycle regulated, with no missing data in the*M*= 22 arrays.After data normalization and degenerate subspace rotation (see

*Appendix*in supplemental material at www.pnas.org), the time variations of |γ_{1}〉_{RN}and |γ_{2}〉_{RN}fit normalized sine and cosine functions of two 66-min periods during the cell cycle, from*t*= 7 to 119 min, and initial phase θ ≈ π/4, respectively (Fig. 4*c*). While |γ_{2}〉_{RN}describes steady-state expression in the*CLB2*- and*CLN3*-overactive arrays, |γ_{1}〉_{RN}describes underexpression in the*CLB2*-overactive arrays and overexpression in the*CLN3*-overactive arrays.Figure 4

Upon sorting the 4,579 genes in the subspace spanned by |γ

_{1}〉_{RN}and |γ_{2}〉_{RN}(Fig. 5*b*), |γ_{1}〉_{RN}is correlated with genes that peak late in the cell cycle stage G_{1}and early in S, among them*CLN3*, and we associate |γ_{1}〉_{RN}with the cell cycle expression oscillations that start at the transition from G_{1}to S and are dependent on*CLN3*, which encodes a G_{1}/S cyclin. Also, |γ_{1}〉_{RN}is anticorrelated with genes that peak late in G_{2}/M and early in M/G_{1}, among them*CLB2*, and therefore we associate −|γ_{1}〉_{RN}with the oscillations that start at the transition from G_{2}/M to M/G_{1}and are dependent on*CLB2*, which encodes a G_{2}/M cyclin. Similarly, |γ_{2}〉_{RN}is correlated with genes that peak late in M/G_{1}and early in G_{1}, anticorrelated with genes that peak late in S and early in S/G_{2}, and uncorrelated with*CLB2*and*CLN3.*We, therefore, associate |γ_{2}〉_{RN}with the oscillations that start at the transition from M/G_{1}to G_{1}(and appear to be*CLB2*- and*CLN3*-independent), and −|γ_{2}〉_{RN}with the oscillations that start at the transition from S to S/G_{2}(and appear to be*CLB2*- and*CLN3*-independent).Figure 5

Upon sorting the 22 arrays in the subspace spanned by |α

_{1}〉_{RN}and |α_{2}〉_{RN}(Fig. 5*a*), |α_{1}〉_{RN}is correlated with the arrays |*a*_{13}〉 and |*a*_{14}〉, as well as with |*a*_{21}〉 and |*a*_{22}〉, which measure the*CLN3*-overactive samples. We therefore associate |α_{1}〉_{RN}with the cell cycle cellular state of transition from G_{1}to S, which is simulated by*CLN3*overactivation. Also, |α_{1}〉_{RN}is anticorrelated with the arrays |*a*_{9}〉 and |*a*_{10}〉, as well as with |*a*_{19}〉 and |*a*_{20}〉, which measure the*CLB2*-overactive samples. We associate −|α_{1}〉_{RN}with the cellular transition from G_{2}/M to M/G_{1}, which is simulated by*CLB2*overactivation. Similarly, |α_{2}〉_{RN}appears to be correlated with |*a*_{2}〉, |*a*_{3}〉, |*a*_{11}〉, and |*a*_{12}〉, anticorrelated with |*a*_{6}〉, |*a*_{7}〉, |*a*_{16}〉, and |*a*_{17}〉, and uncorrelated with |*a*_{19}〉, |*a*_{20}〉, |*a*_{21}〉, or |*a*_{22}〉. We therefore associate |α_{2}〉_{RN}with the cellular transition from M/G_{1}to G_{1}(which appears to be*CLB2*- and*CLN3*-independent), and −|α_{2}〉_{RN}with the cellular transition from S to S/G_{2}(which also appears to be*CLB2*- and*CLN3*-independent).With all 4,579 genes sorted the gene variations of |α

_{1}〉_{RN}and |α_{2}〉_{RN}fit normalized sine and cosine functions of period*Z*≡*N*− 1 = 4,578 and initial phase π/8, respectively (Fig. 6*b*and*c*). The normalized and sorted cell cycle expression approximately fits a traveling wave, varying sinusoidally across both genes and arrays. The normalized and sorted expression in the*CLB2*- and*CLN3*-overactive arrays approximately fits standing waves, constant across the arrays and varying sinusoidally across genes only, which appear similar to −|α_{1}〉_{RN}and |α_{1}〉_{RN}, respectively (Fig. 6*a*).Figure 6

## Conclusions

We have shown that SVD provides a useful mathematical framework for processing and modeling genome-wide expression data, in which both the mathematical variables and operations may be assigned biological meaning.

## Abbreviation

- SVD
- singular value decomposition

## Note

§

In this report, m̂ denotes a matrix, |

*v*〉 denotes a column vector, and 〈*u*| denotes a row vector, such that m̂|*v*〉, 〈*u*|m̂, and 〈*u*|*v*〉 all denote inner products and |*v*〉〈*u*| denotes an outer product.## Acknowledgments

We thank S. Kim for insightful discussions, G. Sherlock for technical assistance and careful reading, and J. Doyle and P. Green for thoughtful reviews of this manuscript. This work was supported by a grant from the National Cancer Institute (National Institutes of Health, CA77097). O.A. is an Alfred P. Sloan and U.S. Department of Energy Postdoctoral Fellow in Computational Molecular Biology, and a National Human Genome Research Institute Individual Mentored Research Scientist Development Awardee in Genomic Research and Analysis (National Institutes of Health, 1 K01 HG00038-01). P.O.B. is an Associate Investigator of the Howard Hughes Medical Institute.

## References

1

S P Fodor, R P Rava, X C Huang, A C Pease, C P Holmes, C L Adams

*Nature (London)***364**, 555–556 (1993).2

M Schena, D Shalon, R W Davis, P O Brown

*Science***270**, 467–470 (1995).3

P T Spellman, G Sherlock, M Q Zhang, V R Iyer, K Anders, M B Eisen, P O Brown, D Botstein, B Futcher

*Mol Biol Cell***9**, 3273–3297 (1998).4

F P Roth, J D Hughes, P W Estep, G M Church

*Nat Biotechnol***16**, 939–945 (1998).5

M B Eisen, P T Spellman, P O Brown, D Botstein

*Proc Natl Acad Sci USA***95**, 14863–14868 (1998).6

U Alon, N Barkai, D A Notterman, K Gish, S Ybarra, D Mack, A J Levine

*Proc Natl Acad Sci USA***96**, 6745–6750 (1999).7

P Tamayo, D Slonim, J Mesirov, Q Zhu, S Kitareewan, E Dmitrovsky, E S Lander, T R Golub

*Proc Natl Acad Sci USA***96**, 2907–2912 (1999).8

S Tavazoie, J D Hughes, M J Campbell, R J Cho, G M Church

*Nat Genet***22**, 281–285 (1999).9

M P S Brown, W N Grundy, D Lin, N Cristianini, C W Sugnet, T S Furey, M Ares, D Haussler

*Proc Natl Acad Sci USA***97**, 262–267 (2000).10

G H Golub, C F Van Loan

*Matrix Computation*(Johns Hopkins Univ. Press, 3rd Ed., Baltimore, 1996).11

S G Mallat

*A Wavelet Tour of Signal Processing*(Academic, 2nd Ed., San Diego, 1999).12

T W Anderson

*Introduction to Multivariate Statistical Analysis*(Wiley, 2nd Ed., New York, 1984).## Information & Authors

### Information

#### Published in

#### Classifications

#### Copyright

Copyright © 2000, The National Academy of Sciences.

#### Submission history

**Accepted**: June 15, 2000

**Published online**: August 29, 2000

**Published in issue**: August 29, 2000

#### Acknowledgments

We thank S. Kim for insightful discussions, G. Sherlock for technical assistance and careful reading, and J. Doyle and P. Green for thoughtful reviews of this manuscript. This work was supported by a grant from the National Cancer Institute (National Institutes of Health, CA77097). O.A. is an Alfred P. Sloan and U.S. Department of Energy Postdoctoral Fellow in Computational Molecular Biology, and a National Human Genome Research Institute Individual Mentored Research Scientist Development Awardee in Genomic Research and Analysis (National Institutes of Health, 1 K01 HG00038-01). P.O.B. is an Associate Investigator of the Howard Hughes Medical Institute.

### Authors

## Metrics & Citations

### Metrics

#### Citation statements

#### Altmetrics

### Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

#### Cited by

Loading...

## View Options

### View options

#### PDF format

Download this article as a PDF file

DOWNLOAD PDF### Get Access

#### Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Personal login Institutional Login#### Recommend to a librarian

Recommend PNAS to a Librarian#### Purchase options

Purchase this article to get full access to it.