New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Integrative analysis of genomescale data by using pseudoinverse projection predicts novel correlation between DNA replication and RNA transcription

Contributed by Gene H. Golub, September 13, 2004
Abstract
We describe an integrative datadriven mathematical framework that formulates any number of genomescale molecular biological data sets in terms of one chosen set of data samples, or of profiles extracted mathematically from data samples, designated the “basis” set. By using pseudoinverse projection, the molecular biological profiles of the data samples are leastsquaresapproximated as superpositions of the basis profiles. Reconstruction of the data in the basis simulates experimental observation of only the cellular states manifest in the data that correspond to those of the basis. Classification of the data samples according to their reconstruction in the basis, rather than their overall measured profiles, maps the cellular states of the data onto those of the basis and gives a global picture of the correlations and possibly also causal coordination of these two sets of states. We illustrate this framework with an integration of yeast genomescale proteins' DNAbinding data with cell cycle mRNA expression time course data. Novel correlation between DNA replication initiation and RNA transcription during the yeast cell cycle, which might be due to a previously unknown mechanism of regulation, is predicted.
Recent advances in highthroughput technologies enable monitoring molecular biological signals, e.g., mRNA expression levels and proteins' DNAbinding occupancy levels, that correspond to activities of cellular systems, e.g., DNA replication, RNA transcription, and proteins' DNAbinding on a genomic scale. Integrative analysis of these global signals promises to give new insights into cellular mechanisms of regulation, i.e., global causal coordination of cellular activities. Integrative analysis of different types of largescale molecular biological data requires mathematical tools that are able to formulate any number of largescale data sets in terms of a common frame of reference, while reducing the complexity of the data to make them comprehensible (1, 2). These tools should provide datadriven models or mathematical frameworks for the description of the data, where the variables, i.e., the patterns that they uncover in the data, and operations, i.e., data reconstruction and classification in subspaces spanned by these patterns, may represent some biological reality.
Recently we showed that singular value decomposition (SVD) (3, 4) and generalized SVD (GSVD) (5) provide such datadriven frameworks for genomescale molecular biological data. For example, the variables of SVD, “eigengenes” and corresponding “eigenarrays,” in the analyses of yeast Saccharomyces cerevisiae cell cycle time course mRNA expression data (6), and those of GSVD, “genelets” and corresponding “arraylets,” in the comparative analysis of yeast and human (7) cell cycle time course mRNA expression data, were shown to correlate with observed genomescale effects of known cell cycle regulators and measured samples of the cell cycle stages that they regulate, respectively. Mathematical reconstruction of the yeast data in these subsets of eigengenes and corresponding eigenarrays, or genelets and corresponding arraylets, was shown to simulate approximately the experimental observation of the cell cycle progression alone, rather than the cell cycle progression together with concurrent biological processes and experimental artifacts. Mathematical classification of yeast genes and arrays according to their expression of these eigengenes and eigenarrays, or genelets and arraylets, rather than overall expression, mapped the data onto cell cycle stages and outlined the progression of the cell cycle along genes and in time, respectively.
Now we show that pseudoinverse projection (8) provides an integrative datadriven framework that formulates any number of genomescale data sets in terms of a chosen set of data samples, or profiles extracted mathematically from data samples, which is designated the “basis” set. Pseudoinverse projection of a data set onto the basis set is a linear transformation of the data set from the open reading frames (ORFs) × datasamples space to the datasamples × basissamples space, where each of the data samples is leastsquaresapproximated as a linear superposition of the basis profiles. We show that mathematical reconstruction of the data in the basis may simulate experimental observation of only the cellular states manifest in the data that correspond to those of the basis. Mathematical classification of the data samples according to their reconstruction in the basis, rather than their overall molecular biological profiles, maps the cellular states of the data onto those of the basis and gives a global picture of the correlations and possibly also coordination of these two sets of cellular states. Novel correlations between data samples and basis profiles might be due to previously unknown mechanisms of regulation.
We illustrate this framework with an integration of yeast genomescale proteins' DNAbinding occupancy data (9) of nine cell cycle transcription factors (10) and four DNA replication initiation proteins (11) with the cell cycle time course mRNA expression data, using as basis sets the eigenarrays and arraylets determined by SVD and GSVD, respectively.¶
Mathematical Methods: Pseudoinverse Projection
Let the basis matrix b̂, of size NORFs or genomic sites × Mbasis profiles, tabulate M genomescale molecular biological profiles, measured from a set of M samples or extracted mathematically from a set of M or more measured samples. The vector in the nth row of the matrix b̂, 〈nb̂, lists the signal of the nth ORF across the different samples which correspond to the different arrays.∥ The vector in the mth column of the matrix b̂, b_{m} 〉 ≡ b̂m〉, lists the measured genomescale signal levels of the mth basis sample. Let the data matrix d̂, of size NORFs × Ldata samples, tabulate a genomescale molecular biological data set of a different type of data and for the same ORFs in the same genome, measured in L samples. The vector in the lth column of the matrix d̂, d_{l} 〉 ≡ d̂l〉, lists the measured genomescale signal levels of the lth data sample.
Moore–Penrose pseudoinverse projection (8) of the data matrix d̂ onto the basis matrix b̂ is then a linear transformation of the data d̂ from the NORFs × Ldata samples space to the Mbasis profiles × Ldata samples space (see Fig. 5 and Appendix, which are published as supporting information on the PNAS web site), where the matrix b̂ ^{†}, i.e., the pseudoinverse of b̂ satisfies such that the transformation matrices b̂b̂ ^{†} and b̂ ^{†} b̂ are orthogonal projection matrices. The pseudoinverse of b̂ is datadriven and unique. In this space the data matrix d̂ is represented by the matrix ĉ, which tabulates the correlations between the M vectors that span the pseudoinverse b̂ ^{†}, , and the L profiles of the samples that span the data matrix d̂, {d_{l} 〉}, such that for all 1 ≤ m ≤ M and 1 ≤ l ≤ L.
Pseudoinverse Computation. We use the SVD of the basis matrix b̂ = ûŵv̂^{T} , where ŵ is a diagonal nonnegative matrix, û^{T}û = v̂^{T}v̂ = Î, and Î is the identity matrix, to compute the pseudoinverse b̂ ^{†} = v̂ŵ ^{–1} û^{T} , such that Eq. 2 is satisfied, and b̂b̂ ^{†} = ûû^{T} and b̂ ^{†} b̂ = v̂v̂^{T} are orthogonal projection matrices. We then compute the pseudoinverse correlations ĉ from b̂ ^{†} and d̂. We also compute the canonical correlation of each data profile with the basis, 0 ≤ ≤ 1.
Integrative Data Reconstruction. The pseudoinverse projection of d̂ onto b̂ allows reconstruction of the data, d̂ → b̂b̂ ^{†} d̂, where each of the data samples is leastsquaresapproximated by a linear superposition of the basis profiles, , without eliminating ORFs or samples. For reconstruction and visualization, we set the arithmetic mean of each ORF across the samples and that of each sample across the ORFs to zero, such that each ORF and sample in the reconstructed data set is centered at its sample or ORFinvariant level, respectively.
Integrative Data Classification. The reconstructed data samples are classified by similarity in the contributions of the basis profiles to their overall measured profiles rather than by their overall measured profiles alone.
Consider a basis that is determined by SVD analysis of a set of measured samples, and is spanned by M > 2 eigenarrays, {b_{m} 〉}, two of which, b̂ _{1}〉 and b̂ _{2}〉, span a subspace of interest. We plot the correlation of with each reconstructed data sample d_{l} 〉, c_{2l}〈d_{l} b̂b̂ ^{†}d_{l} 〉^{–½}, along the yaxis vs. that of along the xaxis. In this plot, the distance of each sample from the origin is its amplitude in the subspace spanned by b _{1}〉 and b _{2}〉 relative to its overall reconstructed amplitude, . The angular distance of each sample from the xaxis is its phase in the transition from the profile b _{1}〉 to b _{2}〉 and back to b _{1}〉, tan φ _{l} = c _{2} _{l} /c _{1} _{l} . We sort the reconstructed samples according to their angular distances from the xaxis, φ _{l} .
Consider also a basis that is determined by GSVD comparative analysis of two sets of measured samples and is spanned by M > 2 arraylets of one of these sets, {b_{m} 〉}. We approximate this basis with that spanned by the two vectors and , where the vectors x〉 and y〉 leastsquaresapproximate the corresponding Mgenelets subspace, {〈γ _{m} }, and maximize . We plot the projection of each data sample, d_{l} 〉, from the Marraylets subspace onto , that is along the yaxis vs. that onto along the xaxis, normalized by its ideal amplitude, where the contribution of each arraylet to the overall projected sample adds up rather than cancels out, . In this plot, the distance of each sample from the origin, r_{l} , is the amplitude of its normalized projection. An amplitude of 1 indicates that the contributions of the arraylets add up, and an amplitude of 0 indicates that they cancel out. The angular distance of each sample from the xaxis, φ _{l} , is its phase in the transition from the profile to and back, going through the projections of all M arraylets in this subspace. Again, we sort the reconstructed samples according to φ _{l} .
Independently, we also parallel and antiparallelassociate each data sample with most likely parallel and antiparallel cellular states, or none thereof, according to the annotations of the two groups of n ORFs each, with largest and smallest levels of biological signal in this sample among all N ORFs, respectively. The P value of a given association by annotation is calculated by using combinatorics and assuming hypergeometric probability distribution of the K annotations among the N ORFs, and of the subset of k ⊆ K annotations among the subset of n ⊂ N ORFs, , where is the binomial coefficient (12). We define the most likely association of a data sample with a cellular state as the association which corresponds to the smallest P value.
Biological Results: Integrative Analysis of mRNA Expression and Proteins' DNABinding Data
Basis Sets. (i) SVD cell cycle mRNA expression basis. SVD analysis (3, 4) of relative mRNA expression levels of 4,579 ORFs in 22 yeast samples measured by Spellman et al. (6) determined two dominant orthogonal eigenarrays and corresponding eigengenes of similar significance that span the yeast cell cycle expression subspace (see Data Sets 1 and 2 and Mathematica Notebook 1, which are published as supporting information on the PNAS web site). The 22 samples correspond to 18 samples of a cell cycle time course of an αfactorsynchronized culture, and two samples each of strains with overexpressed CLN3 and CLB2, which encode G_{1} and G_{2}/M cyclins, respectively. One eigenarray was shown to correlate and anticorrelate with the samples of overexpressed CLN3 and CLB2, respectively (Fig. 1a ). The corresponding eigengene was shown to correlate with CLN3 and its targets, i.e., genes for which expression peaks at the transition from G_{1} to S, and anticorrelate with CLB2 and its respective targets, for which expression peaks at that from G_{2}/M to M/G_{1} (Fig. 1b ). Classification of the yeast arrays and genes in the subspaces spanned by these two eigenarrays and eigengenes gives a picture that resembles the traditional understanding of yeast cell cycle regulation (13): G_{1} cyclins, such as Cln3, and G_{2}/M cyclins, such as Clb2, drive the cell cycle past either one of two antipodal checkpoints, from G_{1} to S and from G_{2}/M to M/G_{1}, respectively (Fig. 1c ). The SVD cell cycle mRNA expression basis we use is spanned by the M = 9 most significant eigenarrays across the N = 4,579 ORFs, including the two eigenarrays that span the SVD cell cycle expression subspace. (ii) GSVD cell cycle mRNA expression basis. GSVD comparative analysis (5) of mRNA expression of 4,523 yeast and 12,056 human ORFs in 18 samples each of time courses of αfactorsynchronized yeast culture (6), and double thymidine blocksynchronized HeLa cell line culture measured by Whitfield et al. (7), determined six dominant genelets and corresponding six yeast and six human arraylets, at –π/3, 0 and π/3 initial phases, of similar significance in both data sets that span the yeast and human common cell cycle expression subspace (Data Sets 2, 3, and 4 and Mathematica Notebook 1, which are published as supporting information on the PNAS web site). The two 0phase yeast arraylets were shown to correlate with cell cycle transition from G_{2}/M to M/G_{1}, in which the yeast culture is synchronized initially, and anticorrelate with that from G_{1} to S (Fig. 1d ). The two 0phase human arraylets were shown to anticorrelate with the transition from G_{2}/M to M/G_{1}, and to correlate with that from G_{1} to S, in which the human culture is synchronized initially. The two shared 0phase genelets were shown to correlate with 0phase oscillations of both yeast and human genes (Fig. 1e ). Simultaneous classification of the yeast and human arrays and genes in the subspaces spanned by the six yeast and six human arraylets, and six shared genelets, respectively, gives a picture that resembles the traditional understanding of the biological similarity in the regulation of the yeast and human cell cycles (13), i.e., two antipodal checkpoints, at the transition from G_{1} to S and at that from G_{2}/M to M/G_{1}, that are regulated independently of other cell cycle events (Fig. 1f ). The GSVD cell cycle mRNA expression basis we use is spanned by the six yeast arraylets across the 4,523 ORFs.
Data Sets. (i) Proteins' DNAbinding data. This data set tabulates the relative DNAbound protein occupancy levels of the N = 2,928 ORFs with at least one valid data point in any one of L = 13 samples, which correspond to the nine yeast cell cycle transcription factors measured by Simon et al. (10) and four yeast replication initiation proteins measured by Wyrick et al. (11) (Data Set 5, which is published as supporting information on the PNAS web site). The relative binding occupancy level of the nth ORF in the lth sample is presumed valid when the P value calculated by either Simon et al. or Wyrick et al. that is associated with the measured relative binding occupancy signal is <0.1. We divide each ORF measurement by the arithmetic mean of the measurements for that ORF, thus converting the data to binding levels of each protein relative to those of all other proteins. (ii) αFactor mRNA expression data. This set tabulates the relative mRNA expression levels of the 4,636 ORFs with valid data in all of the 18 samples of a cell cycle time course of an αfactorsynchronized culture (6) (Data Set 6, which is published as supporting information on the PNAS web site). The relative expression level of the nth ORF in the lth sample is presumed valid when the ratio of the measured expression signal to that of the background is >1 for both the synchronized culture and the asynchronous reference. (iii) CLB2 and CLN3 mRNA overexpression data. This set tabulates mRNA expression of 5,840 ORFs with valid data in four samples, two samples each of strains with overexpressed CLN3 and CLB2, which encode G_{1} and G_{2}/M cyclins, respectively (6) (Data Set 7, which is published as supporting information on the PNAS web site). (iv) CDC15 mRNA expression data. This set tabulates mRNA expression of 4,122 ORFs with valid data in all 24 samples of a cell cycle time course of a yeast CDC15 mutant culture synchronized by temperature change (6) (Data Set 8, which is published as supporting information on the PNAS web site).
Pseudoinverse Reconstruction of the Proteins' DNABinding Data in the mRNA Expression Bases. Of the 2,227 and 2,139 ORFs in the intersections of the 2,928 ORFs of the proteins' DNAbinding data set and the 4,579 and 4,523 ORFs of the SVD and GSVDcell cycle mRNA expression bases, 400 and 377 ORFs were microarrayclassified, and 58 and 60 were traditionally classified as cell cycleregulated, respectively. In these intersections, at least one canonical correlation of each binding profile with either the SVD or GSVD bases is >0.1 (see Fig. 6 and Mathematica Notebook 2, which are published as supporting information on the PNAS web site). We reconstruct the proteins' DNAbinding data in the SVD and GSVD bases by using pseudoinverse projections in these intersections (Fig. 2). With the ORFs sorted according to their SVD and GSVD cell cycle phases, the ORF variations of the SVD and GSVDreconstructed binding profiles approximately fit cosine functions of one period and of varying initial phases. With the nine transcription factors ordered Mbp1, Swi4, Swi6, Fkh1, Fkh2, Ndd1, Mcm1, Ace2, and Swi5, following Simon et al. (10), the SVD and GSVDpseudoinverse correlations approximately fit cosine functions of one period and of varying initial phases across the nine samples, and are approximately invariant across the four samples of the replication initiation proteins, Mcm3, Mcm4, Mcm7, and Orc1 (Fig. 3).
The SVD and GSVDreconstructed transcription factors' data approximately fit traveling waves, cosinusoidally varying across the ORFs as well as the nine samples. Simon et al. (10) observed a similar traveling wave in the binding data of the nine transcription factors, ordered as above, across only 213 ORFs in the intersection of ORFs with a P value <0.001 for at least one data point in any one of the nine samples, and ORFs that were microarrayclassified as cell cycleregulated, sorted according to their cell cycle phases as calculated by Spellman et al. (6). These traveling waves are in agreement with current understanding of the cell cycle's progression of transcription along the genes and in time as it is regulated by DNA binding of the transcription factors at the promoter regions of the transcribed genes. Pseudoinverse reconstruction of the data in both the SVD and GSVD bases, therefore, simulates experimental observation of only proteins' DNAbinding cellular states that correspond to those of mRNA expression during the cell cycle. The SVD and GSVDreconstructed replication initiation proteins' data approximately fit standing waves, cosinusoidally varying across the ORFs and constant across the four samples, that are antiparallel to the reconstructed profiles of Mbp1, Swi4, and Swi6, and parallel to that of Mcm1.
Pseudoinverse Mapping of the Proteins' DNABinding Data onto the Cell Cycle mRNA Expression Subspaces. We map the SVD and GSVDreconstructed proteins' DNAbinding data onto the SVD and GSVDcell cycle mRNA expression subspaces, respectively, associating with each binding profile cell cycle phase and amplitude (Fig. 4). Projected from the SVD basis, that is spanned by nine eigenarrays, onto the SVDcell cycle subspace, which is spanned by two of these eigenarrays, all SVDreconstructed samples have at least 25% of their binding profiles in this subspace, where their distances from the origin satisfy 0.5 ≤ r_{l} < 1, except for Fkh2. Projected from the sixdimensional GSVDcell cycle subspace, which is spanned by six arraylets, onto the twodimensional subspace that approximates it, 50% or more of the contributions of the six arraylets to each GSVDreconstructed sample add up, where the distance of each array from the origin satisfies 0.5 ≤ r_{l} < 1.
Sorting the samples according to their SVD or GSVD phases gives an array order that is similar to that of Simon et al. (10) and describes the yeast cell cycle progression from the cellular state of Mbp1's binding through that of Swi5's. The SVD and GSVDmappings of the transcription factors' binding profiles onto the expression subspaces are also in agreement with the current understanding of the cell cycle program. Mapping the binding of Mbp1, Swi4, and Swi6 onto the cell cycle expression stage G_{1} corresponds to the biological coordination between the binding of these factors to the promoter regions of ORFs and the subsequent peak in transcription of these ORFs during G_{1}. The mapping of Mbp1, Swi4, and Swi6 onto G_{1}, which is antipodal to G_{2}/M, also corresponds to their binding to promoter regions of ORFs that exhibit transcription minima or shutdown during G_{2}/M and to their minimal or lack of binding at promoter regions of ORFs that have transcription peaks in G_{2}/M. Similarly, the mapping of Mcm1 onto G_{2}/M corresponds to its binding to the promoter regions of ORFs that are subsequently transcribed during the transition from G_{2}/M to M/G_{1}. The binding profiles of the replication initiation proteins are SVD and GSVDmapped onto the cell cycle stage that is antipodal to G_{1}. This mapping is consistent with the reconstructed profiles of Mcm3, Mcm4, Mcm7, and Orc1 being antiparallel to those of Mbp1, Swi4, and Swi6 and parallel to that of Mcm1. Thus, DNA binding of Mcm3, Mcm4, Mcm7, and Orc1 adjacent to ORFs is shown to be correlated with minima or even shutdown of the transcription of these ORFs during the cell cycle stage G_{1}, suggesting a previously unknown genomescale coordination between DNA replication initiation and RNA transcription during the cell cycle in yeast.
Independently, we also parallel and antiparallelassociate each binding profile with most likely parallel and antiparallel cell cycle stages, or none thereof (Table 1, which is published as supporting information on the PNAS web site), by calculating the P value for the distribution of the 506 and 77 ORFs that were microarray and traditionally classified as cell cycleregulated, respectively, among all 2,928 ORFs and among each of the subsets of 200 ORFs with largest and smallest levels of binding occupancy, respectively (Fig. 7, which is published as supporting information on the PNAS web site). At least one of the four P values for each profile, following either the microarray or traditional classification, for either parallel or antiparallel association, is <0.01. Most of the P values are «0.01. Almost all parallel and antiparallel associations of each profile are consistently antipodal, i.e., half of a cellcycle period apart. Also, almost all associations following the microarray classification are consistent with the associations following the traditional classification. For example, following both the microarray and traditional classifications, the profile of Mcm1 is associated in parallel with G_{2}/M and in antiparallel with G_{1}. The SVD and GSVD mappings of all of the binding profiles onto the cell cycle transcription subspaces are consistent with these probabilistic associations by ORF annotations.
Pseudoinverse Integration of the mRNA Expression Data with the mRNA Expression Bases. We integrate the αfactor cell cycle, CLB2 and CLN3 overexpression and CDC15 cell cycle mRNA expression data sets with the SVD and GSVDcell cycle mRNA expression bases by using pseudoinverse projections (see Figs. 8–18 and Tables 2 and 3, which are published as supporting information on the PNAS web site). The results are all consistent and in agreement with the current understanding of the cell cycle program.
Pseudoinverse Integration of the Replication Initiation Proteins' DNABinding Data with the Transcription Factors' DNABinding Basis. We integrate the replication initiation proteins' DNAbinding data with the transcription factors' DNAbinding data after reconstruction in either the SVD or GSVDcell cycle RNA transcription bases (see Figs. 19 and 20, which are published as supporting information on the PNAS web site). Again we find that the binding profiles of the replication initiation proteins, Mcm3, Mcm4, Mcm7, and Orc1, are anticorrelated with the profiles of Mbp1, Swi4, and Swi6 and correlated with the profile of Mcm1.
Discussion
We showed that pseudoinverse projection can be used for integrative analysis of different types of largescale molecular biological data. One consistent picture emerges upon integrating genomescale proteins' DNAbinding data with the SVD and GSVDcell cycle mRNA expression bases, which is in agreement with the current understanding of the yeast cell cycle program. This picture correlates the binding of replication initiation proteins with minima or shutdown of the transcription of adjacent ORFs during the cell cycle stage G_{1}, under the assumption that the measured cell cycle mRNA expression levels are approximately proportional to cell cycle RNA transcription activity. It is known that replication initiation requires binding of Mcm3, Mcm4, Mcm7, and Orc1 at origins of replication across the yeast genome during G_{1} (14, 15) and that these replication initiation proteins are involved with transcriptional silencing at the yeast mating loci (16, 17). It was suggested recently that the transcription factor Mcm1 also binds origins of replication (18). Either one of at least two mechanisms of regulation may be underlying this novel genomescale correlation between DNA replication initiation and RNA transcription during the yeast cell cycle: The transcription of genes may reduce the binding efficiency of adjacent origins, or the binding of replication initiation proteins to origins of replication may repress, or even shut down, the transcription of adjacent genes. Thus a datadriven mathematical model, where the mathematical variables and operations represent biological reality, has been used to predict a biological principle that is truly on a genomescale: The ORFs in either one of the basis or data sets were selected on the basis of data quality alone and were not limited to ORFs that are microarray or traditionally classified as cell cycleregulated, suggesting that the RNA transcription signatures of yeast cell cycle cellular states may span the whole yeast genome. This idea is in agreement with the recent observation that a genomewide oscillation in transcription gates DNA replication and the cell cycle (19).
Possible additional applications of pseudoinverse projection include integrating additional data of different cellular programs, e.g., yeast meiosis or invasive growth, and of different type, e.g., DNA sequence motif abundance in ORFs' promoter regions, DNA copy number, mRNA expression, or proteins' DNAbinding levels, with the basis set of yeast cell cycle mRNA expression to elucidate the coordination of these programs in terms of their genomic signals.
Acknowledgments
We thank D. Botstein and P. O. Brown for introducing us to genomics; J. F. X. Diffley, P. Green, R. R. Klevecz, and J. J. Wyrick for thoughtful and thorough reviews of this manuscript; I. W. Dawes, V. R. Iyer, E. M. Marcotte, and K. Nasmyth for insightful discussions; and G. W. Brown, I. Haviv, I. Simon, and B. K. Tye for helpful comments. This work was supported by National Science Foundation Grant CCR0430617 (to G.H.G.). O.A. is an Individual Mentored Research Scientist Development Awardee in Genomic Research and Analysis (5 K01 HG0003801) of the National Human Genome Research Institute.
Footnotes

↵ ‡ To whom correspondence should be addressed. Email: orlyal{at}mail.utexas.edu.

Abbreviations: SVD, singular value decomposition; GSVD, generalized SVD.

↵ ¶ Alter, O., Golub, G. H., Brown, P. O. & Botstein, D., Miami Nature Biotechnology Winter Symposium: The Cell Cycle, Chromosomes and Cancer, Jan. 31–Feb. 4, 2004, Miami Beach, FL (www.med.miami.edu/mnbws/alter.pdf).

↵ ∥ In this article, m̂ denotes a matrix, v〉 denotes a column vector, and 〈u denotes a row vector, such that m̂v〉, 〈um̂, and 〈uv〉 all denote inner products and v〉〈u denotes an outer product.
 Copyright © 2004, The National Academy of Sciences
References
 ↵

↵
Lu, P., Nakorchevskiy, A. & Marcotte, E. M. (2003) Proc. Natl. Acad. Sci. USA 100 , 10370–10375. pmid:12934019

↵
Alter, O., Brown, P. O. & Botstein, D. (2000) Proc. Natl. Acad. Sci. USA 97 , 10101–10106. pmid:10963673

↵
Alter, O., Brown, P. O. & Botstein, D. (2001) in Microarrays: Optical Technologies and Informatics, eds. Bittner, M. L., Chen, Y., Dorsel, A. N. & Dougherty, E. R. (Int. Soc. Optical Eng., Bellingham, WA), Vol. 4266, pp. 171–186.

↵
Alter, O., Brown, P. O. & Botstein, D. (2003) Proc. Natl. Acad. Sci. USA 100 , 3351–3356. pmid:12631705

↵
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998) Mol. Biol. Cell. 9 , 3273–3297. pmid:9843569

↵
Whitfield, M. L., Sherlock, G., Saldanha, A. J., Murray, J. I., Ball, C. A., Alexander, K. E., Matese, J. C., Perou, C. M., Hurt, M. M., Brown, P. O. & Botstein, D. (2002) Mol. Biol. Cell 13 , 1977–2000. pmid:12058064

↵
Golub, G. H. & Van Loan, C. F. (1996) Matrix Computation (Johns Hopkins Univ. Press, Baltimore), 3rd Ed.
 ↵
 ↵

↵
Wyrick, J. J., Aparicio, J. G., Chen, T., Barnett, J. D., Jennings, E. G., Young, R. A., Bell, S. P. & Aparicio, O. M. (2001) Science 294 , 2357–2360. pmid:11743203
 ↵

↵
Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K. & Watson, J. D. (1994) Molecular Biology of the Cell (Garland, New York), 3rd Ed.
 ↵
 ↵
 ↵
 ↵

↵
Chang, V. K., Fitch, M. J., Donato, J. J., Christensen, T. W., Merchant, A. M. & Tye, B. K. (2003) J. Biol. Chem. 278 , 6093–6100. pmid:12473677

↵
Klevecz, R. R., Bolen, J., Forrest, G. & Murray, D. B. (2004) Proc. Natl. Acad. Sci. USA 101 , 1200–1205. pmid:14734811