New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Singular value decomposition of genomescale mRNA lengths distribution reveals asymmetry in RNA gel electrophoresis band broadening

Contributed by Gene H. Golub, June 7, 2006
Abstract
We describe the singular value decomposition (SVD) of yeast genomescale mRNA lengths distribution data measured by DNA microarrays. SVD uncovers in the mRNA abundance levels data matrix of genes × arrays, i.e., electrophoretic gel migration lengths or mRNA lengths, mathematically unique decorrelated and decoupled “eigengenes.” The eigengenes are the eigenvectors of the arrays × arrays correlation matrix, with the corresponding series of eigenvalues proportional to the series of the “fractions of eigen abundance.” Each fraction of eigen abundance indicates the significance of the corresponding eigengene relative to all others. We show that the eigengenes fit “asymmetric Hermite functions,” a generalization of the eigenfunctions of the quantum harmonic oscillator and the integral transform which kernel is a generalized coherent state. The fractions of eigen abundance fit a geometric series as do the eigenvalues of the integral transform which kernel is a generalized coherent state. The “asymmetric generalized coherent state” models the measured data, where the profiles of mRNA abundance levels of most genes as well as the distribution of the peaks of these profiles fit asymmetric Gaussians. We hypothesize that the asymmetry in the distribution of the peaks of the profiles is due to two competing evolutionary forces. We show that the asymmetry in the profiles of the genes might be due to a previously unknown asymmetry in the gel electrophoresis thermal broadening of a moving, rather than a stationary, band of RNA molecules.
Advances in sequencing technology (1), including DNA and RNA gel electrophoresis (2–6), fueled the Human Genome Project, promoted the resulting sequencing of numerous complete genomes, and stimulated the emergence of DNA microarray hybridization technology. This highthroughput technology makes it possible to assay the hybridization of DNA or RNA molecules, extracted from a single sample, with several thousands of probes simultaneously (7, 8). Different types of molecular biological signals, such as abundance levels of DNA, RNA, and DNAbound proteins can now be measured on genomic scales (9, 10).
Recently, Hurowitz and Brown (11) described the use of DNA microarrays in the genomescale measurement of the distribution of the lengths of mRNA gene transcripts in yeast. Electrophoresis was used to separate the transcripts by migration length in an agarose gel, where each migration length corresponds to a transcript length. The gel was cut into slices, and the relative abundance levels of the different yeast transcripts in each slice was measured with a DNA microarray. We describe the singular value decomposition (SVD) (12) of the mRNA abundance levels data matrix of genes × arrays, i.e., gel slices, electrophoretic migration lengths, or mRNA lengths. SVD separates the measured profiles of the genes into mathematically unique decorrelated and decoupled “eigengenes,” which are the eigenvectors of the arrays × arrays correlation matrix. The corresponding series of eigenvalues are proportional to the series of “fractions of eigen abundance,” each of which indicates the significance of the corresponding eigengene relative to all eigengenes. Recently, we illustrated the possible correspondence between significant eigengenes uncovered in DNA microarray data and the independent biological and experimental processes that compose the data with an analysis of genomescale mRNA expression from yeast during its cell cycle program (ref. 13, and see also refs. 14 and 15).
Here we show that the eigengenes of the yeast mRNA lengths distribution data fit “asymmetric Hermite functions.” Hermite functions are the eigenfunctions of the quantum harmonic oscillator (17, 18) and the integral transform which kernel is a generalized coherent state (19, 20). We show that the corresponding fractions of eigen abundance fit a geometric series, as do the eigenvalues of the integral transform which kernel is a generalized coherent state. We show that, as follows from the uniqueness of SVD, the “asymmetric generalized coherent state” model fits the measured genomescale distribution of the lengths of mRNA gene transcripts in yeast, i.e., that (i) the profiles of mRNA abundance levels of most genes fit asymmetric Gaussians; and (ii) the distribution of the peaks of the profiles of the genes fits an asymmetric Gaussian that peaks approximately at the mRNA length of 1,000 ± 50 nucleotides. We hypothesize that the asymmetric Gaussian distribution of the peaks of the profiles of the genes is due to two competing evolutionary forces that balance at the peak of this distribution.
Recently, we predicted a previously unknown biological principle by modeling DNA microarray data. Integrating genomescale proteins’ DNAbinding data with cell cycle mRNA expression time course data from yeast, using pseudoinverse projection, we predicted a previously unknown correlation between DNA replication initiation and RNA transcription, which might be due to an undiscovered mechanism of regulation (21). Now we reveal a previously unknown physical principle by modeling DNA microarray data. We show that the asymmetry in the profiles of mRNA abundance levels of the genes across the arrays, i.e., gel slices, might be due to a previously unknown asymmetry in the thermal broadening of a moving band of mRNA molecules. The peak of the symmetric Gaussian profile of a stationary band shifts within the moving band, changing its profile into an asymmetric Gaussian. We conclude that the mathematical modeling of DNA microarray data might be used to uncover the physical as well as the biological principles that govern the activities of DNA and RNA.
SVD of GenomeScale mRNA Lengths Distribution in Yeast
Hurowitz and Brown (11) measured relative mRNA abundance levels for 9,867 putative genes of the yeast Saccharomyces cerevisiae, over a range of 60 mm of an agarose gel cut into 30 slices of 2 mm each, spanning the electrophoretic migration range of 42–100 mm and the corresponding mRNA lengths range of 4,400–300 nucleotides, after 2 h of separation by length in an electric field of 9 V/cm. The data set we analyze tabulates the ratios of mRNA abundance levels for the P = 6,776 genes with no missing data in any of the X = 30 arrays corresponding to the 30 slices of gel. Let the matrix d̂ of size Pgenes × Xarrays tabulate the genomescale mRNA lengths distribution of yeast (see www.bme.utexas.edu/research/orly/harmonic_oscillator). The vector in the pth row of the matrix d̂, 〈pd̂〉 lists the profile of relative abundance levels of the pth gene transcript across the different gel slices which correspond to the different arrays. ^{§} The vector in the xth column of the matrix d̂, d̂x〉, lists the relative abundance levels of the P gene transcripts as measured in the xth gel slice by the xth array.
We compute the singular value decomposition (SVD) of the data matrix d̂ = ûω̂v̂^{T} (12) (see www.bme.utexas.edu/research/orly/harmonic_oscillator). The nth row of the orthogonal transformation matrix v̂^{T} lists the nth “eigengene” 〈nv̂^{T} , which is unique up to a phase factor of ±1 (13) (Fig. 1 a). The diagonal nonnegative matrix ω̂ lists the “eigen abundance” levels {〈nω̂n〉} (Fig. 1 b). The significance of the nth eigengene is indicated by the nth “fraction of eigen abundance” Ω_{n} = 〈nω̂n〉^{2}/(Σ_{n=1} ^{N} 〈nω̂n〉^{2}), i.e., the abundance captured by the nth eigengene relative to that captured by all eigengenes. Note that the eigengenes are the eigenvectors of the Xarrays × Xarrays correlation matrix d̂^{T}d̂ with the corresponding series of eigenvalues proportional to the series of the fractions of eigen abundance.
Eigengenes Fit “Asymmetric Hermite Functions”
The nth Hermite function,
where H
_{n}(
We fit the nth eigengene with the (n − 1)th continuous “asymmetric Hermite function,” where the generalized Hooke’s constant is asymmetric with respect to the equilibrium x = 0 (Fig. 2), These asymmetric Hermite functions form only an approximately orthogonal basis. The (n − 1)th asymmetric Hermite function is, therefore, normalized after discretization by sampling at unit intervals in the range x ∈ [−11, 18], where the equilibrium x = 0 is set at the gel migration length of 78 mm, and then fit to the nth eigengene, for n = 1, … , 10. We find that the “asymmetric generalized Hooke’s constant” is approximately k = k _{1} ≈ 0.36 for x ≤ 0 and k = k _{2} ≈ k _{1}/4 ≈ 0.09 for x ≥ 0. The arithmetic mean of the correlations between the (n − 1)th asymmetric Hermite function and the nth eigengene for n = 1, … , 10 is ≈0.78. The inflection points of the eigengenes approximately sample the corresponding continuous “asymmetric parabolic potential” kx ^{2}/2, where k = k _{1} for x ≤ 0 and k = k _{2} = k _{1}/4 for x ≥ 0, at unit intervals (Figs. 1 a and 2).
Asymmetric Generalized Coherent State Model of the GenomeScale mRNA Lengths Distribution
Assume that the profile of mRNA abundance levels measured for the pth gene across the X slices of gel approximately fits a Gaussian of the variable x which peaks at x = p with the variance σ_{x}
^{2} = 1/2a > 0, exp[−a(x − p)^{2}]. Assume also that the distribution of the peaks across the P genes fits a Gaussian of the variable p which peaks at the equilibrium p = x = 0 with the variance σ_{p}
^{2} = 1/2b > 0, where σ_{p}
^{2} ≫ σ_{x}
^{2} such that a ≫ b. With these assumptions, the abundance level of the pth gene transcript as measured by the xth array is proportional to the generalized coherent state (19)
The correlation of the mRNA abundance levels measured by the xth and yth arrays across the P genes, in the limit of P → ∞, is then proportional to the generalized coherent state
where α = (a
^{2} + 2ab)/[2(a + b)] and β = a
^{2}/[2(a + b)]. For a ≫ b, α ≈ a/2 + b and β ≈ a/2. Using Eq. 2
, it can be shown that the Hermite functions h_{n}
(
where k = 2
Fractions of Eigen Abundance Fit a Geometric Series.
Fitting the fractions of eigen abundance {Ω_{n}} with the geometric series of Eq. 7 {cλ^{n}} for n = 2, … , 10, we find that λ ≈ 0.8. The correlation between the fractions of eigen abundance and the geometric series for n = 2, … , 10 is >0.99 (Fig. 1 b). From k = k _{1} ≈ 0.36 for x ≤ 0, k = k_{2} ≈ k _{1}/4 ≈ 0.09 for x ≥ 0 and λ ≈ 0.8 for all x, we calculate that a = a _{1} ≈ 1.6 for x ≤ p, and a = a _{2} ≈ a _{1}/4 ≈ 0.4 for x ≥ p, and that b = b _{1} ≈ 0.02 for x ≤ 0, and b = b _{2} ≈ b _{1}/4 ≈ 0.005 for x ≥ 0. Note that a _{1}/b _{1} ≈ a _{2}/b _{2} ≈ 80, such that a ≫ b.
Note also that, for the fractions of eigen abundance computed from the generalized coherent state of Eq. 5
after discretization to fit approximately the geometric series of Eq. 7
, it follows that λ ≈ 1 −
Profiles of mRNA Abundance Levels of Most Genes Fit Asymmetric Gaussians.
Hurowitz and Brown (11) observed that, for most genes, the profile of mRNA abundance levels of each gene, 〈pd̂, peaks at only one of the X gel slices. We fit the profiles of mRNA abundance levels measured for the yeast genes VMA7, ADE12 and HIS4, which were selected by Hurowitz and Brown as representative genes, with the asymmetric Gaussians exp[−a(x − p)^{2}], where a = a _{1} for x ≤ p and a = a _{2} ≈ a _{1}/4 for x ≥ p, sampled at unit intervals in the ranges x ∈ [−7, 22], [−16, 13], and [−23, 6], such that their peaks are set at the gel migration lengths of 88, 70, and 56 mm, which according to Hurowitz and Brown correspond approximately to the mRNA lengths of 675 ± 35, 1,500 ± 60, and 2,575 ± 100 nucleotides, respectively. We find that the arithmetic mean of the correlations between the measured profile for each of these three genes and the corresponding discretized Gaussian is ≈0.94 (Fig. 3).
Distribution of the Peaks of the Profiles of the Genes Fits an Asymmetric Gaussian.
From Eq. 5 , the arithmetic mean of the profiles of mRNA abundance levels of the P genes, in the limit of P → ∞, is approximately proportional to the distribution of the peaks x = p across the X gel slices for a ≫ b
We fit the arithmetic mean of the profiles of mRNA abundance levels of the P genes with the asymmetric Gaussian exp(−bx ^{2}), where b = b _{1} for x ≤ 0 and b = b _{2} = b _{1}/4 for x ≥ 0, sampled at unit intervals in the range x ∈ [−10, 19], where the equilibrium is set at the gel migration length of 80 mm. We find that the correlation is >0.99 (Fig. 4).
GenomeScale mRNA Lengths Distribution Fits an Asymmetric Generalized Coherent State.
We fit the mRNA lengths distribution data with the analytical continuous asymmetric generalized coherent state f(x, p) of Eq. 5 , where the variances σ_{x} ^{2} = 1/2a and σ_{p} ^{2} = 1/2b are asymmetric with respect to the peaks x = p of the profiles of the genes and the equilibrium p = x = 0, respectively (Fig. 4),
SVD of the asymmetric generalized coherent state is computed after discretization by sampling at unit intervals in the range x ∈ [−10, 19], where the equilibrium is set at the gel migration length of 80 mm (Fig. 7, which is published as supporting information on the PNAS web site). We find that the arithmetic mean of the correlations between the nth eigengene computed for the discretized asymmetric generalized coherent state of Eq. 9 and the nth eigengene computed for the measured data for n = 1, … , 10 is ≈0.73. The inflection points of the eigengenes computed for the discretized asymmetric generalized coherent state approximately sample the asymmetric parabolic potential kx ^{2}/2, where k = k _{1} for x ≤ 0 and k = k _{2} = k _{1}/4 for x ≥ 0 at unit intervals (Figs. 7a and 8, which are published as supporting information on the PNAS web site). We find that the correlation between the fractions of eigen abundance computed for the discretized asymmetric generalized coherent state and the geometric series {cλ^{n}} for n = 2, … , 10 is >0.99 (Fig. 7b).
Following Eq. 8 , we approximate the distribution of the peaks of the genes exp(−bp ^{2}) in the asymmetric coherent state of Eq. 5 with the arithmetic mean of the profiles of the genes SVD is computed again after discretization of exp[−a(x − p)^{2}] by sampling at unit intervals in the range x ∈ [−10, 19], where the equilibrium is set at the gel migration length of 80 mm (Fig. 9, which is published as supporting information on the PNAS web site). We find that the approximated asymmetric generalized coherent state of Eq. 10 fits the measured data better than the asymmetric generalized coherent state of Eqs. 5 and 9 : The arithmetic mean of the correlations between the nth eigengene computed for the approximated and discretized asymmetric generalized coherent state of Eq. 10 and the nth eigengene computed for the measured data for n = 1, … , 10 is ≈0.86 (Figs. 5 and 6). Note also the robustness of the eigengenes and the fractions of eigen abundance computed for the asymmetric generalized coherent state of Eq. 9 to the perturbation introduced by the approximation of Eq. 10 .
Why Does the Distribution of the Peaks of the Profiles of the Genes Fit an Asymmetric Gaussian?
Our hypothesis is that there are two competing evolutionary forces that determine the lengths of mRNA gene transcripts, and also the distribution of the peaks of the P profiles measured for the P genes. The first force acts to maximize the information content of the genes and their functional specificity, and therefore also their mRNA lengths. The second force acts to minimize the costs associated with the transcription as well as posttranscriptional processes, such as translation, and therefore also the lengths of mRNA gene transcripts. These forces balance at the peak of the distribution exp(−bp ^{2}), i.e., the equilibrium p = x = 0. We find the equilibrium at the gel migration length of 80 mm, which according to Hurowitz and Brown (11) corresponds approximately to the mRNA length of 1,000 ± 50 nucleotides. Both forces are linearly proportional to and oppositely directed to the displacement of the gel migration length of a transcript from this equilibrium gel migration length in the manner of the restoring force of the harmonic oscillator. Note that the gel migration length of a transcript is approximately linearly proportional to the logarithm of the mRNA length of the transcript (3, 11). The proportionality constants are b = b _{1} for the first force in the range of x ≤ 0, and b = b _{2} for the second force in the range of x ≥ 0. We find that b _{2} ≈ b _{1}/4, suggesting that the first force, which acts to increase mRNA lengths that are shorter than the equilibrium length of 1,000 nucleotides, is larger than the second force, which acts to decrease mRNA lengths that are longer than the equilibrium length.
Why Do the Profiles of mRNA Abundance Levels of Most Genes Fit Asymmetric Gaussians?
Asymmetry in the gel electrophoresis thermal broadening of a moving, rather than a stationary, band of RNA molecules, along the axis of the electric field, might be underlying this previously unknown asymmetry in the profiles of mRNA abundance levels of the genes. When the electric field is turned off, the thermal broadening of a band of RNA molecules in an agarose gel is symmetric along the axis of the field, such that the distribution of the RNA molecules is a Gaussian profile, exp[−a(x − p)^{2}], which peaks at the position where the RNA molecules were loaded onto the gel, x = p. When the electric field is turned on, the RNA molecules migrate in the gel along the axis of the electric field in addition to their thermal diffusion (3–6). The electrophoretic velocity depends on the lengths of the RNA molecules. Molecules which are of the same lengths will migrate with the same velocity and form a moving band. In the thermal broadening of a moving band of RNA molecules the peak of the band is moving toward the front of the band and away from its back. As a result, the width of the band narrows in the direction of migration, σ_{x,1} =
Discussion
We have shown, using SVD, that the genomescale distribution of yeast mRNA gene transcript lengths measured by DNA microarrays can be modeled mathematically as an asymmetric generalized coherent state, where the profiles of mRNA abundance levels of most genes as well as the distribution of the peaks of these profiles fit asymmetric Gaussians. We have hypothesized that the asymmetric Gaussian distribution of the peaks of the profiles of the genes is due to two competing evolutionary forces, which balance at the peak of this distribution, approximately at the mRNA length of 1,000 ± 50 nucleotides. SVD analyses of DNA microarray measured genomescale distributions of the lengths of unspliced premRNA, spliced mRNA exons and separately mRNA introns in different organisms (22), as well as different experimentally evolved strains of any one organism (23, 24), may be used to study the evolutionary forces which affect mRNA lengths.
We have shown that the asymmetry in the profiles of the genes might be due to a previously unknown asymmetry in the gel electrophoresis thermal broadening of a moving, rather than a stationary, band of RNA molecules. Previous simulations and measurements of DNA band broadening in gel electrophoresis have shown that the broadening of a moving band can be different from that of a stationary band, but have not suggested asymmetry (4–6). We conclude that the mathematical modeling of DNA microarray data might be used to uncover the physical as well as the biological principles which govern the activities of DNA and RNA.
Acknowledgments
We thank D. B. Oberman for recognizing that the eigengenes are Hermite functions, for recognizing that the asymmetric band broadening is due to a shift of the Gaussian peak within the moving band, and for many very insightful discussions. We also thank J. J. Collins, T. A. Duke, A. Goriely, J. Ross, and M. O. Vlad for thoughtful and thorough reviews of this manuscript and M. V. Berry, D. Botstein, P. O. Brown, G. M. Church, E. H. Hurowitz, V. R. Iyer, W. H. Press, G. W. Slater, and D. W. L. Sumners for helpful comments. This work was supported by National Science Foundation Grant CCR0430617 (to G.H.G.) and a National Human Genome Research Institute Individual Mentored Research Scientist Development Award in Genomic Research and Analysis (K01 HG00038) (to O.A.).
Footnotes
 ^{†}To whom correspondence may be addressed. Email: orlyal{at}mail.utexas.edu or golub{at}stanford.edu

Author contributions: O.A. and G.H.G. designed research; O.A. performed research; O.A. analyzed data; and O.A. and G.H.G. wrote the paper.

Conflict of interest statement: No conflicts declared.

↵ § In this manuscript, m̂ denotes a matrix, v〉 denotes a column vector, and 〈u denotes a row vector, such that m̂v〉, 〈um̂, and 〈uv〉 all denote inner products and v〉 〈u denotes an outer product.
 Abbreviation:
 SVD,
 singular value decomposition.

Freely available online through the PNAS open access option.
 © 2006 by The National Academy of Sciences of the USA
References

↵
 Church G. M. ,
 Gilbert W. ,
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵

↵
 Schena M. ,
 Shalon D. ,
 Davis R. W. ,
 Brown P. O. ,
 ↵
 ↵
 ↵

↵
 Golub G. H. ,
 Van Loan C. F. ,

↵
 Alter O. ,
 Brown P. O. ,
 Botstein D. ,

↵
 Yeung M. K. ,
 Tegner J. ,
 Collins J. J. ,

↵
 Vlad M. O. ,
 Arkin A. P. ,
 Ross J. ,

 Alter O. ,
 Golub G. H. ,

↵
 Schiff L. I. ,

↵
 Alter O. ,
 Yamamoto Y. ,
 ↵
 ↵

↵
 Alter O. ,
 Golub G. H. ,
 ↵

↵
 Lenski R. E. ,
 Travisano M. ,

↵
 Dunham M. J. ,
 Badrane H. ,
 Ferea T. ,
 Adams J. ,
 Brown P. O. ,
 Rosenzweig F. ,
 Botstein D. ,
Citation Manager Formats
More Articles of This Classification
Physical Sciences
Applied Mathematics
Biological Sciences
Genetics
Related Content
 No related articles found.
Cited by...
 Coordinated Metabolic Transitions During Drosophila Embryogenesis and the Onset of Aerobic Glycolysis
 A tensor higherorder singular value decomposition for integrative analysis of DNA microarray data from different studies
 Metagene projection for crossplatform, crossspecies characterization of global transcriptional states
 Discovery of principles of nature from mathematical modeling of DNA microarray data