Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / EVOLUTION
Tissue-driven hypothesis of genomic evolution and sequence-expression correlations
*School of Life Sciences, Institutes of Biomedical Sciences, Center for Evolutionary Biology, Fudan University, Shanghai 200433, China; and
Department of Genetics, Development, and Cell Biology, Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, IA 50011
Communicated by Jiazhen Tan, Fudan University, Shanghai, China, December 18, 2006 (received for review September 30, 2006)
| Abstract |
|---|
|
|
|---|
expression divergence | tissue expression | mammalian genomics | gene duplications
We have recognized that, without developing an explicit evolutionary model that can provide a common ground for predicting and testing by coherent data analyses, it is difficult to have a comprehensive understanding of these issues. In this article, we develop a stochastic model for genomic evolution under the principle of stabilizing selection and formulate the tissue-driven hypothesis by postulating that stabilizing selections for both expression and sequence divergences may be affected simultaneously by the common factors of tissues in which the genes is expressed. Facilitated by substantial multispecies microarrays (16), we test several predicted genomic correlations from the tissue-driven hypothesis. Finally, we discuss the evolutionary scenario of genomic correlations, demonstrating that accumulated tissue constraints may shape the correlated pattern of sequence and expression evolution.
| The Model |
|---|
|
|
|---|
|
|
where
e is the optimal value of expression level, wti is the coefficient for stabilizing selection on gene expression in tissue ti; a large wti means a strong selection pressure, and vice versa (Fig. 1). Under the stabilizing model of Eq. 1, we have shown that the expression divergence follows an OrnsteinUhlenback (OU) process (23). The stochastic OU process is characterized by the infinitesimal mean
0(x
e) and variance
2/2Ne, where
2 is the mutational variance, Ne is the effective population size, and
0 = wti
2 measures the direct force against the deviation from the optimum. Given the initial expression value x0, the OU model claims that x(t) follows a normal distribution with the mean and variance given by
|
|
respectively, where
= 2Ne
0 is the decay rate of expression divergence.
|
e), from Eq. 2 we have E[x1 x0] = E[x2 x0] =
e. If gene expression diverged along a lineage independently, we have E[x1x2 x0] = E[x1 x0]E[x2 x0] =
e2, and E[x1x2] = E[
e2], resulting in Cov(x1, x2) = Var(
e). In the same manner, one can show V(x1) = V(x2) =
2(1 e2
t)/2
+ Var(
e). Therefore, the expression distance for any gene pair g in tissue ti, Eti,g = E[(x1 x2)2], is given by
|
|
where 
is the mutational variance,
g is the decay rate of expression divergence of gene pair g, and Wti,g =
g/
is the strength of stabilizing selection on expression divergence. Thus, Eti,g is inversely related to Wti,g. When
, Eti,g = 1/Wti,g.
Tissue-Dependent Evolutionary Rate of Protein Sequence.
Gu (24) studied the evolutionary rate of a protein sequence, based on the principle that stabilizing selection on protein function generates sequence conservation. In the case of single protein function y (such as enzyme activity or DNA-binding affinity, also called molecular phenotype, the stabilizing selection on y follows a simple Gaussian form (Fig. 2)
|
|
Thus, the coefficient of selection on y is given by s(y) = 1 f(y)
(y
g)2/2
. On the other hand, random (nonsynonymous) mutations in the coding region affect the molecular phenotype y according to a distribution with the mean
g and variance
m2 (Fig. 2). Consequently, the mean of selection of coefficient is given by
= E[(y
g)2]/2
= 
/2
, and the selection intensity Sg = 4Ne
= 2Ne
/
. In the general case of multiple (K) molecular phenotypes of protein function, Gu (24) have shown
|
|
where the subscript i assigns 
and 
specific to the ith molecular phenotype.
|

) may be tissue-dependent, which can be modeled as 
= a
/Zg, i = 1,..., K. Whereas each a
is a tissue-independent constant, tissue factor Zg measures the accumulated tissue effect on fitness; a larger Zg means a greater tissue effect. For gene g expressed in Lg different tissues, we implement an additive-effect model Zg =
, in which Lj is the contribution from tissue j. The mean selection intensity in Eq. 5 can be rewritten in terms of tissue dependency
|
|
where Sg,0 = 2Ne


/a
is the tissue-independent component. Hence, given the mutation rate v, the evolutionary rate of gene g is given by
|
|
Eq. 7 links between-tissue-effects and evolutionary rate of protein sequence. Apparently, the evolutionary rate decreases when the accumulated tissue effect Zg is strong and vice versa.
Tissue-Driven Hypothesis and Predictions. The tissue-driven hypothesis of genomic evolution postulates that the tissue factor plays an important role of functional constraint on the rate of genomic evolution, because genes influence phenotypic characters by expression in specific tissues. The phenotypic consequences of genetic variations in regulatory and coding sequences are both affected by the common microenvironment of tissues. Below, we discuss several predicted genomic correlations that can be tested by the genomic data.
Tissue Expression Distance (Eti).
To measure the expression difference of a tissue between two species, we define Eti as the mean expression distance over N orthologous genes in tissue ti, that is, Eti = 
Eti,g/N, where Eti,g is given by Eq. 3. Under some moderate conditions, Eti can be approximated by
|
|
[see supporting information (SI) Appendix], where the mean tissue factor Wti is the (harmonic) average of Wti,gs,
is the mean decay-rate of expression divergence, and t is the time of speciation. Eq. 8 indicates that the tissue expression distance increases with time t and decreases with the mean tissue factor wti. When
is close to 0 (very weak stabilizing selection) or t is short (closely related species), Eq. 8 can be reduced to Eti
2
2t, i.e., the Brownian model (12, 13), where
2 is the mean mutational variance over genes. In the case of distantly related species when the expression divergence approaches the steady state, the time-dependent term in Eq. 8 vanishes, resulting in Eti
1/wti.
Tissue Expression and Sequence Distances: The Eti Dti Correlation.
Let dg be the evolutionary distance between an orthologous gene pair (g). For a set (Nti) of genes that are expressed in tissue ti, the mean evolutionary distance is given by
. Because dg = 2
gt, where
g is given by Eq. 7, we have shown that the mean selection intensity of tissue (ti)-expressed protein sequences can be written as
ti
S0(Zti +
) (see SI Appendix); Zti is the mean of accumulated tissue-(ti) factors over expressed genes, S0 is the mean of tissue-independent components, and
is a constant. Thus, we have
|
|
where D
= 2
t.
According to the tissue-driven hypothesis, two mean tissue factors Wti and Zti should be positively correlated, because they represent the effects of common microenvironment of tissue ti on expression divergence and protein sequence conservation, respectively. This argument predicts a positive correlation between tissue expression distance (Eti) and tissue sequence distance (Dti). In the special case when Zti = wti and Eti
1/wti (steady-state expression divergence), we obtain the following form
|
|
where a = 1 S0
/2 and b = S0/2.
Interspecies and Interduplicate Tissue Expression Divergence: The Eti Tdup Correlation.
The tissue-driven hypothesis also predicts that tissue factors may affect the expression divergence between duplicate genes. Consider a pair of duplicate genes that have diverged
evolutionary time units. Under the similar stabilizing selection model, one can obtain the expression distance between duplicated genes, Tdup,ti,g, which is virtually the same as Eq. 3. To be clear, we use Qti,g for the tissue factor of expression divergence between duplicate pair g. For a set (Ndup) of duplicate genes, let
be the tissue (ti) duplicate distance. Similar to Eqs. 5 and 6, we have
|
|
where Qti is the mean tissue factor for the interduplicate expression divergence in tissue ti,
is the mean decay rate of expression divergence, and
is the mean evolutionary time of the duplicate gene set. Hence, positively correlated wti and Qti under the tissue-driven hypothesis leads to a testable prediction of positive correlation between Eti and Tdup. In particular, a linear Eti Tdup relationship is expected when wti = Qti.
Tissue Broadness and Preference.
One can rewrite the accumulated tissue effect on gene g in Eq. 6 as Zg = Lgx
g, where Lg is the number of (Lg) of tissues in which gene g is expressed, and
is the average tissue factor for gene g. In fact,
g measures the effect of tissue preference, or tissue types, on the expression divergence. In short, the accumulated tissue effect can be decomposed into two factors: tissue broadness (Lg) and tissue preference (
g). The protein sequence becomes more conserved if the gene is expressed in more tissues or in tissues with more stringent constraints.
Although many studies have showed the effect of tissue broadness (9, 17, 25), the effect of tissue preference has not been well investigated. We address this issue by grouping genes with the same tissue broadness (Lg). When Lg is the same, the larger the
g value, the greater the selection intensity Sg and so the lower evolutionary rate
g. This prediction can be tested under the tissue-driven hypothesis that claims a positive correlation between Wti,g and Zti,g (see below).
| Results |
|---|
|
|
|---|
Tissue Expression Divergence Between Human and Mouse. Based on 8,936 humanmouse orthologs, we estimated the tissue expression distance Eti for each of 29 tissues. Fig. 3 shows a substantial variation of Eti among tissues. Indeed, there is a 2.4-fold difference from the lowest Eln = 0.85 (lymph node, ln) to the highest Epc = 0.206 (pancreas, pc).
|
Correlation (Eti Dti) Between Tissue Expression and Sequence Divergence. For each tissue ti, we calculated the tissue protein distance (Dti) between the human and mouse. Similar to Eti, the observed variation of Dti among tissues may indicate the tissue's role in protein evolution. Moreover, the tissue-driven hypothesis expects covariation between Eti and Dti, because it postulates the same tissue-specific developmental constraint that may affect both tissue expression divergence and sequence divergence of expressed proteins. We indeed found a highly significant correlation between Eti and Dti based on 29 humanmouse tissues (Fig. 4). In the case of high expression (Fig. 4A), the (Pearson) coefficient of correlation is R = 0.55 (P < 0.001), whereas R = 0.66 (P < 0.001) in the case of normal expression (Fig. 4B). Use of the Spearman rank correlations results in very similar P values (<0.001). Hence, the significance of Eti Dti correlation provides statistical evidence to support the tissue-driven hypothesis. In addition to two cutoffs presented in Fig. 4 A and B, we have examined several other criteria for gene expression and found that the Eti Dti correlation is robust against the choice of cutoff (data not shown).
|
|
|
| Discussion |
|---|
|
|
|---|
Functional Constraint vs. Positive Selection.
A basic assumption of the tissue-driven hypothesis is that genome evolves largely under functional constraints maintained by stabilizing selections at levels from cell physiology to development. In some evolutionary lineages, episodic adaptive selection may happen either in expression pattern or in protein function (9, 21, 2629). For instance, hundreds of genes (
2% human genes) showed dramatic brain-specific expression shifts in the human lineage (3, 4, 26, 27). When the tissue-driven hypothesis is extended to include adaptive selection, we found the predictions for both Eti Dti and Eti Tdup hold. We have examined the rapid-shift (S) model of expression divergence (12). In this case, one can show that the tissue expression distance in Eq. 8 can be modified as Eti = Shm + (1 e
t)/wti, and the duplicate tissue distance in Eq. 11 as Tdup = Sdup + (1 e
t)/Qti, where Shm and Sdup are the rapid-shift components between humanmouse genes and between duplicate genes, respectively. Except for extreme cases, Shm and Sdup apparently do not affect the predicted genomic correlations.
Effect of Expression Level on Protein Sequence Evolution. It has been claimed (9, 17, 18) that highly expressed genes tend to evolve slowly. We have examined this confound effect of tissue broadness and found that our main results are robust. For instance, high significance of Eti Dti correlation (Fig. 4) holds at various cutoffs, from normal to highly expressed genes. On the other hand, our model can be extended to take the effect of expression level into account, e.g., by assuming the tissue-factor Zg is expression level-dependent.
Tissue Expression Pattern in Primates and Mammals. The Eti Dti correlation between the human and chimpanzee has been investigated by Khaitovich et al. (8), based on five tissues (brain, liver, heart, kidney, and testis). However, our reanalysis of the same data sets leads to nonsignificant result (Spearman rank test P > 0.2), as opposed to the original claim (8) (the Pearson correlation P < 0.05). It is known that the Pearson correlation could be too liberal in small sample size. Because the current study (29 humanmouse tissues) includes these five tissues, we did observe a roughly consistent ranking in Eti or Dti, i.e., the lowest values in the cerebellum/brain, whereas we found the highest values in the testis. Hence, one may speculate that the Eti Dti correlation holds in both primates and mammals, although more primate microarray data are needed.
Some Technical Issues. We have examined several technical issues that may affect our interpretations. First, our analysis is robust against the noise of microarrays, because the expression variation among biological replicates of microarrays is much smaller than the average expression difference between the human and mouse (16). Nevertheless, using the corrected expression distance (13), a conserved measure for interspecies expression divergence, we obtained virtually the same results (data not shown). Second, the exclusion of young duplicates (5) (after the human-mouse split) has almost no effect on our results. Third, we have used several alternative options to determine the status of expression level in a tissue. In all cases, highly significant genomic correlations are always observed.
Because of expression leakage or fluctuation, observed similar gene expression profiles do not necessarily mean a similar tissue function. The extent of these nonfunctional expressions is subject to the debate (30). It seems that the expression leakage may be more frequent in those tissues with relatively weak developmental constraints. Besides, evolution of expressions can be affected by many issues such as transregulatory elements (31) or the alternative splicing isoforms (32). Indeed, more questions are raised than we can solve in evolutionary genomics (3033).
| Materials and Methods |
|---|
|
|
|---|
20% of cases that had multiple tags in the microarray were targeted against the single gene. We solved this problem by assigning the averaged or the highest expression value for each of these genes (16). Nevertheless, these two treatments provided virtually the same results.
Estimation of Tissue Expression Distance (Eti).
Consider a set (N) of orthologous genes between species 1 (human) and species 2 (mouse). Let xg1,ti and xg2,ti be the (log2-transformed) expression levels of the gth orthologous genes in tissue ti, respectively. Under the OU model, one can easily show that the tissue (ti) expression distance defined in Eq. 8 can be estimated as follows
|
|
Estimation of Tissue Protein Distance (Dti). We calculated Dti as the mean of evolutionary distances of proteins that are expressed in tissue ti. For each gene, the evolutionary distance was estimated by the Poisson correction; other methods gave virtually the same results (data not shown). For each tissue ti, we inferred the status of gene expression as follows: (i) High expression: the AffyRatio of the gene is above the medium expression among 79 human tissues (16). (ii) Normal expression: calculate the percentages of AD counts (adjusted by the background AD = 200) of the gene in all 29 tissues and then, in a descending order, select the expressed tissues of the gene until the accumulated AD percentage up to 97.5%. This approach may avoid some spurious high AD counts.
Estimation of Tissue Duplicate Distance (Tdup) for Expression Divergence.
Consider a set (Ndup) of duplicate gene pairs. For the jth duplicate pair, the expression levels (AffyRatio) of two duplicate genes in a given tissue (ti) are denoted by xj and yj, respectively. Then, similar to the calculation of Eti in Eq. 12, we estimate Tdup by the formula
|
|
A large Tdup value reflects the plasticity of tissue-specific developmental constraint that allows more expression divergence between duplicate genes.
Estimation of Tissue Broadness and Preference.
The number (Lg) of tissues in which gene g is expressed, or the tissue broadness, can be inferred as described above. For gene g that is expressed in Lg different tissues, let Ej (j = 1,..., Lg) be the jth tissue expression distance between the human and mouse. Because a large Ej means less tissue constraint on expression divergence, we propose an index that can be used to measure the effect of tissue preference as follows
|
|
where tissue expression distance Ej is estimated by Eq. 12. In particular, when the expression divergence is close to the steady state, we have Ej
1/Wj so that tg is an estimate of the mean tissue factor
, which is a proxy for the tissue preference
g =
under the tissue-driven hypothesis that predicts Wj
Zj, creating a negative correlation between tg and the evolutionary distance of protein sequence (dg).
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: OU, OrnsteinUhlenback.
To whom correspondence should be addressed at: 536 Science II Hall, Iowa State University, Ames, IA 50011. E-mail: xgu{at}iastate.edu
Author contributions: X.G. designed research; X.G. and Z.S. performed research; X.G. contributed new reagents/analytic tools; X.G. and Z.S. analyzed data; and X.G. and Z.S. wrote the paper.
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0610797104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||