Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / EVOLUTION
Adaptive evolution in humans revealed by the negative correlation between the polymorphism and fixation phases of evolution



,
*Department of Biological Sciences, Graduate School of Science, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan;
Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, IL 60637; and
Department of Genome Sciences, University of Washington, 1705 NE Pacific Street, #K-357, HSB J-279, Seattle, WA 98195
Edited by Tomoko Ohta, National Institute of Genetics, Mishima, Japan, and approved January 4, 2007 (received for review July 5, 2006)
| Abstract |
|---|
|
|
|---|
10–13% of amino acid substitutions between humans and chimpanzee may be adaptive.
amino acid substitution | deleterious mutation | positive selection | selective constraint
To investigate the evolutionary dynamics and correlation between these two phases, we need to be able to classify mutations in coding regions into more categories than the traditional nonsynonymous and synonymous dichotomy. We follow the procedure developed by Tang et al. (refs. 5 and 6, see also ref. 7), who classified amino acid changes into 75 elementary types. An elementary amino acid change is one that can be reached by 1-bp substitution in the codons (Phe to Leu, for example). The rest of the 115 types are all composites of two or three elementary changes (Phe to Pro, for example). One can then ask how these 75 elementary changes differ in their likelihoods of becoming polymorphic, or fixed, and how the dynamics in these two phases are correlated.
Here, we describe an analysis of human polymorphism in relation to human–chimpanzee divergence by using a variety of publicly available data sets. Human polymorphism data are often collected in a way that results in strong ascertainment biases, which we demonstrate favors nonsynonymous over synonymous SNPs. (The bias potentially makes the estimation of positive selection conservative.) The comparisons among different amino acid changes help to ameliorate this bias, thereby allowing meaningful inferences on the forces shaping patterns of protein evolution to be made.
| Results |
|---|
|
|
|---|
Because of these different strategies for genotyping SNPs among databases, we first studied patterns of polymorphism and divergence for nonsynonymous and synonymous changes in each database to investigate possible ascertainment bias introduced by variable SNP discovery strategies.
Contrasting Divergence and Polymorphism for Nonsynonymous (A) Vs. Synonymous (S) Changes. All SNPs are classified as either A (which stands for "amino acid altering") or S. We polarized the them into ancestral vs. derived-state on the basis of the chimpanzee sequence, and our analysis is based on derived allele frequencies. The A/S ratio across the frequency spectrum, with fixed changes between human and chimpanzee, are shown in Fig. 1 and supporting information (SI) Fig. 4.
|
Perlegen Data.
Similar to many previous studies (1, 2, 14), the Perlegen data shows a significantly larger A/S ratio for the lowest frequency class (
20%) (
2 = 47.2, P < 0.001; Fig. 1a). The highest frequency class has a reduced A/S ratio, but that is perhaps due to the small sample size in that class. A straightforward explanation for the excess in A/S in the lowest frequency class is the presence of slightly deleterious amino acid polymorphisms (15, 16). These amino acid variants can rise to a modest frequency before selection overcomes the effect of genetic drift. We used 20% as the frequency cutoff for common and rare SNPs. This cutoff value is reasonable, because deleterious mutations rarely reach the frequency of 20% unless |2Ns| < 10, where s denotes selective coefficient and N denotes effective population size (ref. 14, and see SI Fig. 5). Note that the results presented are insensitive to the choice of cutoff (see details in Discussion).
In Table 1, we summarize the observed and expected SNP patterns. The ratio between the first two A/S ratios is referred to as the polymorphism index (PI), defined as [Arare/Srare]/[Amutation/Smutation]. PI is the likelihood that a new amino acid substitution will become polymorphic, relative to that for a synonymous mutation. We also defined the fixation index (FI = [Adivergence/Sdivergence]/[Acommon/Scommon]) as the likelihood an amino acid variant that exists at moderate to high frequency (>20%) will become fixed, relative to that for a synonymous variant at the same frequency. This is a modified McDonald and Kreitman test (MK test) (3), which includes all of the variants. In general, FI > 1 has been accepted as a likely indication of positive selection in action. Note that the ascertainment bias toward middle frequency variants, especially the nonsynonymous ones, would lead to the underestimation of adaptive evolution. As a result, our conclusion would be a conservative one.
|
2 = 8.45, P < 0.05), the magnitude seems too small to be biologically meaningful. Furthermore, when data from multiple loci are combined, Shapiro et al. (2) has recently shown that the neutral expectation for the FI is not necessarily equal to 1. For example, following the procedure outlined by Shapiro et al. (2), we calculated the neutral value of the FI for the Perlegen data to be 1.128, which is indeed larger than the observed value. Hence, there is no evidence of positive selection by this analysis.
HapMap Data.
The HapMap SNP patterns are summarized in Table 1. The A/S ratio of new mutations and divergence are comparable with that of the Perlegen data. However, the A/S ratios of both rare (
20%) and common (>20%) SNPs are larger than Perlegen. The expected neutral FI is 1.023, which is greater than the observed FI, 0.600. The observed PI is 0.636 (2.041/3.209), which is also substantially higher than the estimate from Perlegen data.
As can be seen in Fig. 1b, the A/S ratios are very similar up to the highest level of polymorphism (>90%), but there is a steep drop in divergence that cannot be accounted for by demography, and no other data set shows this pattern. Nor can it be explained by the presence of slightly deleterious mutations. The most logical explanation is the strong bias in the inclusion of nonsynonymous polymorphisms over synonymous ones in the HapMap data.
Seattle + NIEHS Data. The SNP patterns are given in SI Table 3. The combined SeattleSNPs and NIEHS data sets show two distinctive features. First, the A/S ratio in the new mutation class is unusually low (2.3 vs. 3.2 in the larger data sets). Second, the A/S ratio for the highest frequency class is more than twice that for other common polymorphisms (1.333 vs. 0.589). This result may be due to the small number of genes, which were chosen for their possible implications in immunity-related diseases (17). Fay and Wu's H test (18) is significant for many genes in the Seattle + NIEHS data set, indicating the excess of very high-frequency variants. Whether this excess is an indication of hitchhiking with advantageous mutation is not the focus of this study. We therefore excluded the 80–100% frequency class of SNPs from the MK test. The divergence A/S ratio (0.747) is indeed significantly higher than that of the common polymorphism (0.589). More accurately, the observed FI (1.205, based on polymorphisms between 20–80%) is larger than expected (1.079).
Summary of the MK Test Contrasting A and S Changes. Although the combined SeattleSNPs and NIEHS data sets shows some evidence of adaptive protein evolution by the MK test, the set of genes appears to be somewhat uncommon in their function (17). Indeed, the small number of genes chosen for specific purposes makes the extension of this result to the whole genome quite uncertain. Between the two large "genome-scale" data sets, ascertainment may have upwardly biased the polymorphic A/S ratio, wiping out any potential signal of positive selection detectable by the MK test. The following section is designed to see whether the signal of positive selection is indeed absent in the human–chimpanzee comparison or whether it has merely been obscured by ascertainment bias.
Contrasting Divergence and Polymorphism for the 75 Elementary Amino Acid Changes. The potential biases toward collecting common amino acid polymorphisms complicate inferences of adaptive evolution and attenuate the power of statistical methods designed for complete sequence data. Therefore, we compared different classes of amino acid changes. Among the 190 (20 x 19/2) possible amino acid changes, 75 are referred to as elementary amino acid changes, which differ by 1 bp in their codons (5). We assume that there is much less ascertainment bias among the 75 elementary changes than between A and S changes. The observations below indeed support this assumption. The justification of adapting the framework of the MK test to analyzing different classes of amino acid changes is given in SI Table 4.
Perlegen Data Set.
With large data sets, we can calculate FI for each elementary change (see Materials and Methods). Under strict neutrality, the ratio of polymorphism to divergence (P/D) should be the same across all 75 elementary changes, much like the conventional MK test between A and S changes (Table 2). Again, we used common SNPs (>20%) to calculate FIs. The P/D ratios among the 75 classes in Table 2 are highly heterogeneous (
2 = 186.4, P < 0.001), indicating variation in FI among classes. By sequentially removing each class in the descending order of FI, we found that the 41 classes with the lower FI values are homogeneous, with an average FI of 0.948.
|
|
|
amino acid 2 changes are more likely to be deleterious than other types of changes, then they are also more likely to be advantageous. Thus, although negative selection against amino acid 1
amino acid 2 would often prevent the changes from becoming polymorphic, positive selection is also more effective in driving them to fixation when they do become polymorphic. The opposite dynamics of becoming polymorphic and becoming fixed also explain the lack of power in predicting FI by the conventional measures (see Fig. 2 for examples). Most measures attempt to predict the long-term evolutionary dynamics of amino acid changes, or evolutionary index (EI), in the terminology of Tang et al. (5). Because EI is proportional to PI x FI, a good predictor of PI would obviously be a bad one for FI and vice versa. Therefore, the best predictors from those attempts are probably the ones that do not do particularly well (nor particularly poorly) in either the polymorphism or fixation phase. This compromise applies to other measures of evolutionary dynamics such as percent accepted mutation (PAM) (24) and blocks substitution matrix (Blosum) (25). For similar reasons, physicochemical distance measures of amino acids have not been very successful in predicting EI (5), because most of these measures rely on some evolutionary indices as well.
Finally, if common polymorphisms are all neutral, then FIs among classes should not be statistically different, but Fig. 3 and Table 2 reveal a subset of elementary changes with unusually large FIs that stand apart from the homogeneous group. The dashed line of Fig. 3 separates those that have higher-than-expected FIs (above the dashed line) from the homogeneous group (below the dashed line). The latter all have an FI value of
0.948, as opposed to the range of values from 1.38 to 4.1 in the former group with an average of 1.71. If we use the MK test on the former group as a whole (see Table 2), then the number of amino acid changes between species in excess of expectation is 3,160 (= 7,589 – [419 x 32,582/3,082]), which is 11.9% (3,160/[7,589 + 18,916]) of the total amino acid substitutions. This proportion is often interpreted to be due to positive selection, although considerable caution has to be exercised in this interpretation (see Discussion).
HapMap Data Set.
The main difference between HapMap and Perlegen appears to be a greater bias toward nonsynonymous polymorphisms in the former compared with the latter. By the
2 test, the top 30 classes have significantly higher FI values (mean = 0.958) than those of the remaining classes (mean = 0.537; see bottom of Table 2). When FI is plotted against PI, the HapMap data behave qualitatively like the Perlegen data (Fig. 3b). The difference is that the FI values stabilize at 0.537 for HapMap. Because this low FI value is true even for very high frequency SNPs (see Fig. 1b), we suggest that this may be close to the neutral FI value, which is much <1 due to ascertainment bias, as discussed earlier.
Given the ascertainment bias, it would not be possible to estimate adaptive evolution by the conventional MK test, which contrasts A and S changes. However, if we assume that the homogeneous group, which consists of the bottom 45 classes of amino acid changes, represents neutral variants, then they may be substituted for silent changes in the MK test (see bottom of Table 2). We may thus calculate the excess in amino acid substitutions among the top 30 classes as 3,149.3 (= 7,168 – [437 x 22,990/2,500]), which is 10.4% (3,149.3/[7,168 + 22,990]) of the total. (If we use the same procedure of contrasting low and high FI amino acid changes on the Perlegen data, then the percentage of excess is 12.8%.) The estimated proportion of adaptive changes is surprisingly similar between the two data sets despite their very different absolute values.
| Discussion |
|---|
|
|
|---|
A different and perhaps more important reason to classify amino acid changes into the 75 elementary types is to compare their evolutionary dynamics in the polymorphism and fixation phases of evolution. Because selection operating in these two phases is likely to be different (for example, negative selection is unlikely to play a major role in the fixation phase), the distinction should provide a better resolution for measuring selective pressure.
Another issue is the assignment of the ancestral vs. derived state for any human SNP. The parsimonious assignment using a chimpanzee sequence as the outgroup has an inherent error rate. In SI Fig. 8, we used the macaque sequence as a second outgroup and estimated the error rate in polarizing the ancestral vs. derived variant to be
0.65%. In other words, of every 100 SNPs, <1 site is expected to have its derived state assigned incorrectly.
The finding of an L-shape negative correlation between FI and PI has a simple interpretation: Amino acid changes that experience stronger negative selection are also more likely to experience stronger positive selection. This finding would imply that dissimilar amino acids are not only more likely to be deleterious but also more likely to be advantageous. The latter implication, that advantageous amino acid changes tend not to be of the conservative kind, has been a point of contention in the literature (26, 27). Indeed, both the neutral theory and the neo-Darwinian view seem to suggest otherwise. For example, Kimura (28) stated a rule of molecular evolution as such: "Those mutant substitutions that are less disruptive to the existing structure and function of a molecule (conservative substitutions) occur more frequently in evolution than more disruptive ones." In the neo-Darwinian view, adaptive evolution is believed to take small incremental steps (29), and conservative amino acid changes certainly fit the bill well. The observation of Fig. 3 is thus somewhat surprising, because almost all adaptive changes have low PIs and, hence, are nonconservative (if we measure the conservativeness by PI). Strictly speaking, conservative vs. radical changes should be defined in physicochemical terms. However, because there are many such terms including molecular weight, volume, surface area, polarity, AWR, pI, aliphatic, aromatic and so on, their relative importance is usually determined by how well each is correlated with evolutionary rate. Thus, conservative measures cannot be decoupled from evolutionary dynamics.
Indeed, physicochemical distances are usually developed to predict the evolutionary dynamics of amino acid substitutions. Although both of the two commonly used similarity measures (Grantham's and Miyata's) have some power in predicting substitutions between species (5, 19, 30, 31), neither has any power in predicting the fixation probability (Fig. 2 and SI Fig. 6). Most similarity measures attempt to predict the likelihood of substitution between species (i.e., EI; see ref. 5), which is the product of two negatively correlated quantities, PI and FI. As explained earlier, no measure is likely to predict either PI or FI particularly well because of this negative correlation. In light of the observations of Fig. 3, we suggest that new measures be developed by fitting the predictions to PI and FI separately. PI alone is certainly a better measure than EI for the conservativeness of amino acid changes. The higher the PI, the more likely the amino acid changes can become polymorphic and the changes can be said to "more conservative."
We estimated the proportion of adaptive amino acid changes between human and chimpanzee to be 10.4–12.8%. This range is close to estimates in previous studies (32) that used entirely different approaches. Our estimate is accurate only if the assumption that the FI value of the homogeneous group of Fig. 3 represents the true neutral value. This value is close to 1 in Fig. 3a, but in Fig. 3b, it is much less than 1 (presumably due to ascertainment bias in HapMap data). The assumption is reasonable as long as SNPs >20% in human populations are neutral variants and unlikely to be deleterious (see SI Fig. 5 for justification). If they are advantageous, then our estimate of adaptive evolution would be conservative. Other general caveats against the adaptive interpretation of the MK test (3, 33) of course apply.
In summary, by dividing amino acid changes into the 75 elementary classes, which are increasingly feasible with large genomic data sets, we gain insight into molecular evolutionary processes both within and between species. Coding regions in humans are found to be under both strong positive and negative selection by this type of analysis.
| Materials and Methods |
|---|
|
|
|---|
For the SeattleSNPs and NIEHS data, DNA and protein sequences of the corresponding human genes to the genotyped SNPs were provided by the databases. For Perlegen and HapMap data, the annotations of genotyped SNPs and the corresponding DNA and protein sequences of human genes were obtained from the Single Nucleotide Polymorphism Database of the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov/SNP) by using the reference numbers (rs numbers) of SNPs.
The cDNA and protein sequences of human and chimpanzee were obtained from Ensembl database (www.ensembl.org/info/data/download.html). The putative orthologous protein sequences of chimpanzee to human were determined by reciprocal BLAST top hits (34). The putative orthologous pairs of human and chimpanzee protein sequences were aligned by CLUSTALW (35) and the corresponding cDNA sequences were aligned according to these protein alignments by the tranalign program, which is included in the EMBOSS package (36).
The human gene from Ensembl database and the gene from SNP databases are considered to be identical when they have 100% match for the entire coding sequence. Five thousand eight human–chimpanzee orthologs were identified for the Perlegen data, 5,535 were identified for the HapMap data, and 274 were identified for the SeattleSNPs + NIEHS data.
All of the human SNPs we used were polarized into ancestral and derived alleles according to parsimony referring to chimpanzee DNA sequences. SNPs that were unable to be polarized or SNPs that possessed more than three alleles were not used in this study. Furthermore, SNPs whose corresponding human gene does not have a chimpanzee ortholog were also not considered.
The Expected Number of New Mutations. We used the same method in Tang et al. (5) to obtain the expected number of new mutations. When one nucleotide substitution is allowed, one codon can change in nine different ways. Some of the changes may result in amino acid changes (A sites) and others may not (S sites; refs. 37 and 38). Once a nucleotide sequence is given, the ratio of S sites/A sites can be calculated. We included the transition/transversion ratio in this calculation, and we estimated this ratio as 2.4. This ratio was estimated by the fourfold degenerate sites of 5,535 human genes. The expected number of nonsynonymous changes was obtained by setting the expected number of synonymous changes equal to the observed.
We also used the concept of elementary amino acid changes, which is suggested by Tang et al. (5). Nonsynonymous changes can be classified into 75 kinds of elementary changes, which are caused by one nucleotide change in a codon. The expected number of amino acid changes for each class can be calculated by distinguishing nonsynonymous changes according to the elementary changes (5).
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: A, nonsynonymous substitutions; EI, evolutionary index; FI, fixation index; MK test, McDonald and Kreitman test; PI, polymorphism index; S, synonymous substitutions.
To whom correspondence should be addressed. E-mail: ciwu{at}uchicago.edu
Author contributions: H.T. and C.-I.W. designed research; J.G. performed research; J.G. and H.T. analyzed data; and J.G., J.M.A., and C.-I.W. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS direct submission.
This article contains supporting information online at www.pnas.org/cgi/content/full/0605565104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
J. Charlesworth and A. Eyre-Walker The McDonald-Kreitman Test and Slightly Deleterious Mutations Mol. Biol. Evol., June 1, 2008; 25(6): 1007 - 1015. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Hanada, S.-H. Shiu, and W.-H. Li The Nonsynonymous/Synonymous Substitution Rate Ratio versus the Radical/Conservative Replacement Rate Ratio in the Evolution of Mammalian Genes Mol. Biol. Evol., October 1, 2007; 24(10): 2235 - 2241. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Nei The new mutation theory of phenotypic evolution PNAS, July 24, 2007; 104(30): 12235 - 12242. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||