Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / EVOLUTION
More genes underwent positive selection in chimpanzee evolution than in human evolution
Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109
Communicated by Morris Goodman, Wayne State University School of Medicine, Detroit, MI, February 26, 2007 (received for review December 21, 2006)
| Abstract |
|---|
|
|
|---|
molecular evolution | population size
). An
value significantly >1 indicates the action of positive selection, whereas an
significantly <1 indicates negative (or purifying) selection. Using this approach, two earlier studies (17, 18) pioneered the identification of human and chimp PSGs at the genomic scale, although no comparison was made between the numbers of human and chimp PSGs. In fact, the studies' results would be unsuitable for the comparison, owing to a number of deficiencies. First, both studies used the mouse as an outgroup, to distinguish between human-specific and chimp-specific nucleotide substitutions, because of the unavailability of genome sequences from any closer outgroups at that time. Because mouse is distantly related to human and chimp, this practice introduces errors. Second, one of the studies (17) was based on less reliable statistical methods and assumptions (19), whereas the other (18) used the draft chimp genome sequence (1) known to contain many more errors than the finished human genome sequence (20, 21). Because the majority of genes in a genome have
< 1, and sequencing errors have an expected
of 1, the errors inflate
and the false detection of positive selection. In this work, we first design a protocol to rectify these problems and then use the protocol to identify and compare human and chimp PSGs. Our results show substantively more PSGs in chimpanzee evolution than in human evolution. | Results and Discussion |
|---|
|
|
|---|
Second, we applied an improved branch-site likelihood method for identifying PSGs (25), which has been shown by computer simulation to produce good results even when some of the assumptions are violated (25). The method requires that the branches in a phylogenetic tree be separated into foreground and background branches a priori, where foreground branches are tested for the occurrence of positive selection. The method assumes that two classes of codons, either negatively selected (class 0) or neutral (class 1), exist in the background branches. This null model is compared with an alternative model in which a proportion of class 0 codons, and the same proportion of class 1 codons, become positively selected in the foreground branches. Positive selection in foreground branches is inferred for a gene if the likelihood of the observation of the gene sequences is significantly higher under the alternative model than under the null model. To further verify the suitability of the method in the present context, we conducted additional computer simulations specifically designed to mimic the evolution of human, chimp, and macaque genes [see supporting information (SI) Materials and Methods]. Our results showed that the false-positive rate is acceptable, except for extreme conditions when it slightly exceeds the nominal rate (see SI Tables 3 and 4).
Third, we used high-quality nucleotides from the 4x coverage chimp genome sequence to allow a fair comparison with the human sequence. Briefly, we assembled alignments of orthologous genes from human, chimp, and macaque, using publicly available genome sequences and annotations (see Materials and Methods). We then eliminated alignment gaps and those codons in which one or more chimp nucleotides did not meet our quality cutoff. Three different cutoffs, low (Q0), intermediate (Q10), and high (Q20), were used to generate three data sets. After removing alignments of <100 codons, we obtained our final data sets, containing 13,955, 13,924, and 13,888 genes for the Q0, Q10, and Q20 cutoffs, respectively (see SI Table 5). Even the smallest data set (Q20) has a total alignment length of 17,995,887 nucleotides, with a mean alignment length of 432 codons (standard deviation, 339 codons). All three data sets contain >50% of genes in a primate genome and cover >50% of all protein-coding regions in the genome. Using parsimony, we inferred the numbers of nucleotide substitutions in human and chimp lineages since their split. This inference is expected to be accurate because the three species studied here are closely related. We found that the ratio of the number of synonymous substitutions in the chimp lineage to that in the human lineage is r = 1.103 ± 0.009, 1.020 ± 0.008, and 0.985 ± 0.008 for the Q0, Q10, and Q20 data sets, respectively. Assuming identical mutation rates per year between human and chimp lineages, r is expected to be 1. If the mutation rate is 3% lower in humans than in chimps, as has been suggested (26), r is expected to be 1.03. Given these considerations, Q0 data, as used in an earlier study (18), are apparently unsuitable because the observed r is significantly higher than the expectation. To make our conclusion more conservative, we use Q20 rather than Q10 data. Two other independent assessments of the chimp genome sequence, one of which evaluated it against 172 kb of finished chimp sequence, also recommended the use of Q20 data for comparison with the human genome sequence (1, 20). Most importantly, the number of synonymous substitutions is already 1.5% lower in chimp than in human when the cutoff of Q20 is used, suggesting that the chimp sequencing errors become negligible at this quality level. The comparison between the 172 kb of draft and finished chimp sequences also showed that the use of cutoffs higher than Q20 is undesirable because many chimp-specific nucleotide changes tend to be lost (20). This is probably because polymorphic sites in the chimp individual that was sequenced, estimated to be 0.1% of all sites (1), tend to have lower qualities than homozygous sites. These polymorphic sites are excluded progressively as one increases the quality cutoff, which hampers a fair comparison with human because the human genome sequence contains polymorphic sites (1). Note that errors in the macaque genome sequence should not affect our analysis because the probability for a macaque error to occur at a nucleotide position where human and chimp differ is small. Even when such rare events occur, they should affect human and chimp equally and hence would not bias our results. Our human–chimp comparison should not be biased by indel errors because the detection of positive selection does not use indel information.
More PSGs in Chimp Evolution than in Human Evolution.
Applying the likelihood method and a P value of 5% for statistical significance (25), we identified 154 genes that were under positive selection in the human lineage (Table 1 and SI Table 6) and 233 in the chimp lineage (see SI Table 7). Thus, chimps have 51% more PSGs than humans have. As expected, the excess of chimp PSGs is even greater (157%) should the Q10 data be used (SI Table 5). The proportion of PSGs in the genome is 233/13,888 = 1.7% for the chimp lineage, significantly greater than that (154/13,888 = 1.1%) for the human lineage (P < 10–4,
2 test). Because 13,888 statistical tests were conducted for each lineage, it is necessary to control for multiple testing. Under Bonferroni correction, two human genes and 21 chimp genes remain statistically significant (see SI Table 8). With use of a false discovery rate of 5%, the same two human genes and 59 chimp genes are significant (SI Table 8). The proportion of PSGs in the chimp genome remains significantly greater than that in the human genome (P < 10–4,
2 test), even after the multiple-testing corrections (Table 1).
|
2 test). Note that this is a conservative estimate because we did not consider non-PSGs from the 4x sequence that may become PSGs in the 6x sequence. Such incidences are possible because potentially more nucleotides per gene can be analyzed in the 6x sequence, leading to improved statistical power in identifying PSGs. Additionally, 4x and 6x sequences may differ at polymorphic sites, which can affect the outcome of PSG identification when the number of substitutions is small. Because the analyses of the 4x and 6x sequences both indicate substantially more PSGs in chimps than in humans, and because the 6x assembly is preliminary and unpublished, our subsequent analyses use the PSGs identified from the Q20 data of the 4x assembly. An additional reason for using the 4x assembly is the finding of a number of cases in which the 4x assembly is apparently more accurate than the 6x assembly (see SI Materials and Methods).
We found that the mean
of all genes is 0.259 ± 0.002 in the human lineage, significantly larger than that (0.245 ± 0.002) in the chimp lineage (P < 10–4; Table 1). For the common set of 13,508 non-PSGs between humans and chimps, the mean
is also significantly larger in human (0.252 ± 0.002) than in chimp (0.238 ± 0.002) (P < 10–4; Table 1). Because the majority of non-PSGs are under negative selection, as reflected in their low
values, the above results indicate stronger negative selection in chimps than in humans. Multiple-population genetic data indicate that the long-term effective population size of humans (in the last 1–2 million years) is several-fold smaller than that of chimps and than that of the human–chimp common ancestor (2, 27–34). A recent analysis of 1 million base pairs of Neanderthal nuclear DNA also suggested that the common ancestor of modern humans and Neanderthals had a small effective population size (35). It is thus probable that the effective population size is greater in the chimp lineage than in the human lineage for a large portion of the divergence time between the two lineages. Population genetic theories (36) predict that both positive and negative selection are more effective in large populations than in small populations. Our observation that chimps have more PSGs but fewer nonsynonymous substitutions in non-PSGs than humans is consistent with these predictions.
Computer simulations showed that the branch-site likelihood method cannot detect all PSGs. Rather, the detection rate increases as the
of background branches increases (see SI Table 9). If the overall strength of positive selection is weaker in humans than in chimps because of smaller populations of humans than chimps, a higher average background
is required for PSGs to be detectable in humans than in chimps. We found that in the macaque branch of the human–chimp–macaque tree, the mean
for all genes is 0.226 ± 0.001. For human PSGs, the mean
in the macaque branch is 0.294 ± 0.007, significantly greater than the mean
in the macaque branch (0.278 ± 0.005) for chimp PSGs (P < 0.05). Hence, these observations are consistent with the simulation result and further support the notion that positive selection was weaker in the human lineage than in the chimp lineage. Theories also predict that recombination can increase the efficacy of selection (37). Indeed, PSGs tend to be located in high-recombination regions, although this effect is significant in chimps (P = 0.041) but not in humans (P = 0.32) (see SI Fig. 4), probably as a result of a difference in statistical power caused by the difference in the number of PSGs in the two species.
Similarities and Differences Between Human and Chimp PSGs. It has been claimed that genes of certain functional categories, such as olfaction and nuclear transport, were more frequently under positive selection in humans than in chimps, based on the ranking of all genes by their P values in the likelihood test of positive selection (17). Because genes with reduced negative selection also tend to have low P values (although unlikely to be as low as 0.05), such ranks potentially mix genes under positive selection with those under reduced negative selection. We took a more rigorous approach by limiting our analysis to the PSGs we detected. We found that seven genes are shared between the human and chimp PSGs (see SI Table 10), significantly greater than expected by chance (2.6; P < 0.02, binomial test), suggesting the presence of some common targets of positive selection in the two lineages. We classified all PSGs into biological process groups and molecular function groups, as defined in the PANTHER database (38). A randomization test indicated a significant difference in distribution of human and chimp nonoverlapping PSGs among biological process groups (Fig. 1A) and among molecular function groups (Fig. 1B). Those groups showing the greatest differences between the two species are listed in Fig. 1C. Interestingly, however, the majority of these groups (e.g., protein metabolism and modification, anion transport, phosphate transport, and lyase) do not correspond to the widely assumed adaptive phenotypic differences between humans and chimps (e.g., neurogenesis), suggesting the existence of yet-to-be-recognized adaptive phenotypic differences between the two species. We did not detect several previously reported PSGs that control brain size or cognitive functions (39–42) because previous identifications of these PSGs were based on a comparison of polymorphism and divergence data, whereas only divergence data are used here. As mentioned above, due to the paucity of chimp polymorphism data, any fair genome-wide comparison of human and chimp PSGs would have to be limited to divergence data at this time.
|
2 test; and see SI Table 11). On examining the peak-expression tissue group for each gene (see SI Table 12), we again found no significant difference in the overall tissue distribution between human and chimp PSGs (Fig. 2). Notably, 14 (11%) human PSGs and 13 (6.7%) chimp PSGs have peak expressions in one or more parts of the brain, but the difference is not statistically significant (
2 = 1.74, P = 0.19). On the contrary, for the central nervous system outside of the brain, human (8) has fewer PSGs than chimp (14) (
2 = 0.09, P = 0.77). These findings are consistent with recent comparative genomic analyses (21, 43) and do not support more positive selection in humans than in chimps in regard to nervous system genes (44).
|
|
|
| Materials and Methods |
|---|
|
|
|---|
We applied the parsimony principle to identify human-specific and chimpanzee-specific substitutions, using the macaque as the outgroup. The numbers of synonymous (s) and nonsynonymous (n) nucleotide substitutions in the human and chimp lineages were counted. Using the modified Nei–Gojobori method (66) with a transition/transversion ratio of 2 (67), we estimated that the total number of nonsynonymous sites in the 13,888 genes of the Q20 data set was N = 12,783,034 and the total number of synonymous sites was S = 5,215,415, with their ratio being N/S = 2.45. Thus, for a set of genes, the mean nonsynonymous-to-synonymous rate ratio in a lineage can be computed by (n/s)/(N/S) = (n/s)/2.45 = 0.41n/s.
Identification of PSGs. Using PAML (68), we applied the improved branch-site test of positive selection (test 2 in ref. 25) to identify putative cases of positive selection in the human lineage among the 13,888 genes (Q20 data). When we tested positive selection in the human lineage, the human branch was designated as the foreground branch and the chimp and macaque branches were designated as background branches. We tested positive selection in the chimp lineage similarly. Bonferroni correction (69) and a false discovery rate of 5% (70) were used to correct for multiple testing. We also analyzed the Q10 data set and identified 165 human and 424 chimp PSGs.
Use of the 6x Chimp Genome Assembly. Our analysis of chimp PSGs using the 6x chimp genome assembly is described in SI Materials and Methods.
Comparison Between Human and Chimp PSGs.
Using the PANTHER database (38), we classified the 13,888 genes into different groups of biological processes and molecular functions. Note that these groups are not mutually exclusive and that a gene may belong to more than one group. To examine the distributional difference between human and chimp PSGs across PANTHER groups, we defined the statistic
|
|
where xi and yi are the number of human and chimp PSGs, respectively, in PANTHER group i, and n is the total number of PANTHER groups. Because of the nonindependence of PANTHER groups, we used a randomization test to examine whether the observed
2 was significantly different from the random expectation. Briefly, we randomly divided the 373 unshared human and chimp PSGs into 147 human PSGs and 226 chimp PSGs and computed
2 by using the above formula. We repeated this procedure 10,000 times to obtain the null distribution of
2, to which the observed
2 is compared. Similar results were obtained when the seven shared PSGs were included.
The microarray gene expression data in 79 human tissues, and the nucleotide sequences for 27,215 probe sets on the array, were obtained from ref. 71. The probe set sequences were used to perform BLAST searches against the human coding sequences annotated by Ensembl. Probe sets that matched to multiple genes were considered ambiguous and were discarded. A total of 26,195 probe sets were unambiguously matched to 16,605 distinct genes. Among these 16,605 genes, 12,099 genes, including 127 human PSGs and 195 chimp PSGs, can be found in our Q20 data set. For genes that matched to more than one probe set, the expression levels measured by different probe sets were averaged for each tissue replicate. Two replicates were available for each tissue, and these were averaged to determine the expression level of a gene in each tissue. Identification of tissue specificity can be obscured if multiple tissues with very similar expression profiles are used (72). We therefore consolidated multiple tissues representing similar areas into tissue groups and took the highest expression level from any tissue in a group as the single representative expression level score for the tissue group (21) (SI Table 12). Expression levels in pathogenic tissues were not considered. A gene was considered to be tissue-specific if the expression level in the highest tissue group was greater than or equal to twice the expression level in the second highest tissue group. The 3,299 genes meeting this criterion are said to be tissue-specific in the highest tissue. We also considered the peak expression tissue for every gene.
Online Mendelian Inheritance in Man (www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) was used to identify all genes known to be involved in human Mendelian diseases. The chromosomal locations of all genes were obtained from Ensembl.
Recombination rate data for 1-megabase segments of human chromosomes were downloaded from University of California, Santa Cruz (http://genome.ucsc.edu/cgi-bin/hgTables). A recombination rate was assigned to each gene in the Q20 data set, based on the 1-megabase segment in which the midpoint of the gene lies. Of the 13,888 genes analyzed here, 13,714 are found in regions of known recombination rates. Among these 13,714 genes, 152 human and 228 chimp PSGs have available recombination rates. We then computed the mean recombination rate of the 152 human PSGs. To estimate the expected value of this mean, we randomly picked 152 genes from 13,714 genes and computed the mean. This procedure was repeated 10,000 times to estimate the probability that the observed mean is greater than the expected mean. The same procedure was applied to chimp PSGs, under the assumption that the recombination rate of a chimp gene is the same as for its human ortholog, which is probably correct for the majority of genes at the 1-megabase scale (73).
Performance of the Improved Branch-Site Likelihood Method. The performance of the improved branch-site likelihood method is described in SI Materials and Methods.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: PSG, positively selected gene.
*To whom correspondence should be addressed at: Department of Ecology and Evolutionary Biology, University of Michigan, 1075 Natural Science Building, 830 North University Avenue, Ann Arbor, MI 48109. E-mail: jianzhi{at}umich.edu
Author contributions: M.A.B., P.S., and J.Z. designed research; M.A.B., P.S., and J.Z. performed research; M.A.B., P.S., and J.Z. analyzed data; and M.A.B. and J.Z. wrote the paper.
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0701705104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
M. Uddin, M. Goodman, O. Erez, R. Romero, G. Liu, M. Islam, J. C. Opazo, C. C. Sherwood, L. I. Grossman, and D. E. Wildman Distinct genomic signatures of adaptation in pre- and postnatal environments during human evolution PNAS, March 4, 2008; 105(9): 3215 - 3220. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. C. Nickel, D. Tefft, and M. D. Adams Human PAML browser: a database of positive selection on human genes using phylogenetic methods Nucleic Acids Res., January 11, 2008; 36(suppl_1): D800 - D808. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Wang and J. Zhang Rapid evolution of primate ESX1, an X-linked placenta- and testis-expressed homeobox gene Hum. Mol. Genet., September 1, 2007; 16(17): 2053 - 2060. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||