Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / POPULATION BIOLOGY
Extreme genomic variation in a natural population


*Departments of Genetics and Pathology, Stanford University Medical Center, Stanford, CA 94305-5324; and
Department of Computer Science, Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada M5S 3G4
Communicated by Robert H. Waterston, University of Washington, Seattle, WA, January 31, 2007 (received for review September 7, 2006)
| Abstract |
|---|
|
|
|---|
ciona | genome | heterozygosity | population size
For Ciona spp, a high rate of heterozygosity was reported in the initial publications of the C. savignyi (8) and C. intestinalis (3) genome sequences. However, genomewide quantification and characterization of variation were not conducted in either study. The assembly methodology used by the C. intestinalis project merged allelic sequences such that heterozygosity in the sequenced individual was severely underestimated, and analysis of variation in C. savignyi was confined to seven regions of the genome that totaled
200 kb. The results we report here, which are based on analyses of the entire genome, confirm and extend the initial estimates of heterozygosity in C. savignyi. In addition, we address a broader spectrum of variation that includes large structural polymorphism, mobile element activity, microinsertions and microdeletions, and single nucleotide polymorphism.
Our analyses of heterozygosity, population size, and selection were made possible by the unique nature of the initial genome assembly of the sequenced individual, in which allelic sequences from the sister chromosomes were forced to assemble into separate scaffolds (8). The initial assembly thus contained two copies of each genomic region, but no information as to which scaffolds were allelic. To quantify heterozygosity, we constructed a semiautomatic alignment pipeline that produced a tiling path of allelic scaffolds, and then ordered and aligned them while eliminating detectable contig-level assembly errors [see supporting information (SI) Text]. This process resulted in 374 pairwise alignments, the largest 100 of which contained 88% of the genome. The approximate size of the C. savignyi haploid genome ("haplome") estimated from the alignments is 174 Mb.
| Results |
|---|
|
|
|---|
|
10-fold higher than in humans (9), but comparable to the inversion rate seen between human and baboon, as estimated from alignments of the Encode regions. The size distribution is expectedly skewed, with a few large events accounting for the majority of inverted sequence and small inversions accounting for the majority of events (Fig. 1A).
|
The vast majority of indel events are of size 110 bp (microindels; Table 1). The relative size distribution of microindels is extraordinarily similar to that inferred from a comparison of the rat and mouse genomes (10) and to that seen within the human genome (11), with 43% being single-base indels and having an exponential fall-off of frequency with increasing size (Fig. 1D). The remarkable congruence of the size distributions underscores the conservation of the molecular processes that generate microindels in animal genomes. The genomewide average SNP heterozygosity is 4.5% (Table 1), a figure that is in agreement with the previous study of 220 kb of finished sequence from the sequenced individual (8). This extreme degree of SNP heterozygosity is precisely confirmed in finished BAC sequence from seven unrelated individuals, underscoring that the sequenced individual is not an outlier that happens to be unusually polymorphic. Nucleotide diversity in 4-fold degenerate (4D) sites is an unprecedented 8.0% (SI Table 2). Assuming that 4D sites reflect the neutral diversity in the population, the level of single-nucleotide polymorphism in C. savignyi is approximately two percentage points higher than the neutral divergence between human and Old World monkeys (12). The mutational spectrum of SNPs is not unusual, as the transition/transversion rate ratio of 2.45 (SI Table 3) is similar in other species that, like Ciona, do not exhibit a substantial bias against CpG dinucleotides (1). In summary, the C. savignyi genome bears extremely high rates of several types of polymorphism that exhibit standard characteristics.
Because of their high rates, SNPs and microindel polymorphisms lent themselves to high-resolution analysis of the distribution of heterozygosity within the C. savignyi genome (Fig. 2). First, the distributions of heterozygosity measured in nonoverlapping windows of 1 kb (SNPs) and 5 kb (microindels) are overdispersed compared with those from a randomized alignment (SI Fig. 5). Second, we observe a correlation between the frequency of SNPs and microindels across the genome (R2 = 0.25 in 1-kb windows), which is in contrast to comparisons between distinct species, in which a correlation between SNPs and microindels is not observed (10). Both phenomena could be caused by structural or functional heterogeneity across the genome or heterogeneity in the time to coalescence. We observe no correlation between SNP or microindel frequency and repeat density, gene density, or GC content, and therefore propose that these phenomena are a reflection of the time to coalescence between the two alleles in a given region, as has been observed for SNPs in humans (13).
|
, is a function of effective population size, Ne, and mutation rate, µ:
= 4 Neµ (5). Hence, extreme heterozygosity in C. savignyi could be caused by either a large population size or elevated mutation rates. Invoking roughly equally elevated mutation rates for multiple distinct types of polymorphism does not seem parsimonious, suggesting that a large population size may be the better explanation. We therefore estimated effective population size by using recombination rate estimates, relying on the relationship Ne =
/4c (
is the population recombination parameter, representing the frequency of recombination events among sampled individuals since their most recent common ancestor; c is the per-site, per-generation recombination rate) (5). We directly obtained a range of values for c by typing five large genomic regions in a genetic cross. The average value of c was of 5 x 108, but all five values were used in our analyses to capture variation in recombination rates across the genome (see SI Text).
was obtained with the aforementioned seven BAC sequences, which constitute a third allele, as follows: Using alignments of the BACs to the sequenced individual's allelic regions, we analyzed the lengths of SNP-based "haplotype blocks," which are defined as runs of SNPs in which two chromosomes share one allele and the third has another (Fig. 3A). A comparison between the observed block lengths and those calculated from an alignment in which the order of positions is randomized shows an excess of long blocks (Fig. 3A), which is caused by linkage disequilibrium (14). We then generated hypothetical block length distributions by simulation, using a range of values for Ne (Fig. 3B). We find that the C. savignyi haplotype block length distribution is most similar to simulated distributions generated at an Ne of 1.5 million individuals (Fig. 3C and SI Fig. 6).
|
= 4 Neµ using our estimates of Ne = 1.5 x 106 and
4D = 0.08. Alternatively, we calculate µ to be 7.6 x 109 per generation per base by setting
to the genomewide SNP heterozygosity (0.045). Both estimates of µ are well within the range previously reported for vertebrate and invertebrate species (15). We conclude that the extreme heterozygosity in C. savignyi is caused by a large effective population size and not an elevated mutation rate. Evidence for Strong Purifying Selection in C. savignyi. The neutral theory (5) predicts highly efficient natural selection as a consequence of a large effective population size, as the efficacy of selection is determined by the selection coefficient, s, of a mutation, and the effective population size. Mutations are strongly selected against only if they reduce fitness by s >> 1/4 Ne (16). Hence, as effective population size increases, purifying selection should be able to remove alleles with smaller selection coefficients. Because of the large population size, the variation present in C. savignyi should therefore reflect stronger purifying selection than that seen in organisms with smaller effective population sizes. Testing whether these important predictions of the neutral theory hold, we performed analyses to quantify the strength of purifying selection in the C. savignyi population.
We find a tantalizing footprint of efficient purifying selection in the length distribution of segments of perfect identity between the haplomes (Fig. 4A). In the absence of selection, mutations would be distributed such that the frequency of identical segments decays exponentially as a function of length. By contrast, the C. savignyi genome exhibits a "sawtooth" pattern with a periodicity of three, wherein there are more segments of length 3N + 2 than of length 3N or 3N + 1, even though there is the expected general trend of shorter segments being more common than longer ones. As third position changes in codons that bound N nonpolymorphic codons yield an identical segment of length 3N + 2, the sawtooth pattern is suggestive of purifying selection in coding regions. The pattern is lost just before 150 bp, the mean length of exons in C. intestinalis (3).
|
Second, we examined the physicochemical characteristics of the amino acid changes that are caused by 30,895 nonsynonymous SNPs (SI Table 6). For each of the possible changes, we calculated the ratio of observed-to-expected frequency and found this ratio to be strongly anticorrelated with physicochemical distance (20) between the encoded amino acids (R2 = 0.37; Fig. 4B and SI Table 7). This result shows that SNPs that generate more dissimilar amino acids are, on average, subject to stronger purifying selection than those that generate similar amino acid variants. To our knowledge, this type of analysis has not been done before, and so we lack a comparison with other species. Nonetheless, the fact that more than one-third of the variance in heterozygosity of nonsynonymous SNPs can be explained by Grantham's values (20), which combine two physicochemical properties and the atomic composition of the amino acids, is astounding. The remarkably low value of pA/pS and the strong anticorrelation between physicochemical distance and nonsynonymous SNP frequency underscores the effectiveness of purifying selection in the C. savignyi population.
A large effective population size should also result in highly efficient positive selection, provided that potentially advantageous variants segregate in the population. To address this possibility, we calculated pA/pS ratios across 28,489 high-confidence exons with known frame, where a pA/pS of >1 might indicate positive selection. This type of analysis is the only one we could perform, given that two alleles do not lend themselves to detection of selective sweeps or other signatures of positive selection. Only 17, or 0.06%, of analyzed exons had a pA/pS of >1, and similarity searches and manual examination of the relevant gene models allowed no conclusions as to a shared biological function. By contrast, >78% of exons had a pA/pS of <0.1. Notwithstanding our limited power to detect positive selection, this result further underscores the prevalence of purifying selection in the C. savignyi population.
| Discussion |
|---|
|
|
|---|
The strength of purifying selection acting on the extreme genomic variation in C. savignyi also underscores the apparent robustness of the cellular and developmental machinery of a species that contains such genomic variation, considering that a sufficient fraction of individuals in each generation must be viable and have the potential to reproduce. First, the "average" cellular machinery engaged in recombination, replication, and gene expression that is carried in the population must be able to function reliably upon the vast diversity of genomes; second, the population diversity in functional elements (regulatory elements, exons, etc.) that engage in physical interactions must not be so high that it results in too high a chance of deleterious synthetic (genetic) interactions. Given its wealth of easily ascertained polymorphism, C. savignyi provides an excellent natural laboratory for further exploration of the dynamics of variation in natural populations and the functional consequences of such variation.
| Methods |
|---|
|
|
|---|
Inversion Identification.
Nineteen of the inversions, which cover
1.2% of bases in the genome, were identified in our alignment pipeline, manually examined, and reoriented in the final alignment. These regions were excluded from subsequent automated inversion calling, as were 12 allelic sequence pairs that contained large palindromic low complexity regions. Additional smaller inversions were called automatically with SLAGAN (25). A subset of inversion breakpoints was experimentally verified in a previous study (8). We estimated a 2.1% humanbaboon inversion rate from 28 Mb of alignments of all ENCODE (26) target regions; 113 inversions were found, which contained a total of 588 kb.
Indel and SNP Identification. Indels were parsed directly from the haplome alignment by counting the number and size of alignment gaps in regions that did not contain or border assembly breaks. SNPs were identified only in aligned positions that passed a strict alignment quality filter (SI Text). Nucleotide diversity in 4D sites was calculated from alanine, glycine, proline, threonine, serine (TCN codons only), and valine codons by dividing the number of codons with a synonymous substitution by the total number of identical or synonymous instances of that amino acid.
Repeat Identification. Repeats were identified with RepeatMasker (http://repeatmasker.org) using a de novo repeat library constructed by the RECON (27) program and hand-curated to remove multicopy genes, tRNA, and rRNA elements (SI Text). The library is available at http://mendel.stanford.edu/SidowLab/Ciona.html.
Estimation of Per-Site Recombination Rate.
An estimate for the per-generation per-site recombination rate was obtained by typing 92 meioses of an outbred cross in five distinct regions of the C. savignyi genome totaling 4.6 Mb, or 2.6% of the genome. Markers were spaced approximately every 200 kb across the five regions, which ranged in length from 572 kb to 1.1 Mb. The average physical distance per map unit was
250 kb/cM and ranged from 130 to 550 kb/cM across the five regions.
Calculating the Population Recombination Parameter
.
Most available methods of calculating
were written for typical SNP data sets that are comprised of many alleles at discontinuous, short loci. All such methods we tried failed to produce an estimate of
from our data, which was comprised of only three alleles but contained complete sequence over a large region. We therefore estimated
by comparing the length distribution of observed and simulated haplotype blocks of concordant SNPs. The MS program (28) was used to simulate 1,000 independent replicates of three sequences with
= 0.045 at a succession of values of
corresponding to values of Ne ranging from 100,000 to 20 million. Similarity between the observed and simulated distributions was calculated as the sum of the absolute value of the difference in frequency at each block length. Block lengths of >20 were condensed into one category. Further details are available in SI Text.
Identification of Coding Variants. Coding regions and exons were identified by homology to the C. intestinalis v2.0 gene set (http://genome.jgi-psf.org/Cioin2 and SI Text). Counts of all aligned codons and amino acids are provided in SI Table 5 and SI Table 6. pA/pS was estimated from a concatenation of all annotated codons by using CODEML (29) with the F3 x 4 codon frequency model; pA = 0.0059 and pS = 0.0825. Physicochemical distance between amino acids was measured with the Grantham matrix (20), a composite metric of volume, polarity, and atomic composition. Expected frequencies of polymorphic amino acid pairs were normalized to account for codon frequency and the difference in the rate of transitions and transversions. A total of 4,678 amino acid polymorphisms that are the result of more than one nucleotide change were not included in this analysis, but are recorded in SI Table 6. For detection of potential positive selection on exons, 28,489 exons that exhibited more than three silent changes between the two alleles (from a total of 52,372 exons identified by homology to C. intestinalis) were analyzed for their pA/pS ratios.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: 4D, 4-fold degenerate.
To whom correspondence should be addressed. E-mail: arend{at}stanford.edu
Author contributions: K.S.S. and A.S. designed research; K.S.S., M.B., and M.M.H. performed research; M.B. and M.M.H. contributed new reagents/analytic tools; K.S.S. and A.S. analyzed data; and K.S.S. and A.S. wrote the paper.
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0700890104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
C. D. Brown, D. S. Johnson, and A. Sidow Functional Architecture and Evolution of Transcriptional Elements That Drive Gene Coexpression Science, September 14, 2007; 317(5844): 1557 - 1560. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||