New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Estimating the number of unseen variants in the human genome

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved January 7, 2009 (received for review August 8, 2008)
Abstract
The different genetic variation discovery projects (The SNP Consortium, the International HapMap Project, the 1000 Genomes Project, etc.) aim to identify as much as possible of the underlying genetic variation in various human populations. The question we address in this article is how many new variants are yet to be found. This is an instance of the species problem in ecology, where the goal is to estimate the number of species in a closed population. We use a parametric betabinomial model that allows us to calculate the expected number of new variants with a desired minimum frequency to be discovered in a new dataset of individuals of a specified size. The method can also be used to predict the number of individuals necessary to sequence in order to capture all (or a fraction of) the variation with a specified minimum frequency. We apply the method to three datasets: the ENCODE dataset, the SeattleSNPs dataset, and the National Institute of Environmental Health Sciences SNPs dataset. Consistent with previous descriptions, our results show that the African population is the most diverse in terms of the number of variants expected to exist, the Asian populations the least diverse, with the European population inbetween. In addition, our results show a clear distinction between the Chinese and the Japanese populations, with the Japanese population being the less diverse. To find all common variants (frequency at least 1%) the number of individuals that need to be sequenced is small (∼350) and does not differ much among the different populations; our data show that, subject to sequence accuracy, the 1000 Genomes Project is likely to find most of these common variants and a high proportion of the rarer ones (frequency between 0.1 and 1%). The data reveal a rule of diminishing returns: a small number of individuals (∼150) is sufficient to identify 80% of variants with a frequency of at least 0.1%, while a much larger number (> 3,000 individuals) is necessary to find all of those variants. Finally, our results also show a much higher diversity in environmental response genes compared with the average genome, especially in African populations.
A main goal of the various human genome projects is to discover genetic variants in human genomes. The HapMap project (http://www.hapmap.org/) has contributed much to our understanding of the underlying genetic variation in diverse human populations and has facilitated the discovery of many loci robustly associated with common human diseases, such as diabetes, obesity, breast cancer, and many others (1–5). The recently launched 1000 Genomes Project (http://www.1000genomes.org/), by sequencing 1,000 genomes, aims to discover much of the existent common variation (frequency at least 1%), including both singlenucleotide polymorphisms (SNPs) and the less explored copynumber variants (6).
In this article, we provide a systematic way to predict the number of new variants with a specified minimum frequency to be identified in future datasets of specified sizes. In particular, based on sequence data for a set of individuals, can we predict how many more new variants will be found if we were to sequence a new set of individuals of a given size? This question is related to the species problem in ecology, where one is interested in estimating the number of species in a closed population. A particular example is estimating the number of words Shakespeare knew but did not use. Specifically, if a new volume by Shakespeare were to be discovered, how many new words would we expect to see? Efron and Thisted (7) used a GammaPoisson model to address this question. We adapt the approach in Efron and Thisted to the problem of predicting the number of genetic variants yet to be identified in future studies. The method also allows calculation of the number of individuals required to be sequenced in order to detect all (or a fraction of) the variants with a given minimum frequency. In the following sections we develop the method and show applications to several available sequence datasets: ENCODE, SeattleSNPs, and National Institute on Environmental Health Sciences (NIEHS) SNPs. Although these datasets contain only SNPs, the method can be applied to counting other types of variants, including copynumber variants.
1. Methods
First, we introduce some notation. For our purposes, an individual shows variation at a particular position if the corresponding allele is different from the ancestral allele. We say that a position is variable (or is a variant) in a sample if there is at least one individual in the sample that shows variation at that position.
Suppose we have data on N_{ind} individuals; for example, each of the N_{ind} individuals has been sequenced in a genomic region, and hence for each position we know whether an individual shows variation or not.
We follow the notation in ref. 7. Suppose the total number of variable positions in the human genome is an unknown, fixed number, denoted here by N. Let f_{s} be the unobserved frequency of variable position s, and let x_{s} be the number of times variable position s has been observed in the N_{ind} individuals in our dataset. Then, x_{s} ∼ Bin(N_{ind},f_{s}). Of course, we can only observe those variable positions with x_{s} > 0. We assume that f_{s} ∼ Beta(a,b). The Beta prior is not only mathematically convenient, but a good approximation for the distribution of allele frequencies at biallelic markers under a neutral selection and mutationdrift equilibrium model, as Wright (8) showed.
Let n_{x} be the number of positions with exactly x individuals showing variation at a position. Hence, n_{1} is the number of variants that occur in only one individual, etc., and
As in ref. 7, we want to estimate Δ(t), i.e., the number of new variants to be found in the next t · N_{ind} individuals (if we were to perform a new sequencing study of that size). For t ≥ 0, we can write
where f(θ) = θ ^{a−1} · (1θ)^{b−1}/B(a,b) is the density function for the Beta distribution and
Inserting the form for f(θ) into Eq. 1 we obtain: Detailed derivations are shown in the supporting information (SI) Appendix. Note that Δ(0) = 0 and when t → ∞.
In addition, we can calculate Δ_{f}(t), i.e., the number of new variants with frequency at least f expected to be found in a new set of t · N_{ind} individuals, as follows: where CDF is the cumulative distribution function of the beta distribution.* As expected, Δ_{0}(t) = Δ(t). Also as t → ∞.
It is also possible to predict the number of individuals necessary to capture essentially all variants with frequency at least f. This amounts to finding the smallest value of t such that for some small, fixed ɛ. Next, we show how we estimate parameters a and b from the available data.
1.1. Estimation of the Parameters a and b of the Beta Distribution.
To estimate a and b, we use maximumlikelihood estimation. The probability that exactly x individuals show variation at a position is: for x ≥ 0. As we already mentioned, the zero class is not observed and hence we need to fit the zerotruncated betabinomial distribution. The probability distribution function for this truncated distribution becomes: for x ≥ 1. The likelihood function can then be written as: and the loglikelihood function is:
We maximize LL(a,b) to obtain the maximumlikelihood estimators (MLEs) for a and b. The maximization is carried out through the Newton–Raphson method.
Note that in the likelihood function above, we assume that markers are in linkage equilibrium. Dependence among nearby markers makes the effective number of markers in the region look smaller, but does not lead to systematic bias.
2. Applications
We now present results when applying this method to several sequence datasets: ENCODE, SeattleSNPs, and NIEHS SNPs datasets. Although these examples concern SNPs, the method is equally applicable to other types of variants, including copynumber variants, provided highquality data are available.
2.1. Encode Data Application.
We first applied the method to the sequencing data from the ENCODE project (http://www.hapmap.org/downloads/encode1.html.en). Ten 500kb regions of the genome were sequenced in 48 unrelated DNA samples: 16 Yoruba (YRI), 16 CEPH European (CEPH), 8 Han Chinese (CHB), and 8 Japanese (JPT).† These regions were chosen to be representative of the genome in general, including various chromosomes, recombination rates, gene density, and values of nontranscribed conservation with mouse.
For a large proportion of SNPs, i.e., 88%, the ancestral allele is known and can be found from dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/). For those SNPs where the ancestral allele is not known, we consider the ancestral allele to be the major allele at the locus. This is reasonable, because the major allele at a SNP is typically the ancestral allele in the absence of selection (9, 10).
To make results comparable across the 4 populations (YRI, CEPH, CHB, and JPT), we considered only 7 of the sequenced individuals for each dataset (only 7 individuals were sequenced in CHB). In Fig. 1, for each population we plot the expected number of new variants with a specified minimum frequency to be discovered in future datasets, of sizes t · N_{ind} with t ∈ [0.1,16].‡ Overall, the results are as expected, with the African population the most diverse and the Asian populations the least diverse. Interestingly, the Chinese population is predicted to be more diverse than the Japanese population; we observe the same diversity pattern in an independent dataset, the NIEHS SNPs data, in a following section. The MLEs for a and b are: (â_{YRI},
We also estimated the minimum number of individuals necessary to sequence in order to detect all (or at least 80% of) the variation with a specified minimum frequency in each of the four ENCODE populations (Table 1). It is interesting to note the large difference between the number of individuals necessary to find all variants with frequency at least 0.001, e.g., 3,521 for the CEPH population, and the number of individuals necessary to detect 80% of all variants, e.g., 154 for CEPH; finding 99% of the variants in CEPH requires 1,008 individuals. The reason is that, as we add individuals, new variants are identified. The discovery process is very efficient in the beginning, but after many individuals are sequenced, each additional individual contributes less and less to the pool of the newly discovered variants (Fig. 2). This saturation effect is also noticeable in Fig. 1, especially for variants with a frequency of at least 0.05 or 0.10. For rarer variants (frequency < 0.05) the asymptote is not yet reached for t ≤ 16.
2.2. SeattleSNPs Data Application (Inflammatory Response Genes).
We applied our method to a second dataset, the sequencing data available from the SeattleSNPs web site (http://pga.gs.washington.edu/). We used data on 24 Yoruban (YRI) and 23 CEPH unrelated samples (mostly different from the HAPMAP samples in the ENCODE dataset, with 4 YRI samples and 6 CEPH samples in common between the two datasets) sequenced at ∼1.6 Mb of reference sequence covering 76 genes related to inflammatory response (11) (the sequenced regions are different from the regions in the ENCODE dataset).
Given that the regions sequenced in this dataset were not overlapping with the regions in the ENCODE dataset, we used the ENCODE dataset to predict the total number of SNPs expected to exist in the SeattleSNPs dataset. To make the two datasets completely independent, we removed the 4 YRI samples and the 6 CEPH samples (shared between the two datasets) from the SeattleSNPs dataset, leaving us with 20 YRI samples and 17 CEPH samples. We ignored the small percentage of variations that were small diallelic insertiondeletions, and only considered SNPs. In total, we identified 6,494 SNPs in the YRI samples and 3,876 in the CEPH samples.
For YRI, given that 8 individuals were sequenced as part of the ENCODE project, the predicted number of SNPs in 20 individuals is simply the observed number of SNPs in the 8 sequenced individuals plus the estimated number of new variants to exist in 12 new individuals, i.e., Δ(1.5), which gives an estimated 12,973 SNPs. Similarly, for CEPH, given that 16 individuals were sequenced, the predicted number of SNPs in 17 individuals is the observed number of SNPs in 16 individuals plus Δ(0.06), resulting in an estimate of 10,225 SNPs. However, these do not represent the total number of SNPs found in the region; the observed SNPs were QC (qualitycontrol) filtered. Assuming that 90% of all SNPs passed quality checks [as this was the proportion of QC + SNPs in Phase I of the HapMap Project (12)], the predicted number of SNPs in 20 YRI individuals becomes 12,973/0.9 = 14,414 and in the 17 CEPH individuals, 10,225/0.9 = 11,361. The region sequenced in the ENCODE project is 5 Mb long; hence, for a 1.6 Mb region, we predict 4,612 SNPs in YRI and 3,635 SNPs in CEPH (Table 2).
Although such a crosssample, crossregion comparison is susceptible to many caveats, it does reveal an interesting observation. It appears that the numbers predicted based on the ENCODE data are smaller than the observed numbers in the SeattleSNPs data. An excess of rare variants in the SeattleSNPs data compared with the ENCODE data has been noticed before (13). One plausible explanation has to do with the nature of the regions sequenced in the two projects. For the ENCODE data, the regions were selected in a broader fashion, to encompass various features of the genome (both gene and nongene regions, various recombination rates, etc.), whereas the genes sequenced in the SeattleSNPs data are all related to inflammatory response. Smith et al. (2005) found that inflammatory and immunerelated genes have the lowest linkage disequilibrium and highest sequence diversity among 35 gene categories considered. The much larger difference between the predicted number and the observed number of SNPs in YRI (SeattleSNPs data) compared with CEPH may suggest that diversity in inflammatory related genes is even larger (when compared with the average diversity in the genome) for the YRI than for the CEPH.
2.3. NIEHS SNPs.
In a different study (http://egp.gs.washington.edu/), (15), 293 environmental response genes have been sequenced (for a total length of 5.95 Mb) in 27 individuals of African descent, 22 of European descent, 22 of Hispanic descent, and 24 of Asian descent (12 Chinese and 12 Japanese). We counted the number of SNPs found in each gene. In summary, we identified 26,114 SNPs in the African sample, 15,116 in the European sample, 16,644 in the Hispanic sample, and 14,407 in the Asian sample. We used the ENCODE data to predict 18,224 SNPs in the African sample, 14,147 in the European sample and 12,036 in the Asian sample (Table 2). As with the SeattleSNPs dataset, greater diversity in environmental response genes is noticeable in the NIEHS SNPs dataset compared with the predictions based on the ENCODE data, especially in the African sample. In this context, it is perhaps interesting to mention that the MLEs of a for the four populations were substantially smaller than the corresponding estimates in the ENCODE dataset: â_{African} = 0.036, â_{European} = 0.076, and â_{Asian} = 0.04 (unlike the MLEs for b, which are similar between the two datasets). This observation can be interpreted as saying that on average variants in the regions sequenced in the NIEHS SNPs dataset have lower frequencies than those in the ENCODE dataset. Also, the Japanese population shows less variation than the Chinese population, as we previously found with the ENCODE dataset (SI Appendix).
2.4. CrossValidation.
Validation of the method using different datasets has inherent problems, and discrepancies may simply be a result of different underlying characteristics for the datasets, such as sequence data from different genomic regions, different sequencing accuracies, etc. To avoid these pitfalls, we tried to assess the performance of the method, using the same data. To this end we performed crossvalidation for the African, European, and Asian populations in both the ENCODE and the NIEHS datasets. We randomly split each sample into a training set and a testing set of roughly equal sizes, a thousand times. For ENCODE, the average error in prediction is between 2 and 5% for the three populations, whereas for NIEHS, the average error is between 8 and 12% (see Tables 3 and 4).
3. Discussion
In this article, we proposed a method to predict the number of new, i.e., not yet seen, variants with a specified minimum frequency to be identified in a new dataset, comprising sequence data for a set of individuals. In addition, our method also allows prediction of the minimum number of individuals necessary to sequence in order to detect at least a specified fraction of all the variants. This is particularly relevant, because, in the context of genomewide association studies, once a genomic region has been identified to be associated with disease, subsequent sequencing studies are necessary to determine the causal variant.
We applied the approach to three sequence datasets: ENCODE, SeattleSNPs, and NIEHS SNPs datasets. The results are consistent with the expected diversity pattern, with the African population having the most number of SNPs, the Asian populations the least, and the European population being inbetween. Notably, both the ENCODE and NIEHS SNPs datasets support the hypothesis that the Japanese are less diverse (in terms of number of variants) than the Chinese. The number of individuals necessary to capture all common variation (frequency at least 1%) is small and the 1000 Genomes Project is likely to find most of them, subject to sequence accuracy. Even for the rarer variants (frequency at least 0.1%), a large proportion of them can be found with small samples (in the low hundreds), but to find all of them, thousands of individuals are necessary.
Our method is based on a parametric assumption, namely the beta distribution as an approximation of the frequency distribution for biallelic variants. As we mentioned previously, this approximation is reasonable following populationgenetics arguments and has been frequently used in literature (see for example ref. 16; see also the SI Appendix for goodnessoffit results for the ENCODE and NIEHS SNPs datasets). Simulation results (see SI Appendix) show that our predictions are accurate as long as the Beta model holds; the accuracy does decrease as t increases toward infinity and as the frequency threshold f decreases toward 0. As argued in ref. 7, as t → ∞ and f → 0, Δ_{f}(t) is going to be an underestimate of the true number of variants. This in turn implies that the estimated number of individuals necessary to sequence to capture a certain fraction of the variation with minimum frequency f is going to be an underestimate, especially when f → 0; the bias gets smaller as we restrict attention to more common variants.
The Beta assumption is inherently a simplification of reality. Deviations from this model, for example, due, to natural selection, will affect our predictions. In the regions sequenced in the NIEHS dataset, there is an excess of rare variants. As expected in this situation, our predictions represent underestimates of the real numbers. Indeed, crossvalidation within the NIEHS dataset shows that our method systematically underestimates the true number of variants by ∼8–12% on average. By contrast, crossvalidation within the smaller ENCODE dataset leads to a smaller average error (2–5%), with no systematic bias. It is important to emphasize that the sample sizes in our examples are fairly small (a few individuals for the ENCODE dataset), and that the prediction error is expected to become smaller as the original number of sequenced individuals becomes larger. In addition to crossvalidation, we also predicted the number of variants in the two datasets, SeattleSNPs and NIEHS SNPs, based on the ENCODE data. Overall, the predictions for the European and the Asian sample were ∼7–17% lower, whereas for the African sample the prediction was ∼40% lower. Although crosssample and crossregion predictions have many caveats, here the underestimation in our predictions is, in part, likely due to the expected higher diversity in environmental response genes compared with the average genome. Moreover, the much larger prediction error for the African population may be explained by African people being even more diverse in these regions, compared with European and Asian people, as a result of geographically localized patterns of selection. This is especially plausible because the crossvalidation error in the NIEHS dataset for the African sample was much smaller, i.e., ∼8%, and not different for the African population compared with the European or Asian populations.
Future extensions of this method include incorporation of selection effects, sequencing errors, and nonparametric approaches. The question is whether it is possible to remove the parametric (beta model) assumption on the frequency distribution and recalculate Δ(t) in a nonparametric fashion. It would then be interesting to assess how the nonparametric approach compares with the parametric approach we developed in this article.
Although we applied the approach to SNP data, the method applies equally well to counting other types of variants, including copynumber variants. This is particularly useful because currently much less is known about copynumber variants than about SNPs.
Acknowledgments
We thank two reviewers for comments that helped improve the manuscript. This work was supported by National Institutes of Health/National Institute of Mental Health R01 MH59532.
Footnotes
 ^{1}To whom correspondence should be addressed. Email: iionita{at}hsph.harvard.edu

Author contributions: I.I.L. and N.M.L. designed research; I.I.L. and N.M.L. performed research; I.I.L. and C.L. contributed new reagents/analytic tools; I.I.L., C.L., and N.M.L. analyzed data; and I.I.L. and N.M.L. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0807815106/DCSupplemental.

↵* Detailed derivations are shown in SI Appendix.

↵† For the YRI population, data on only 8 individuals were available, and similarly for the CHB population, data on only 7 individuals were available.

↵‡ We report these predicted numbers in a table in the SI Appendix.
 © 2009 by The National Academy of Sciences of the USA
References
 ↵
 ↵
 Duerr RH,
 et al.
 ↵
 Herbert A,
 et al.
 ↵
 ↵
 ↵
 Sebat J,
 et al.
 ↵
 Efron B,
 Thisted R
 ↵
 Wright S
 ↵
 ↵
 ↵
 ↵
 ↵
 Wallace C,
 Dobson RJ,
 Munroe PB,
 Caulfield MJ

 Smith AV,
 Thomas DJ,
 Munro HM,
 Abecasis GR
 ↵
 Nickerson DA,
 Rieder MJ,
 Crawford DC,
 Carlson CC,
 Livingston RJ
 ↵
 Coram M,
 Tang H
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
 Physical Sciences
 Statistics
 Biological Sciences
 Genetics