Fagundes et al. 10.1073/pnas.0708280104.
Fig. 3. Empirical distributions of the estimated relative probabilities of the AFREG (solid line), MREBIG (dashed line), and ASEG (dotted line) models when they are the true models. To check the efficiency of the model choice procedure for the three best models, we simulated 1,000 data sets under the AFREG, MREBIG, and ASEG model, by drawing random parameter values from the priors. The relative probabilities of these three models were then estimated under our model selection procedure, as explained in Materials and Methods. The area under the curve on the right of the vertical line represents the fraction of times the true model is recovered (relative probability >0.5) by our estimation procedure, which is 79.3% for the AFREG, 80.1% for the MREBIG model, and 50.3% for the ASEG model.
Fig. 4. Empirical distributions of the estimated relative probabilities of the AFREG model when the AFREG (solid line), MREBIG (dashed line), and ASEG (dotted line) models are the true models. Here, we simulated 1,000 data sets under the AFREG, MREBIG, and ASEG models by drawing random parameter values from the priors. The density estimates of the three models at the AFREG posterior probability = 0.781 (vertical line) were used to compute the probability that AFREG is the correct model given our observation that PAFREG = 0.781. This probability is equal to 0.817.
Fig. 5. Posterior (thick line) and prior (thin line) distributions of the estimated parameters of the AFREG model. The duration of the bottlenecks are not reported as they were considered here as nuisance parameters. Parameter labels (x axis) are described in Table 1.
Fig. 6. Distribution of Tajima's D obtained from 5,000 simulations by using the median estimates of each model: Africa (solid line), Asia (dashed line), and America (dotted line). The AFREG and ASEG models lead to distributions shifted toward more negative values in Africa than in Asia, matching well with current observations for both nuclear and uniparentally inherited markers. In contrast, the MREBIG model leads to more similar distributions between Africa and Asia and a clear shift toward positive values for uniparentally inherited markers.
Fig. 7. Graphical presentation of the eight alternative models of human evolution tested in our study and their associated sets of parameters. In parameter acronyms, "N" represents population sizes, "T" represents the timing of some events, "M" represents migration rates, and "Db" is for the duration of a bottleneck period. (A) African replacement models with instantaneous (AFRIG) or exponential growth (AFREG). (B) African origin with assimilation models with instantaneous (ASIG) or exponential growth (ASEG). (C) Multiregional evolution models with a single population size for archaic and modern populations (MRE1S), two populations sizes related to archaic and modern populations (MRE2S), a bottleneck in Africa and instantaneous growth for modern populations (MREBIG), and with a bottleneck in Africa and exponential growth for modern populations (MREBEG). See Introduction and Materials and Methods for further justification of these models. Values for these parameters are shown in SI Table 2.
Fig. 8. Comparison of posterior probabilities of the number of components in a mixture model, as initially described in Richardson and Green (6). The result of the regression procedure based on summary statistics and described above is almost as discriminative as the RJ-MCMC method using all of the data and performs better than the rejection method described in Pritchard et al. (4).
SI Text
Sample Information. We sampled one individual in 12 different Native American populations. All individuals came from Central and South American populations that belong to the Amerind linguistic phyla (1). The populations sampled and their linguistic classifications (following ref. 1) are: Aché (Equatorial, Kariri-Tupi), Arara (MacroCarib, Carib), Bribri (Chibchan, Talamanca), Guatuso (Chibchan, Rama), Guaymi (Chibchan, Guaymi), Kuben-Kran-Kegn (MacroGe, Cayapo), Lengua (MacroPanoan, Lengua), Quechua (Andean, Quechua), Tiryio (MacroCarib, Carib), Waiwai (MacroCarib, Carib), Xavante (MacroGe, Ge-Kaingang), and Zoró (Equatorial, Kariri-Tupi). This sampling scheme was used to replicate the sampling strategy of Yu et al. (2), designed to get a general picture of the genetic diversity at the continental level with a limited sample size. To our knowledge, this is the largest multilocus sequence data set available for Native Americans in which the same panel of individuals has been studied for all loci.
Both African and Asian samples in (2) include one individual taken in 10 different populations. Sampled populations in Africa comprise Biaka Pygmy, Mbuti Pygmy, Ghana, Kikuyu, !Kung, Luo, Youruba, River, South African Bantu, and Zulu. Sampled Asian populations are Cambodia, Northern China, Southern China, Han Taiwanese, Japanese, Mongolian, Vietnamese, and Yakut, as well as two Indian samples (Bengal and Punjab). However, because the Indian subcontinent has been recently colonized by Indo-Europeans, to which West-Asians are genetically most similar (e.g., 3) we decided to exclude these Indian individuals from the analysis to avoid incorporating additional parameters in our scenarios. For the same reason, we did not include 10 European individuals also sequenced for the same 50 loci (2), because we wanted to keep a relatively simple simulation scheme and avoid estimating additional parameters necessary for describing the settlement of Europe.
Primers. New amplification primers designed by us are:
T1469-F 5'CTTGAACATTTGCTTCTTGGA
T1469-R 5'CTGGTGTCTGTGTCTGTCTCC
T151-F 5' AACCCCATGTTTTCCTGGGAA
T151-R 5'GATATGCTCTTACTTGGCTGCAC
T812-F 5'CTGTAAATGTGTGTACTTTTTCAACGTG
T812-R 5'GAGGAGCCTGCCACCAACA
T1386-F 5'TTTGGTGAAAAGGGCACCCTAGAT
T1386-R 5'AATGAGAAAACCAAACCTCAAGAAG
T864-F 5'TCTCTTATCAAACTAGCTAAATTT
T864-R 5'GCTGATTGGGGAGCCTCTATTTTCT
Model Selection Procedure. The ABC method readily lends itself to model selection. The simulations give samples from the joint distribution P(F, S) of parameters F, and summary statistics S. The summary statistics, considered alone irrespective of their associated parameter values, are sampled from their marginal
distribution, P(S), or marginal likelihood. For a particular data set with summary statistics, S*, given the marginal distributions,
and
, with priors
and
, we can compute a posterior probability for, say, model M1 as
.
Pritchard et al. (4) suggested that
could be approximated by the proportion of simulated points in which
. Beaumont (5) has suggested that an alternative approach is to sample the model indicator (i.e.,
for models M1, ..., Mm) from the prior, p(M), and treat it as a categorical random variable, Y, in the ABC simulations. We can then apply categorical regression to estimate P(Y = y | S = S*), where y = 1...m is the indicator for model My. Specifically, we estimate the coefficients b in
using the VGAM package implemented in R. We make the regression estimate locally around S* in the same way as in the standard
regression approach - i.e., only the points within the tolerance interval d are retained, and these are weighted by an Epanechnikov kernel that has a maximum when S = S* and is 0 when
. A fitted object is created using the vglm() function, and
is computed using the predict()R function. The full R function calmod()that was written to perform this is available at http://www.rubic.reading.ac.uk/~mab. This method estimates the posterior probability of the model directly, avoiding the need to approximate multivariate
densities. A comparison of this method with standard rejection, using data from population genetic simulations of populations
diverging with drift and migration, suggested that the regression approach for model selection in ABC was substantially more
efficient than the use of rejection (5).
We illustrate the model selection procedure by applying it to the enzyme data analyzed using a Gaussian mixture model by Reversible
Jump MCMC (RJMCMC) in Richardson and Green (6). The aim is to compute the posterior distribution of the number of components,
k, in the mixture. In their model
, where wi is the mixture proportion and has a Dirichlet (1,...,1) prior, and mi and si2 are respectively the mean and variance of a Gaussian distribution for the ith component. The priors for mi and si2 are described in detail in Richardson and Green (6). We simulated 100,000 data sets of 245 observations, as reported in Richardson
and Green (6) using the priors specified in the text and in the legend of Table 1 in their paper. We measured the following
eight summary statistics: (1) the Kolmogorov-Smirnov 1-sample statistic for fit against a Gaussian distribution with mean
and variance equal to the sample mean and variance in the simulated data; (2) the square of this value; (3-4) the first and
fourth quantile of the 245 observations; (5) the sample standard deviation; the quantiles at points i/20, i = 1.19 were computed, and then the difference between the ith and (i-1)th quantile (i = 2...19) was computed; and then (6-8) the first and fourth quintile, and the sample standard deviation of this distribution
were computed. We make no claim that these summary statistics are in any sense optimal and were arrived at by some trial and
error. The justification for the complicated procedure to obtain the last three summary statistics is the intuition that variation
in inter-quantile distances should be related to "lumpiness" in distributions. Clearly there is a choice in the number of
quantiles used, but no effort was made to explore this further.
The posterior probabilities for k, the number of components in the mixture, are given below in SI Fig. 8. For comparison the original RJMCMC results, which use all of the information in the data, are given. It can be seen that the ABC regression results are very similar to but somewhat broader than the RJMCMC results. Indeed, they conform to the commonly observed characteristic of ABC when compared with methods that use full-likelihood of somewhat less certainty in the parameter estimates, but no strong bias or marked discrepancy. It can also be seen that the regression-based posterior probabilities are more accurate than those based simply on rejection, in accordance with the earlier observation in Beaumont (5).
1. Greenberg JH (1987) Language in the Americas (Stanford Univ Press, Stanford, CA).
2. Yu N, Chen FC, Ota S, Jorde LB, Pamilo P, Patthy L, Ramsay M, Jenkins T, Shyue SK, Li WH (2002) Genetics 161:269-274.
3. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW (2002) Science 298:2381-2385.
4. Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW (1999) Mol Biol Evol 16:1791-1798.
5. Beaumont MA (2007) in Simulations, Genetics and Human Prehistory - A Focus on Islands, eds Matsumura S, Forster P, Renfrew C (McDonald Institute Monographs, Uni of Cambridge, Cambridge, UK).
6. Richardson S, Green PJ (1997) J R Stat Soc B 59:731-792.