# Inference of complex population histories using whole-genome sequences from multiple populations

^{a}Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637;^{b}Department of Human Genetics, University of Chicago, Chicago, IL 60637;^{c}Department of Statistics, University of California, Berkeley, CA 94720;^{d}Chan Zuckerberg Biohub, San Francisco, CA 94158;^{e}Computational Biology Graduate Group, University of California, Berkeley, CA 94720;^{f}Computer Science Division, University of California, Berkeley, CA 94720

See allHide authors and affiliations

Edited by Elizabeth A. Thompson, University of Washington, Seattle, WA, and approved July 10, 2019 (received for review March 26, 2019)

## Significance

An increasing number of population genomic studies now try to infer complex models of population history using a number of whole-genome sequences sampled from multiple populations. A key technical challenge to this effort is to compute model likelihoods, which involves integrating out latent variables (genealogical histories) that live in extremely high dimensions. This is a notoriously difficult computational problem, especially when the sample size is greater than a handful and the underlying population genetic model is complex. Here, we present an efficient, flexible statistical method that can scale to larger sample sizes and more populations than previously possible. Aside from demographic inference, our method can be used in other statistical inference problems in evolutionary biology and human genetics.

## Abstract

There has been much interest in analyzing genome-scale DNA sequence data to infer population histories, but inference methods developed hitherto are limited in model complexity and computational scalability. Here we present an efficient, flexible statistical method, diCal2, that can use whole-genome sequence data from multiple populations to infer complex demographic models involving population size changes, population splits, admixture, and migration. Applying our method to data from Australian, East Asian, European, and Papuan populations, we find that the population ancestral to Australians and Papuans started separating from East Asians and Europeans about 100,000 y ago, and that the separation of East Asians and Europeans started about 50,000 y ago, with pervasive gene flow between all pairs of populations.

Whole-genome sequences are now routinely available for population genetic analyses, and inference methods that can take better advantage of genome-scale data have received considerable attention in recent years. In particular, there has been much interest in methods that can use the genomic data of individuals from multiple populations to infer complex models of population history. In addition to being of historical interest, population demography is important to study because it influences patterns of genetic variation, and understanding the intricate interplay between demography and other evolutionary forces such as natural selection is a major aim in population genetics.

Inferring these demographic histories is computationally and statistically challenging. One class of methods (1⇓⇓⇓⇓⇓⇓–8) based on the sample frequency spectrum (SFS) is computationally efficient but ignores linkage information, and the minimax rate of convergence for such estimators is poor (9). Also, their utility is limited by the fact that the number of model parameters that can theoretically be identified using the SFS alone is bounded by the sample size (10). Although no identifiability results currently exist for the general case, methods (13⇓⇓⇓⇓⇓⇓⇓–21) that take linkage structure into account are empirically more statistically efficient and can be used to infer models with many parameters even when the sample size is small.* This is of practical importance, since an increasing number of studies now seek to infer complex demographic models involving multiple populations using only a small number of individuals sampled from each population (e.g., refs. 22⇓⇓–25). A popular demographic inference method of this kind is PSMC (pairwise sequentially Markovian coalescent) (13), which uses a pair of sequences to infer piecewise-constant population size histories. Its extension, MSMC (multiple sequentially Markovian coalescent) (18, 19), can use sequences sampled from a pair of populations to infer a genetic separation history in addition to population size changes. A more recent method called SMC++ (20) can scale to hundreds of individuals, but it is able to analyze individuals from only a pair of populations that have diverged without subsequent gene flow.

Parallel to these developments, an inference method called diCal (Demographic Inference using Composite Approximate Likelihood) (16) was introduced to infer piecewise-constant effective population size histories using multiple sequences, thereby providing improved inference about the recent past. The key mathematical component of diCal is the conditional sampling distribution (CSD)

Here, we present our method diCal2, a scalable inference tool for population genomic analysis under general demographic models, which extends diCal in several ways. diCal2 has been successfully applied in recent empirical studies of human demographic history using whole-genome sequencing data (22, 24, 25). In the present paper, we provide a detailed description of the method, carry out a simulation study to benchmark its performance, and discuss its strengths and weaknesses.

To handle gene flow between populations, diCal2 builds on previous theoretical work (26) which introduced a CSD for subdivided populations with unchanging continuous migration; that earlier work did not address parameter estimation, which is the focus of this article. In contrast to MSMC, which does not explicitly model population structure, we consider fully parametric demographic models, including subdivided population structure with migration, that are easier to interpret. Our method also enables inference under demographic models more general than the 2-population clean-split model currently implemented in SMC++. Specifically, our method is flexible enough to model 1) an arbitrary number of populations specified by the user, 2) an arbitrary pattern of population splits and mergers, 3) more general population size changes (e.g., piecewise-exponential), 4) arbitrary migration patterns with time-varying continuous migration rates or pulse admixture events, and 5) an arbitrary poly-allelic mutation model at each site (including diallelic or tetraallelic). These are significant improvements on the previous version of diCal (16), which could only be used to infer piecewise-constant population size changes in a single population.

In addition to these features, we introduce major computational improvements which enable the use of whole-genome data. The mathematical details of our method and the computational extensions are provided in *Materials and Methods* and *SI Appendix*, *SI Text*. Below, we briefly highlight the key technical advances.

In PSMC and the earlier version of diCal, the demographic epochs and HMM discretization intervals are both fixed, and the latter forms a strict refinement of the former. In contrast, discretization intervals and demographic epochs are decoupled in our improved version of diCal. For example, population size change points or population split times can vary freely and do not need to coincide with discretization interval boundaries. This flexibility allows for more accurate parameter estimation, especially for population split times.

Moreover, the CSDs for different haplotypes can be combined in various ways to devise a composite likelihood that can be used in a maximum likelihood framework for parameter estimation. Our implementation of the expectation–maximization (EM) algorithm allows any composite likelihood that is composed of sums and products of CSDs, which includes the product of approximate conditionals (PAC) used by Li and Stephens (27) to detect recombination hotspots.

For substantial computational speedup, we implement a previously described “locus-skipping” algorithm (28), which analytically and exactly integrates over contiguous stretches of nonsegregating loci. However, locus skipping is less computationally efficient with missing data, and thus, to efficiently incorporate missing alleles, we also implement an alternative speedup by grouping loci together into larger blocks. A similar speedup was used in PSMC, but it treats the whole block as a single diallelic site. In contrast, the blocks in our method are viewed as full nonrecombining haplotypes.

For complex demographic models, the likelihood function may have local optima. To address this issue, we implement a flexible genetic algorithm and combine it with the EM procedure to enable more efficient navigation of high-dimensional parameter space.

Lastly, we also present an application of our method to data from the Simons Genome Diversity Project (SGDP) (23) to investigate the population history of Australians, East Asians, Europeans, and Papuans. There has been some debate whether the population ancestral to Australians and Papuans (which we call Australo-Papuans, following ref. 29) split off prior to the divergence of East Asians and Europeans (e.g., ref. 29), or whether East Asians and Australo-Papuans first split from Europeans (e.g., ref. 30). We find substantial evidence in favor of the former hypothesis, but find that there has been pervasive gene flow between all of these populations since their divergence.

## Results

To demonstrate the flexibility, accuracy, and efficiency of our method, we performed an extensive simulation study under a variety of biologically relevant demographic scenarios. DNA sequence data were simulated using the software scrm (31). We simulated 100 datasets for each demographic scenario and set the haplotype length to 250 Mbp for each dataset. We used

### Recent Exponential Growth.

The first model we investigated involves recent exponential population growth. To investigate the performance of our method, we simulated data consisting of 10 haplotypes under the demographic model depicted in Fig. 1*A*. We fixed

We used the leave-one-out composite likelihood (LCL) in our EM procedure combined with a genetic algorithm to estimate all 5 parameters of the demographic model. For the genetic algorithm, we chose 50 random starting points that were each optimized for 5 EM iterations. Then we chose the 5 best parameter values (“parents”) and replaced each of them with an average of 3 “offspring” parameter sets to obtain the next “generation.” These were then optimized for 5 EM iterations. We repeated this procedure for 4 more “generations,” and reported the parameters that achieved the overall maximal likelihood value. We found that the results are robust to the choice of composite likelihood scheme.

Violin plots representing the accuracy of the inferences are shown in Fig. 2*A* and *SI Appendix*, Fig. S1. Analysis of the simulated data shows that, in these scenarios, all parameters are estimated with little variability. However, the results indicate that the estimate of the exponential growth rate is biased upward. This bias is somewhat counterbalanced by a slight downward bias of the time when growth starts, and the population size before growth starts. In fact, the estimates lead to very accurate contemporary population sizes. We note that it is possible to empirically correct for biases in applications via simulation. Furthermore, using more sequence data for each individual reduces the variability of the estimates. We stress that our method accurately estimates recent exponential growth rates using only 10 haplotypes. This is far less than the sample size (thousands to tens of thousands) required by SFS-based methods to get reasonable estimates; see Bhaskar et al. (5) and references therein.

Note that, in these and the following simulations, the ancestral population size

### Population Split.

We also investigated a model of a past population split, depicted in Fig. 1*B*. This model allows for a bottleneck before the populations split, and subsequent gene flow following the split. We first focused on the case with no gene flow, i.e., with migration probability

We simulated datasets with 2 haplotypes in each of the extant populations. We simulated 100 datasets each for

Fig. 2*B* and *SI Appendix*, Fig. S2 show the accuracy of the estimator. These empirical results demonstrate that our method is able to estimate the parameters in this clean-split model with high accuracy. Most parameters show little bias, and the empirical distributions are very narrow. Only the estimates of the extant population sizes

### Isolation with Migration.

We also investigated the demographic model shown in Fig. 1*B* allowing for positive gene flow after the ancestral population splits into 2. We set the migration probability to

Fig. 2*C* and *SI Appendix*, Fig. S3 show the accuracy of the estimator. In both scenarios, we used the pairwise composite likelihood. Again, most parameter estimates show little bias or variability, the exceptions being

### Three-Population Model.

Our method can handle models with more than 2 extant populations each with several haplotypes. To test the accuracy of our method in this setting, we simulated data under the model depicted in Fig. 1*C* relating 3 extant populations. Under this model, an ancestral population of size *D* shows the accuracy of our method. Again, the empirical distribution of the estimates shows little bias or variability.

### Application to SGDP Data.

We used our method to investigate the pattern of population splits between Australians, East Asians, Europeans, and Papuans. There has been some debate about the relative ordering of population splits; specifically, there has been competing evidence about whether East Asians and Europeans split most recently (e.g., ref. 29) or whether Australo-Papuans and East Asians split most recently (e.g., ref. 30). To date these splits, we used Australian, French, Han, and Papuan individuals from the SGDP (23) and fit models for each of the 6 possible pairs of these populations, allowing for recent population size changes and pulse admixture. The model is depicted in Fig. 3, and additional details are given in *Materials and Methods*. The estimates of the divergence time

We also found evidence of pervasive recent gene flow. In particular, we found pulse admixture proportions of 15 to 26% between each pair of populations, all occurring 5 to 20 ka. We note that our model cannot capture all of the intricacies of human demographic history: There has likely been continuous gene flow between all populations punctuated by a few mass migrations. Our gene flow estimates likely attempt to capture both of these modes simultaneously along with indirect gene flow through intervening populations. While it is unlikely that about a quarter of any population was replaced by a geographically distant population, our results suggest that, since their divergences tens of thousands of years ago, these populations have exchanged a considerable number of migrants.

Additional details about the data, data processing, and parameter settings for our method are presented in *Materials and Methods*. All parameter estimates, bootstrap results, and measures of goodness-of-fit evaluated using cross-coalescence rate curves (CCRs) may be found in *SI Appendix*, Fig. S4 and Table S1.

There is also a separate debate about the number of out-of-Africa events (23, 29, 32), with some studies suggesting that a second, earlier wave left traces of ancestry specifically in Australo-Papuans. We caution against interpreting our results in this context, since directly testing this hypothesis would require explicitly including African populations in the analysis. Moreover, our estimates of *SI Appendix*, Fig. S4 and similar curves in refs. 23 and 29 show, the populations involved already exhibited a substantial degree of structure 100 ka. Representing this complex population history by a simple model with a single pulse admixture event after splitting as in our analysis, or by a single estimate for the divergence time as in refs. 23 and 29, is certainly an oversimplification that omits relevant details. Lastly, we do not explicitly model the excess traces of Denisovan ancestry that are found in Papuans (33), which may cause differences in the estimated divergence times.

## Discussion

The results described above demonstrate that our method can efficiently and accurately estimate demographic parameters in biologically relevant scenarios. Our method has recently been used to study the history of Native American peoples (22, 24, 25), due to the flexible framework underlying the method, enabling the consideration of a wide range of population histories.

A limitation of our method is that it relies on the haplotype structure of the sample, and thus requires phased data. Phasing errors can lead to biased parameter estimation; see supplementary information, section S7, specifically table S13, of ref. 22 for a simulation study that explores the bias when estimating divergence times. Moreover, note that the simulations presented above were performed using homogeneous recombination and mutation rates along the genome. The EM procedure underlying our method could be modified to accommodate heterogeneous rates without severely impacting the runtime. If the correct rates are specified, we do not expect the accuracy of inference to be adversely affected. If the correct rates are not known, it would also be possible to adjust the method for joint inference, or to use rates obtained from alternative approaches (34).

Aside from demographic inference, we note that our method can be used in other population genetic problems of interest, such as model selection (see, for example, supplementary information 18.4 of ref. 24). Furthermore, the posterior decoding of latent variables in our CSD can be used in detecting admixture tracts (35), estimating fine-scale recombination rates in admixed individuals, distinguishing ancestral and introgressed polymorphism, and detecting incomplete lineage sorting. Also, applying our CSD in methods for phasing genotypes, imputing missing sequence data, and detecting identity-by-descent tracts (36) would make it possible to properly account for demography or infer it simultaneously, thus potentially improving accuracy. Lastly, it is straightforward to incorporate temporal samples (ancient DNA sequences) into our method (24), which leads to further interesting applications.

## Materials and Methods

Here, we briefly describe our method, a composite likelihood framework to estimate demographic parameters using EM. Further details are provided in *SI Appendix*, *SI Text*. We also describe our analysis of the SGDP data.

### Demographic Inference Using diCal2.

A central building block of our method diCal2 for demographic inference is the CSD *SI Appendix*, section 1. The CSDs presented in Steinrücken et al. (26) and Sheehan et al. (16) can be obtained as special cases of the model presented here.

The CSD can then be used to define composite likelihood functions, which, in turn, enable us to perform maximum composite likelihood inference of the demographic parameters, Θ. We can use any such composite likelihood function that is composed of sums and products of CSDs, for example, the PAC framework which has been used successfully by Li and Stephens (27) to infer recombination hotspots.

To find the parameter values that maximize this composite likelihood, we employ the composite likelihood in the standard EM framework (39). While, in principle, all parameters of the model can be inferred, we focus on the demographic parameters, Θ. Since it is not possible to derive a closed form solution for the maximum in the maximization step in general, we employ numerical optimization schemes, like the Nelder–Mead simplex algorithm (40), to efficiently determine the requisite maximum. We provide mathematical details for the implementation of the EM algorithm in *SI Appendix*, section 3.

In *SI Appendix*, section 4, we provide details on the implementation of the “locus-skipping” algorithm, and the alternative speedup that groups loci into larger blocks. Furthermore, in *SI Appendix*, section 5, we provide mathematical details of the modifications to the trunk genealogy to increase accuracy. Finally, we describe, in *SI Appendix*, section 6, how to employ a discretization for the HMM computations that differs from the partition induced by the demography and remains fixed throughout the optimization procedure.

### Runtime.

The runtime of our implementation of the EM algorithm is linear in the number of haplotypes times the number of CSDs in the composite likelihood and quadratic in the number of populations involved. The E step depends linearly on the length of the haplotypes, whereas the M step is independent of this quantity. The exact complexity and runtime of parameter estimation depends on the composite likelihood used, the details of the genetic algorithm, and the number of parameters to estimate. The analyses of the simulated data presented in this section were performed on a cluster of AMD Opteron processors. The raw sequential runtime of analyzing a single dataset averaged 100 to 120 CPU hours, but, by taking advantage of the independence structure of the composite likelihood and the genetic algorithm, we were able to decrease the runtime to an average of 15 to 20 wall clock hours, using up to 16 cores in parallel. The one exception was the 3-population model, where the parallelized version took, on average, 70 wall clock hours, due to the more complex demographic model and the increased number of haplotypes.

### SGDP Analysis.

For the analysis of the SGDP data, we used the following individuals: B_Australian-3, B_Australian-4, S_French-1, S_French-2, B_French-3, S_Han-1, S_Han-2, B_Han-3, S_Papuan-1, S_Papuan-3, and B_Papuan-15. The data were phased using Shapeit (41) with read-based phasing (phased data provided by I. Mathieson, Department of Genetics, University of Pennsylvania, Philadelphia, PA), and all sites in Heng Li’s 75-bp universal mask (23) were treated as missing. As in our simulations, we used a mutation rate and recombination rate of *SI Appendix*, section 4), and discretized time with 8 log-uniformly spaced break points between 1.5 ka and 5 Ma. Each genome-wide analysis of a pair of populations with combined sample size 10 to 12 took ∼90 to 145 wall clock hours on AMD Opteron processors, using up to 10 cores in parallel and less then 30 GB of memory.

As seen in the simulations, the raw inferred parameters may be biased. To address this issue and infer confidence intervals, we performed a parametric bootstrap using msprime (42). For each pair of populations, we simulated 10 full genome datasets, and reran our method on each of these datasets. Our reported estimates are “debiased” estimates, obtained by subtracting the estimated bias from our raw estimates. We then used the bootstraps to estimate a SD for each parameter, and reported confidence intervals based on a normal approximation (i.e., the debiased estimate *SI Appendix*, Table S1. We also note that the bootstrap results were obtained using the same grouping of loci (2.5-kbp bins), and thus show that this procedure does not severely impact accuracy.

To assess goodness of fit, we used MSMC to infer CCRs on the real data, and then on data simulated under our debiased estimates, presented in *SI Appendix*, Fig. S4. For each pair of populations, we used a single diploid from each population (B_Australian-3, S_French-1, S_Han-1, and S_Papuan-1), using all of the autosomes and again treating all sites in Heng Li’s 75-bp universal mask as missing. We simulated 5 replicates of each pair of populations to assess the variability in the MSMC CCRs. The CCRs are qualitatively similar between the real and simulated data, and the fit is quite good for a model with only 9 parameters. As discussed above, introducing additional size changes and migration rates would likely improve the fit.

### Software Availability.

The algorithms described here are implemented in a new version of the software package diCal2, which is available for download at https://sourceforge.net/projects/dical2.

## Acknowledgments

We thank Sara Mathieson and Geno Guerra for helpful discussions and for testing our software. Furthermore, we thank Iain Mathieson for helpful discussions and providing the phased SGDP data. This research is supported, in part, by National Institutes of Health Grant R01-GM094402 and a Packard Fellowship for Science and Engineering. Y.S.S. is a Chan Zuckerberg Biohub Investigator.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: yss{at}berkeley.edu.

Author contributions: M.S. and Y.S.S. designed research; M.S. and J.K. developed algorithms and software; M.S., J.K., J.P.S., and Y.S.S. performed research; M.S. and J.P.S. analyzed data; and M.S., J.K., J.P.S., and Y.S.S. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

↵*Whether the distribution of pairwise coalescence times uniquely determines the demographic model has been answered recently (11, 12).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1905060116/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- R. Nielsen

- ↵
- ↵
- S. Lukić,
- J. Hey

- ↵
- ↵
- A. Bhaskar,
- Y. R. Wang,
- Y. S. Song

- ↵
- ↵
- J. Jouganous,
- W. Long,
- A. P. Ragsdale,
- S. Gravel

- ↵
- ↵
- J. Terhorst,
- Y. S. Song

- ↵
- ↵
- ↵
- L. Cowen

- Y. Kim,
- F. Koehler,
- A. Moitra,
- E. Mossel,
- G. Ramnarayan

- ↵
- ↵
- ↵
- ↵
- S. Sheehan,
- K. Harris,
- Y. S. Song

- ↵
- ↵
- ↵
- K. Wang,
- I. Mathieson,
- J. O’Connell,
- S. Schiffels

- ↵
- ↵
- ↵
- M. Raghavan et al.

- ↵
- ↵
- ↵
- J. V. Moreno-Mayar et al.

- ↵
- ↵
- N. Li,
- M. Stephens

- ↵
- ↵
- ↵
- J. D. Wall

- ↵
- ↵
- L. Pagani et al.

- ↵
- ↵
- J. P. Spence,
- Y. S. Song

- ↵
- ↵
- ↵
- ↵
- ↵
- A. P. Dempster,
- N. M. Laird,
- D. B. Rubin

- ↵
- ↵
- O. Delaneau,
- B. Howie,
- A. Cox,
- J. F. Zagury,
- J. Marchini

- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Population Biology