New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
Effects of the population pedigree on genetic signatures of historical demographic events
Edited by John C. Avise, University of California, Irvine, CA, and approved April 19, 2016 (received for review February 13, 2016)

Abstract
Genetic variation among loci in the genomes of diploid biparental organisms is the result of mutation and genetic transmission through the genealogy, or population pedigree, of the species. We explore the consequences of this for patterns of variation at unlinked loci for two kinds of demographic events: the occurrence of a very large family or a strong selective sweep that occurred in the recent past. The results indicate that only rather extreme versions of such events can be expected to structure population pedigrees in such a way that unlinked loci will show deviations from the standard predictions of population genetics, which average over population pedigrees. The results also suggest that large samples of individuals and loci increase the chance of picking up signatures of these events, and that very large families may have a unique signature in terms of sample distributions of mutant alleles.
The degree to which a sample may be considered representative of a population is a fundamental question in any application of statistics. In the complicated world of evolutionary and population genetics, where it is sometimes not even clear which aspects of ancestry or data should be modeled as random processes, questions of this sort assume greater significance still, and simple mistakes can have drastic effects on inference. These issues are brought to the fore in the field of phylogeography, which was first developed by Avise and colleagues in the 1980s after the introduction of genotyping technologies into evolutionary biology and which takes as its starting point the fact that hierarchical patterns of genetic variation contain information about the locations of populations and species in the past, as well as their relative population sizes and other factors of biological interest (1).
The core debate about randomness in the subsequent development of phylogeography was about whether individual gene genealogies should be treated as outcomes of highly variable random processes, which need to be modeled, or as simple observations from which conclusions about the past may be drawn more or less directly (2⇓⇓⇓–6). There will be cases in which the size and shape of a single gene genealogy contain substantial information about population-level or intraspecific ancestry but, as noted in a recent review (7), this debate has come down on the side of modeling. The reasons for this are that gene genealogies are in fact the results of random processes, likely at the population level but certainly at the level of Mendelian genetic transmission, and that it is not known a priori whether a given set of data comes from one of those cases in which gene genealogies are individually informative (8⇓–10). Although this particular issue may be considered settled, debates about the proper application of random models in phylogeography continue to arise (11, 12).
We consider an additional question about the application of random models that has received comparatively little attention either in phylogeography or population genetics. Namely, what is the extent to which genealogies in the family sense—also known as organismal pedigrees (13) or population pedigrees (14)—constrain gene genealogies and thus genetic variation? Two points distinguish this question from the initial core debate about randomness in phylogeography.
First, whereas in phylogeography the focus has been on the undesirable effects of making inferences conditional on a single gene genealogy estimated from data, here it is on the validity of inferences based on standard population-genetic models that average over population pedigrees when in fact there is only one. It turns out that in relatively large well-mixed populations with constant demography over time, the predictions of standard models are generally quite accurate even though they involve this conceptual error (13, 14). The second point is that the variation we are interested in here is variation among loci for a set of sampled individuals. Even though the population pedigree may itself be the outcome of a random process, all loci in the genome share the same pedigree. The population pedigree should thus be considered a given, fixed quantity because peculiarities of genetic variation among loci in the genome may be due to peculiarities in the pedigree.
Work on the effects of population pedigrees began in 1990 with Ball et al. (13), who made the fundamental observation that standard-model predictions for a single well-mixed population fit the distributions of pairwise measures of diversity among independent loci on a given pedigree surprisingly well. Follow-up work on subdivided populations came to similar conclusions but also illustrated that sampling small numbers of transmission pathways through a pedigree can give results quite different from corresponding standard-model predictions (15) and that pedigrees can substantially affect the probabilities of gene-tree topologies in isolation-by-distance migration models (16). These works used simulations to generate pedigrees and to model genetic transmission within each pedigree.
Chang (17) explored two key aspects of ancestry within population pedigrees analytically, proving for a population of N individuals that (i) the most recent common ancestor of all present-day individuals in the pedigree sense (i.e., an individual through which all present-day individuals are cousins) will typically be observed at
Subsequent work using both analysis and simulations has emphasized the rapid approach to equilibrium of shared ancestry in pedigrees. Reproductive values of individuals across the population (20), which are proportional to the probabilities that a genetic lineage sampled randomly today traces back to each individual in a given past generation, reach a stationary distribution on this same
Pedigrees are, of course, a mainstay of medical genetics, where they allow powerful inferences about the genetics of human disease (23). These are not population pedigrees, which cover entire populations or species for all times, but partial recent pedigrees of sampled individuals. Pedigree analyses of this sort are being applied to a growing number of natural populations, ones for which patterns of reproductive relationship are known, to disentangle the genetics of complex traits and understand patterns and consequences of inbreeding (24). Observed partial pedigrees have also been used to make inferences about recent historical demography—for example, the French settlement of Quebec (25)—directly from pedigree shape without genetics.
Population pedigrees have less frequently made their way into the models of population genetics. Beyond the examples above (13⇓⇓–16, 21, 22), they have been invoked to study the length distribution of admixture tracts in a descendant population (26) as well as to describe the ways in which ancestors in the pedigree sense are numerous, whereas the genetic ancestors among them are comparatively few (27, 28).
Here, we use simulations to assess the potential for two kinds of demographic events to alter the shape of population pedigrees so dramatically that they have marked signatures on genetic variation across the genome, specifically among independently segregating loci without intralocus recombination. We begin by emphasizing the assumptions of standard population-genetic models, which determine how they should be applied, and the resulting conceptual error involved in using standard models to explain variation across the genome in diploid biparental organisms. The first kind of demographic event we consider is the case of a very large family at some generation in the past. The second is the introduction and sweep through the population of a strongly advantageous mutant allele. In both cases, we ask whether data from unlinked loci will deviate from standard predictions for the same demographies without these special events. We restrict our attention to well-mixed populations. This provides a baseline set of results against which subsequent work (e.g., on geographically structured populations) may be compared.
Two Conceptually Different Random Experiments
One of the most familiar results of population genetics is the probability there will be j copies of an allele in the next generation given there are currently i copies of it in a population of N individuals,
The classic experiments of Buri (32), in which the entire evolutionary process was repeated independently a large number of times, provide the appropriate sort of data. In one experiment, Buri recorded allele frequencies of a selectively neutral mutation (
(A) Data from series I (table 13 of ref. 32) for generations 1–19. In each generation, the proportion for each allele frequency is the fraction of the total 107 populations that showed that particular frequency. Generation 0 is not depicted but would have allele frequency equal to 16 and proportion equal to 1. (B) Corresponding theoretical prediction using Eq. 1 iteratively, but with the effective population size
Now consider another standard population-genetic prediction, in this case for the distribution of the number (
Thus, Eq. 2 is an equilibrium result that captures the balance between genetic drift and mutation. It predicts what would be observed if two sequences at a locus were sampled at random from such a population. For most organisms, it is not feasible to perform long-term experiments analogous to those of Buri (32) to create multiple replicate populations for comparison with Eq. 2 or other similar predictions. Instead, these predictions are applied to datasets of multiple loci genotyped in the same set of individuals sampled from a single population (or species). Although this type of application is conceptually wrong because the loci share the pedigree, standard-model predictions match simulated pedigree-coalescent data surprisingly well for large, well-mixed populations (13, 14).
An example of this standard type of application is given in table 3 of ref. 35, which gives the numbers of loci showing zero, one, two, three, or four SNP differences between pairs of sequences at 12,027 loci ranging in length between 400–700 bp in one of the first major SNP-typing studies in humans. Fig. 2 plots these data alongside the corresponding predictions from Eq. 2. The coalescent model in Fig. 2 and the more sophisticated one in table 3 of ref. 35, which takes variation in the lengths of loci and the mutational opportunity among loci into account, can both be rejected using a χ2 test. However, it is not clear that this is due to the pedigree, because humans deviate from the assumptions of standard models in other ways (e.g., growth and population structure).
This standard type of application is assumed to be appropriate for loci that are far enough apart in the genome (on different chromosomes in the extreme case) that they assort essentially independently into gametes. Whether or not they assort independently, Eq. 2 is not the correct prediction because Eq. 2 involves the implicit assumption that the loci do not share the same pedigree. Loci on different chromosomes are independent, but only conditional on the population pedigree. They might collectively show patterns of times to common ancestry or genetic variation that depend on the specific features of the pedigree.
In fact, the population pedigree completely determines the probabilities of coalescence in any given generation. Fig. 3 shows a four-generation piece of the Spanish Hapsburg royal family from a study of inbreeding in the demise of this ruling family line (36). Two alleles, one sampled from Mary of Portugal and one sampled from Philip II, would have zero chance of coalescing in the previous two generations, then a substantial probability of coalescing in past generation 3. Thus, the probability of coalescence is not constant over time, as assumed in standard models, and it may not be clear whether it should ever be equal to familiar result
A small portion of the human population pedigree, from Alvarez et al. (36). Spanish Habsburg King Charles II, who is not shown but would be three generations below, is inferred to have had an inbreeding coefficient of
Simulations for a variety of models of reproduction show that standard predictions, such as the exponential distribution of
In what follows, we consider the effects of extreme pedigrees on distributions of time to coalescence, pairwise SNP differences, and frequencies of mutations in a sample. The results are from simulations of population pedigrees and coalescence of alleles from sampled individuals within pedigrees. In large part, our findings provide further support of the robustness of standard models that average over pedigrees but also suggest that some demographic events might leave signatures detectable in large samples of loci and individuals.
Pedigree Effects of a Large Family
An extensive recent study of human Y-chromosome variation (37) identified a number of descent clusters present at unusually high frequencies in Asia and inferred that these represent the genetic heritages of a corresponding number of highly reproductively successful men. It was surmised that one of these men was Genghis Khan, who had previously been suggested as the source of a particular Y-chromosome haplotype found at
We consider a hypothetical, extreme scenario based on these inferences about Genghis Khan, in which a single man has a very large number of children at generation 28 in the past. Details of our simulations are given in Materials and Methods. We present results for distributions of pairwise coalescence times among autosomal loci in a pair of individuals, assuming independent assortment but conditional on a single shared population pedigree. We also present results for pairwise SNP differences and site frequencies, for which we use
Fig. 4A shows the probabilities of pairwise coalescence, or the proportion of loci expected to coalesce, in each of the past 40 generations assuming “Genghis Khan’s” children comprise 8% of the population. There is very little coalescence in the most recent generations, 1–20, due the strong population growth assumed, but there would still be little coalescence during this time in a population of constant size (here
Simulated distributions of coalescence times conditional on a population pedigree for the case of a large family described in the text, in which the children comprise 8% of the population in generation 27. Each panel is based on a single population pedigree and single pair of sampled individuals. (A) Only the most recent generations. (B) The whole range of coalescence times on the coalescent time scale of the ancestral population (
Looking at the same scenario over the much longer time frame relevant to coalescence, in Fig. 4B, this extra mass of coalescence probability has no discernible effect on recent coalescence (leftmost bin in Fig. 4B) now corresponding to coalescence within the recent
The situation changes when the children make up 68% of the population. Fig. 5A shows a dramatic effect even on the overall distribution of coalescence times. In this case the increase in the chance of coalescence is
Simulated distributions of pairwise coalescence conditional on a population pedigree in which the children in generation 27 comprise 68% of the population. Each panel is based on a single population pedigree and single pair of individuals sampled. (A) A plot of coalescence times, analogous to Fig. 4B. (B) The distribution of genetic variation among loci with or without the demographic event of such a very large family. Proportions are estimated based on
We also investigated the possibility there would be greater power to detect the pedigree effects of a large family using site-frequency data. We again simulated ancestries of very many loci starting from the same set of individuals sampled without replacement from the current generation, only now we sampled 1,000 individuals and followed 1,000 genetic lineages, creating pseudodata for each locus then counting the number of copies of each mutant in the sample. Fig. 6 shows these “unfolded” site-frequency distributions (42) for the case in which the children comprise 8% of the population (Fig. 6A) and in which they comprise 68% of the population (Fig. 6B).
Unfolded site-frequency distributions when the children of the large family comprise 8% (A) versus 68% (B) of the population. In both panels, the lines in red display results for the assumed background demography with growth but no large family and are identical in both panels, and the lines in blue show results when there is a large family. The lines in blue in B are based on 100,000 replicate loci; the others are based on 10,000 loci.
When the children make up 8% of the population, there seems to be no discernible effect on site frequencies, but a striking pattern is observed when the children make up 68% of the population. Differences appear in two parts of the distribution. First, there is a deficit of polymorphic sites at which the mutant is found in about 50–200 copies in the sample. The explanation for this is that many potential branches in the gene genealogy that would have had between roughly 50 and 200 descendants in the sample will be collapsed to zero when bunches of lineages coalesce in “Genghis Khan.” Without a large family, these branches would have positive lengths and mutations on them would produce polymorphisms in these site-frequency classes. In the simulations for Fig. 6B, an average of 934 lineages remained by generation 27 in the past, so each of the two clusters of coalescent events in “Genghis Khan” involve an average of
The second effect on the site-frequency distribution is an increase in the number of high-frequency derived mutations. Similar patterns have been ascribed to positive selection (43), but U-shaped distributions of allele frequencies are observed within local populations subject to migration (29) and are not unexpected when multiple-merger coalescent events can occur (44). We do not have a quantitative explanation of this pattern in Fig. 6B, but, roughly speaking, it is due to the fact that both of the large clusters may be on one side of the root of the gene genealogy. As described in Materials and Methods, we verified the overall pattern of site frequencies for this case using a modified set of standard coalescent simulations.
Pedigree Effects of a Selective Sweep
We also investigated the potential of a strong selective sweep to structure the population pedigree in such a way that a genome-wide deviation from the predictions of the standard neutral model would be observed. Whereas the genetic effects of selective sweeps are known to be dramatic for loci linked to a locus under selection (45⇓–47), it is generally understood that unlinked loci are not affected by sweeps. In fact, there is some small effect of a selective sweep even on unlinked loci, which may be attributed to a transient increase of the variance of offspring numbers during a selective sweep (48, 49). To investigate this effect of a sweep as mediated by the population pedigree, we simulated very strong selective sweeps beginning at generation 50 in the past in a population of constant size
The pedigree effects of a sweep may be likened to those of a large family, with the family now defined in genetic terms and where the event unfolds over a larger number of generations. Another conceptually similar phenomenon is cultural inheritance of fertility, or correlation in offspring numbers, across generations, evidence for which has been inferred from the shapes of human mitochondrial gene genealogies (50).
Fig. 7 shows probabilities of pairwise coalescence, or proportion of loci expected to coalesce, in each of the past 56 generations assuming a selection coefficient of
Distributions of pairwise coalescence times conditional on the population pedigree for the case of a selective sweep, with either
Not surprisingly, the effects of lesser sweeps are very subtle. Fig. 7B shows the effect of a sweep with
In contrast to the large-family simulations that included population growth, and therefore showed little coalescence in the first
Fig. 8 provides a more detailed view of the pedigree effects of strong selective sweeps. Ten replicate populations, each with a sweep beginning in generation 50 in the past, were simulated. The probabilities of both coalescence for a pair of lineages and the frequency of the advantageous allele were computed for every generation in the pedigree. These two quantities are shown in Fig. 8 with thicker and thinner lines, respectively, and using different colors for each of the 10 replicates. Fig. 8A shows that sweeps with
Distributions of pairwise coalescence times and trajectories of selective sweeps for 10 different replicate populations. As in Fig. 7,
A greater level of variation in the timing of the ten sweeps is visible in Fig. 8B, with s = 1, than in Fig. 8A, with s = 10. Fig. 8B also shows that differences in the timing of the increase in coalescence probability track differences in the timing of sweeps (distinguished by color). Variation in the timing of a sweep is attributable to the time it takes the favored allele to escape the effects of genetic drift when it is in low copy number in the population. Especially in Fig. 8A, it can be seen that coalescence tends to happen earlier in the sweep, when the favored allele is in low frequency (51). Finally, there is greater variation in the additional density of coalescence events among sweeps in Fig. 8A (
Conclusions
We have explored two ways in which demographic events within populations may alter the structure of organismal genealogies, or population pedigrees, so as to produce unexpected patterns of variation across genomes. Our simulations of the effects of recent very large families and strong selective sweeps on variation among unlinked loci have primarily yielded negative results. Standard population-genetic predictions that average over pedigrees, such as
Finally, the genetic signatures of recent demographic events that we have uncovered apply marginally to single sites—they do not take linkage and recombination into account—and we note that the pedigree effects of such events might be relatively strong for multilocus measures such as the length distribution of blocks of identity by descent (54, 55).
Materials and Methods
Simulations of Population Pedigrees and Coalescence.
We simulated pedigrees according to the diploid, two-sex version of the Wright–Fisher model of random mating. That is, each individual in the next generation (forward in time) has a mother and a father chosen uniformly at random from the female and male adults of the current generation. Given a pedigree, neutral genetic loci are transmitted according to Mendel’s laws. Importantly, multiple loci are independent conditional on the pedigree. For each simulated population pedigree, a sample of individuals is taken at random without replacement from the current population, which is generation 0 in the model. A single genetic lineage is followed backward in time from each sampled individual according to Mendel’s law of independent segregation (i.e., going with 50% chance to the mother or the father in each generation). When two lineages trace back to the same individual, they coalesce with probability 1/2 and remain distinct in that individual with probability 1/2. For each pedigree and sample, we simulated large numbers of loci that were assumed also to follow Mendel’s law of independent assortment. The programs used in this research may be downloaded from wakeleylab.oeb.harvard.edu/resources.
Pedigree Simulations Coalescent with a Large Family.
We set the generation in which there was a large family to be generation 28 in the past using the fact that Genghis Khan lived about 800 y ago and a current estimate of 29 y as the average length of one human generation (56). We assume that the children of our “Genghis Khan” comprised either 8% or 68% of the population, and that for the next 27 generations the population grew at rate 0.3 per generation (52), which is similar to estimates of growth for descent clusters in Balaresque et al. (37). The results we present do not depend strongly on this growth because, either way, generation 28 in the past is very recent compared with average coalescence time. We assume an ancestral population size of
In simulating genetic data, we assumed that all mutations are selectively neutral and that each mutation produces a unique polymorphic site (34). For each gene genealogy we placed a Poisson number of mutations randomly on the branches in the standard way to create pseudodata (57), with the modification that our gene genealogies are not necessarily simple bifurcating trees. For a gene genealogy with total length t generations, the number of mutations would be Poisson(
Pedigree-Coalescent Simulations of Strong Selective Sweeps.
We assumed that a selectively favored allele A was introduced as a mutant in a single copy in the population in generation 50 in the past, into a background of wild-type alleles a. The relative fitnesses of the three diploid genotypes were
Acknowledgments
We thank Noah Rosenberg for helpful comments and discussion.
Footnotes
- ↵1To whom correspondence should be addressed. Email: wakeley{at}fas.harvard.edu.
Author contributions: J.W., L.K., and P.R.W. designed research; J.W., L.K., and P.R.W. performed research; J.W., L.K., and P.R.W. analyzed data; and J.W. and P.R.W. wrote the paper.
The authors declare no conflict of interest.
This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “In the Light of Evolution X: Comparative Phylogeography,” held January 8–9, 2016, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. The complete program and video recordings of most presentations are available on the NAS website at www.nasonline.org/ILE_X_Comparative_Phylogeography.
This article is a PNAS Direct Submission.
References
- ↵.
- Avise JC
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Ewens WJ
- ↵.
- Hudson RR
- ↵
- ↵
- ↵
- ↵
- ↵.
- Wakeley J,
- King L,
- Low BS,
- Ramachandran S
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Fisher RA
- ↵
- ↵.
- Barton NH,
- Etheridge AM
- ↵.
- Thompson EA
- ↵.
- Pemberton JM
- ↵.
- Moreau C, et al.
- ↵.
- Liang M,
- Nielsen R
- ↵
- ↵
- ↵.
- Wright S
- ↵
- ↵.
- Ewens WJ
- ↵
- ↵.
- Sjödin P,
- Kaj I,
- Krone S,
- Lascoux M,
- Nordborg M
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Lynch M
- ↵.
- Akashi H
- ↵.
- Fay JC,
- Wu C-I
- ↵
- ↵
- ↵.
- Kaplan NL,
- Hudson RR,
- Langley CH
- ↵.
- Kim Y,
- Stephan W
- ↵
- ↵.
- Barton NH
- ↵
- ↵
- ↵.
- Keinan A,
- Clark AG
- ↵
- ↵
- ↵
- ↵
- ↵.
- Hudson RR
Citation Manager Formats
More Articles of This Classification
Related Content
- No related articles found.






















