## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# The probability of monophyly of a sample of gene lineages on a species tree

Edited by John C. Avise, University of California, Irvine, CA, and approved April 18, 2016 (received for review February 5, 2016)

## Abstract

Monophyletic groups—groups that consist of all of the descendants of a most recent common ancestor—arise naturally as a consequence of descent processes that result in meaningful distinctions between organisms. Aspects of monophyly are therefore central to fields that examine and use genealogical descent. In particular, studies in conservation genetics, phylogeography, population genetics, species delimitation, and systematics can all make use of mathematical predictions under evolutionary models about features of monophyly. One important calculation, the probability that a set of gene lineages is monophyletic under a two-species neutral coalescent model, has been used in many studies. Here, we extend this calculation for a species tree model that contains arbitrarily many species. We study the effects of species tree topology and branch lengths on the monophyly probability. These analyses reveal new behavior, including the maintenance of nontrivial monophyly probabilities for gene lineage samples that span multiple species and even for lineages that do not derive from a monophyletic species group. We illustrate the mathematical results using an example application to data from maize and teosinte.

Mathematical computations under coalescent models have been central in developing a modern view of the descent of gene lineages along the branches of species phylogenies. Since early in the development of coalescent theory and phylogeography, coalescent formulas and related simulations have contributed to a probabilistic understanding of the shapes of multispecies gene trees (1⇓–3), enabling novel predictions about gene tree shapes under evolutionary hypotheses (4, 5), new ways of testing hypotheses about gene tree discordances (6, 7), and new algorithms for problems of species tree inference (8, 9) and species delimitation (10, 11). A “multispecies coalescent” model, in which coalescent processes on separate species tree branches merge back in time as species reach a common ancestor (12), has become a key tool for theoretical predictions, simulation design, and evaluation of inference methods, and as a null model for data analysis.

A fundamental concept in genealogical studies is that of monophyly. In a genealogy, a group that is monophyletic consists of all of the descendants of its most recent common ancestor (MRCA): every lineage in the group—and no lineage outside it—descends from this ancestor. Backward in time, a monophyletic group has all of its lineages coalesce with each other before any coalesces with a lineage from outside the group.

The phylogenetic and phylogeographic importance of monophyly traces to the fact that monophyly enables a natural definition of a genealogical unit. Such a unit can describe a distinctive set of organisms that differs from other groups of organisms in ways that are evolutionarily meaningful. Species can be delimited by characters present in every member of a species and absent outside the species, and that therefore can reflect monophyly (13, 14). In conservation biology, monophyly can be used as a prioritization criterion because groups with many monophyletic loci are likely to possess unique evolutionary features (15). Reciprocal monophyly, in which a set of lineages is divided into two groups that are simultaneously monophyletic, is often used in a genealogical approach to species divergence (16, 17). The proportion of loci that are reciprocally monophyletic is informative about the time since species divergence and can assist in representing the level of differentiation between groups (4, 18).

Many empirical investigations of genealogical phenomena have made use of conceptual and statistical properties of monophyly (19). Comparisons of observed monophyly levels to model predictions have been used to provide information about species divergence times (20, 21). Model-based monophyly computations have been used alongside DNA sequence differences between and within proposed clades to argue for the existence of the clades (22), and tests involving reciprocal monophyly have been used to explain differing phylogeographic patterns across species (23). Comparisons of observed levels of monophyly with the level expected by chance alone (24) have assisted in establishing the distinctiveness of taxonomic groups (25, 26). Loci that conflict with expected monophyly levels have provided signatures of genic roles in species divergences (27⇓–29).

For lineages from two species under a model of population divergence, Rosenberg (4) computed probabilities of four different genealogical shapes: reciprocal monophyly of both species, monophyly of only one of the species, monophyly of only the other species, and monophyly of neither species. The computation permitted arbitrary species divergence times and sample sizes—generalizing earlier small-sample computations (1⇓–3, 30, 31)—and illustrated the transition from the species divergence, when monophyly is unlikely for both species, to long after divergence, when reciprocal monophyly becomes extremely likely. Between these extremes, the species can pass through a period during which monophyly of one species but not the other is the most probable state.

Although this two-species computation has contributed to various insights about empirical monophyly patterns (21⇓–23, 32⇓–34), many scenarios deal with more than two species. Because multispecies monophyly probability computations have been unavailable—except in limited cases with up to four species (4, 35⇓⇓–38)—multispecies studies have been forced to rely on two-species models, restricting attention to species pairs (25, 34, 39) or pooling disparate lineages and disregarding their taxonomic distinctiveness (23, 26).

Here, we derive an extension to the two-species monophyly probability computation, examining arbitrarily many species related by an evolutionary tree. Furthermore, we eliminate the past restriction (4) that the lineages whose monophyly is examined all derive from the same population. This generalization is analogous to the assumption that in computing the probability of a binary evolutionary character (40⇓–42), one or both character states can appear in multiple species. Our approach uses a pruning algorithm, generalizing the two-species formula in a conceptually similar manner to other recursive coalescent computations on arbitrary trees (9, 40⇓⇓⇓–44).

Like the work of Degnan and Salter (5), which considered probability distributions for gene tree topologies under the multispecies coalescent model, our work generalizes a coalescent computation known only for small trees (4, 35) to arbitrary species trees. We study the dependence of the monophyly probability on the model parameters, providing an understanding of factors that contribute to monophyly in species trees of arbitrary size. Finally, we explore the utility of monophyly probabilities in an application to genomewide data from maize and teosinte.

## Results

### Model and Notation.

#### Overview.

Consider a rooted binary species tree T with ℓ leaves and specified topology and branch lengths. For each of the ℓ species represented by leaves of T, a number of sampled lineages is specified. Given a specified partition of the lineages into two subsets, we consider a condition describing whether one, the other, both, or neither of the two subsets of lineages is monophyletic. Our goal is to provide a recursive computation of the probability that the condition is obtained under the multispecies coalescent model. Notation appears in Table S1.

#### Lineage classes.

The initial sampled lineages are partitioned into class *S* (subset) for lineages within a chosen subset, and class *C* (complement) for all lineages not included in *S*. Coalescence between an *S* lineage and a *C* lineage produces an *M* (mixed) lineage. Any coalescence involving an *M* lineage also produces an *M* lineage. Coalescences between two *S* or two *C* lineages produce *S* and *C* lineages, respectively (Table 1).

Letting the number of *S* and *C* lineages present initially in the *i*th leaf be

#### Monophyly events.

A monophyly event *S* and *C*. We can choose to label a class “monophyletic” or “not monophyletic,” or assign no label at all, so that nine monophyly events are possible, six of which are relevant for our purposes (Table 2). All lineages in a monophyletic class must coalesce within the class to a single lineage before any coalesces outside the class. If multiple classes are labeled monophyletic, then each class must be separately monophyletic.

#### Species-merging events.

We orient the species tree vertically, “up” toward the root and “down” toward the leaves. From a coalescent backward-in-time perspective, at every internal node of the species tree—representing a species-merging event—lineages enter from two branches directly below the node. We label one of these branches “left” and the other “right,” based on an arbitrarily labeled diagram of species tree T. These labels are used only for bookkeeping; the labeling does not affect subsequent calculations. Lineages entering from the left and right branches are called “left inputs” and “right inputs,” respectively. Each node *x* of T is associated with exactly one branch, leading from node *x* to its immediate predecessor on T. We refer to this branch with the shared label *x*.

For an internal branch *x* in T, the number of class-*S* left inputs is *C*, *M*); the number of class-*S* right inputs is *C*, *M*). The total number of class-*S* inputs of *x* is *C*, *M*). The number of lineages that exit branch *x*, entering a branch farther up the species tree, is the set of outputs of branch *x*:

We combine the input and output values into two three-entry vectors: the “input states” *x* corresponding to its left and right incoming branches by *L*s and *R*s, which, read from left to right, give the steps needed to reach them from *x*. For example, *x* to the right (

The time interval associated with node *x* is *x*. Branch lengths are measured in coalescent time units of *N* generations, where *N* represents the haploid population size along the branch and is assumed to be constant. Thus, larger population sizes correspond to shorter lengths of time in coalescent units. Coalescences between inputs during time *x*. The root branch of T has infinite length.

The outputs of any nonroot branch are exactly the left or the right inputs of another branch farther up the tree; the outputs of the root are the outputs of the species tree. The root has only one output lineage: *x* are the outputs of *x* corresponds to leaf *i*, we let

We define *x* and *x*, ignoring the rest of the species tree.

#### Coalescence sequences.

A coalescence sequence is a sequence of coalescences that reduces a set of lineages to another set of lineages. As an example, consider four lineages—labeled A, B, C, and D—that coalesce to a single lineage. One sequence has A and C coalesce first, followed by B and D, then the lineages resulting from the AC and BD coalescences. This sequence could be described as (A, C), (B, D), (AC, BD). If the first two coalescences happened in opposite order, the sequence would be (B, D), (A, C), (AC, BD).

#### Combinatorial functions.

The probability *n* lineages coalesce to *j* lineages in time *T* is given by equation 6.1 of ref. 45. It is nonzero only when

Following equation 4 of ref. 4, the number of coalescence sequences that reduce *n* lineages to *k* lineages is

Finally, the binomial coefficient

### The Central Recursion.

#### Overview.

We develop a recursion for the probability of a particular output state *x* given the initialized species subtree *x* into a product of the probabilities of the output states of **1**, which we represent by *F*, is the probability that the inputs coalesce to the specified outputs during time *x* as **1**, we can write the central recursion of our analysis:*S* across all of the leaves subtended by *C*). Each of the two summations is a nested triple sum, proceeding componentwise over the three entries in the vectors

#### Bounds of summation.

The sums in Eq. **2** traverse all possible inputs of branch *x*. We use summation bounds that only require information contained in the initialized species subtree

For the upper bounds, because coalescence does not create new *S* and *C* lineages (Table 1), the numbers of *S* and *C* lineages never exceed the numbers of *S* and *C* leaves in the gene tree, respectively. Thus, for branch *x*, an upper bound for the possible number of inputs of class *S* or *C* from one side (*L* or *R*) is *S* and *C*.

We use Eq. **2** to calculate probabilities only for *S* lineage and a *C* lineage. Because the leaves possess no *M* lineages and because only the unique coalescence between an *S* and a *C* lineage creates an *M* lineage (Table 1), the number of *M* lineages never exceeds 1.

#### Probability of the outputs of a node given the inputs.

Separating the function *F* from Eq. **2** into a term for the probability that the correct number of outputs is produced from the inputs and a combinatorial term *F* takes the form*S* is of interest, we have:*S* lineages in the species tree T at the species merging event corresponding to node *x*. For cases 1 and 3,

Function *F* (Eq. **3**) describes the probability of an output state and monophyly event given an input state and the initialized species tree. Its *g* term records the probability that the correct number of coalescences occur during the time **4**) records the fraction of those sequences that produce the correct output and preserve the monophyly event

The cases in Eq. **4** represent distinct scenarios for the types of input and output lineages present (Fig. 2 *A–G*). In case 1 (Fig. 2 *A–E*), no coalescence violates *S* (case 1e) or *C* lineages (cases 1b, 1c, 1d), and the only change from input to output is a reduction in *S* or *C* lineages.

In cases 2 and 3, both *S* and *C* lineages are present, and we enumerate the ways to obtain the desired output state from the input state in accord with the monophyly event. To obtain

Case 2 describes the only possible way an *S* lineage and a *C* lineage can coalesce with each other under *F*). All extant *S* lineages at the time of node *x* (*C* lineage when *k* class-*C* lineages remain from the *C* lineages present in both species at node *x*. This coalescence results in a single *M* lineage and *C*, which can coalesce in any order to a single class-*M* lineage and *C* lineages.

The number of ways that *k* lineages is *S* lineage can coalesce with one of *k* lineages of class *C* is *k*. Finally, *k* lineages—one *M* lineage and *C* lineages—can coalesce to *k*, which ranges from just enough *C* lineages (*S* lineage coalesces with one *C* lineage and then no other coalescence occurs—to the total number *C* lineages, when all of the *S* lineages coalesce before any of the *C* lineages coalesce. The denominator of ratio *M*, reduces the formula to the two-species equation 11 from ref. 4 (*Supporting Information*).

Case 3 describes any situation with *S* and *C* lineages present and no interclass coalescence (Fig. 2*G*). At node *x*, the *S* lineages coalesce to *S* lineages, and the *C* lineages to *C* lineages. Group *S* has not yet coalesced with the other sampled lineages and does not do so within this species tree branch; its monophyly is not necessarily determined on the branch. The number of ways

Any pairing of an input state and an output state that does not belong in cases 1–3 of Eq. **4** must violate

#### Reciprocal monophyly.

Monophyly events *C* and *M* lineages cannot coexist. Thus, cases 1c and 1d of Eq. **4** move to “otherwise” for *S* lineages have coalesced to a single *S* lineage and all *C* lineages have coalesced to a single *C* lineage, whereas *S* lineages coalesce. For **4**; for *C* lineages must be all *C* lineages in the tree at the time of node *x* (as we did for *S* lineages for case 2 of Eq. **4**; *C* lineages coalesce to a single lineage before the interclass coalescence. Setting **4**, and noting that *H*), applicable when **5** or cases 1c and 1d of Eq. **4**,

#### Completing the calculation.

Having obtained a recursion that propagates monophyly probabilities through a species tree, we apply Eq. **2** at the root to complete the calculation of the probability of a monophyly event on **6**,*S* and *C* switched. These recursive computations reduce to the known values for the two-species case (*Supporting Information*).

### Effect of Species Tree Height *T*.

To illustrate the features of monophyly probabilities, we now examine the effects on the probabilities of model parameters. First, we vary the tree height *T* and preserve relative branch length proportions, studying the limiting cases of

*T* = 0.

At *S* lineages and all *C* lineages enter the root. Using Eq. **7**, and noting that *S* and *C* lineages:*f* decreases with increasing *s* or *c*, as adding any lineage increases the chance of a monophyly-violating interclass coalescence.

#### T → ∞.

As *S*, *S* lineages in the tree.

For large *T*, the monophyly probability depends on properties of *S* lineages must encounter *C* lineages only above its root. If *C* lineages, then complete coalescence in each branch implies monophyly of *S* lineages, and the monophyly probability is 1. If *C* lineages and is at a leaf, *k*, then the limiting probability is

If *C* lineages but is not a leaf, however, then complete coalescence in every branch implies that some proper subset of *S* lineages must coalesce with *C* lineages before all of the *S* lineages can coalesce with each other. In this case, the limiting monophyly probability is 0.

#### Finite, nonzero *T*.

The extreme cases assist in understanding the behavior of the probability of *T*. We enumerate the possible situations based on

If *C* lineages, then decreasing the tree height decreases the probability of monophyly by decreasing the time during which *S* lineages are able to coalesce with only themselves, eventually approaching a minimum *T* increases the monophyly probability toward 1 as

If *C* lineages and is a leaf, then decreasing the tree height decreases the monophyly probability by decreasing the time before more *C* lineages are added to the population that contains the *S* lineages. Shrinking the tree also increases the expected number of additional *C* lineages introduced at species merging events, further decreasing the monophyly probability. The minimal probability of monophyly therefore occurs at *T*.

If *C* lineages and is not a leaf, then the minimal probability of monophyly, approached as *T* is not guaranteed, and different initial sample sizes on the same species tree can generate different behavior.

### Effect of Relative Branch Lengths.

Next, to investigate the behavior of the monophyly probability as *T* increases, we devise a simple three-species, two-parameter scenario, subdividing the tree height *T* by a parameter *r*. We calculate the probability of *r* and *T*.

Fig. 3 shows the species tree and its resulting monophyly probabilities for four representative initial conditions. For each lineage class, *S* and *C*, the four cases place one or more lineage pairs into the three species, using different placements across the four cases. The cases include scenarios in which at least one species contains both *S* and *C* lineages (B, D, E), in which one (C) or both lineage classes spans multiple species (B, D, E), and in which the species containing *S* lineages are not monophyletic in the species tree (B, C).

The four cases (Fig. 3 *B–E*) illustrate differences in the pattern of increase or decrease in the monophyly probability with changes in *r* at fixed tree height *T* (*Supporting Information*). In most cases with fixed *r*, the probability decreases to 0 with increasing *T*, although in some boundary cases with *T* (see above on *T*, monotonically decreasing, or not monotonic at all.

### Effect of Pooling.

Our next scenario simulates the difference between separating and pooling distinct species when computing monophyly probabilities, recalling that tests with more than two species have until now required the pooling of multiple clades (23, 26).

We consider four species trees with equal height and 12 lineages (Fig. 4). Six class-*C* lineages appear in one species descended from the root. The other six—the *S* lineage class—are evenly divided between one, two, three, or six other leaves. If we interpret the seven-leaf tree in Fig. 4*D* to be the “true” species tree, then the other trees represent pooling schemes, the two-leaf tree (Fig. 4*A*) being the only one possible to analyze using previous results.

Fig. 4 *E–J* displays the probabilities of all possible monophyly events for each tree. For each event, pooling does not affect the extreme cases *T*, the monophyly probability for the *S* lineages decreases as pooling is reduced from the case in which the six class-*S* lineages are treated as belonging to a single species to the case in which each lineage is in its own species (Fig. 4*E*); the monophyly probability for *C* remains largely unchanged (Fig. 4*F*). As pooling is reduced, the probability of monophyly of only *S* and not *C* decreases (Fig. 4*G*), and that of only *C* and not *S* increases (Fig. 4*H*). The reciprocal monophyly probability decreases (Fig. 4*I*) and the probability of no monophyly increases (Fig. 4*J*).

In this scenario, the *S* and *C* lineages meet only at the species tree root, and the monophyly probabilities are determined by the numbers of lineages that reach the root. Coalescence is faster with more nonisolated lineages; pooling species together results in more coalescence events and fewer *S* lineages entering the root, increasing the probability of monophyly of both *S* and *C* lineages as well as the reciprocal monophyly probability (Fig. 4 *E, F,* and *I*). Decreasing the number of *S* lineages at the root decreases the number of coalescences needed to produce *S* lineages does not change the number of coalescences necessary to produce *E* and *F*). The probability for

As can be seen from the increase in probability for *E*), the correct monophyly probability for clades that have been pooled tends to be lower than that obtained under a model where the pooled clades are treated as a single clade. The monophyly probability will likely be overestimated if populations are pooled.

### Application to Data.

To illustrate the empirical use of Eq. **7** and to test if our theoretical results reasonably replicate patterns in real data, we perform an analysis of monophyly frequencies using *Zea mays* maize and teosinte genomic data (46).

Hufford et al. (47) analyzed 75 individuals from the data of Chia et al. (46), considering four groups: teosinte varieties var. *parviglumis* (“parviglumis”) and var. *mexicana* (“mexicana”) and domesticated maize landraces (“landraces”) and improved lines (“improved”). Modifying the estimated tree of individuals from figure 1 in Hufford et al. (47) to make a model “species” tree the leaves of which are the four groups (Fig. 5*A*), we compute theoretical monophyly probabilities for each of the groups via Eq. **7**. We also estimate the empirical frequency of monophyly for each group by randomly sampling individuals from each group, constructing multiple gene trees per sample from SNP blocks, and averaging frequencies of monophyly in the gene trees over the random samples. This procedure employs 100 unique random samples of eight individuals from the Hufford et al. subset, each containing two individuals from each of the four groups. Finally, we compare the observed and theoretical monophyly frequencies.

The monophyly frequencies appear in Fig. 5*B* and are summarized in Table S2. The theoretical frequencies predict the observations reasonably well. For each clade, especially parviglumis and mexicana, the mean observed monophyly frequency over 100 samples closely coincides with the theoretical monophyly probability (Fig. 5*B*). Although the theoretical probability is noticeably below the mean for the improved and landrace clades and above the mean for parviglumis and mexicana, it lies well inside the observed distributions.

Eq. **7** relies on a model with selectively neutral loci and constant population size; a deviation from theoretical probabilities could suggest a violation of one of the model assumptions. Domestication imposes strong selection and population bottlenecks (27, 48, 49), factors that violate our model in a manner that would increase monophyly frequencies. Excess empirical monophyly in the improved and landrace clades (Fig. 5*B*, Table S2) is thus compatible with domestication in the history of these domesticated groups.

## Discussion

Extending a past computation (4) from 2 to *n* species, we have obtained a general algorithm for the probability of any monophyly event of two lineage classes in a species tree of any size. In our generalization, unlike in previous calculations, no restriction exists on the class labeling of lineages, so that monophyly probabilities can be computed on samples aggregated across multiple species. We have uncovered behaviors absent in the two-species case, including nonmonotonicity of the monophyly probability in the tree height and positive limiting probabilities below 1. Both phenomena occur in scenarios newly possible to include in monophyly calculations, in which the lineage set whose monophyly is of interest spans multiple species, or in which lineages of at least one species span both classes.

We have used a pruning algorithm similar to other species tree computations (9, 40⇓⇓⇓–44) that evaluate a quantity at a parent node in terms of corresponding values for daughter nodes. In previous applications of this idea, the states recorded at a node are generally simpler than our input and output states. For example, in evaluating the time to the MRCA (43), they are one-dimensional; our approach instead tracks lineage classes as three variables, accommodating complex transitions that occur at interclass coalescences.

Previous work on monophyly probabilities has been limited to small numbers of species (4, 35⇓⇓–38). This limitation has forced investigators to either group multiple species together into a single clade (23, 26)—a choice that our tree-pooling experiment shows can overestimate monophyly probabilities—or to consider pairwise comparisons when multispecies analyses would be preferable (25, 34, 39). By identifying a bias that occurs when pooling distinct species in monophyly probability computations, our experiment suggests that pooling should be avoided when possible. Our results allow researchers to move beyond such simplifications by performing monophyly calculations in larger species groups.

One application of our results is to extend a test of a null hypothesis that an observed monophyletic pattern is due to chance alone (24). This test has been available only in situations with species-specific lineages and two-species trees; it can now be extended to arbitrary trees and non-species-specific lineages. The results also provide a step toward computations for monophyly events on three or more lineage groups considered jointly.

As an empirical demonstration, we analyzed data from maize and teosinte, calculating theoretical and observed monophyly frequencies in four groups. The empirical frequencies generally match the predictions; frequencies exceeding predicted values in the domesticated species may reflect the fact that domestication bottlenecks and strong selection can violate our model in a manner that increases the likelihood of monophyly.

We note that our *Z. mays* results should be viewed with caution. We assumed a model of instantaneous divergence events without incorporating the subsequent gene flow that likely occurred in this system (47). Furthermore, our model species tree contains uncertainty; however, we do not expect a bias in any specific direction to have resulted from its construction. Perhaps more seriously, we generated the model tree from the same study whose data we used for constructing gene trees. However, considerations of monophyly were irrelevant in producing the model tree, so that construction of the model did not guarantee the agreement we obtained between theoretical and observed monophyly.

The maize analysis illustrates how our framework can be used to study monophyly in multispecies genomic data. The formulas derived here allow for greater flexibility in studies of monophyly and its relationship to species trees, contributing to a more comprehensive toolkit for phylogeographic, systematic, and evolutionary studies.

## Materials and Methods

### Maize Species Tree.

We used maize HapMap V2 SNP data from www.panzea.org/#!genotypes/cctl (46) consisting of 55 million SNPs and small indels from 103 *Z. mays* inbred lines. To construct Fig. 5*A*, we determined relative branch lengths from figure 1 in Hufford et al. (47). We chose a tree height of 0.04, measured in units of *N* generations, where *N* is the haploid population size, noting that a ∼10,000-y domestication time (47) translates via conversion factors calculated from figure 7 in ref. 50 (top panel, *N* generations. We chose our root as the root of the Hufford et al. ingroup tree (second node from left in figure 1 of ref. 47, call it *x*), our Parviglumis/Domesticated node as the MRCA of all domesticated lineages and parviglumis lineages TIL01, TIL03, TIL11, and TIL14 (*L* is “down” rather than “left”), and our Landrace/Improved node as the MRCA of all domesticated lineages (

### Maize Samples.

We chose 100 samples of four lineage pairs, selecting randomly among 29 improved, 12 landrace, 8 parviglumis, and 2 mexicana individuals. We chose pairs within groups so that the Hufford et al. tree, a genome-wide tree of individuals, restricted to each eight-lineage sample would display the model species tree in Fig. 5*A*, irrespective of which lineage in a pair was chosen to represent its group (*Supporting Information*).

### Maize Gene Trees.

The maize genome has *hclust* UPGMA (unweighted pair group method with arithmetic mean) clustering function in the R *stats* package. SNPs with missing data for a lineage pair were excluded in distance calculations.

### Software Implementation.

The *Monophyler* software package implementing Eqs. **7**, **8**, and **9** can be found at rosenberglab.stanford.edu/monophyler.html.

## Reduction to the Two-Species Case from Ref. 4

This appendix shows that our recursive Eq. **2** reduces properly to the two-taxon results of Rosenberg (4). Rosenberg (4) studied four monophyly events *S* and reciprocal monophyly with the settings from ref. 4, showing that we obtain the same results.

In our notation, the root node of a two-taxon tree is *x*, and the leaves are

For the probability of monophyly of *S*, applying the initial conditions in Eq. **2** yields:**S1**, and the limits of summation therefore agree with ref. 4.

For the combinatorial term **S1**, the only possibility is case 2 in Eq. **4**:*S*. Expression S2 therefore accords with the corresponding equation 11 of ref. 4. Note that the line above equation 10 of ref. 4 contains a known typographical error, with **S2**, did not produce an error in the numbered equation 10 of ref. 4.

The next step is to verify that the probability terms in Eq. **S1** for the left and right species tree leaves agree with ref. 4. For the left leaf, our definitions of probabilities for leaves force the summation to have only one nonzero term, with input probability 1. Because only input *S* lineages are present, case 1e for **4**) applies, so that

The recursion terminates at the leaves. Because node *x* is the root, **S1**. Thus, the probability of **S3** and **S4** and simplifying produces equation 15 of ref. 4, confirming agreement of our formulas with those of ref. 4.

## Probabilities in the Relative-Branch-Length Scenario

The four cases of Fig. 3, representing different distributions across species of lineages in lineage classes *S* and *C*, illustrate different effects of the tree height *T* and relative-branch-length parameter *r*. These differing effects can be explained by considering the way in which likely coalescence patterns for the sampled lineages differ as a function of the locations of those lineages.

In Fig. 3*B*, with *r*. With *T* fixed, *r* modulates the time during which the *r*, a monophyly-violating coalescence is more likely. Because the *r* is less important than *T* in predicting the monophyly probability. As *T* increases, because the minimal subtree with respect to *S* is the full species tree—containing *C* lineages but not occurring at a leaf—the probability approaches 0.

For Fig. 3*C*, *r* controls whether a coalescence violating *r* increases at fixed *T*, the *S* is the full species tree, for *T* differs with *r*: it increases monotonically at *r*.

In Fig. 3*D*, with *r* increases with *T* fixed, the time before the *r*. For *S* lies at the MRCA for the sister species pair, the probability approaches 0 as *T* only for

Finally, in Fig. 3*E*, we set *T*, increasing *r* increases the time during which the *r* and a monotonic increase is observed for *r* increases from 0 to 1.

## Pairs of Maize and Teosinte Lineages

The eight-lineage subsamples of maize and teosinte lineages all contained the only two mexicana individuals in the ref. 47 dataset; one of four pairs of parviglumis individuals: {TIL07, TIL09}, {TIL10, TIL17}, {TIL01, TIL11}, and {TIL03, TIL14}; either {MR12, MR20}, any two individuals from {MR03, MR23, MR21, MR18, MR06}, or any two from {MR05, MR09, MR24, MR26, MR01}, for a total of 21 possible landrace pairs; and two individuals chosen from the pairs {IL14H, P39}, {KY21, M162W}, {CML103, TX303}, {CML247, CML322}, any two from {CAU178, OH78, MS71, B97, W22, W64A, CAUMO17, MO17, OH43, B73, CAUZHENG58, CAU478, CAU5003, CML333, CML52}, or any two from {NC350, NC358, CML69, KI11, CML228, KI3}, for a total of 124 possible improved pairs.

The outlier samples all contain the pair {CAUMO17, MO17} (improved) or the pair {TIL03, TIL14} (parviglumis), two recently coalescing pairs for which the model species tree least adequately reflects the original species tree of ref. 47. In the case of parviglumis, one of the four pairs produces substantially different results from the others, and for convenience, we regard it as an outlier.

## Numerical Implementation

In our numerical implementation, although a leaf has no input nodes, without loss of generality, we let all its inputs “enter” from the left. To reduce numerical challenges, we use a binomial coefficient representation that avoids large numerators and denominators:**4** using binomial coefficients as**4**, we have**5**,

## Acknowledgments

We thank Jeff Ross-Ibarra for assistance with the maize data and John Rhodes and two reviewers for comments on a draft of the manuscript. We acknowledge support from NIH Grant R01 GM117590, NSF Grant DBI-1458059, a New Zealand Marsden grant, and a Stanford Graduate Fellowship.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: rsmehta{at}stanford.edu.

Author contributions: R.S.M., D.B., and N.A.R. designed research; R.S.M. and N.A.R. performed research; R.S.M. analyzed data; and R.S.M., D.B., and N.A.R. wrote the paper.

The authors declare no conflict of interest.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “In the Light of Evolution X: Comparative Phylogeography,” held January 8–9, 2016, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. The complete program and video recordings of most presentations are available on the NAS website at www.nasonline.org/ILE_X_Comparative_Phylogeography.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1601074113/-/DCSupplemental.

## References

- ↵
- Tajima F

- ↵
- Takahata N,
- Nei M

- ↵
- Karlin S,
- Nevo E

- Neigel J,
- Avise J

- ↵
- ↵
- ↵
- Wu CI

- ↵
- ↵
- ↵
- ↵
- ↵
- Yang Z,
- Rannala B

- ↵
- ↵
- ↵
- ↵
- ↵
- Hoch PC,
- Stephenson AC

- Baum DA,
- Shaw KL

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Ting CT,
- Tsaur SC,
- Wu CI

*Odysseus*. Proc Natl Acad Sci USA 97(10):5313–5316. - ↵
- Dopman EB,
- Pérez L,
- Bogdanowicz SM,
- Harrison RG

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Knowles LL,
- Kubatko LS

- Degnan JH

- ↵
- ↵
- ↵
- ↵
- RoyChoudhury A,
- Felsenstein J,
- Thompson EA

- ↵
- ↵
- ↵
- Efromovich S,
- Kubatko LS

- ↵
- ↵
- ↵
- ↵
- ↵
- Wright SI, et al.

- ↵
- Innan H,
- Kim Y

- ↵
- Ross-Ibarra J,
- Tenaillon M,
- Gaut BS

*Zea*. Genetics 181(4):1399–1413. - ↵
- Schnable PS, et al.

- ↵
- Remington DL, et al.

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Evolution