# Quantifying the risk of hemiplasy in phylogenetic inference

See allHide authors and affiliations

Edited by David M. Hillis, The University of Texas at Austin, Austin, TX, and approved November 5, 2018 (received for review June 29, 2018)

## Significance

Convergent evolution provides key evidence for the action of natural selection. The process of convergence is often inferred because the same trait appears in multiple species that are not closely related. However, different parts of the genome can reveal different relationships among species, with some genes or regions uniting lineages that appear unrelated in the species tree. If changes in traits occur in these discordant regions, a false pattern of convergence can be produced (known as “hemiplasy”). Here, we provide a way to quantify the probability that hemiplasy occurs and contrast it with the probability of convergence. We find that hemiplasy is likely to explain many apparent cases of convergent evolution, even when the fraction of discordant regions is low.

## Abstract

Convergent evolution—the appearance of the same character state in apparently unrelated organisms—is often inferred when a trait is incongruent with the species tree. However, trait incongruence can also arise from changes that occur on discordant gene trees, a process referred to as hemiplasy. Hemiplasy is rarely taken into account in studies of convergent evolution, despite the fact that phylogenomic studies have revealed rampant discordance. Here, we study the relative probabilities of homoplasy (including convergence and reversal) and hemiplasy for an incongruent trait. We derive expressions for the probabilities of the two events, showing that they depend on many of the same parameters. We find that hemiplasy is as likely—or more likely—than homoplasy for a wide range of conditions, even when levels of discordance are low. We also present a method to calculate the ratio of these two probabilities (the “hemiplasy risk factor”) along the branches of a phylogeny of arbitrary length. Such calculations can be applied to any tree to identify when and where incongruent traits may be due to hemiplasy.

Convergent traits found in distantly related organisms are prime examples of the role of natural selection in evolution. They are often used as evidence for the importance of adaptation in shaping organismal form and function, although they may also reflect underlying constraints on developmental pathways (1). Understanding how often convergent evolution occurs, and the conditions under which it occurs, will allow us to better understand its causes. The identification of clear cases of convergence—or more broadly, homoplasy, which includes reversals to ancestral states (2)—also enables us to determine how often convergent phenotypes are underlain by convergent molecular changes (3, 4).

Homoplasy, whether due to convergence or reversal, is inferred when multiple independent evolutionary changes are required to explain the same character states observed among sampled lineages. Here, we consider cases of convergence where the same derived state has apparently evolved from the same ancestral state multiple times; this is sometimes referred to as “parallel” evolution (cf. ref. 5). To make strong inferences about homoplasy, we require both a phylogenetic tree describing the relationships among taxa and a model of trait evolution. If either the tree or the model is incorrect, this can lead to errors in inferences about the number of transitions that have occurred (6⇓⇓⇓–10). To reduce errors in species trees, genome-scale datasets have been used to generate topologies that have strong statistical support at almost all nodes (e.g., refs. 11 and 12). Although concerns remain about appropriate models of trait evolution in some cases (13), the “resolution” of species trees with phylogenomic datasets would appear to have removed the main source of error in inferring character-state transitions accurately.

However, genome-wide data have also highlighted the ubiquity of gene-tree discordance—when individual gene trees have different topologies than the species tree—even when statistical support for the species tree is high (e.g., refs. 14⇓⇓–17). Discordance can be due to many factors, both technical and biological. The technical causes of gene-tree discordance can include a paucity of informative sites at individual loci, misidentification of paralogs as orthologs, and a lack of fit between the sequence data and the substitution model used to infer the tree. The biological causes of gene-tree discordance are incomplete lineage sorting (ILS) and introgression/hybridization (18), although natural selection can sometimes result in discordance (e.g., ref. 19). In the presence of any of these biological processes, individual loci can have different histories from the species tree; discordance due to either introgression or ILS in the history of extant lineages does not go away over time, so studies of both ancient and recent divergences can be affected.

Discordance presents a problem for inferences of convergent evolution because it can produce patterns of homoplasy even when none has occurred (20). In a phenomenon dubbed “hemiplasy” (21), transitions that occur on branches of discordant trees that do not exist in the species tree will generate incongruent trait patterns (Fig. 1*A*). Incongruent traits—binary traits that cannot be explained by a single trait transition on a bifurcating species tree—are the basis for claims of convergent evolution and homoplasy. As all discordant trees have discordant branches (22, 23), hemiplasy is expected to explain a substantial number of observed incongruent trait patterns in all cases where biological discordance exists (24⇓⇓–27). Note that hemiplasy cannot be confused for cases of convergent evolution arising from different ancestral states (26), so we do not consider this type of convergence here.

While complications due to hemiplasy have begun to be appreciated, the relative importance of hemiplasy and homoplasy in any particular set of relationships is rarely quantified (24, 28). Here, we present a model that describes the relative probabilities of hemiplasy and homoplasy for an incongruent trait. Given even minimal amounts of gene-tree discordance, we find that hemiplasy is more likely, or at least as likely, as homoplasy for a wide range of conditions. Based on this model, we present a method for calculating the ratio of hemiplasy to homoplasy [the “hemiplasy risk factor” (HRF)] along a phylogeny. This method provides a general tool for understanding the risk of incorrect inferences of convergence, for use with phylogenies of any size.

## Results

### The Model.

Consider the evolution of a trait among three species with the phylogeny [A, (B, C)]. We assume that this trait, which can be either molecular or phenotypic, is binary, with 0 representing the ancestral state and 1 representing the derived state. On this phylogeny, we observe an incongruent trait pattern (e.g., allelic states of A = 1, B = 0, and C = 1; Fig. 1*A*) that cannot be explained by a single transition along the species tree. There are two possible explanations for this incongruence (Fig. 1*A*). One possibility is that the trait is hemiplastic: A single transition has occurred along the internal branch of a discordant gene tree in which B is sister to a clade consisting of A and C. Alternatively, the trait is homoplastic: Two trait transitions have occurred, due to either convergence (two changes from 0 → 1) or reversal (one change from 0 → 1 and one change from 1 → 0; *SI Appendix*, Fig. S1). In both homoplasy scenarios, some of the observed allelic states are not identical by descent, although they differ as to whether these are the derived states (convergence) or the ancestral states (reversal).

The goal of our model is to understand the biological parameters that affect the relative frequency of these two mechanisms for trait incongruence. To address this goal, we derive expressions for the probabilities of hemiplasy (*P*_{e}) and homoplasy (*P*_{o}), which both broadly depend on population size (*N* diploid individuals, assumed constant), the rate of mutation (μ, per 2*N* generations), and the timing of speciation events (in units of 2*N* generations). Assuming that all discordance is due to incomplete lineage sorting, the population size and timing of speciation events determine the genealogy τ, with possible topologies α = [A, (B, C)], β = [B, (A, C)], and γ = [C, (A, B)]. Topology α is concordant with the species tree, while both β and γ are discordant. Classic results from coalescent theory (29) provide the probability of each topology, along with the lengths of each branch in each tree. The topology of tree τ determines the probability of mutation, ν(λ_{i},τ), along every branch *i* (labeled as in Fig. 1*B*) with length λ_{i}. Because of its dependence on gene-tree topology, the function ν(λ_{i},τ) is different for each branch. For instance, the probability of mutation along the internal branch in discordant tree β is uniquely constrained by *t*_{3} and the time to coalescence of A and C. The expressions for each ν(λ_{i}, τ) are described in *SI Appendix*.

In the case of hemiplasy, two events are necessary to explain the allelic states specified above: the occurrence of discordant genealogy β and a mutation occurring on the internal branch of this gene tree (but not on any other branch). This probability is given by:*t*_{2} is the length of the single internal branch on the species tree (Fig. 1*A*), and λ_{4} is the length of the internal branch on genealogy β (Fig. 1*B*). The three terms here correspond to (from left to right): the probability of observing genealogy β, the probability of a mutation occurring on the internal branch of this genealogy, and the probability of no mutations occurring anywhere else.

In the case of homoplasy, two alternative sets of mutational events may have occurred: either two independent origins of the derived state (on the terminal branches leading to A and C) or a single mutation along the branch leading to the ancestor of all three species, coupled with a back-mutation on the branch leading to B. Because these events can occur on any one of the three possible topologies, the probability of homoplasy is:*p*(τ) is the probability of genealogy τ. The probability of homoplasy therefore corresponds to the sum of the probability of convergence averaged across all topologies and the probability of reversal averaged across all topologies (the first and second terms in the summation, respectively).

### The Relative Probability of Hemiplasy and Homoplasy.

Our model demonstrates that the probabilities of both hemiplasy and homoplasy are determined by many of the same factors. The frequencies of individual gene trees, the branch lengths in these trees, and the rate of mutation to the character of interest all interact to determine these probabilities. However, because the exact values for these parameters may rarely be known, the precise probabilities of each of these outcomes on their own may not be easily obtainable. Instead, here we present the ratio of probabilities of hemiplasy and homoplasy (*P*_{e}/*P*_{o}) for a range of parameter values to provide a quantitative sense for when each will be relatively more important. The ratio *P*_{e}/*P*_{o} represents the relative probabilities for a specific incongruent trait pattern; in the next section, we present a measure that averages over the two patterns that are possible with binary traits on a rooted three-taxon tree.

The amount of discordance (which decreases with increasing *t*_{2}) and the mutation rate tend to have opposite effects on the relative probabilities of hemiplasy and homoplasy (Fig. 2*A*). As expected, increasing levels of discordance lead to increasing relative probabilities of hemiplasy. Although homoplasy can occur on discordant trees, this outcome still requires two mutations; therefore, for a given mutation rate, more discordance will always lead to relatively more hemiplasy than homoplasy. Multiple empirical datasets include levels of discordance that imply *t*_{2} < 1 for individual internal branches (e.g., refs. 14, 16, 27, and 30⇓–32), suggesting that the probability of hemiplasy in such systems will be high. On the other hand, when there is little opportunity for discordance, there is also a greatly reduced probability for hemiplasy. By the time an internal branch of a species tree is 6*N* generations long, ∼99.9% of all gene-tree topologies will share this branch—that is, they will be concordant. In this area of parameter space, homoplasy becomes vastly more likely than hemiplasy as an explanation for incongruent trait patterns.

The mutation rate also has a strong effect on the relative probabilities of hemiplasy and homoplasy. As should be expected, when the rate of mutation to the character of interest is high, homoplasy becomes relatively more likely. Reducing mutation rates results in a steeper decrease in homoplasy relative to hemiplasy (because of the two mutations that are required), leading to a higher ratio of *P*_{e}/*P*_{o}. For low enough mutation rates, very few trees must be discordant in order for the probabilities of hemiplasy and homoplasy to be equal, so that *P*_{e}/*P*_{o} = 1. In this area of parameter space, there is an equal probability for any particular incongruent trait pattern to be caused by either hemiplasy or homoplasy. We can also think of this ratio as indicating that, given a collection of incongruent trait patterns of this type, we expect that approximately half will be due to hemiplasy and half to homoplasy. These results, shown for one particular pattern of incongruence (A = 1, B = 0, C = 1), are expected to be the same for the two possible incongruence patterns on the species tree in Fig. 1*A* (see the next section for cases in which these can be different).

While the amount of discordance and rate of mutation both have strong effects on *P*_{e}/*P*_{o}, the length of the branch subtending the clade of interest (*t*_{3}; Fig. 1*A*) and the lengths of the tip branches (*t*_{1}) have a much smaller effect. The length of *t*_{3} is only relevant to cases of homoplasy caused by reversal and has no effect on either the probability of hemiplasy or the probability of homoplasy caused by convergence. As such, it has very little effect on the relative probabilities of these two outcomes (*SI Appendix*, Fig. S2). In contrast, *t*_{1} affects the probability of homoplasy by both reversal and convergence while having no impact on the probability of hemiplasy, and consequently has a larger influence on *P*_{e}/*P*_{o} (Fig. 2*B*). All convergent events on the species tree considered here involve changes on one of the paired lineages (in this case, species C) and the unpaired lineage (species A), so *t*_{1} determines how much time there is for a change to occur on these lineages (together with *t*_{2}, which contributes to the probability of a mutation along the branch, leading to species A). Therefore, larger or smaller values of *t*_{1} allow for more or less time in which homoplasy can occur, respectively, shifting the balance of *P*_{e}/*P*_{o} slightly (compare Fig. 2 *A* and *B*).

A numerical example may help to highlight the utility of these calculations. One of the most well-studied cases of discordance caused by ILS occurs among human (H), chimpanzee (C), and gorilla (G; briefly reviewed in ref. 33). The genome sequences of these three species—plus an outgroup (orangutan)—show that 70% of nucleotides in protein-coding regions are congruent with the presumed species tree [i.e., (G, (H, C))], with 15% congruent with each of the (C, (H, G)) and (H, (C, G)) topologies (31). Multiple studies have also calculated the divergence times and effective population sizes of these species and their ancestral lineages (e.g., refs. 34 and 35), as well as per-generation nucleotide mutation rates (e.g., ref. 36), offering us the opportunity to estimate *P*_{e}/*P*_{o} for this clade. For incongruent nucleotides with states H = 1, C = 0, and G = 1, the expected *P*_{e}/*P*_{o} = 16.3 (*Methods*). This estimate means that hemiplasy is approximately sixteen times more likely than homoplasy to explain cases where human and gorilla share a derived allele to the exclusion of chimpanzee. Alternatively, we can view the value as indicating that hemiplasy is the cause of ∼94% of all sites with this incongruent site pattern. This value of *P*_{e}/*P*_{o} uses the average mutation rate per site in the genome (1.2 × 10^{−8} per generation), even though the rate can vary by orders of magnitude across sites. At hypermutable CpGs (1.2 × 10^{−7}; ref. 36), for instance, only two-thirds of observed incongruence is expected to be due to hemiplasy (i.e., *P*_{e}/*P*_{o} = 1.6). In contrast, at non-CpG sites (9.9 × 10^{−9}; ref. 36) ∼95% of all incongruence will be due to hemiplasy (i.e., *P*_{e}/*P*_{o} = 19.7).

### Calculating the HRF Along a Phylogeny.

The model and results presented thus far have focused on a rooted three-taxon tree, with speciation events that have occurred in the recent past. Importantly, however, the relevant calculations for the probabilities of hemiplasy and homoplasy are not restricted to such trees, but can be iteratively applied to trees of any size, of any height, and with nonultrametric branch lengths. These three-taxon probabilities can therefore be used within larger phylogenetic trees to highlight individual branches along which hemiplasy is likely to be responsible for observed incongruence.

To usefully summarize the relative probabilities of hemiplasy and homoplasy, we define the HRF as the ratio of *P*_{e} to *P*_{o} averaged across incongruent trait patterns:*P*_{e} and *P*_{o} separately for each incongruent pattern because in nonultrametric trees these may be quite different from one another (e.g., when the branches leading to species “B” and “C” are not the same length; Fig. 1*A*). While this means that the HRF no longer represents the probability of hemiplasy for a specific incongruent pattern, it does helpfully summarize the overall risk of hemiplasy relative to homoplasy.

The HRF is intended to highlight individual branches of phylogenetic trees along which hemiplasy may be responsible for observed incongruence, in opposition to the assumption that all incongruence is due to homoplasy. An HRF can be calculated for each internal branch of a rooted species tree, but not for tip branches. Note that given the definitions of hemiplasy and homoplasy, the HRF associated with a branch does not indicate that a character-state transition has occurred along this branch: For instance, under hemiplasy due to ILS, the relevant mutation has occurred on an earlier lineage (e.g., the one with length *t*_{3} in Fig. 1*A*) but has remained polymorphic through the relevant branch (e.g., the one with length *t*_{2} in Fig. 1*A*). Instead, HRFs identify branches of a tree where the processes leading to hemiplasy may be occurring, and around which homoplastic transitions may be incorrectly inferred. In such cases, standard ancestral state reconstruction methods will infer homoplastic substitutions on the branches neighboring the one with the high HRF—either the branches directly “above” and “below” it in the case of reversals or one branch below and one branch sister to it in the case of convergence (*SI Appendix*, Fig. S1).

We have implemented a package written in R (https://www.r-project.org/) for calculating and visualizing HRFs on larger phylogenetic trees (github.com/guerreror/pepo). The software walks through a phylogenetic tree starting from the tips, calculating HRFs on every trio of lineages (*Methods*). The input species tree must have branch lengths given in units of 2*N* generations (sometimes referred to as “coalescent units”), and a mutation rate must be specified. Multiple methods can output species trees in coalescent units (e.g., the MP-EST program; ref. 38), or—under the assumption that all gene-tree discordance is due to ILS—these lengths can be estimated for internal branches of a tree by taking the proportion of discordant trees [i.e., 1 minus the concordance factor (CF)] and solving for *t*_{2} = −log(3/2(1 − CF)). Because HRFs are intended to aid researchers during exploratory studies of the evolution of many different types of traits, the exact value of the mutation rate used should not be a key concern. For a reasonable estimate of the mutation rate, the HRFs calculated will instead represent the relative risk among branches of a larger tree along which hemiplasy may be occurring. The exact value of *P*_{e}/*P*_{o} can still be calculated for a specific trait of interest on a smaller portion of the tree using Eqs. **1** and **2**.

To provide an example of HRFs calculated on a larger tree, we use the phylogeny of wild tomatoes presented in Pease et al. (16). This dataset represents a rather extreme example of a recent rapid radiation, with none of the 2,745 gene trees (inferred from 100-kb windows) matching the exact topology of the species tree. After generating the necessary input species tree from a collection of gene trees (*Methods*), we calculated HRFs for all internal branches (Fig. 3). As can be seen, while there are multiple longer branches along which homoplasy should be a better explanation for incongruent patterns than hemiplasy, there are also many branches with equal or higher relative probabilities of hemiplasy. The branches with high HRFs are often the shortest branches, where the most discordance is to be expected. But note that two branches with equal discordance (i.e., equal CFs; ref. 37) can have very different HRFs. The calculation of HRFs depends not only on the length of the target branch in coalescent units—which solely determines the expected degree of discordance, and to a large extent the probability of hemiplasy—but also on the length of the surrounding branches, which help to determine the probability of homoplasy. HRFs therefore represent a unique and complementary tool for understanding trait evolution on trees.

## Discussion

Phylogenomic studies have revealed high levels of gene-tree discordance in many species trees (14⇓⇓–17). Such discordance can be concealed by statistical measures of support for the species tree, such as bootstrap support or posterior probabilities, which can be high even when discordance is rampant. Regardless of confidence in the species tree topology, underlying gene-tree discordance means that observed patterns of trait incongruence can be due to single transitions. The phenomenon of hemiplasy likely explains the distribution of multiple ecologically important traits (e.g., refs. 39 and 40), but can mislead standard methods for inferring the number of times a trait has involved and the timing of these transitions (25⇓–27). The goal of the work presented here is to quantify the risk of incorrectly inferring homoplasy when hemiplasy is occurring. We do this by finding explicit expressions for the probabilities of hemiplasy and homoplasy and use these expressions to develop a measure (the HRF) that highlights branches of a phylogeny along which hemiplasy may be occurring.

Our results are limited to the analytically tractable case involving three lineages and are therefore a simplification relative to larger trees. One important limitation of this approach when applying it to larger trees is that we are considering only two evolutionary events at a time: either two substitutions in the case of homoplasy or one substitution and one discordant gene tree in the case of hemiplasy. When an incongruent trait is limited to three lineages or clades (even if these clades contain multiple species), this approximation will be appropriate. Even when more than three lineages are involved, single hemiplastic substitutions can explain incongruent patterns. However, the probability of such events declines rapidly with the number of taxa involved (as the fraction of genealogies that are compatible with hemiplasy declines; *SI Appendix*, Fig. S3). Most likely, incongruent traits on rooted trees with more than five taxa will require more than two events. Such patterns may be explainable solely by homoplasy or hemiplasy, or even a mix of the two. The idiosyncrasy of larger trees and more complex patterns of incongruence likely means that simulation approaches will be necessary to address the risk of hemiplasy.

Our model is currently applicable to any type of binary trait, as these are the traits that are most often the focus of studies of convergence. While it should be straightforward to extend the model to any sort of discrete trait [e.g., gene family size (41)], even when limited to binary traits, a more important question may be what value of the mutation rate to use. We have defined the mutation rate in our model as the rate at which one state can change into the other. Although we have discussed a reversible process here (i.e., mutations from 0 → 1 and 1 → 0 are allowed), this is not a requirement of our model—it can accommodate other types of mutational processes as well. For instance, we could include markers for which back-mutations are thought to be impossible or rare (e.g., ref. 32) by simply removing the reversal term from Eq. **2**, or by using two mutation parameters with separate rates for 0 → 1 and 1 → 0 transitions (equivalent to different rates in the λ_{5} and λ_{2} terms of Eq. **2**, respectively). Similarly, while we have assumed that the rate of mutation is constant across the tree, separate mutation parameters for different branches can be included in future work. Fortunately, our model does not require us to specify whether the trait is molecular or phenotypic, nor any other details about the process. For studies of molecular traits—such as nucleotide substitutions—the appropriate mutation rate to use should be clear. For phenotypic traits, the mutation rate may be much higher or much lower than this per-nucleotide rate. For example, the loss of some traits may be possible by the inactivation of multiple genes (e.g., ref. 42); in such cases, the appropriate mutation rate should include the sum of nucleotide mutation probabilities at all of the sites in the genome at which inactivation can be accomplished, making it much higher than the single-site rate. In contrast, trait transitions requiring multiple nucleotide changes at a limited number of sites in the genome may have a total mutation rate that is equal to the square or cube of the per-nucleotide rate, making it even lower (and consequently making homoplasy even less likely). Although the HRF still requires a mutation rate be specified to be calculated, our hope is that it will highlight the relative risk among different branches of a larger tree regardless of the specific value used.

Our model makes two important assumptions. First, we have based all of our calculations on a coalescent model in which ILS is the only process that generates gene-tree discordance. However, at least among eukaryotic organisms, introgression appears to be a biological cause of discordance in a wide variety of systems (43). To a first approximation, the process causing discordance should not have a large effect on the results, as both ILS and introgression will be more likely to lead to discordant topologies when there are short internode branches (because introgression between sister lineages does not result in discordance). The HRF will therefore still point to the same lineages along which hemiplasy is high or low relative to homoplasy, the latter of which is not affected by introgression. In addition, our calculations assuming only ILS are likely to be underestimates of the probability of hemiplasy: If the cause of discordance is introgression, the internal branches of discordant gene trees do not have to be as short, making the probability of mutation along them higher. On the other hand, estimates of internal branch lengths on the species tree can be artificially lowered by gene-tree discordance due to technical errors. This occurs because estimated branch lengths must accommodate observed levels of discordance, with greater discordance, implying shorter branches. As errors in gene-tree inference are more likely farther in the past, it may be that estimates of the probability of hemiplasy are commensurately less reliable for branches deeper in the species tree.

Our second important assumption is that selection is not acting on the traits of interest. Directional selection on loci underlying a trait will cause them to be more concordant relative to the neutral expectations for the case in which ILS is the sole cause of discordance. Note, however, that the magnitude of this effect may be small, especially for traits controlled by multiple loci. Even when selection on individual loci is strong, it is only selection on the internal branch of the species tree that matters—lineage-specific selection cannot affect discordance. Moreover, if introgression is the cause of hemiplasy, then selection is irrelevant to the probability of discordance. While directional selection decreases discordance, balancing selection can increase the probability of discordance. Multiple examples of hemiplasy acting on balanced polymorphisms have already been identified (e.g., refs. 14, 39, and 44). As these examples are also associated with shorter internal branches of the respective species trees, the HRF will likely also have highlighted the pertinent branches, regardless of whether the assumptions of our model have been violated.

We have modeled the probability of hemiplasy and homoplasy for a single trait. If the trait under consideration is the allelic state at a specific nucleotide—as in the primate example used above—or a phenotype controlled by such a substitution, then one may observe multiple traits showing the same incongruent pattern for at least two reasons. First, the fact that different loci across a genome will share the same discordant topology means that the appearance of multiple traits with the same pattern of incongruence are expected under hemiplasy. For this case, the assumption of independence among the single traits should be approximately correct. Second, different sites located close together in the genome will show the same incongruent pattern because of linkage. Although such stretches may be short (on the order of exon length; refs. 28 and 45), two sites in the same gene will be more likely to show the same pattern of incongruence because they share the same gene-tree topology. While such clustering of hemiplastic sites may help to identify the cause of incongruence (because homoplastic sites are not expected to be affected by linkage), they also violate our assumption of independence among traits. Further work will need to be done to fully incorporate linkage, and the correlation it induces among multiple traits, into our model.

Recent work on the genetic basis of convergent traits has revealed that such traits are sometimes determined by convergent molecular changes (e.g., refs. 46⇓–48). The proportion of cases in which convergent phenotypes are underlain by convergent genotypes remains an open question (3, 4), and both additional phylogenetic and functional work will be needed to accurately estimate this proportion. The work here aims to aid phylogenetic studies of convergent evolution by quantifying the proportion of time the apparent convergence in traits (which would also appear to be molecular convergence) may instead be caused by underlying gene-tree discordance. Although the ecological conditions under which hemiplasy occurs may still be informative about the processes driving similarity in traits (e.g., ref. 49), properly distinguishing between hemiplasy and homoplasy is necessary for understanding the molecular basis for convergent evolution.

## Methods

### Calculating *P*_{e}/*P*_{o} for Primates.

To calculate *P*_{e}/*P*_{o} in the HC ancestor, we estimated branch lengths in coalescent units (6.8, 6.8, 7.6, 0.6, and 3.6 for H, C, G, HC, and HCG, respectively) that agree with published divergence times and effective population sizes (34, 35, 50). These lengths assume that H and C diverged 4.1 Mya, that HC and G diverged 5.5 Mya, and that generation time is 20 y throughout the clade. Additionally, we assume effective population sizes of 15,000; 15,000; 18,000; 19,000; 65,000; and 45,000 individuals for H, C, G, HC, and HCG, respectively. Given these assumptions, the value of *P*_{e}/*P*_{o} for incongruent cases {H = 0, C = 1, G = 1} and {H = 1, C = 0, G = 1} is the same, but in *Results*, we focus on the latter for clarity. For the genome-wide average estimate of the population mutation rate, we use μ = 1.2 × 10^{−4} per 30,000 generations, which implies that *N*_{e} = 15,000 individuals and a per-site mutation rate of 4.0 × 10^{−9} (one-third of the published estimated rate, since we are considering convergence between only two states); the estimates for CpG and non-CpG sites used the corresponding per-site rates (from ref. 36) and the same value of *N*_{e}. We assumed that the mutation rate was constant along the phylogeny.

### Calculating HRFs.

An HRF (Eq. **3**) can be calculated for any branch with a sibling, an ancestor, and two daughter lineages (compare branch 4 in Fig. 1). In other words, all branches of a phylogeny—except the root and leaves—have an HRF value. The *pepo* package walks up a phylogeny (in R, a *phylo* list as defined by the *ape* package; ref. 51) and calculates HRF for each internal branch, returning the values in a new data frame (compatible with *treeio*; ref. 52). The default HRF calculation in *pepo* allows reversals at the same rate as forward mutations, but these parameters can be specified by the user. This default method assumes that the ancestor of each focal branch has a length of at least 8*N* generations [i.e., *t*_{3} = max(4, *x*), where *x* is the observed ancestral branch length]. This setting, which can also be modified by users, is intended to reduce the effect of our assumption that A, B, and C have coalesced by the end of *t*_{3}.

We calculated HRFs along the tomato phylogeny reported by Pease et al. (16), specifically the “best coalescent-based phylogeny from 100 replicates of MP-EST using 100-kb genome window trees” in the supplement of that work. Because MP-EST assigns an arbitrary length of nine to leaves, we modified terminal branch lengths in two ways. For species where multiple individuals were collected, we collapsed monophyletic samples into a single branch representing the species (the ancestral branch of the replicates). For species with a single sample, we assigned a terminal length of one, an equally arbitrary value that is probably closer to the evolutionary history of this clade.

## Acknowledgments

We thank the editor and two reviewers for constructive feedback. This work was supported by National Science Foundation Grant DBI-1564611.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: rafguerr{at}indiana.edu.

Author contributions: R.F.G. and M.W.H. designed research, performed research, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1811268115/-/DCSupplemental.

- Copyright © 2018 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- ↵
- Wake DB,
- Wake MH,
- Specht CD

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Misof B, et al.

- ↵
- ↵
- Fontaine MC, et al.

- ↵
- Jarvis ED, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Copetti D, et al.

- ↵
- ↵
- ↵
- Wu M,
- Kostyun JL,
- Hahn MW,
- Moyle LC

- ↵
- Mendes FK,
- Livera A,
- Hahn MW

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Hobolth A,
- Dutheil JY,
- Hawks J,
- Schierup MH,
- Mailund T

- ↵
- ↵
- ↵
- ↵
- Lamichhaney S, et al.

- ↵
- ↵
- Hahn MW,
- De Bie T,
- Stajich JE,
- Nguyen C,
- Cristianini N

- ↵
- ↵
- ↵
- ↵
- Slatkin M,
- Pollack JL

- ↵
- ↵
- Projecto-Garcia J, et al.

- ↵
- Zhen Y,
- Aardema ML,
- Medina EM,
- Schumer M,
- Andolfatto P

- ↵
- Lee KM,
- Coop G

- ↵
- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Evolution