## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Maximum likelihood inference of reticulate evolutionary histories

Edited by David M. Hillis, The University of Texas at Austin, Austin, TX, and approved October 7, 2014 (received for review April 30, 2014)

## Significance

Phylogenetic trees play a central role in biology, modeling evolutionary histories of taxa ranging from genes, to genomes, and to species. Although trees will continue to be an essential modeling tool in evolution, phenomena such as hybridization, or gene flow more generally, result in evolutionary histories that are best modeled by phylogenetic networks. Inference of such networks is complicated by the presence of other evolutionary events, such as incomplete lineage sorting (ILS). Here, we provide a maximum likelihood method for inferring reticulate evolutionary histories while accounting for ILS. The method enables new evolutionary analyses under more complex evolutionary scenarios than existing methods can handle.

## Abstract

Hybridization plays an important role in the evolution of certain groups of organisms, adaptation to their environments, and diversification of their genomes. The evolutionary histories of such groups are reticulate, and methods for reconstructing them are still in their infancy and have limited applicability. We present a maximum likelihood method for inferring reticulate evolutionary histories while accounting simultaneously for incomplete lineage sorting. Additionally, we propose methods for assessing confidence in the amount of reticulation and the topology of the inferred evolutionary history. Our method obtains accurate estimates of reticulate evolutionary histories on simulated datasets. Furthermore, our method provides support for a hypothesis of a reticulate evolutionary history inferred from a set of house mouse (*Mus musculus*) genomes. As evidence of hybridization in eukaryotic groups accumulates, it is essential to have methods that infer reticulate evolutionary histories. The work we present here allows for such inference and provides a significant step toward putting phylogenetic networks on par with phylogenetic trees as a model of capturing evolutionary relationships.

Phylogenetic trees have long been a mainstay of biology, providing an interpretive model of the evolution of molecules and characters and a backdrop against which comparative genomics and phenomics are conducted. Nevertheless, some evolutionary events, most notably horizontal gene transfer in prokaryotes and hybridization in eukaryotes, necessitate going beyond trees (1). These events result in reticulate evolutionary histories, which are best modeled by phylogenetic networks (2). The topology of a phylogenetic network is given by a rooted, directed, acyclic graph (rDAG) that is leaf-labeled by a set of taxa (Fig. 1; more details are provided in *Model* and *SI Appendix*). Reticulation events result in genomic regions with local genealogies that are incongruent with the speciation pattern. Several methods and heuristics use this incongruence as a signal for inferring reticulation events and reconstructing phylogenetic networks from local genealogies. These methods, which are surveyed elsewhere (2⇓–4), assume that reticulation events are the sole cause of all incongruence among the gene trees and seek phylogenetic networks to explain all of the incongruence. A serious limitation of these methods is that they would grossly overestimate the amount of reticulation in a dataset when other causes of incongruence are at play. Indeed, several recent studies (5⇓⇓⇓–9) have shown that detecting hybridization in practice can be complicated by the presence of incomplete lineage sorting (ILS) (Fig. 1).

Recently, a set of methods was devised to analyze data where reticulation and ILS might both be simultaneously at play (10⇓⇓⇓⇓–15). However, these methods are all applicable to simple scenarios of species evolution and mostly assume a known hypothesis about the topology of the phylogenetic network. As reported (16, 17), we devised methods for computing the likelihood of a phylogenetic network, given a set of gene tree topologies. Still, these methods did not allow for inference of phylogenetic networks (they assume a given phylogenetic network topology and compute its likelihood). To the best of our knowledge, the first method to conduct a search of the phylogenetic network space in search of optimal phylogenies is described in a study by our group (18). However, this method is based on the maximum parsimony criterion: It seeks a phylogenetic network that minimizes the number of “extra lineages” resulting from embedding the set of gene tree topologies within its branches.

Progress with phylogenetic network inference notwithstanding, methods of inferring reticulate evolutionary histories while accounting for ILS are still considered to be in their infancy and inapplicable broadly (9). This inapplicability stems mainly from two major issues: the lack of a phylogenetic network inference method and the lack of a method to assess the confidence in the inference. Here, we develop methods that resolve both issues and carry phylogenetic networks into the realm of practical phylogenomic applications. For the inference, we propose operations for traversing the phylogenetic network space, as well as methods for assessing the complexity of a network. For measuring branch support of inferred networks, we use the bootstrap method. Furthermore, we derive, for the first time to our knowledge, the distribution (density) of gene trees with branch lengths, given a phylogenetic network, and use it in inference. Our methods provided very good results on simulated datasets. We also applied our methods to a dataset of thousands of loci from five house mouse (*Mus musculus*) genomes. The analysis yielded a well-supported evolutionary history with two hybridization events.

## Model

We seek to infer a phylogenetic network Ψ that models the (potentially reticulate) evolutionary history of a set *b* of Ψ has a length *b* in generations and *b*. A consequence of this setting is that the phylogenetic network does not have to be ultrametric. Furthermore, whereas the model does not require or necessitate a constant population size across all branches of the network, the population size and number of generations of each branch are dependent, given the branch’s length. In other words, the values of neither of these two parameters can be uniquely determined, given the length of a branch in our model (e.g., doubling both keeps the branch length unchanged). As is common in the literature in this area, we use a single composite parameter Ψ to denote the phylogenetic network topology and its branch lengths.

Tracing the evolution of a lineage from a leaf of the network back toward the root follows the multispecies coalescent model on trees, yet with one major difference: As a lineage encounters a reticulation node, it tracks one of the two parents of that node according to an inheritance probability. Because the probabilities of inheritance vary from one hybridization event to another in the network, and because different loci may provide different hybridization signals in the population (Fig. 1), the inheritance probabilities are given by a *m* is the number of independent loci (given the species phylogeny) in the dataset being analyzed and the entries of Γ satisfy three conditions for every *i*) *ii*) *b* incident into a tree node, and (*iii*) *b* and *b* incident into node *v* in Ψ, the entry *i* tracks branch *b* when “entering” the population represented by node *v*. It is important to note here that the topology and branch lengths of Ψ, as well as the matrix Γ, are to be inferred from the data; details are given below and in *SI Appendix*.

### Likelihood Formulation Based on Sequence Data.

Consider *m* independent loci along with a set *i*. The number of sequences in each *i*, and this number can vary from one locus to another. Under the independence assumption, the likelihood of an evolutionary history Ψ and inheritance probabilities Γ is given by*g*, and *g*, where *g* represents a gene genealogy (topology and branch lengths). It is important to note here that for computing the probability *u* is the per-site mutation rate, the conversion from units of the expected number of nucleotide substitutions per site to coalescent units can be done by multiplying every gene tree branch length by

### Likelihood Formulation Based on Estimated Genealogies.

Although the likelihood formulation given by Eq. **1** uses all of the information in the data, inference of the species phylogeny from estimated genealogies can significantly speed up the inference. In this case, the likelihood formulation becomes*i* and

Inference of high-quality species phylogenies based on Eq. **2** requires accurate estimates of the individual gene genealogies. Because the methods are aimed at data from closely related species and potentially multiple individuals from populations, the signal in the sequence data might be too low for estimating accurate gene genealogies. Although inference from sequences (Eq. **1**) accounts naturally for this issue, it is important to account for it explicitly when conducting inference from estimates of gene genealogies. Assume that for each locus *i*, the uncertainty in estimation is accounted for by having a collection of gene genealogies *i* based on *p* bootstrap replicates. In this case, we have*p* is given by the pmf or pdf, depending on whether the individual genealogy estimates are given by their topologies alone or by their topologies and branch lengths, respectively. The likelihood model is now given by Eq. **2**, with *p* from Eq. **3** being used instead of the pmf or pdf for individual binary genealogies. We demonstrate the performance of this formulation in *Results*.

### Maximum Likelihood Inference.

Under maximum likelihood (ML), the inference problem amounts to computing the pair **1** or based on estimated gene genealogies using Eq. **2**. Inference based on Eq. **1** requires computing the integral over all possible gene genealogies. Bryant et al. (21) provided an efficient algorithm for computing this integral when each independent locus is given by a biallelic marker. To enable ML inference based on Eq. **1**, the algorithm of Bryant et al. (21) needs to be extended along three axes: allowing for sites with more than two states, allowing for the species history to have reticulations, and allowing for each marker to consist of more than a single site. Although extensions along all three axes are technically achievable, inference of even three-taxon networks with a single reticulation from a few sites is computationally prohibitive (*Discussion*). We therefore focus on inference based on Eq. **2** in this work. Using this formulation, the pmf *Results*, we derive the pdf of gene genealogies (with branch lengths), given a phylogenetic network.

Given all of these tools, the inference problem is still very hard computationally, because the optimal Ψ and Γ need to be computed. It is standard in the case of species tree inference to use heuristics that walk the tree space in search of optimal solution candidates. It makes sense, therefore, to devise techniques for walking the phylogenetic network space in search of optimal phylogenetic networks while optimizing branch lengths and the Γ matrix. However, extra caution must be taken when searching the network space. In the case of trees, all rooted, binary trees on a given number of taxa are essentially different models with the same number of parameters. In the case of networks, on the other hand, an arbitrarily large number of reticulation nodes can be added during the search, resulting in more complex models that, by definition, could fit the data at least as well as simpler models. Because the goal is to estimate the true amount of reticulation, rather than only fitting the data, we address this challenge in two ways. First, we devise a search heuristic that searches the phylogenetic network space in layers. Second, we explore the use of cross-validation as a method to ameliorate overfitting the data, which adds to the array of other methods (e.g., information criteria) that have already been used (12, 16). Finally, to assess the fit of the inferred phylogenetic network to the data, we devise a parametric bootstrap approach that allows us to quantify branch support for the phylogenetic network. We give details for all of these methods below and in *SI Appendix*.

## Results

### Probability Density of a Gene Tree.

Given a phylogenetic network Ψ and a gene genealogy *j* (topology and branch lengths in both cases), we denote by *b* under the coalescent history *h* and the time of node *y* (a formal definition is provided in *SI Appendix*). We denote by *i*th element of the vector. Furthermore, we denote by *b* and by *b* under *h*. Then, we have*x* in the phylogenetic network Ψ. A full derivation of the formula and a more efficient algorithm for computing it along the lines of Yu et al. (17), which avoid explicit summations over the possible coalescent histories, are given in *SI Appendix*.

### Searching the Space of Phylogenetic Networks.

Letting *n* taxa, we denote by *n* leaves and *k* reticulation nodes. In particular, *k*: one operation that allows the search to ascend a layer from *SI Appendix*). It is worth mentioning that although the space of all phylogenetic tree topologies on *n* taxa is finite, the space of all phylogenetic network topologies on *n* taxa is, in theory, infinite, because *k* are unbounded. For example, consider the case of only two taxa. There is a unique, rooted tree in this case. However, because multiple hybridization events could happen between the same two sister taxa at different times, any number of horizontal edges can be added between these two taxa. Nevertheless, whether such repetitive hybridization scenarios are identifiable from typical genomic datasets is a different question.

A heuristic for estimating the optimal branch lengths for a fixed species tree topology, given gene tree topologies, that is based on repeated application of Brent’s method (22) was introduced by Wu (20). We use a similar heuristic for estimating the phylogenetic network branch lengths and inheritance probabilities (full details are given in *SI Appendix*). Coupling topological transformations and parameter estimation heuristics with the likelihood formulation above enables searching the space in a hill-climbing manner to infer an ML phylogenetic network. Given the existence of local optima within each layer, multiple, independent runs can be made.

### Controlling for Model Complexity.

Because networks in *K*-fold cross-validation, whereby the input set of gene trees is partitioned into *K* subsets of equal sizes, the parameters of the model are inferred from *k* to be the correct estimate of the number of reticulation nodes and

Finally, to assess the support of the phylogenetic networks we infer, we propose using parametric bootstrapping. Having inferred a network Ψ from the data *b* in Ψ as the number of networks (out of the *i*) either or both are reticulation edges or both are not and (*ii*) both define the same clusters (the cluster defined by a branch is the set of all taxa under that branch in the network).

### Performance on Simulated Data.

We implemented all of the methods described above in the publicly available, open-source software package PhyloNet (23) and studied the performance of the methods on several simulated datasets. In the simulation study whose results are reported in Fig. 2, we used phylogenetic network *i*) true gene tree topologies, (*ii*) estimated gene tree topologies, (*iii*) true gene tree topologies and branch lengths, and (*iv*) estimated gene tree topologies and branch lengths. The results of (*i*) and (*ii*) are shown in Fig. 2*B*, whereas the results of (*iii*) and (*iv*) are shown in Fig. 2*C*. For each setting of the number of loci and sequence length, we generated 30 datasets and conducted inferences on all of them.

Whereas the hybridization in the model network involves B and the most recent common ancestor (MRCA) of C and D, the length of the branch between the hybridization event and the divergence of C and D from their MRCA can have a big effect on distinguishing between the true hybridization scenarios and the two given by *A*. Therefore, for every dataset, we recorded whether the method inferred one of the three networks shown in Fig. 2*A*, as opposed to any other network with a single reticulation.

Several trends can be observed in Fig. 2*A*. First, using the true gene tree topologies with branch lengths results in more accurate inferences than using gene tree topologies alone. This finding is not surprising, because the former type of data contains more information than the latter. In particular, when using 80 or 160 loci, the inferred network from the true gene trees with branch lengths is always the true network. On the other hand, when using only the gene tree topologies for 160 loci, in five of the 30 cases, the inference returned one of the two alternative networks *SI Appendix*). However, we expect that sampling more individuals would result in more significant improvements on larger or more complex datasets. In terms of the inferred inheritance probabilities, the true gene trees resulted in very accurate estimates, whereas estimated gene trees with branch lengths resulted, in general, in more accurate estimates of the probabilities. Finally, we found that cross-validation generally does better than information criteria at determining the number of hybridization events (including on the biological dataset, as discussed below). More extensive simulation results under scenarios that are easier for inference than the ones we discussed here are contained in *SI Appendix*.

### Analysis of a Multilocus House Mouse Dataset.

We also used our method to analyze a multilocus dataset of house mouse (*M. musculus*) genomes, obtained from the studies of Staubach et al. (7), Didion et al. (24), and Yang et al. (25) (more details are provided in *SI Appendix*). Staubach et al. (7) found substantial genome-wide evidence of subspecific introgression in all four populations, amounting to 3% of the genome in the two *M. m. domesticus* populations (one from France and the other from Germany), 4% in an *M. m. musculus* population from Kazakhstan, and 18% in an *M. m. musculus* population from the Czech Republic. However, it is important to note that the HAPMIX method (26), which was used by Staubach et al. (7), does not explicitly account for ILS.

Our study included all of the samples in the study of Staubach et al. (7). Furthermore, our study included additional samples from an *M. m. musculus* population from China (25) that were not used in the study of Staubach et al. (7). In this analysis, we used estimated gene tree topologies alone. The reason for doing so is that the genomic sequences are obtained from very closely related individuals (these individuals are five individuals from the same species), and very little variation exists in the data to estimate branch lengths with any accuracy. Furthermore, this low variation does not allow for proper bootstrap analysis of gene trees for the individual loci. The powerful signal in this dataset comes from the very large number of loci. In our analysis, we found a significant improvement in a phylogenetic network with a single reticulation over no reticulations and a significant improvement in a phylogenetic network with two reticulations over a single reticulation. However, when we continued the search for the optimal network with three reticulations, we found that the improvement gained by considering a third reticulation event was insignificant based on the information criteria, and that there was no improvement at all based on cross-validation. We thus called the optimal phylogenetic network with two reticulations as our hypothesis for the evolutionary history of this set of genomes. This evolutionary history is shown in Fig. 3 (more details of the results and analyses are provided in *SI Appendix*). The phylogenetic network is not ultrametric, and it is worth emphasizing that the branch lengths are given in coalescent units. Thus, the lack of ultrametricity could be due to different population sizes or, to a lesser degree, different generation times.

Our analysis of house mouse genomes produces an evolutionary history that differs from that reported by Staubach et al. (7) not only in terms of the number of populations involved but also by accounting for the evolutionary history of the populations involved. We consider the percentages of the genome with introgressed origin reported by Staubach et al. (7) to be overestimates, because introgression involving an ancestral population that later split into more than one extant population would be multiply reported for each extant population in the case of the study by Staubach et al. (7). On the other hand, the same percentages would be underestimated in the case where admixed populations were used in place of the nonadmixed reference populations required by HAPMIX, as Staubach et al. (7) did by using putatively introgressed mouse samples to construct the reference populations. Notably, our methodology does not require the use of nonadmixed reference populations.

We hypothesize that the more recent introgression event in Fig. 3 is due to gene flow from secondary contact, where the ranges of the two *M. musculus* subspecies overlapped, roughly at the border between Germany and the Czech Republic. The biological interpretation of the more ancient introgression event is less clear. We conjecture that the event is related to gene flow during and after subspecific divergence. Further study may provide important clues to the mechanistic basis of the evolution of subspecies in *M. musculus* and the process of speciation itself.

It is important to note that although we used a very large number of loci, there was still uncertainty in the inferred origins of the two hybridization events (as shown in Fig. 3), a similar pattern to the one observed in the simulation results and discussed above. This uncertainty is a reflection of the weak signal in these data, coupled with the low inheritance probability and short branch length between the hybridization and the MRCA of *M. m. musculus* from China and *M. m. musculus* from Kazakhstan and *M. m. domesticus* from France and *M. m. domesticus* from Germany, which is an issue that we discussed above in the context of the simulated data. The samples used are very closely related, resulting in genomes with a very small number of segregating sites, and hence a weaker signal for inference. Nonetheless, the uncertainty is localized in the sense that the potential donors of the genetic material of each hybridization event revolve around a single ancestral node. Because all five populations under analysis are closely related, most of the reconstructed gene trees were not binary, due to identical sequences of multiple alleles. Because bootstrapping is not useful in these scenarios (every locus has a handful of sites, most of which are monomorphic), we used the nonbinary gene tree topologies for the loci and considered the set of all resolutions as the set of gene tree estimates to use in Eq. **3**.

## Discussion

We have devised methods that enable revisiting existing evolutionary analyses and conducting new ones when both hybridization and ILS are either suspected or observed. Programs implementing all of these methods are publicly available in the open-source software package PhyloNet (23). We illustrated the power of our method in extensive simulations and demonstrated its utility on a dataset of mouse genomes. In our model, we abstract the notion of hybridization such that each reticulation edge can be viewed as a “tunnel” through which genetic material can flow repetitively and at different, yet close, times. In other words, the interpretation of a reticulation edge is not that it is a single event of mating between two individuals from two populations or species; rather, it encompasses an ongoing gene flow within a time interval that can be abstracted with one edge and one inheritance probability. This abstraction is a major difference between our model and the more detailed population genetic models that account explicitly for rates of gene flow, such as the isolation-with-migration model. A major direction for future research is scaling up our methods to larger datasets. Currently, it takes a few seconds to a few minutes to evaluate the likelihood of a phylogenetic network with 10–20 taxa (17). This running time can vary significantly even among networks with the same numbers of taxa and reticulation events, because the shape of the gene tree and the configuration of the reticulation nodes in the network (their locations and interdependencies) are the crucial factors (27). However, optimizing the branch lengths and inheritance probabilities, coupled with the phylogenetic network search, is the bottleneck for computation. Furthermore, as our analyses, both on simulated and biological data, demonstrated, it might often be the case that several evolutionary histories have similar likelihoods. This observation calls for Bayesian approaches to inference of phylogenetic networks, whereby a distribution of networks, rather than a point estimate, is computed. In this case, modern Markov chain Monte Carlo techniques can replace the traditional hill-climbing technique we used here.

Although we discussed the model above with respect to a single population mutation rate (*θ*), it is generalizable in a straightforward manner to allow for different rates across the branches of the phylogenetic network if the branch-specific population size is known. Furthermore, a rate *i* to vary the mutation rates across loci (all gene tree branch lengths for locus *i* are multiplied by

A major assumption underlying our models and methods is free recombination between loci and no recombination within. This assumption is common to the majority of methods and tools that infer species phylogenies from multilocus data (even in the absence of hybridization). Relaxing this assumption requires introducing spatial dependence in the data, similar to a method we recently introduced (28). However, this extension only makes the model more complex and significantly increases the computational requirements of the inference methods. Currently, to use such inference methods, it is assumed that independent loci are sampled and that each locus is recombination-free. If a locus contains recombination, it can be partitioned into recombination-free regions, potentially at the expense of creating regions that are too short for reliable estimation of gene trees, further emphasizing the need to account carefully for uncertainty in gene tree estimates.

Although we focused on using gene trees, the ultimate goal is to enable inference directly from sequences (Eq. **1**), because such an approach uses the full signal in the data and bypasses the issue of uncertainty in gene tree estimates and the need to deal with it carefully. As discussed above, the SNP (single nucleotide polymorphism) and AFLP (amplified fragment length polymorphism) Package for Phylogenetic analysis (SNAPP) method of Bryant et al. (21) enables such an inference from biallelic data in the case of phylogenetic trees (when no hybridization is allowed), even though the authors presented a Bayesian approach based on the likelihood function, rather than an ML approach. Extending the algorithms of SNAPP to allow for an ML inference based on Eq. **1** is doable, yet the application of such an extension is computationally prohibitive even for the smallest phylogenetic network (three taxa and a single reticulation), as we have observed from preliminary work.

Finally, although we varied the number of individuals sampled per species in our simulations, more thorough investigations need be conducted of the data requirements (more taxa, more loci, or more alleles) to tease apart introgression signals from those signals arising from population effects. These investigations would inform the data collection and help focus the efforts aimed at ameliorating the computational requirements. For example, in the mouse dataset we considered here, the five genomes are very closely related, giving a very weak signal for estimating gene tree branch lengths with any reasonable accuracy. In this case, the large number of loci provided a powerful signal for the network inference. The simulations, on the other hand, show that with stronger signal within the individual markers, fewer loci would be needed for accurate inferences.

## Acknowledgments

L.N. was supported, in part, by National Science Foundation Grants DBI-1062463 and CCF-1302179, as well as by Grant R01LM009494 from the National Library of Medicine (NLM). K.J.L. was partially supported by a training fellowship from the Keck Center of the Gulf Coast Consortia, on the NLM Training Program in Biomedical Informatics (Grant T15LM007093).

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: nakhleh{at}cs.rice.edu or yy9{at}cs.rice.edu.

Author contributions: Y.Y. and L.N. designed research; Y.Y., J.D., and L.N. performed research; Y.Y., J.D., and K.J.L. contributed new reagents/analytic tools; Y.Y., J.D., K.J.L., and L.N. analyzed data; and Y.Y., J.D., K.J.L., and L.N. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1407950111/-/DCSupplemental.

## References

- ↵
- ↵.
- Nakhleh L

- ↵.
- Huson D,
- Rupp R,
- Scornavacca C

- ↵
- ↵.
- Green RE, et al.

- ↵.
- Eriksson A,
- Manica A

- ↵
- ↵
- ↵
- ↵.
- Huber KT,
- Oxelman B,
- Lott M,
- Moulton V

- ↵
- ↵.
- Kubatko LS

- ↵
- ↵.
- Yu Y,
- Than C,
- Degnan JH,
- Nakhleh L

- ↵.
- Jones G,
- Sagitov S,
- Oxelman B

- ↵
- ↵.
- Yu Y,
- Ristic N,
- Nakhleh L

- ↵.
- Yu Y,
- Barnett RM,
- Nakhleh L

- ↵.
- Kubatko LS,
- Carstens BC,
- Knowles LL

- ↵
- ↵.
- Bryant D,
- Bouckaert R,
- Felsenstein J,
- Rosenberg NA,
- RoyChoudhury A

- ↵.
- Brent R

- ↵
- ↵
- ↵
- ↵
- ↵.
- Yu Y

- ↵

## Citation Manager Formats

## Sign up for Article Alerts

## Jump to section

## You May Also be Interested in

### More Articles of This Classification

### Biological Sciences

### Evolution

### Related Content

- No related articles found.