# Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees

^{a}Department of Genetics, University College London, London WC1E 6BT, United Kingdom;^{b}Radcliffe Institute for Advanced Studies, Harvard University, Cambridge, MA 02138;^{c}Key Laboratory of Random Complex Structures and Data Science (RCSDS), National Center for Mathematics and Interdisciplinary Sciences (NCMIS), Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China

See allHide authors and affiliations

Edited by David M. Hillis, The University of Texas at Austin, Austin, TX, and approved January 2, 2018 (received for review July 17, 2017)

## Significance

The Bayesian method is widely used to estimate species phylogenies using molecular sequence data. While it has long been noted to produce spuriously high posterior probabilities for trees or clades, the precise reasons for this overconfidence are unknown. Here we characterize the behavior of Bayesian model selection when the compared models are misspecified and demonstrate that when the models are nearly equally wrong, the method exhibits unpleasant polarized behaviors, supporting one model with high confidence while rejecting others. This provides an explanation for the empirical observation of spuriously high posterior probabilities in molecular phylogenetics.

## Abstract

The Bayesian method is noted to produce spuriously high posterior probabilities for phylogenetic trees in analysis of large datasets, but the precise reasons for this overconfidence are unknown. In general, the performance of Bayesian selection of misspecified models is poorly understood, even though this is of great scientific interest since models are never true in real data analysis. Here we characterize the asymptotic behavior of Bayesian model selection and show that when the competing models are equally wrong, Bayesian model selection exhibits surprising and polarized behaviors in large datasets, supporting one model with full force while rejecting the others. If one model is slightly less wrong than the other, the less wrong model will eventually win when the amount of data increases, but the method may become overconfident before it becomes reliable. We suggest that this extreme behavior may be a major factor for the spuriously high posterior probabilities for evolutionary trees. The philosophical implications of our results to the application of Bayesian model selection to evaluate opposing scientific hypotheses are yet to be explored, as are the behaviors of non-Bayesian methods in similar situations.

The Bayesian method was introduced into molecular phylogenetics in the 1990s (1⇓–3) and has since become one of the most popular methods for statistical analysis in the field, in particular, for estimation of species phylogenies (4⇓⇓–7). It has been noted that the method often produces very high posterior probabilities for trees or clades (nodes in the tree). In the first-ever Bayesian phylogenetic calculation, a biologically reasonable tree for five species of great apes was produced from a dataset of 11 mitochondrial tRNA genes (739 sites), but the posterior probability for that tree, at 0.9999, was uncomfortably high (1). In the past two decades, the Bayesian method has been used to analyze thousands of datasets, with the computation made possible through Markov chain Monte Carlo (MCMC) (4, 5). It has become a common practice to report posterior clade probabilities only if they are

In the star-tree paradox, large datasets were simulated using the star tree and then analyzed to calculate the posterior probabilities for the three binary trees (Fig. 1). Most biologists would want the posterior probabilities for the binary trees to converge to

Bayesian model selection is known to be consistent (16). When the data size

Here we study the asymptotic behavior of Bayesian model selection in a general setting where multiple misspecified models are compared. We are interested in how the posterior probabilities for models behave when the data size increases. Do the dynamics depend on whether there are any free parameters in the models? If one model is less wrong than another (in a certain sense appropriately defined), will the less wrong model always win? We present the proofs and mathematical analyses in *General Theory for Equally Wrong Models with No Free Parameters (d=0) and General Theory for Equally Right or Equally Wrong Models with Free Parameters (d>0)*. In the main text, we summarize our results and illustrate them using three canonical simple problems. Our analysis suggests that the problem exposed by the star-tree paradox is actually far more troubling than discussed previously (11⇓⇓⇓–15).

## Results

### Problem Description.

We consider independent and identically distributed (i.i.d.) models only. The data

The dynamics depend on how well the models fit the data. Let

### Characterization of Bayesian Model Selection.

The asymptotic behavior of *SI Text* and summarized in Fig. 2. We identify three types of asymptotic behaviors: type 1 (“balanced”), type 2 (“volatile”), and type 3 (“polarized”), as defined below. We also refer to three types of inference problems that give rise to those behaviors.

Type 1 (balanced) is for the posterior model probability

Type 2 (volatile) is for

Type 3 (polarized) is for

It is remarkable that the asymptotic behavior is determined by whether or not the compared models are distinct and not by whether they are both right or both wrong or by whether the compared models have unknown parameters. For example, cases

### Problem 1. Fair-Coin Paradox (Equally Wrong Models with No Free Parameter).

Consider a coin-tossing experiment in which the coin is fair with the probability of heads

As the models involve no free parameters, the likelihood **3** or 0.272, 0.0876, 0.0277, and 0.0088 exactly by the binomial distribution. Thus, in large datasets, moderate posterior probabilities will be rare, and either *A*, *i* shows the distribution of

Fig. 3 *A*, *ii* shows the comparison of

### Problem 2. Fair-Balance Paradox (Equally Right Models or Equally Wrong and Indistinct Models).

The true model is

We assign a uniform prior on the two models (*Analysis of Problem 2 (Two Equally Right Models or Equally Wrong but Indistinct Models)*).

Fig. 3*B* shows the density of

### Problem 3. Fair-Balance Paradox (Equally Wrong and Distinct Models).

The true model is **1**, if*Analysis of Problem 3 (Two Equally Wrong and Distinct Models, Gaussian with Incorrect Variances)*). This is a type-3 problem (Fig. 2, **S15** in *Analysis of Problem 3 (Two Equally Wrong and Distinct Models, Gaussian with Incorrect Variances)*.

We use **5** holds and the two models are equally wrong, to generate independent variables *C*, *i* shows the estimated density of *A*, *i*), even though in problem 1 the models do not involve any unknown parameters while here they do.

Fig. 3 *C*, *ii* shows the density of

### Star-Tree Paradox and Bayesian Phylogenetics.

In Bayesian phylogenetics (1, 2), each model has two components: the phylogenetic tree describing the relationships among the species and the evolutionary model describing sequence evolution along the branches on the tree (19). Each tree

Here we consider three simple cases involving three or four species (Fig. 1). We use the general theory described above to predict the asymptotic behavior of posterior probabilities for trees and use computer simulation to verify the predictions.

Case A (Fig. 4 *A* and *A*´) involves equally right models. We use the rooted star tree *A*) to generate datasets to compare the three binary trees. The Jukes–Cantor (JC) substitution model (22) is used both to generate and to analyze the data, which assumes that the rate of change between any two nucleotides is the same. The molecular clock (rate constancy over time) is assumed as well, so that the parameters in each binary tree are the two ages of nodes (

The best-fitting parameter values are

Case B (Fig. 4 *B* and *B*´) involves equally wrong models that are indistinct. This is similar to case A except that the JC+Γ model (22, 23) is used to generate data, with different sites in the sequence evolving at variable rates according to the gamma distribution with shape parameter **1**) that are indistinct. The posterior tree probabilities have a nondegenerate distribution. This is the type-2 volatile behavior for equally wrong and indistinct models (Fig. 2,

Case C (Fig. 4 *C* and *C*´) involves equally wrong and distinct models. Like case B, the simulation model is JC+Γ with *B*). The true tree is the unrooted star tree *B*, with *B*). As **1**). As this is a type-3 problem (Fig. 2,

We note that most phylogenetic analyses involve unrooted trees as the clock assumption is violated except for closely related species. Furthermore, because of the violation of the evolutionary model, all trees (or the joint tree-process models) represent wrong statistical models. Thus, among the three cases considered in Fig. 4, case C is the most relevant to analysis of real data, when Bayesian model selection exhibits type-3 polarized behavior. Previous analyses of the star-tree paradox (12, 14, 15) have deplored the volatile behavior of the Bayesian phylogenetic method, but those studies examined case A only, so the real situation is worse than previously realized.

A practically important scenario is where all binary trees are wrong because of violation of the evolutionary model but the true tree is less wrong than the other trees. We present such a case in Table S2, in which the data are simulated under JC+Γ (with

## Discussion

### High Posterior Probabilities for Phylogenetic Trees.

This work has been motivated by the phylogeny problem and in particular by the empirical observation of spuriously high posterior probabilities for phylogenetic trees (9⇓⇓⇓⇓–14). We note that certain biological processes such as deep coalescence (24, 25), gene duplication followed by gene loss (26), and horizontal gene transfer (24, 26) may cause different genes or genomic regions to have different histories. However, as discussed in the Introduction, posterior probabilities for many trees or clades observed in real data analyses are decidedly spurious even if the true tree is unknown.

One explanation for the spuriously high posterior probabilities for phylogenetic trees is the failure of current evolutionary models to accommodate interdependence among sites in the sequence, leading to an exaggeration of the amount of information in the data. Interacting sites may carry much less information than independent sites. This explanation predicts the problem to be more serious in coding genes than in noncoding regions of the genome as noncoding sites may be evolving largely independently due to lack of functional constraints. However, empirical evidence points to the opposite, with noncoding regions having higher substitution rates and higher information content (if they are not saturated with substitutions), generating more extreme posteriors for trees.

Our results suggest that the problem may lie deeper and may be a consequence of the polarized nature of Bayesian model selection when all models under comparison are misspecified. As the assumptions about the process of sequence evolution are unrealistic, the likelihood model is wrong whatever the tree, although the true tree may be expected to be less wrong than the other trees. As the different trees constitute opposing models that are nearly equally wrong, the inference problem is one of type 3 (Fig. 2,

### Bayesian Selection of Opposing Misspecified Models.

We have provided a characterization of model selection problems according to the asymptotic behavior of the Bayesian method as the data size *General Theory for Equally Wrong Models with No Free Parameters ( d=0) and General Theory for Equally Right or Equally Wrong Models with Free Parameters (d>0*)]. While all of the problems considered here involve comparison of two equally right or equally wrong models, three different asymptotic behaviors are identified, which we label as type 1, type 2, and type 3. The type-1 behavior is for the posterior model probability

*A*and

*A*´, the estimates of

With type-3 behavior, *A*, *ii* and *C*, *ii* and Table S2). While the less wrong model eventually wins in the limit of infinite data, Bayesian model selection is overconfident in large but finite datasets, supporting the more wrong model with high posterior too often.

Note that the question of how the posterior model probability should behave when large datasets are used to compare two equally wrong models is somewhat philosophical and may not have a simple answer. One position is to accept whatever behavior the Bayesian method exhibits. This may be legitimate given that Bayesian theory is the correct probability framework for summarizing evidence in the prior and likelihood. The polarized behavior in type-3 problems may then be seen as a consequence of “user error” (for not including the true model in the comparison), exacerbated by the large data size. In this regard we note that the posterior predictive distribution (27, 28) can be used to assess the general adequacy of any model or the compatibility between the prior and the likelihood, and indeed this has been widely used to assess the goodness of fit of models in phylogenetics (29, 30). Nevertheless, a number of sophisticated and parameter-rich models have been developed for Bayesian phylogenetic analysis, due to three decades of active research (31), and furthermore extreme sensitivity to the assumed model is not a desirable property of an inference method. Seven decades ago, Egon S. Pearson (ref. 32, p.142) wrote that “Hitherto the user has been accustomed to accept the function of probability theory laid down by the mathematicians; but it would be good if he could take a larger share in formulating himself what are the practical requirements that the theory should satisfy in application.” This stipulation may be relevant even today.

Two heuristic approaches have been suggested to remedy the high posterior model probabilities in the context of phylogenies. The first one is to assign nonzero probabilities to multifurcating trees (such as the star tree of Fig. 1) in the prior (11). This is equivalent to assigning some prior probability to the model

### Non-Bayesian Methods.

The phylogeny problem was described by Jerzy Neyman (ref. 33, p. 1) as “a source of novel statistical problems.” In the frequentist framework, the test of phylogeny, or test of nonnested models in general, offers challenging inference problems. Note that in many model selection problems, the model itself is not the focus of interest. For example, when an experiment is conducted to evaluate the effect of a new fertilizer, the sensitivity of the inference to the assumed normal distribution with homogeneous variance may be of concern, but the focus is not on the normal distribution itself. In phylogenetics, the phylogeny (which is a model) is of primary interest, far more important than the branch lengths (which are parameters in the model). The test of phylogeny is thus more akin to significance/hypothesis testing than to model selection. Model-selection criteria such as Akaike information criteria (34) or Bayesian information criteria (35) simply rank the trees by their likelihood (maximized over branch lengths) and will not be useful for attaching a measure of significance or confidence in the estimated tree. The phylogeny problem (or the problem of comparing nonnested models in general) falls outside the Fisher–Neyman–Pearson framework of hypothesis testing, which involves two nested models, one of which is true (36, 37).

In principle Cox’s likelihood-ratio test (38), which conducts multiple tests with each model used as the null, can be used to compare nonnested models. For type-3 problems (Fig. 2,

The most commonly used method for attaching a measure of confidence in the maximum-likelihood tree is the bootstrap (39), which samples sites (alignment columns) to generate bootstrap pseudodatasets and calculates the bootstrap support value for a clade (a node on the species tree) as the proportion of the pseudodatasets in which that node is found in the inferred ML tree. This application of bootstrap for model comparison appears to have important differences from the conventional bootstrap for calculating the standard errors and confidence intervals for a parameter estimate (40); a straightforward interpretation of the bootstrap support values for trees remains elusive (31, 41⇓–43). At any rate, the asymptotic behavior of bootstrap support values under the different scenarios of Fig. 2 merits further research. For the fair-coin example of problem 1 (Fig. 2,

## Materials and Methods

### Star-Tree Simulations.

For Fig. 4 *A*, *A*´, *B*, and *B*´, the true tree is *A*. The data of counts of five site patterns (*C* and *C*´, the true tree is *B*. Sequence alignments were simulated using EVOLVER and analyzed using MrBayes (4).

## Acknowledgments

We thank Philip Dawid and Wally Gilks for stimulating discussions and Jeff Thorne and an anonymous reviewer for constructive comments. Z.Y. was supported by a Biotechnological and Biological Sciences Research Council grant (BB/P006493/1) and in part by the Radcliffe Institute for Advanced Study at Harvard University. T.Z. was supported by Natural Science Foundation of China grants (31671370, 31301093, 11201224, and 11301294) and a grant from the Youth Innovation Promotion Association of the Chinese Academy of Sciences (2015080).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: z.yang{at}ucl.ac.uk.

Author contributions: Z.Y. designed research; Z.Y. and T.Z. performed research; Z.Y. and T.Z. analyzed data; and Z.Y. and T.Z. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1712673115/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Chen MH,
- Kuo L,
- Lewis P

- ↵
- ↵
- ↵
- Suzuki Y,
- Glazko G,
- Nei M

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Bandyopadhyay PS,
- Forster M

- Dawid A

- ↵
- Berk R

- ↵
- ↵
- ↵
- ↵
- ↵
- Munro H

- Jukes T,
- Cantor C

- ↵
- ↵
- ↵
- Xu B,
- Yang Z

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Yang Z

- ↵
- ↵
- Gupta SS,
- Yackel J

- Neyman J

- ↵
- Petrov BN,
- Csaki F

- Akaike H

- ↵
- ↵
- Lehmann E

- ↵
- ↵
- Cox D

- ↵
- ↵
- Efron B,
- Tibshirani R

- ↵
- ↵
- Efron B,
- Halloran E,
- Holmes S

*Proc Natl Acad Sci USA*93:7085–7090, and correction (1996) 93:13429–13434. - ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Evolution

- Physical Sciences
- Statistics