# Simultaneous Bayesian inference of phylogeny and molecular coevolution

^{a}Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland;^{b}Department of Integrative Biology, University of California, Berkeley, CA 94720;^{c}Swiss Institutes of Bioinformatics, Quartier Sorge, 1015 Lausanne, Switzerland;^{d}Department of Biological and Environmental Sciences, University of Gothenburg, 413 19 Gothenburg, Sweden;^{e}Global Gothenburg Biodiversity Centre, University of Gothenburg, 413 19 Gothenburg, Sweden

See allHide authors and affiliations

Edited by David M. Hillis, The University of Texas at Austin, Austin, TX, and approved January 23, 2019 (received for review August 10, 2018)

## Significance

Phylogenetic methods inferring molecular coevolution have recently gained traction for their capacity to predict protein surface interactions and to clarify their function within metabolic pathways. These methods rely on phylogenies inferred under the assumption that coevolution does not exist (i.e., sites evolve independently). However, violations of this assumption lead to considerable inaccuracies in the inferred phylogeny, which in turn can negatively affect the estimation of coevolution. We tackle this problem by developing a Bayesian method to simultaneously infer phylogenetic relationships and predict coevolution from nucleotide sequences. The main novelty of our method is its ability to account for the interdependencies between molecular coevolution and phylogeny, thus relaxing a long-standing assumption in the study of molecular evolution.

## Abstract

Patterns of molecular coevolution can reveal structural and functional constraints within or among organic molecules. These patterns are better understood when considering the underlying evolutionary process, which enables us to disentangle the signal of the dependent evolution of sites (coevolution) from the effects of shared ancestry of genes. Conversely, disregarding the dependent evolution of sites when studying the history of genes negatively impacts the accuracy of the inferred phylogenetic trees. Although molecular coevolution and phylogenetic history are interdependent, analyses of the two processes are conducted separately, a choice dictated by computational convenience, but at the expense of accuracy. We present a Bayesian method and associated software to infer how many and which sites of an alignment evolve according to an independent or a pairwise dependent evolutionary process, and to simultaneously estimate the phylogenetic relationships among sequences. We validate our method on synthetic datasets and challenge our predictions of coevolution on the 16S rRNA molecule by comparing them with its known molecular structure. Finally, we assess the accuracy of phylogenetic trees inferred under the assumption of independence among sites using synthetic datasets, the 16S rRNA molecule and 10 additional alignments of protein-coding genes of eukaryotes. Our results demonstrate that inferring phylogenetic trees while accounting for dependent site evolution significantly impacts the estimates of the phylogeny and the evolutionary process.

Molecular coevolution is the evolutionary process by which interactions between distant sites of a molecule, or sites of different molecules, are maintained such as to preserve advantageous functional or structural properties. For instance, coevolving fragments within protein sequences are involved in folding constraints and informative of folding intermediates, peptide assembly, or key mutations with known roles in genetic diseases (1, 2). The ever-growing availability of molecular sequences (nucleotides and amino acids) provides us with an unprecedented amount of data that hold a strong potential to reveal genes and gene regions evolving under a constrained process (3, 4).

There exist several methods to infer coevolution from sequence data alone (based on matching patterns between sites, as reviewed in refs. 5 and 6). However, these methods do not exploit a key component in modeling the underlying evolutionary processes: the phylogenetic tree describing the relationships between molecular sequences. Incorporating the phylogenetic signal in the analysis of coevolution is crucial because it enables us to distinguish between truly coevolving patterns and similar patterns induced by the shared history of sequences (5, 7). To this end, several methods have been developed to infer coevolution while accounting for phylogenetic relationships (7), but only a few of these explicitly model the process of coevolution along a given phylogenetic tree (8⇓⇓–11).

All phylogeny-aware methods to detect coevolution rely on the assumption that the phylogenetic relationships between sequences are known and can be treated as “observed data.” Typically, phylogenetic trees are themselves inferred from molecular data (12), but their inference is based on a fundamental assumption that each site evolves independently of all of the others (13). This assumption, which is evidently violated in the presence of coevolution, has benefits in terms of computational tractability, because the likelihood of an alignment given a phylogenetic tree is the product of the individual likelihood of each site. This simplification of the evolutionary mechanism in the presence of nonindependent sites has been shown to decrease the accuracy of the inferred phylogenetic trees (14, 15). However, datasets with strong functional or structural constraints are often analyzed within phylogenetic frameworks that assume independence among sites. For instance, the small ribosomal subunit (16S) is frequently used to estimate the earliest evolutionary relationships between the major lineages of the tree of life (16, 17), neglecting its numerous structural constraints and evidence of coevolution (18).

The presence of coevolutionary patterns across many nucleotide and amino acid sequences extends far beyond the 16S gene and is supported by a large body of evidence (9, 19). Ignoring the interdependencies between the phylogenetic history of the sequences and the constrained processes governing the evolution of nucleotides or amino acids can severely hamper our ability to infer correct phylogenetic trees (14, 15) and accurately detect coevolution (5, 7). However, the inference of these interdependent processes is still conducted separately for mathematical convenience, at the expense of assuming independence among sites which was originally described as “not necessarily biologically valid” by Felsenstein in 1983, at the dawn of likelihood-based molecular phylogenetics (13). To address this issue, we present a Bayesian framework to analyze a nucleotide alignment and jointly estimate (*i*) the number of pairs of sites that coevolve and their position in the sequence, thus differentiating from a background model of independent evolution, (*ii*) the parameters of the independent and dependent models of substitution, and (*iii*) the phylogenetic tree describing the relationships between sequences. Our method is called CoevRJ and the software implementing it is available at https://bitbucket.org/XavMeyer/coevrj.

We evaluate the performance of CoevRJ in reconstructing phylogenetic trees and inferring coevolution based on an extensive range of simulated datasets, an alignment of the highly coevolving 16S rRNA and 10 empirical eukaryote datasets of protein-coding genes. We show that CoevRJ provides an accurate identification of the coevolving sites, as validated by simulated and empirical data. We assess the effects of coevolution on phylogenetic estimates by comparing our results with those obtained under the assumption of independent evolution and demonstrate the importance of accounting for dependence among sites when inferring phylogenetic trees on datasets subject to coevolution.

## Results

### CoevRJ: A Bayesian Framework to Jointly Estimate Phylogeny and Molecular (Co)evolution.

The CoevRJ method simultaneously infers the phylogenetic relationships between molecular sequences as well as the number and position along the sequence of the sites (if any) that evolved in a dependent fashion. The method estimates the posterior probability of the many scenarios of evolution considered, as well as their parameters values using the reversible jump Markov chain Monte Carlo (RJMCMC) algorithm (20) (Fig. 1).

Our approach, further described in *Materials and Methods*, includes a mixture of two evolutionary models that can accommodate many possible scenarios of dependent and independent evolution among sites within a gene. Sites are not a priori assigned to either category; rather, their mode of evolution is estimated from the data. Independent sites are considered to evolve under a general time-reversible substitution model with rate heterogeneity among sites modeled by a Gamma distribution [GTR+Γ model (21)]. The mean rate of substitution is set to 1 and the shape of the Gamma distribution is modeled by a single parameter α and discretized into a finite set of rate multipliers to incorporate varying substitution rates across all sites of the alignment.

Dependent sites are modeled with an adapted version of the Coev model (11), under which coevolving pairs of sites evolve in a dependent fashion, such that the nucleotides at both sites remain within a predefined set of nucleotide pairs, defined as the “coevolving profile.” A substitution in one site of a coevolving pair is expected to trigger a subsequent substitution at the other site such that the nucleotides combination remains in the profile. A coevolving profile contains between two and four nucleotide pairs encompassing the possible cosubstitutions. For instance, pairs (AA, CC, TT, GG) or the Watson–Crick base pairs (AT, CG) may form a coevolving profile. However, the Watson–Crick base pairs augmented with the wobbling pair GT cannot form a profile since the wobbling pair differs by a single substitution from the two other pairs (11). The coevolution process is modeled as a reversible continuous-time Markov chain with rate parameters distinguishing among single site substitutions (*i*) leading to the profile (rate d), (*ii*) breaking the profile (rate s), and (*iii*) allowing the pairs of sites to evolve within out-of-profile pairs (rate r).

The combination of GTR+Γ and Coev forms a mixture of models parameterized by the number of coevolving pairs of sites and their position within the molecular alignment. This mixture of models ranges from sites being fully independent and evolving under a pure GTR+Γ model to all sites being involved in nonoverlapping coevolving pairs. Independent and coevolving sites, regardless of the configuration of the mixture, are assumed to evolve on the same phylogeny (including topology and branch lengths), and therefore both contribute to its estimation. The branch lengths are, however, decoupled between models by scaling their respective rate matrices to yield an expected value of one substitution per branch length unit and by applying a unique branch length (or rate) multiplier ν for sites under coevolution.

Estimating all these parameters represents a significant computational challenge that we overcome by making some simplifying assumptions. We assume the coevolving pairs to follow a homogeneous (co)evolutionary process and share the same substitution rates (

### CoevRJ Accurately Estimates Pairs of Sites and the Phylogeny.

To assess the performance of CoevRJ, we generated a total of 250 alignments of nucleotides each with 1,000 sites and 50 taxa, with a proportion of coevolving sites ranging from 0% (independent evolution) to 50% of the sites. CoevRJ correctly recovered the number of pairs of coevolving sites and their position (if any) under each scenario, with accuracy equal to 0.99 or higher (*SI Appendix*, Table S1). For datasets simulated under independent evolution, CoevRJ identified at most one pair over the 1,000 simulated sites. For datasets simulated with coevolving pairs, CoevRJ accurately identified most of the coevolving pairs of sites with a sensitivity of 98%. Overall, the models inferred with CoevRJ were consistent with the amount of coevolution simulated: Both the number of coevolving sites and their position were accurately identified as well as the sites that were not coevolving.

We then measured the ability of CoevRJ to recover the simulated amount of rate heterogeneity (i.e., the α parameter of the GTR+Γ model) and the total number of substitutions, measured as the total branch length of the reconstructed phylogenetic tree. To estimate the effects of ignoring coevolution between sites, we reanalyzed the datasets under a standard GTR+Γ model where all sites are considered to evolve independently [as implemented in MrBayes (23)]. The performances of CoevRJ and GTR+Γ, assessed by the relative errors with respect to the simulated parameters, were equivalent in the absence of coevolution, reflecting the fact that CoevRJ correctly reduced to a model where all sites are independent (Fig. 2 *A* and *B*). However, as the proportions of coevolving sites in the alignment increased, the accuracy of the estimated rate heterogeneity and branch lengths decreased substantially under the GTR+Γ model, while it remained essentially unchanged under CoevRJ (Fig. 2 *A* and *B*).

Finally, the accuracy of the inferred phylogenetic tree with increasing levels of coevolution was consistently improved when using CoevRJ rather than the standard GTR+Γ model. In presence of coevolution, the tree topologies inferred with CoevRJ were more accurate than those inferred under the GTR+Γ model (*SI Appendix*, Fig. S1). The divergence between the trees estimated by the GTR+Γ model and CoevRJ increased significantly with the proportion of coevolution (*SI Appendix*, Fig. S1) as a small but significant increase in misidentified bipartitions (internal nodes) affected the phylogenetic trees inferred under the independent sites model (Fig. 2*C*).

### CoevRJ Identifies Alternative Hypotheses for the “Tree of Life.”

To gain insight on the effect of accounting for coevolution on an empirical dataset, we analyzed an alignment of the 16S rRNA that includes sequences for 146 taxa spanning the three domains of life, Bacteria, Archea, and Eukaryotes (18). The 16S rRNA is subject to many structural constraints and is therefore used as a benchmark for the method’s ability to predict coevolution. Using this dataset, pairs of sites predicted as coevolving can be assessed by comparing them to the known 3D structure of the small ribosomal subunit for several species [*E. coli* (24), *Drosophila melanogaster* (25) and *Homo sapiens* (26)]. Coincidentally, since 16S is shared by all prokaryotes and eukaryotes and is a slow-evolving gene, it is also often used to infer phylogenetic relationships, especially focusing on the earliest nodes in the tree of life (16, 17).

CoevRJ predicted 256 pairs of nucleotides (19.5% of the alignment positions) as coevolving with a posterior probability greater than 0.95. Of these, 94% of the pairs of sites were located very closely on the 3D structure, for example less than 6.5 Å for *E. coli* (Fig. 3 and *Materials and Methods*). The majority of these pairs (71%) were inferred as having a profile consistent with Watson–Crick base pairs (AT, GC). Among the 16 pairs having probability greater than 0.95 and not supported by the structure of *E. coli*, 14 were inferred with profiles diverging from a pure Watson–Crick profile, suggesting that they could be involved in functional constraints (*SI Appendix*, Table S2). Additionally, 12 of them are known to bind with other small ribosomal subunits not represented in our dataset (summarized in *SI Appendix*, Table S2 from ref. 28). Sites involved in such bonds could be coevolving with residues on other sequences and may thus present evolutionary patterns departing strongly from the one of independent evolution, which may lead CoevRJ to infer them as coevolving within the 16S rRNA. Finally, 108 pairs were predicted as significantly coevolving but with probability lower than 0.95 (27% with sites closely located on the structure of *E. coli*). Empirically validating these predictions is difficult as they could result from coevolution affecting only a portion of the phylogeny and would require the 3D structures for many species. However, the interpretation of the coevolving pairs based on the 3D structure of *E. coli* was confirmed when using the *D. melanogaster* and *H. sapiens* structures for validation (*SI Appendix*, Fig. S2).

The phylogenetic tree inferred by CoevRJ from the 16S RNA dataset significantly differed from the one obtained under the assumption of independence among sites. Our analyses identified 51 internal nodes not shared between the two topologies (normalized Robinson–Foulds distance (29) of 0.18; Fig. 4 and *SI Appendix*, Figs. S3–S9). In addition to topological differences, the two models inferred substantially different branch lengths. For instance, the branch separating the Archaea from the Eukaryotes had an estimated branch length that was 25% longer with CoevRJ than with the GTR+Γ model, while branches deep in the bacterial clades were generally inferred as shorter with CoevRJ (*SI Appendix*, Figs. S10–S12). These discrepancies have a strong impact on the estimates of divergence times based on these phylogenies. Indeed, we found strongly differing ultrametric trees when using the phylogeny estimated under the CoevRJ or the GTR+Γ model (Fig. 4), and conflicting estimates of the divergence times were observed for varying settings of the underlying molecular clock (*SI Appendix*, Fig. S13).

### A Wider Perspective on the Failure to Account for Dependence Among Sites.

We tested the CoevRJ approach on a diverse range of 10 protein-coding genes of eukaryotes from the Selectome database (30) having significantly different alignment size and gene annotation (*SI Appendix*, Table S3). These alignments were specifically selected out of 8,000 protein-coding genes for their significant signals of coevolution as predicted by the Coev method (11). For each dataset, we inferred the parameters with both CoevRJ and GTR+Γ and computed the discrepancies measured between the methods for the inferred rate heterogeneity, branch lengths, and tree topology.

While the proportion and intensity of coevolution detected within these datasets varied (Fig. 5*A* and *SI Appendix*, Fig. S14), not accounting for dependence among sites led to decrease in the estimate of the rate heterogeneity between sites (Fig. 5*B*). Additionally, the estimates of branch lengths differed substantially between CoevRJ and GTR+Γ. The total branch length inferred on these datasets was inconsistent between both methods without showing a bias (Fig. 5*C*). Notably, the difference in total branch length did not come from a global factor equally affecting all branches but from many changes of varying amplitude on different branches (Fig. 5*D*).

The tree topologies inferred under CoevRJ and GTR+Γ differed for all genes (Fig. 5*E*). The amount of differences among topologies obtained under the different methods ranged from 0.06 to more than 0.3 (normalized Robinson–Foulds distance). The differences measured on the substitution rates, branch lengths, and tree topologies suggested that accounting for dependence among sites significantly impacted the parameter estimates without presenting a consistent bias. The diversity of these discrepancies suggests that failing to account for the dependencies between sites led to unpredictable effects on the inferred evolutionary histories.

## Discussion

We presented a method to analyze an alignment, while moving beyond the unrealistic assumption that all sites evolve as independent units. CoevRJ jointly infers the posterior probability of the phylogenetic tree, the pairs of coevolving sites, and the underlying parameters of the evolutionary models. CoevRJ therefore enables us to capture the reciprocal effects of the shared evolutionary history of molecular sequences and the pairs of sites that coevolve. The joint analysis of these two processes has remained an unsolved challenge in previous approaches (6, 31). Our results show that CoevRJ can accurately estimate both the phylogenetic tree and the parameters of the underlying (co)evolutionary process.

Modeling the evolutionary process enables CoevRJ to extract more information from the data than just the position of the pairs of coevolving sites. The posterior probability of each coevolving pair is informative of the strength of the nucleotide pairing along with the inferred distribution of profiles that determine the nature of the pairings. Estimating these parameters within a Bayesian framework results in intuitive posterior probabilities and enables us to use Bayes factors to properly define thresholds for the significance of predicted coevolving pairs (*Materials and Methods*). Given these advantages, the power of CoevRJ to predict coevolution compared favorably to existing methods on the 16S rRNA dataset (*SI Appendix*, Fig. S15).

Our findings join previous studies showing that the accuracy of standard phylogenetic inference is negatively impacted when dependence among sites is present in the data (14, 15). Phylogenetic trees inferred from synthetic datasets with CoevRJ show that our method can correct these inaccuracies. Similarly, phylogenetic trees inferred on the eukaryote datasets strongly differed from the ones inferred with a model assuming that sites evolve independently. The extent of the divergences were not predictable with respect to the nature and magnitude of the coevolutionary predictions. Such inaccuracies in the phylogeny could impact analyses using these phylogenetic trees. For instance, phylogenies inferred on the 16S rRNA datasets suggested that conflicting conclusions would be reached when aiming to date the tree of life (Fig. 4).

The machinery developed for CoevRJ can be extended to other mixtures of models for RNA, DNA, or amino acid sequences. For instance, a targeted study of the secondary structure of RNA could be conducted by replacing the Coev model by models accounting for substitution between Watson–Crick pairs and wobbling pairs (e.g., refs. 8 and 32). However, further improvements to the performance of this machinery are required to relax the most limiting assumptions on the current mixture of models to better integrate the richness and complexity of molecular evolution. For instance, the assumption of rate homogeneity of the coevolving pairs could be relaxed by adding a Gamma model of rate heterogeneity, as for the independent sites, or by considering a mixture of coevolutionary processes with different substitution rates (*n*-tuple) could better capture the potential underlying molecular structures. Finally, the study of coevolution in protein-coding genes would deeply benefit from the integration of codons or amino acid models, which is currently computationally prohibitive.

Extending CoevRJ to amino acid models would enable further investigation of the effect of dependent sites on dating of the tree of life, which is frequently achieved by analyzing the RNA and the proteins in the small ribosomal subunit (16, 17). Solving this computational challenge would also facilitate the study of protein–protein interactions (4) jointly with the underlying evolutionary process. While several methodological challenges remain, our approach paves the way to a new generation of more realistic models of molecular evolution. Improving our understanding of the shared history of genes and species requires that we integrate more complexity in evolutionary models (33), and our method demonstrates that such additional complexity is counterbalanced by a significant improvement of the inferred phylogenetic relationships and (co)evolutionary processes.

## Materials and Methods

### Evolutionary Models.

We designed a set of models in which nucleotides can either evolve independently of the others or according to a coevolutionary process whereby pairs of sites evolve in a mutually dependent fashion. The proportions of both types of sites in an alignment, as well as the specific assignment of each site to either model, is assumed to be unknown and estimated from the data.

Our Bayesian model allows us to jointly infer the following parameters (which are described in detail in the paragraphs below):

*i*) the phylogenetic tree (topology and branch lengths) describing the relationships between genes;*ii*) the number of pairs of coevolving sites in the alignment;*iii*) the assignment of each individual site to either a coevolving pair or to the set of independently evolving sites;*iv*) the parameters of the substitution model describing independent site evolution; and*v*) the parameters of the substitution model for coevolving pairs of sites.

#### Independent substitution model.

Independent sites are modeled as evolving under the GTR+Γ model (21). This model accounts for rate heterogeneity using a discrete Gamma distribution model with four different rate multipliers known as the GTR+Γ. We identify the instantaneous rates quantifying the rate of change from one nucleotide to another

The likelihood of each independent site contained in the set

#### Dependent substitution model.

Dependent sites are assumed to follow a pairwise coevolution model adapted from ref. 11. Under this model, a pair of sites defined by two positions

Since the Coev model does not allow double substitutions, evolutionary changes within the coevolving profile require at least two substitutions, for instance AT → GT → CG. We model this process with two parameters describing (*i*) the rate at which a coevolving pair is replaced by a noncoevolving one (e.g., GT), thus exiting the coevolving profile, and (*ii*) the rate at which the pair of sites return to the coevolving profile (e.g.*,* CG). We indicate the two rates with s and d, respectively. In a coevolving pair, we expect the rate d to be much larger than the rate s, that is, it should be rare to leave a coevolving profile (low s) and it should be highly probable to regain a coevolving pair (high d). Therefore, the ratio

The matrix of instantaneous substitution rates

Branch lengths were not inferred by the original Coev method. Therefore, substitution rates were assumed to be consistent under the site-independent and site-dependent hypotheses. In other words, a substitution on two independent sites was equivalent to two substitutions on a pair of sites such that the total number of substitutions was preserved. This was achieved by doubling the length of branches under the Coev model.

Here we relax that assumption by inferring from the data the factor describing the difference in units of branch length between independent and coevolving substitutions. Branch lengths are therefore decoupled for sites under dependent and independent evolution by a rate modifier ν. This modification leads the probability transition matrix for coevolving pairs to be computed as

In summary, a set of dependent sites

#### The set of models.

Given sequences of N sites, we consider the set of models defined by the number of coevolving pairs k such that

### Bayesian Framework.

We implemented the models in a Bayesian framework named CoevRJ that estimates all of the free parameters

#### Proposals.

To explore this complex parameter and model space, we designed several proposals. Two types of parameters were differentiated: those that do or do not depend on k.

#### Proposals for the phylogenetic tree and GTR+Γ model parameters.

These proposals aim to update the branch lengths t, the tree topology τ, and the parameters

#### Proposals for the Coev model parameters (k > 0 ).

The parameters

Negative correlation between these parameters may happen when the model transitions from a purely independent sites model (

Therefore, we account for this negative correlation by proposing a move with a multivariate normal distribution

Furthermore, given that the

#### Proposals affecting *k* and the positions of pairs.

We have three different types of proposals that operate on the parameters k and *i*) add and remove pairs, (*ii*) change sites included in a pair, and (*iii*) change the coevolving profile in a pair.

#### Proposals moving through the model space.

The first two proposals are transdimensional moves because they change the number of parameters in the model and their acceptance probabilities are given by the RJMCMC algorithm (20). Our set of models defines a sequence of models

The probability of making a jump from a model *A* is defined as

When proposing a move to a model having a different number of parameters, vectors of random numbers *SI Appendix* for details of the CoevRJ implementation).

In this proposal, a pair of positions is drawn by first selecting a position

The opposite move going from **2** defining the acceptance probability of both moves simplifies to**3**, the determinant of the Jacobian, is equal to 1.

For the general case where **3**. The probability of a backward move (

#### Proposals sampling the pair space (k > 0 ).

Two proposals aim to provide a proper mixing of the pairs of positions under coevolution without affecting the number of pairs k and are subject to the standard Metropolis–Hastings acceptance ratio (37). The first proposal breaks an existing pair and exchanges one of its positions with a site considered as independent. A pair

The independent site z is drawn from the probability distribution

#### Proposals exploring the profile space θ *(when* k > 0 ).

The last proposal samples the possible profiles for any given coevolving pairs by drawing a new profile ϕ for a pair

#### Priors on standard parameters.

We assume a uniform prior on the tree topology τ and an exponential prior on each branch length with rate

#### Priors on Coev model parameters.

The branch-length scaling parameter ν for the coevolving pairs has a Gamma prior distribution with shape and scale parameters equal to 2. This distribution has its mode located at 2, which expresses our prior belief that two dependent sites should evolve at the same pace as two independent sites. Parameters *SI Appendix*, Table S4). This parameterization reflects our belief that pairs of sites under coevolution should differ from an independent evolutionary process (i.e.,

For the 16S rRNA dataset, we used results from the Coev model (11) to estimate informed prior distributions on parameters (*SI Appendix*, Table S4) is close to the one defined for the default prior that results in a stationary frequency for nucleotide pairs in the profile to be close to

Finally, we defined a uniform prior distribution on the profile ϕ based on the set of profiles observed in the alignment for a given pair of positions. This empirical prior reduces the space of parameters to explore and plays a key role in making analysis tractable with CoevRJ. However, this simplification implies that pairs of sites cannot (co)evolve under an unobserved profile (e.g., invariant sites are evolving under an independent process by default).

While more informative priors on the distribution of profiles could be used for specific type of molecular data (e.g., favoring Watson–Crick profiles for RNA sequences), we chose to use this conservative and vague prior since it accommodates the wide range of scenarios we considered including coevolution within DNA sequences. Under this prior, the marginal probability of the profiles inferred on the 16S rRNA dataset showed that the method had the power to infer the profile even without more informative priors. Indeed, 71% of the estimated coevolving pairs followed a pure Watson–Crick profile (AT, CG) and 17% of the pairs evolved under a profile containing canonical pairs coupled with other ones (AT, XX or CG, XX or AT, CG, XX). These results suggest that our priors do not prevent us from inferring common coevolving profiles.

#### Priors on the number of pairs.

We used a hierarchical prior on the number of pairs k by assigning a Poisson distribution with rate λ on the number of pairs and by an exponential distribution with parameter *SI Appendix*, Fig. S16). The parameter values used for the hyperprior maintained k close to the true number of simulated coevolving pairs.

#### Priors on the configuration of pairs.

The last prior defines the probability of observing a given set of coevolving pairs of positions. We assumed a uniform prior reflecting that pairs of sites are considered a priori to be equally likely to be under coevolution. Under this assumption, the prior is given by the number of possible configurations of k pairs for a sequence of N positions. Considering that we do not differentiate the order of positions in a pair, the total number of configurations for a given k is defined as

### Significance of Pairs Inferred as Coevolving.

To confidently consider pairs to be coevolving, we designed a method defining a threshold

We define this threshold after an MCMC run with CoevRJ by using the inferred posterior probability

This prior probability **4** by using the threshold for strong significance

### Experimental Setting.

Experiments on all of the datasets consisted of a comparison of results obtained with CoevRJ to the ones obtained with the GTR+Γ model assuming site independence as implemented in MrBayes (23). Prior distributions in MrBayes were set as the ones for CoevRJ with

Runs were considered as having converged when the distribution of tree topologies stabilized [i.e., when the average SD of the splits frequencies (41) was measured to reach 0.05 using three independent runs for each dataset]. In addition, the parameter traces were examined to ensure proper convergence. The burn-in phase of each run was discarded and the remaining samples were used to estimate the posterior distribution. Comparisons between tree distributions were conducted by (*i*) computing the majority-rule consensus tree from the tree distributions obtained under each model and (*ii*) computing the normalized Robinson–Foulds distance (29) between the consensus trees.

#### Simulation of datasets.

We simulated five categories of nucleotides sequences with varying amount of coevolving sites (0, 5, 10, 20, and 50%). We simulated 50 replicates for each of these categories. For each replicate, we simulated a random phylogenetic tree with 50 tips using the R package APE (42) with branch length drawn from an exponential distribution (

Sites evolving independently were simulated with the Evolver simulator (43) using a GTR+Γ model with arbitrary rates. The shape parameter α of the Gamma distribution was drawn from a distribution Gamma

Pairs of coevolving sites were simulated using the Coev simulator (34). Each pair was attributed a random profile composed of two nucleotide pairs (e.g., AA, CC). The parameters for the Coev simulator were set to

#### Proximity of nucleotides on the 3D structure of the 16S rRNA dataset.

To validate CoevRJ predictions of coevolution on the 16S rRNA dataset, we compared these results with pairs of nucleotides located closely on the 3D structure of the molecule (and thus potentially bonding). These pairs were identified as nucleotides having their two closest atoms at a distance of less than 6.5 Å. This threshold is consistent with the one used in ref. 18 (8 Å) and is representative of the average resolution of the Protein Data Bank (PDB) structure considered (24⇓–26). The results obtained with this threshold are robust when considering other possible values in the range of 4 Å to 8 Å (*SI Appendix*, Fig. S3).

#### Molecular dating of the 16S rRNA dataset.

We analyzed the consensus trees obtained with both CoevRJ and the pure GTR+Γ model using the penalized likelihood framework (27) as implemented in the R package APE (42). We used multiple relaxed molecular clock models (i.e., correlated and relaxed) to ensure that our observations were not due to the use of a specific model. Furthermore, each model was used under four different λ values, changing the strength with which the rate is constrained along branches. This parameter took the value

### Data Availability.

The simulated and empirical molecular sequences used for the analyses can be found on the CoevRJ git repository (https://bitbucket.org/XavMeyer/coevrj). The PDB structures used to validate the findings can be accessed on the RSCB PDB (https://www.rcsb.org) with the identifiers 4GD2 (*E. coli*, ref. 24), 5VYC (*H. sapiens*, ref. 25), and 4V6W (*D. melanogaster*, ref. 26).

## Acknowledgments

We thank Michael May, John Huelsenbeck, and two anonymous reviewers whose insightful comments helped improve and clarify this manuscript, and the Vital-IT facilities of the Swiss Institute of Bioinformatics for the use of their HPC infrastructure. This work was supported by Swiss National Science Foundation Grant P2GEP2_178032 (to X.M.), Swedish Research Council Grant 2015-04748 and the Swedish Foundation for Strategic Research (D.S.), and Swiss National Science Foundation Grant 4075-40_167276 and the University of Lausanne (N.S.).

## Footnotes

↵

^{1}X.M. and L.D. contributed equally to this work.- ↵
^{2}To whom correspondence should be addressed. Email: xav.meyer{at}gmail.com. ↵

^{3}D.S. and N.S. contributed equally to this work.

Author contributions: X.M., L.D., D.S., and N.S. designed research; X.M. performed research; X.M. analyzed data; X.M. conceived and implemented the computational approach; L.D. and N.S. conceived the initial study design; and X.M., L.D., D.S., and N.S. wrote the paper.

The authors declare no conflicts of interest.

This article is a PNAS Direct Submission.

Data deposition: The simulated and empirical molecular sequences used for the analyses can be found on the CoevRJ git repository (https://bitbucket.org/XavMeyer/coevrj).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1813836116/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- Dib L,
- Salamin N,
- Gfeller D

- ↵
- Douam F, et al.

- ↵
- ↵
- Szurmant H,
- Weigt M

- ↵
- ↵
- Cocco S,
- Feinauer C,
- Figliuzzi M,
- Monasson R,
- Weigt M

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Hug LA, et al.

- ↵
- ↵
- Uguzzoni G, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- Dunkle JA, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Figliuzzi M,
- Barrat-Charlaix P,
- Weigt M

- ↵
- ↵
- ↵
- Dib L, et al.

- ↵
- Meyer X,
- Chopard B,
- Salamin N

- ↵
- ↵
- ↵
- Gelman A

- ↵
- ↵
- ↵
- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Evolution