## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Power law tails in phylogenetic systems

Edited by Michael Levitt, Stanford University, Stanford, CA, and approved November 29, 2017 (received for review July 7, 2017)

## Significance

Covariance analysis of protein sequence alignments can predict structure and function from sequence alignments alone. Current methodologies typically assume that sequences are independent, notwithstanding their phylogenetic relationships. This corruption constrains the alignments for which covariance analysis can be used. It is critically important to control for phylogeny and understand how phylogeny contaminates signal. This paper presents a mathematical analysis that argues that there is a distinctive signature of phylogeny in the covariance matrix, allowing us to identify modes that are corrupted by phylogeny. This signature is present in large protein sequence alignments, explaining recent covariance analyses, and provides an important step toward decoupling phylogenetic effects from biologically meaningful interactions.

## Abstract

Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essentially independent of the phylogenetic tree topology, depending on just two parameters—the sequence length and the average branch length. We demonstrate that these power law tails are ubiquitous in the large protein sequence alignments used to predict contacts in 3D structure, as predicted by our theory. This suggests that to decouple phylogenetic effects from the interactions between sequence distal sites that control biological function, it is necessary to remove or down-weight the eigenvectors of the covariance matrix with largest eigenvalues. We confirm that truncating these eigenvectors improves contact prediction.

Approaches to biological sequence analysis typically assume that mutations at different sites are independent of each other, although this approximation is clearly limited. Indeed, covariation between sequence distal positions is important for predicting RNA secondary structure (1), where Watson–Crick base-pairing rules create strong covariance signals that can be detected by straightforward methods. In contrast, for proteins, the signal is less strong, and for many years it was unclear whether any remnant of molecular phenotypes such as protein structure is imprinted on covarying sequence positions (2⇓–4).

Recently, with the growth of protein sequence databases (5) and the introduction of sophisticated analyses (6⇓–8), it has become clear that covariance analysis of protein sequences can yield exciting biological insights in a wide range of contexts (9⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓⇓–27). In general a set of homologous protein sequences is constrained by protein structure and function, and with sufficient data it is possible to tease out the nature of these constraints and make biologically relevant predictions (12, 13, 16, 28⇓⇓⇓–32).

An important consideration that limits our ability to infer sets of covarying residues is sequence phylogeny, i.e., the relatedness structure of the data samples (33⇓–35). If some population subgroups are more closely related, then part of the covariation observed in the data will be of purely phylogenetic origin, unrelated to molecular phenotypes such as structure or function (36⇓⇓⇓⇓–41). In population and medical genetics features such as geographical population structure are known to affect the degree of covariance observed between sequences. (42⇓–44).

This raises the question of whether, given n aligned protein sequences of length p, it is possible to distinguish covariance due to phylogeny from that caused by molecular phenotypes (36⇓⇓⇓⇓–41). Here, we analyze a simple theoretical model of molecular evolution and use the tools of random matrix theory (RMT) to develop a theory for the covariance when both phylogeny and structural constraints are present. We show that phylogenetic covariance is distinguished by a power law tail of large eigenvalues, which is essentially independent of phylogenetic details, depending only on the average branch length

Thus motivated, we turn to data and find that the eigenvalue distributions of covariance matrices from large protein sequence alignments (MSAs) have power law tails. This suggests a strategy for cleaning the covariance matrix that at least partly controls for confounding phylogenetic effects: removing the power law tail representing those modes that are most strongly corrupted by phylogeny. For several protein families, we show that contact prediction accuracy improves by excluding those eigenvectors that correspond to the largest eigenvalues. It is interesting to note that the commonly used method of inverting the sample covariance matrix similarly down-weights the largest eigenvalues and up-weights the smallest ones. Our analysis therefore gives an alternative rationalization for why direct coupling analysis (DCA) has proved so successful at inferring true contacts in proteins from sequence data alone. More generally, this eigenvalue power law will occur in any dataset where the samples have a similar hierarchical relationship.

## Results

Molecular phenotypes cause covariance between sequence positions (columns) of the MSA matrix X, while phylogeny causes covariance between sequences (rows) of X. Covariance from either source will appear in both the residue covariance matrix *A* and *B* shows

What happens if there are structural interactions between specific residue pairs in the simulation? In Fig. 1 *C* and *D* we compare the true interactions (gray) with the top 200 scoring pairs from covariance matrices for sequences simulated without (Fig. 1*C*) and with (Fig. 1*D*) phylogeny. Without phylogenetic corruption, *D*). To address this, we first analyze the covariance signal produced by sequences for which the only source of covariance is phylogeny and then ask whether we can distinguish this signal when both phylogenetic and structural correlations are present.

### Phylogenetic Covariance.

To understand the signature of phylogenetic covariance, we consider a Markov model where mutations occur at random and different sites evolve independently. The process starts with a random sequence of length p, drawn from a q-letter alphabet, which undergoes a series of mutation and duplication events dictated by a user-imposed phylogeny with b branching events. This generates an alignment of *Supporting Information*). These relationships imply that the eigenvalues follow the power law**1** for α implies

### Finite-Sampling Effects.

We have thus seen that phylogeny produces a striking signature in the covariance matrix. However, because the number of MSA sequences is limited, this signature will be affected by finite sampling—the sample covariance matrix will contain large entries purely by chance. We use RMT to develop a quantitative characterization of the effect on the corresponding eigenvalue distribution. Consider n independent sequences of length p, with amino acids drawn uniformly at random. The probability distribution of the sample eigenvalues follows the Marčenko–Pastur (MP) distribution*A*). As n increases, Eq. **4** implies that this distribution sharpens around **4** generalizes to describe the eigenvalue distribution of the sample covariance matrix C for any true covariance matrix Σ, such as those caused by phylogeny. We start with the Stieltjes transform,**4**. For phylogeny, the eigenvalues of Σ are drawn from a discrete distribution, so **3**. Eq. **6** describes how finite sampling smoothens out this discrete distribution.

Fig. 2*B* shows how the eigenvalue distribution changes if the sequences follow our simplest phylogeny, where **2**) has eigenvalues **6**, which becomes*B*, which shows excellent quantitative agreement with the simulation, unlike the MP distribution shown in blue. As the number of branching events increases we simply use our exact formulas (*Supporting Information*) for the true eigenvalue distributions in Eq. **6** to compute the expected distribution.

### Analysis of Inhomogeneous Phylogenies.

Real phylogenetic trees are inhomogeneous, with branches of different lengths. Our framework naturally extends to this setting. Fig. 2 *C* and *D* shows the eigenvalue distributions of trees drawn from different distributions; Fig. 2*C* has three branching events with branch lengths drawn from a Poisson distribution, while Fig. 2*D* has seven branching events with branch lengths drawn from a half-normal distribution. Note that the eigenvalue distribution broadens as the number of branching events b increases, reflecting that the maximum true eigenvalue is

For inhomogeneous phylogenies we discovered that analytical solutions follow a simple rule. Consider a phylogeny with branch lengths drawn from a distribution with mean *C* and *D* show that this prediction fits the simulated data closely. To derive the result, we consider a phylogeny with **6** we find*Supporting Information*). This result about inhomogeneous phylogenies is important as it extends our analysis methods to more realistic phylogenies, implying that the power law tail of large eigenvalues described above is general.

### Phenotypic Covariance.

The eigenvalue spectrum for phenotypic covariance depends on how phenotype couples the residues to each other. While this will differ for different phenotypes, recent work has focused on using covariance analysis to predict contacts in tertiary protein structure (11, 13, 14, 16⇓–18). If we consider interactions drawn from a protein contact map, what covariance is caused? For an alphabet with *A*, the eigenvalue distribution of the sample covariance matrix is described well by the MP distribution (Fig. 3*B*). This empirical observation suggests that the eigenvalues of the true covariance matrix are all of similar size, suggesting that structural interactions do not lead to an eigenvalue power law. While real proteins will also have other phenotypic interactions, this model provides a relevant starting point.

### Phylogenetic vs. Structural Covariance.

Crucially, this model suggests that there are strikingly different signatures between the covariance matrix expected from phylogeny and that expected for interactions caused by residue contacts. If only structural interactions are present, the limiting behavior of the maximum eigenvalue saturates logarithmically as a function of the number of interactions (Fig. 4*A*). In contrast, Fig. 4*B* shows that the maximum eigenvalue caused by phylogeny increases exponentially as the sequences undergo more duplication events. Moreover, Fig. 4*C* shows a log–log plot of the eigenvalues as a function of rank for our simulations with just phenotypic interactions (Fig. 3); the data are well fitted by a line of slope zero reflecting the absence of the power law.

To probe these signatures further, we use simulations with a controlled mix of phylogeny and structural interactions. Fig 4*D* shows that the spectra for simulations with just phylogeny and simulations with both phylogeny and 200 random structural interactions obey the same power law. In both cases the upper power law tail follows *D* drops off before the yellow curve. Importantly, these two spectra diverge only outside the power law regime, implying that phylogeny dominates those modes that follow the power law.

These simulations therefore suggest that interactions between residues affect the smallest eigenvalues, while phylogeny affects the largest eigenvalues, giving a potential mechanism for distinguishing the effects of phylogeny. Intuitively, this could arise because interactions between residues make it less likely that mutations at those sites will be accepted, reducing the effective mutation rate of these residues and hence affecting eigenvectors with low eigenvalues. In Fig 5 we simulate sets of sequences with both phylogeny and structural interactions from two different protein contact maps and obtain similar results to those in Fig 4*D*. In contrast to Fig. 3, we find that the eigenvalue distributions of the resulting sequence alignments are not MP, but are well fitted by our analytic approach. The red curves in Fig. 5 *A* and *B* are each found using the phylogenetic parameters from the power law fits in Fig. 5 *C* and *D*, respectively.

### Eigenvalue Spectra of Protein Sequence Data.

Given the vastly different signatures in the eigenvalue distributions expected from phylogeny and structural interactions, it is of great interest to see whether such signatures arise in protein sequence data. To probe this, we choose three representative protein families for which covariance analysis has been shown to yield accurate contact predictions. In Fig. 6 *A–C*, *Top* we show that the eigenvalue distributions follow a power law in each case, as predicted by our theory. Furthermore, as for the simulated data, Fig. 6 *A–C*, *Middle* shows that the phylogenetic parameters extracted from the power fitted in each case provide a closer fit (red curves) to the eigenvalue distribution than to the MP distribution (blue curves).

### Cleaning Protein Spectra.

The analysis of simulated data suggests that the effects of phylogeny can be diminished by removing large modes of the covariance matrix and enforcing the constraint that the remaining eigenvalues are all of the same size. Namely, instead of the full covariance matrix from the sequence alignments, we propose truncating the highest modes,*A–C*, *Top* is used to estimate the phylogenetic parameters required for the analytical solution in Fig. 6 *A–C*, *Middle* (red curve). The point at which the eigenvalues deviate from the power law fit in Fig. 6 *A–C*, *Top* (purple dashed line) is used to determine which modes are dominated by phylogeny and should be truncated from the outer product expansion of the sample covariance matrix. Fig. 6 *A–C*, *Bottom* shows how well different truncations do at contact prediction; the purple dashed line reflects the threshold found from the power law fit and is near optimal in all cases. This phenomenology is entirely consistent with the notion that the modes corresponding to the large eigenvalues reflect the phylogenetic relatedness of the aligned sequences.

## Discussion

This paper was motivated by recent advances (9⇓⇓⇓⇓–14, 16, 21) in predicting protein structure and function from the covariation of sequences, a strategy that has been successful for predicting RNA secondary structure for some time (1, 35). A major confounding effect in both situations is the effect of phylogeny, which introduces correlations between residues (30, 36, 38). The correlations due to structure/function and phylogeny must be disentangled for accurate prediction.

The primary accomplishment of this paper is to identify a feature of the eigenvalue distribution of protein covariance matrices (the power lawtail) that distinguishes covariance due to phylogeny from that caused by structural interactions. The presence of power law tails in the data from diverse protein families allows us to develop an initial approach to deconvolving structural interactions from the covariance that results from sequence phylogeny alone. Our finding that the largest modes of the covariance matrix are dominated by phylogeny suggests an alternative rationalization for the matrix inversion step that enabled features of protein structure and function to be predicted from covariance analysis of large protein sequence alignments. Furthermore the resulting cleaned covariance matrix can be used as input for other inference approaches (9⇓⇓–12, 18, 19, 21).

A further result is a general understanding of how phylogenetic effects impact sequence covariation in different regions of parameter space. Depending on the sequence length p and the average branch length m, there is a parameter regime where the covariance matrix does not feature a power law tail of large eigenvalues, and hence a different approach to disentangling phenotypic interactions from phylogenetic correlations is required. Specifically, as the eigenvalues of the true covariance matrix for phylogenetic interactions are **1** for α, this is equivalent to

We have focused on the eigenvalue distribution; however, information about the phylogeny will also be imprinted in the eigenvectors of the covariance matrix. In the phylogenetic regime, the eigenvectors will have structure that reflects the relationship between the different sequences (43, 44), providing additional information about which modes should be removed for better inference of phenotypic interactions. Understanding the extent to which the effects of phylogeny and structural/functional interactions can be disentangled is an important direction for future research. Is it possible to separate the effects of phylogeny from those of interaction in parameter regimes with no power law tail? Under what circumstances can we accurately infer the strength of interactions? The approach outlined here provides a mathematical framework that future work can exploit to definitively answer these questions.

## Acknowledgments

We thank M. P. Brenner and A. W. Murray for comments on a draft of this paper. This work was supported by a Next Generation fellowship (to L.J.C.), a Marie Curie Career Integration Grant [Evo-Couplings, Grant 631609], and an Engineering and Physical Sciences Research Council PhD studentship (to C.Q.).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: ljc37{at}cam.ac.uk.

Author contributions: C.Q. and L.J.C. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1711913115/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- Durbin R,
- Eddy SR,
- Krogh A,
- Mitchison G

- ↵
- ↵
- ↵
- Lockless SW,
- Ranganathan R

- ↵
- ↵
- ↵
- Lapedes AS,
- Giraud BG,
- Liu L,
- Stormo GD

*Statistics in Molecular Biology and Genetics*, IMS Lecture Notes - Monograph Series, ed Seillier-Moiseiwitsch F (Institute of Mathematical Statistics, Hayward, CA), Vol 33, pp 236–256. - ↵
- Bialek W,
- Ranganathan R

- ↵
- Burger L,
- van Nimwegen E

- ↵
- ↵
- Weigt M,
- White RA,
- Szurmant H,
- Hoch JA,
- Hwa T

- ↵
- ↵
- ↵
- Dahirel V, et al.

- ↵
- Morcos F, et al.

- ↵
- ↵
- Sułkowska JI,
- Morcos F,
- Weigt M,
- Hwa T,
- Onuchic JN

- ↵
- ↵
- Ekeberg M,
- Lövkvist C,
- Lan Y,
- Weigt M,
- Aurell E

- ↵
- ↵
- ↵
- De Leonardis E, et al.

- ↵
- ↵
- Barton JP,
- Kardar M,
- Chakraborty AK

- ↵
- ↵
- Sung YM,
- Wilkins AD,
- Rodriguez GJ,
- Wensel TG,
- Lichtarge O

- ↵
- Bitbol AF,
- Dwyer RS,
- Colwell LJ,
- Wingreen NS

- ↵
- Shakhnovich EI,
- Gutin AM

- ↵
- ↵
- ↵
- ↵
- Jacquin H,
- Gilson A,
- Shakhnovich E,
- Cocco S,
- Monasson R

- ↵
- ↵
- ↵
- ↵
- ↵
- Wollenberg KR,
- Atchley WR

- ↵
- ↵
- ↵
- Obermayer B,
- Levine E

- ↵
- Barton JP,
- Chakraborty AK,
- Cocco S,
- Jacquin H,
- Monasson R

- ↵
- ↵
- ↵
- ↵
- Marčenko VA,
- Pastur LA

- ↵
- Rao NR,
- Edelman A

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Physics

## Jump to section

## You May Also be Interested in

*Ikaria wariootia*represents one of the oldest organisms with anterior and posterior differentiation.