Power law tails in phylogenetic systems

Significance Covariance analysis of protein sequence alignments can predict structure and function from sequence alignments alone. Current methodologies typically assume that sequences are independent, notwithstanding their phylogenetic relationships. This corruption constrains the alignments for which covariance analysis can be used. It is critically important to control for phylogeny and understand how phylogeny contaminates signal. This paper presents a mathematical analysis that argues that there is a distinctive signature of phylogeny in the covariance matrix, allowing us to identify modes that are corrupted by phylogeny. This signature is present in large protein sequence alignments, explaining recent covariance analyses, and provides an important step toward decoupling phylogenetic effects from biologically meaningful interactions. Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essentially independent of the phylogenetic tree topology, depending on just two parameters—the sequence length and the average branch length. We demonstrate that these power law tails are ubiquitous in the large protein sequence alignments used to predict contacts in 3D structure, as predicted by our theory. This suggests that to decouple phylogenetic effects from the interactions between sequence distal sites that control biological function, it is necessary to remove or down-weight the eigenvectors of the covariance matrix with largest eigenvalues. We confirm that truncating these eigenvectors improves contact prediction.

Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essentially independent of the phylogenetic tree topology, depending on just two parameters-the sequence length and the average branch length. We demonstrate that these power law tails are ubiquitous in the large protein sequence alignments used to predict contacts in 3D structure, as predicted by our theory. This suggests that to decouple phylogenetic effects from the interactions between sequence distal sites that control biological function, it is necessary to remove or down-weight the eigenvectors of the covariance matrix with largest eigenvalues. We confirm that truncating these eigenvectors improves contact prediction.
power law | sequence covariance | phylogeny | protein | structure prediction A pproaches to biological sequence analysis typically assume that mutations at different sites are independent of each other, although this approximation is clearly limited. Indeed, covariation between sequence distal positions is important for predicting RNA secondary structure (1), where Watson-Crick base-pairing rules create strong covariance signals that can be detected by straightforward methods. In contrast, for proteins, the signal is less strong, and for many years it was unclear whether any remnant of molecular phenotypes such as protein structure is imprinted on covarying sequence positions (2)(3)(4).
An important consideration that limits our ability to infer sets of covarying residues is sequence phylogeny, i.e., the relatedness structure of the data samples (33)(34)(35). If some population subgroups are more closely related, then part of the covariation observed in the data will be of purely phylogenetic origin, unrelated to molecular phenotypes such as structure or function (36)(37)(38)(39)(40)(41). In population and medical genetics features such as geographical population structure are known to affect the degree of covariance observed between sequences. (42)(43)(44).
This raises the question of whether, given n aligned protein sequences of length p, it is possible to distinguish covariance due to phylogeny from that caused by molecular phenotypes (36)(37)(38)(39)(40)(41). Here, we analyze a simple theoretical model of molecular evolution and use the tools of random matrix theory (RMT) to develop a theory for the covariance when both phylogeny and structural constraints are present. We show that phylogenetic covariance is distinguished by a power law tail of large eigenvalues, which is essentially independent of phylogenetic details, depending only on the average branch length m/p and the number b of branching events or generations.
Thus motivated, we turn to data and find that the eigenvalue distributions of covariance matrices from large protein sequence alignments (MSAs) have power law tails. This suggests a strategy for cleaning the covariance matrix that at least partly controls for confounding phylogenetic effects: removing the power law tail representing those modes that are most strongly corrupted by phylogeny. For several protein families, we show that contact prediction accuracy improves by excluding those eigenvectors that correspond to the largest eigenvalues. It is interesting to note that the commonly used method of inverting the sample covariance matrix similarly down-weights the largest eigenvalues and up-weights the smallest ones. Our analysis therefore gives an alternative rationalization for why direct coupling analysis (DCA) has proved so successful at inferring true contacts in proteins from sequence data alone. More generally, this eigenvalue power law will occur in any dataset where the samples have a similar hierarchical relationship.

Results
Molecular phenotypes cause covariance between sequence positions (columns) of the MSA matrix X , while phylogeny causes covariance between sequences (rows) of X . Covariance from either source will appear in both the residue covariance matrix CR = X T X /n and the sequence covariance matrix CS = XX T /p. This is because CR and CS contain the same information; they have the same nonzero eigenvalues, and their eigenvectors VR and VS are related by VR = X T VS and VS = XVR. Analyses of protein sequence data typically attribute the detected covariance signal to interactions between sequence positions. This can be misleading: Fig. 1 A and B shows CR and CS for a simulated dataset where phylogeny is the only source of covariance. Note that CR contains isolated high-scoring residue Significance Covariance analysis of protein sequence alignments can predict structure and function from sequence alignments alone. Current methodologies typically assume that sequences are independent, notwithstanding their phylogenetic relationships. This corruption constrains the alignments for which covariance analysis can be used. It is critically important to control for phylogeny and understand how phylogeny contaminates signal. This paper presents a mathematical analysis that argues that there is a distinctive signature of phylogeny in the covariance matrix, allowing us to identify modes that are corrupted by phylogeny. This signature is present in large protein sequence alignments, explaining recent covariance analyses, and provides an important step toward decoupling phylogenetic effects from biologically meaningful interactions.  pairs caused by phylogeny, which could be erroneously interpreted to be caused by molecular phenotypes.
What happens if there are structural interactions between specific residue pairs in the simulation? In whereas with phylogeny this reduces to 54/200. The essential question is to find a way to disentangle phylogenetic and phenotypic (e.g., structural) covariance from matrices that contain a superposition of both (e.g., Fig.  1D). To address this, we first analyze the covariance signal produced by sequences for which the only source of covariance is phylogeny and then ask whether we can distinguish this signal when both phylogenetic and structural correlations are present.
Phylogenetic Covariance. To understand the signature of phylogenetic covariance, we consider a Markov model where mutations occur at random and different sites evolve independently. The process starts with a random sequence of length p, drawn from a q-letter alphabet, which undergoes a series of mutation and duplication events dictated by a user-imposed phylogeny with b branching events. This generates an alignment of n = 2 b simulated sequences. Population structure changes the eigenvalue spectrum of the resulting covariance matrix. To see this, consider the simplest phylogeny, consisting of a single branching event with equal-length branches. The true covariance matrix ΣS , i.e., the covariance matrix of the distribution the samples are drawn from, follows by calculating the covariance between the resulting sequences xi and xj . Since this is a stationary Markov process, the covariance between two sequences separated by 2m mutations, which we denote α(m), is E(x(2m)x(0)), which yields where the last equality specializes to a binary alphabet. A phylogeny with a single branching event has the true covariance matrix As the mutation rate m → ∞, note that α → 0. This means that ΣS → I, i.e., the sequences are uncorrelated, and phylogenetic influence is negligible. More generally, as the number of branching events or generations b increases, we find that ΣS is composed of nested squares that correspond to each branching event. This yields b + 1 distinct eigenvalues λi , with P (λ = λi ) = pi ∝ 2 i−b , except for the two largest eigenvalues, which have pi ∝ 2 −b (Supporting Information). These relationships imply that the eigenvalues follow the power law where r is the rank, and β ∝ log 2α is a function of m/p. Under the influence of phylogeny, the maximum eigenvalue increases exponentially with the number of branching events b. Note that there is a precise threshold at 2α = 1, which given Eq. 1 for α implies 2qm/p(q −1) = ln (2), above which this power law behavior occurs.
Finite-Sampling Effects. We have thus seen that phylogeny produces a striking signature in the covariance matrix. However, because the number of MSA sequences is limited, this signature will be affected by finite sampling-the sample covariance matrix will contain large entries purely by chance. We use RMT to develop a quantitative characterization of the effect on the corresponding eigenvalue distribution. Consider n independent sequences of length p, with amino acids drawn uniformly at random. The probability distribution of the sample eigenvalues follows the Marčenko-Pastur (MP) distribution where c = n/p (45). Our simulations confirm that the histogram of eigenvalues of sequences simulated without phylogeny or structural interactions is well described by this analytical formula ( Fig. 2A). As n increases, Eq. 4 implies that this distribution sharpens around λ = 1. RMT further predicts how Eq. 4 generalizes to describe the eigenvalue distribution of the sample covariance matrix C for any true covariance matrix Σ, such as those caused by phylogeny. We start with the Stieltjes transform, where F (λ) is the cumulative distribution function of f (λ), the limiting eigenvalue distribution of C . Marčenko and Pastur (45) used the method of characteristics to relate G(z , c) to T (λ), the cumulative eigenvalue distribution of the true covariance matrix Σ, yielding [

6]
This equation describes the effects of finite sampling. If the true eigenvalues cluster at/near unity, this will result in the MP distribution of Eq. 4. For phylogeny, the eigenvalues of Σ are drawn from a discrete distribution, so dT (λ) = i piδ(λ−λi)dλ (46), where pi = P(λ = λi ) follows the power law of Eq. 3. Eq. 6 describes how finite sampling smoothens out this discrete distribution. eigenvalues λ± = 1 ± α. Alignments of n0 = 2 11 sequences were simulated with m = 10 mutations per branch. The shape of the eigenvalue distribution differs significantly from that of the MP distribution (blue curve). RMT allows us to predict this spectrum using Eq. 6, which becomes The inverse Stieltjes transform, given by the positive imaginary part of G(z , c), analytically describes the expected eigenvalue distribution of CS . This is used to plot the red curve in Fig. 2B, which shows excellent quantitative agreement with the simulation, unlike the MP distribution shown in blue. As the number of branching events increases we simply use our exact formulas (Supporting Information) for the true eigenvalue distributions in Eq. 6 to compute the expected distribution.
Analysis of Inhomogeneous Phylogenies. Real phylogenetic trees are inhomogeneous, with branches of different lengths. Our framework naturally extends to this setting. Fig. 2 C and D shows the eigenvalue distributions of trees drawn from different distributions; Fig. 2C has three branching events with branch lengths drawn from a Poisson distribution, while Fig. 2D has seven branching events with branch lengths drawn from a halfnormal distribution. Note that the eigenvalue distribution broadens as the number of branching events b increases, reflecting that the maximum true eigenvalue is ∝ α b . For inhomogeneous phylogenies we discovered that analytical solutions follow a simple rule. Consider a phylogeny with branch lengths drawn from a distribution with mean m and bounded variance; the eigenvalue distribution is then well approximated by the eigenvalue distribution for the tree with all branch lengths equal to m and the same number of branching events. The red curves in Fig. 2 C and D show that this prediction fits the simulated data closely. To derive the result, we consider a phylogeny with b = 1 and branch lengths m1, m2 drawn from a Poisson distribution with mean m = µ, so that ρi := P(m1 + m2 = i) = (2µ) i e −2µ /i!. If αi = exp(−qi/p(q − 1)), then the eigenvalues of the true covariance matrix are λ = 1 ± αi . Applying Eq. 6 we find Examining the summands, we note that In the limit of large p the dependence on the tree parameters ρi and αi simplifies, so that This approximation, valid for large p, allows us to write Hence the Stieltjes transform for the inhomogeneous tree is equal to the Stieltjes transform for a homogeneous tree with m = µ the mean of the distribution the branch lengths are drawn from. This result can be generalized for any arbitrary distribution and phylogenetic tree topology (Supporting Information). This result about inhomogeneous phylogenies is important as it extends our analysis methods to more realistic phylogenies, implying that the power law tail of large eigenvalues described above is general.
Phenotypic Covariance. The eigenvalue spectrum for phenotypic covariance depends on how phenotype couples the residues to each other. While this will differ for different phenotypes, recent work has focused on using covariance analysis to predict contacts in tertiary protein structure (11,13,14,(16)(17)(18). If we consider interactions drawn from a protein contact map, what covariance is caused? For an alphabet with q = 2, the correlation between two residues that interact with strength j is given by tanh(j ), which saturates as j increases so that the resulting correlation does not exceed unity. With multiple interactions and a larger alphabet, the situation is more complex; however, we can use simulations to characterize the sample covariance matrix and corresponding eigenvalue distribution. We first simulate sequences without phylogeny, using a simple Markov model with nonzero residue couplings at locations dictated by protein contact maps. These couplings were chosen uniformly from the interval [−5, 5]. With the 784 interactions of Fig. 3A, the eigenvalue distribution of the sample covariance matrix is described well by the MP distribution (Fig. 3B). This empirical observation suggests that the eigenvalues of the true covariance matrix are all of similar size, suggesting that structural interactions do not lead to an eigenvalue power law. While real proteins will also have other phenotypic interactions, this model provides a relevant starting point. Phylogenetic vs. Structural Covariance. Crucially, this model suggests that there are strikingly different signatures between the covariance matrix expected from phylogeny and that expected for interactions caused by residue contacts. If only structural interactions are present, the limiting behavior of the maximum eigenvalue saturates logarithmically as a function of the number of interactions (Fig. 4A). In contrast, Fig. 4B shows that the maximum eigenvalue caused by phylogeny increases exponentially as the sequences undergo more duplication events. Moreover, Fig.  4C shows a log-log plot of the eigenvalues as a function of rank for our simulations with just phenotypic interactions (Fig. 3); the data are well fitted by a line of slope zero reflecting the absence of the power law.
To probe these signatures further, we use simulations with a controlled mix of phylogeny and structural interactions. Fig  4D shows that the spectra for simulations with just phylogeny and simulations with both phylogeny and 200 random structural interactions obey the same power law. In both cases the upper power law tail follows β = log(2α)/ log(2) (red line). With interactions, the lower extent of the power law is diminished; the blue curve in Fig. 4D drops off before the yellow curve. Importantly, these two spectra diverge only outside the power law regime, implying that phylogeny dominates those modes that follow the power law.
These simulations therefore suggest that interactions between residues affect the smallest eigenvalues, while phylogeny affects the largest eigenvalues, giving a potential mechanism for distinguishing the effects of phylogeny. Intuitively, this could arise because interactions between residues make it less likely that mutations at those sites will be accepted, reducing the effective mutation rate of these residues and hence affecting eigenvectors with low eigenvalues. In Fig 5 we simulate sets of sequences with both phylogeny and structural interactions from two different protein contact maps and obtain similar results to those in Fig 4D. In contrast to Fig. 3, we find that the eigenvalue distributions of the resulting sequence alignments are not MP, but are well fitted by our analytic approach. The red curves in Fig. 5 A and B are each found using the phylogenetic parameters from the power law fits in Fig. 5 C and D, respectively. Eigenvalue Spectra of Protein Sequence Data. Given the vastly different signatures in the eigenvalue distributions expected from phylogeny and structural interactions, it is of great interest to see whether such signatures arise in protein sequence data. To probe this, we choose three representative protein families for which covariance analysis has been shown to yield accurate contact predictions. In Fig. 6 A-C, Top we show that the eigenvalue distributions follow a power law in each case, as predicted by our theory. Furthermore, as for the simulated data, Fig. 6 A-C, Middle shows that the phylogenetic parameters extracted from the power fitted in each case provide a closer fit (red curves) to the eigenvalue distribution than to the MP distribution (blue curves).
Cleaning Protein Spectra. The analysis of simulated data suggests that the effects of phylogeny can be diminished by removing large modes of the covariance matrix and enforcing the constraint that the remaining eigenvalues are all of the same size. Namely, instead of the full covariance matrix from the sequence alignments, we propose truncating the highest modes, where r = p(q − 1). Fig. 6 shows the results of this approach for contact prediction. For each protein, the slope of the power law fit in Fig. 6 A-C, Top is used to estimate the phylogenetic parameters required for the analytical solution in Fig. 6 A-C, Middle (red curve). The point at which the eigenvalues deviate from the power law fit in Fig. 6 A-C, Top (purple dashed line) is used to determine which modes are dominated by phylogeny and should be truncated from the outer product expansion of the sample covariance matrix. Fig. 6 A-C, Bottom shows how well different truncations do at contact prediction; the purple dashed line reflects the threshold found from the power law fit and is near optimal in all cases. This phenomenology is entirely consistent with the notion that the modes corresponding to the large eigenvalues reflect the phylogenetic relatedness of the aligned sequences.

Discussion
This paper was motivated by recent advances (9-14, 16, 21) in predicting protein structure and function from the covariation of sequences, a strategy that has been successful for predicting RNA secondary structure for some time (1,35). A major confounding effect in both situations is the effect of phylogeny, which introduces correlations between residues (30,36,38). The correlations due to structure/function and phylogeny must be disentangled for accurate prediction. The primary accomplishment of this paper is to identify a feature of the eigenvalue distribution of protein covariance matrices (the power law tail) that distinguishes covariance due to phylogeny from that caused by structural interactions. The presence of power law tails in the data from diverse protein families allows us to develop an initial approach to deconvolving structural interactions from the covariance that results from sequence phylogeny alone. Our finding that the largest modes of the covariance matrix are dominated by phylogeny suggests an alternative rationalization for the matrix inversion step that enabled features of protein structure and function to be predicted from covariance analysis of large protein sequence alignments. Furthermore the resulting cleaned covariance matrix can be used as input for other inference approaches (9-12, 18, 19, 21).
A further result is a general understanding of how phylogenetic effects impact sequence covariation in different regions of parameter space. Depending on the sequence length p and the average branch length m, there is a parameter regime where the covariance matrix does not feature a power law tail of large eigenvalues, and hence a different approach to disentangling phenotypic interactions from phylogenetic correlations is required. Specifically, as the eigenvalues of the true covariance matrix for phylogenetic interactions are ≈(2α) k , we expect large eigenvalues when 2α > 1. Given Eq. 1 for α, this is equivalent to 2q/(q − 1)m/p < ln (2).
We have focused on the eigenvalue distribution; however, information about the phylogeny will also be imprinted in the eigenvectors of the covariance matrix. In the phylogenetic regime, the eigenvectors will have structure that reflects the relationship between the different sequences (43,44), providing additional information about which modes should be removed