From residue coevolution to protein conformational ensembles and functional dynamics
- aInstitute of Structural and Molecular Biology, University College London, London WC1H 0AJ, United Kingdom;
- bStructural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain;
- cDepartment of Chemistry, University College London, London WC1H 0AJ, United Kingdom
See allHide authors and affiliations
Edited by Michael L. Klein, Temple University, Philadelphia, PA, and approved September 23, 2015 (received for review May 2, 2015)

Significance
Evolutionary-related protein sequences have been selected to preserve a common function and fold. Residues in contact in this conserved structure are coupled by evolution and show correlated mutational patterns. The exponential growth of sequenced genomes makes it possible to detect these coevolutionary coupled pairs and to infer three-dimensional folds from predicted contacts. But how far can we push the prediction of native folds? Can we predict the conformational heterogeneity of a protein directly from sequences? We address these questions developing an accurate contact prediction algorithm and a protein coarse-grained model, and exploring conformational landscapes congruent with coevolution. We find that both structural and dynamical properties can be already recovered using evolutionary information only.
Abstract
The analysis of evolutionary amino acid correlations has recently attracted a surge of renewed interest, also due to their successful use in de novo protein native structure prediction. However, many aspects of protein function, such as substrate binding and product release in enzymatic activity, can be fully understood only in terms of an equilibrium ensemble of alternative structures, rather than a single static structure. In this paper we combine coevolutionary data and molecular dynamics simulations to study protein conformational heterogeneity. To that end, we adapt the Boltzmann-learning algorithm to the analysis of homologous protein sequences and develop a coarse-grained protein model specifically tailored to convert the resulting contact predictions to a protein structural ensemble. By means of exhaustive sampling simulations, we analyze the set of conformations that are consistent with the observed residue correlations for a set of representative protein domains, showing that (i) the most representative structure is consistent with the experimental fold and (ii) the various regions of the sequence display different stability, related to multiple biologically relevant conformations and to the cooperativity of the coevolving pairs. Moreover, we show that the proposed protocol is able to reproduce the essential features of a protein folding mechanism as well as to account for regions involved in conformational transitions through the correct sampling of the involved conformers.
Pairs of positions along a protein sequence can show strong correlations arising both from functional and structural constraints (1⇓⇓⇓⇓⇓⇓⇓–9). Earliest approaches for detecting interdependent residues and predicting 3D contacts in proteins (1⇓⇓–4, 8) analyzed alignments containing from tens to a few hundreds sequences. Given the small size of available sequences datasets, these works relied on an independent pair approximation: a “coevolutionary coupling” between two residues was estimated independently for each pair, ignoring the rest of the network of residues. The number of known protein sequences, however, has grown dramatically in the past few years (10). Such a large increase in the size of datasets has allowed to fit—either explicitly (11, 12) or implicitly (13, 14)—pairwise models for protein sequences that take into account the whole network of correlated residues simultaneously, and are able to disentangle correlated positions from “interacting” positions by identifying the parameters of the model with the coupling constants in an Ising-like Hamiltonian (15, 16). Despite their simplicity, these models have had remarkable success in the design of synthetic sequences preserving natural function (13, 14) and in the prediction of interacting pairs of residues from the knowledge of their sequence alone (12, 17⇓⇓⇓–21).
In this paper, we tackle the problem of sampling an ensemble of structures compatible with the observed coevolution between protein residues. We will follow a two-step procedure. The first step corresponds to an inverse problem: from a set of homologous sequences to the parameters of a model. Inverse problems are notoriously computationally hard. For large sets of variables, an exact evaluation of the normalizing constant of the variables’ joint distribution (the partition function, in the language of statistical mechanics) is impracticable. Previous works in the literature focused on efficiency, circumventing this problem by adopting different, approximated solutions (12, 17⇓–19, 22⇓⇓⇓–26), generically based on tractable approximations of the likelihood function. However, given the success and the number of potential applications of coevolutionary analysis, the study of reference and more quantitative approaches is necessary. In this regard, the Monte Carlo Markov chain (MCMC)-based, maximum-likelihood approach, albeit computationally demanding, is in principle exact given a sufficient sampling at each minimization step. In this work we adopt the Boltzmann learning algorithm (11, 27), whose accuracy in inferring the parameters of the pairwise model, at variance with all of the previous approaches in the literature, is not biased a priori by the choice of a particular approximation scheme.
The second step is a direct problem: after translating the probabilistic model for sequences into an energy potential for protein structures, we can explore the resulting energy landscape using molecular dynamics (MD). After extensive sampling, we can characterize the folding reaction and find the best candidate for the native fold as well as metastable intermediates and conformers that may have a functional role. Moreover, we can spot flexible regions, directly connecting coevolution to function and dynamics. With this goal in mind, we introduce a coarse-grained model particularly apt to translate predictions of contacts to a structural ensemble. Thanks to the great reduction in the number of degrees of freedom, coarse-grained models have been widely used to study many aspects of proteins (28⇓⇓⇓⇓⇓–34). Due to their simplicity, Cα models in particular have already been used to predict protein folds from coevolutionary data (35⇓–37). Here, in the same spirit as the model presented in ref. 35, where coevolutionary information is used with a
However, besides recovering a protein native fold, the main advantage of the proposed approach is its ability to capture the conformational heterogeneity and the thermodynamical features of the folding reaction as implied by coevolutionary information only. In contrast to more expensive approaches like all-atom MD or more refined coarse-grained potentials (39), we can afford an extensive equilibrium sampling of the conformational space. We illustrate this point by applying our approach to analyze two energy landscapes related to the folding of the Ras protein and the conformational dynamics of a tyrosine kinase. As expected, Ras folds cooperatively, and we find and characterize a folding intermediate. The protein kinase correctly samples an ensemble of active-like and inactive-like structures that are biologically relevant for its function and shows a flexibility pattern compatible with experimental observations.
Results
Contact Predictions via Boltzmann Learning.
Entropy maximization provides a simple procedure for building a probabilistic model that is consistent with a set of available measures. If we know the average values
To our knowledge, since the early work of Lapedes et al. (11), the numerical route of likelihood maximization via importance sampling, or Boltzmann learning (27), has not been explored, probably due to its computational complexity. As outlined in the introduction, this approach [see, e.g., Roudi et al. (41) for an extensive comparison between Boltzmann learning and various approximated schemes] has the advantage of having unbounded precision in retrieving the parameters of a maximum-entropy model. We tested Boltzmann learning on an assorted set of 18 proteins with varying length, from 63 to 216 residues, and different secondary structure composition (Table 1). For each alignment, we inferred a pairwise model by maximizing a regularized version of the log-likelihood of the sequences with respect to parameters
Sum up of protein domain features and main results
(A and B) Energy surfaces obtained projecting sequences from the ADH_zinc_N domain family (A) and sequences sampled from the model (B) over the first two principal components of the MSA (7), and taking the negative logarithm of the resulting distribution. High-probability regions in sequence space are in dark blue. The cluster structure of the alignment is clearly reproduced by the simulated trajectories. (C and D) Mean precision of the top-ranked predictions for different values of the scaled rank (rank/total number of contacts), and the mean true positive rate for different values of the false-positive rate (ROC curve). The color bands show the SD and the interval between the minimum and maximum values.
For each protein, from the estimated set of
A Coarse-Grained Model for Structure Prediction and Conformational Sampling.
To take full advantage of the accurate contact predictions resulting from the Boltzmann learning algorithm, we propose a coarse-grained model that combines an all-heavy-atom description of the protein backbone with a
Using this simplified model, we investigated the conformational space accessible to the protein domains in Table 1. The nine best-predicted structures obtained in the folding runs are shown in Fig. 2 superimposed to their respective native folds. Those structures correspond to the conformations with minimum distance RMSD of the α carbon atoms (
The nine best-predicted protein structures of the 18 simulated are shown in blue, overlaid to the native conformation in cyan (transparent). The proteins are disposed with increasing size from left to right and from top to bottom. Their Pfam identifier is shown below.
For domains of similar size, we found a clear correlation between the precision of the contact predictions and the final dRMSD values (e.g., domains Trans_Reg_C, CMD, and PAS have lower precision in contact prediction and higher dRMSD values than other domains with a similar number of amino acids; Table 1), confirming that the precision of contacts prediction is the main determinant of the quality of the final fold reconstruction. However, all 18 proteins fold within 3 Å to the native protein, except for the 2GJ3 structure (PAS domain), which performs worse (Table 1 and SI Appendix, Fig. S6). We found that the model is tolerant to both a conspicuous presence of nonnative contacts among the set of predicted contacts as well as to the absence of a large proportion of native contacts among the predicted ones.
To have a reference baseline to compare with, we also simulated the proteins using the same conditions but replacing the set of predicted contacts with the set of native contacts (see details in SI Appendix). The best dRMSDs obtained in this case are always below ∼2 Å (SI Appendix, Fig. S6, black curve). These values quantify the theoretical limit of the present model with perfectly predicted contacts. The quality of the predicted structures is as good as those predicted by the structure-based reference simulation in several cases (HTH_31, Sigma70_r2, RRM_1, Ras). Interestingly, the dRMSD of the minimum energy structure (dashed curve in SI Appendix, Fig. S6) does not deviate much from the absolute minimum dRMSD structure sampled in the trajectory. Indeed, for 17 of 18 proteins, the minimum energy structure is below 4 Å to the native structure, which indicates that the model and the contact predictions are sound and lead to a funnel-shaped energy landscape whose minimum corresponds to the crystallographic structure.
In the single case of the PAS domain, we observe a larger deviation, 7 vs. 3.6 Å for the dRMSD of the minimum energy structure and the absolute minimum, respectively (Table 1), which can be only partially explained by the presence of an associated cofactor in the experimental structure. Such a large difference indicates that though the minimum dRMSD structure still belongs to the native basin, it does not coincide with the minimum energy structure, which is the defining property of the native fold. Inspecting the set of predicted contacts, we found that five of them are in the dimeric interface of the corresponding homodimer (SI Appendix, Fig. S7). After removing these interdomain contacts, the minimum energy dRMSD of the monomeric structure decreases to 4.7 Å (SI Appendix, Fig. S6), showing that these few contacts were responsible for a large displacement from the native conformation. This result is also corroborated by the fact that when five randomly chosen false-positive contacts are removed, the dRMSD of the minimum energy structure does not improve, which we verified running 20 independent simulations (SI Appendix). As far as we know, the effect on fold reconstruction of strongly coupled pairs of residues at homodimeric (or homooligomeric) interfaces, when incorrectly classified as contacts in the monomeric structure, have not been described before. Here we show that this effect can be large, depending on the relative position of the pairs of residues at the dimer interfaces. Reasonably, similar but weaker effects are present for other cases, being that 10 over 18 proteins in our set were homodimeric in the corresponding PDB structure. The analysis demonstrates that the ability of PAS domains to form homodimers (44) is subjected to evolutionary pressure, and identifies a set of five pairs of positions involving the N-terminal helix (I40-F27; Y49-F27; A117-P35; L130-Q29; M132-A34 in the 2GJ3 PDB numbering) that emerge as crucial for the stabilization of the PAS homodimer across the protein family, as supported by their proximity in the crystal structure, strong coevolutionary coupling, and simulation.
Conformational Heterogeneity and Residue Coevolution.
To investigate the ability of coevolutionary couplings to also encode for dynamical and functional information, we analyzed the conformational space close to the native fold in the case of the catalytic domain (CD) of SRC tyrosine kinase and characterized the full folding reaction of the Ras domain from a thermodynamical point of view.
Protein kinases are known to undergo a large conformational rearrangement of a centrally located loop during activation, called activation loop or A-loop. In the case of the CD of the SRC tyrosine kinase, the A-loop spans more than 20 residues, and the structural rearrangement moves the backbone across ≈25 Å from the inactive conformation where the A-loop is folded, to the fully active structure where the A-loop is in an extended conformation. The A-loop is known to be flexible to the point of being invisible in many crystal structures. It is thus interesting to see if traces of this conformational transition can be observed by an exhaustive sampling of the native fold basin with our model. We performed a 500-ns-long parallel tempering (PT) simulation of the 250 residues CD starting from the inactive conformation with the A-loop in the so-called “half-closed conformation,” corresponding to the structure with PDB code 2SRC. We used the 250 best-predicted contacts calculated over 3,812 effective sequences from PFAM Pkinase_Tyr family, without adding any structure-based bias. Indeed, only 130 of the 250 contacts are native contacts (where a native contact is defined between CB atoms within 8 Å in the native structure). Consistent with the fact that no contacts have been predicted for the A-loop, we observe an extensive sampling of the conformational space available to the loop while the rest of the protein correctly maintains the native fold.
In Fig. 3 we compare the experimental and simulated structural flexibility of different positions along the chain. Even though we expect a sequence-dependent effect on structural order/disorder, we note that flexible regions (A-loop, 298–303 loop and αG-helix) are captured by our model with few exceptions (αC-helix, G-rich loop), suggesting that chain flexibility itself is partially encoded in the sequences of the kinase protein family (SI Appendix, Fig. S8). Indeed, these regions are known to be involved in the activation or in protein–protein interaction (45). We note that flexibility cannot be trivially deducted from the number of predicted contacts involving each residue, as shown in SI Appendix, Fig. S8, but is rather a consequence of the whole fold and network of interactions. Moreover, both the active and inactive conformations of the A-loop are repeatedly sampled within 5 Å
Comparison of the fluctuations of the SRC catalytic domain. The experimental X-ray structure (PDB ID code 2SRC) is shown (Left) with the normalized B-factor color and thickness coded. The same structure with the B-factor calculated from the simulation of the SRC catalytic domain (Right). For easier comparison, the scales have been normalized. Higher values correspond to larger fluctuations.
The Ras protein is a
Folding free energy of Ras protein at three different temperatures around the folding temperature
Discussion and Outlook
Among proteins sharing a common ancestor, secondary and tertiary structure is generally much more conserved than protein sequence (50). As a result, a protein family is free to explore a large space of possible sequences while at the same time preserving a common structural framework. As shown by previous works (3, 4, 8, 11, 12, 17⇓⇓⇓⇓⇓⇓⇓⇓–26), the information contained in sequence variability can be translated to a series of couplings between protein residues, restraining the set of conformations compatible with the observed protein sequences. The main objective of this paper is to study such an evolutionary-restrained space of conformations. We extracted the optimal couplings by forgoing unnecessary approximations and using a standard reference algorithm, Boltzmann learning (51), based on MCMC simulations of trajectories in sequence space. This scheme allowed us to verify, at variance with previous works, the simple but important fact that the fitted maximum-entropy models reproduce the observed correlations between residues in multiple sequence alignments. We developed a simplified physical model for protein dynamics, showing that the combination of accurate contact predictions and a coarse-grained yet biologically meaningful protein model allows for a full sampling of the structural ensemble associated to a protein family. Our simulations support the finding (19) that the structural landscape dictated by residue coevolution is dominated by the energy minimum corresponding to the native fold, which we recover with high accuracy for a set of unrelated protein domains. We show that the presence of conserved quaternary structure in the family, as a biologically relevant homodimer or homooligomer, can lead to misclassification of coevolving residues at the dimeric interface as contacts in the monomeric structure, compromising the quality of the folding reconstruction.
Furthermore, the main end of our protocol is to explore the connection between coevolutionary couplings at the residues level and protein dynamics—an issue that has been mentioned in the literature (18, 36, 52) but not directly addressed. We show in two significant cases that coevolutionary information can be used to reproduce and predict increasingly complex protein features: from identifying flexible regions, as in the case of SRC, to the sampling of conformers crucial for the kinase activation and function up to a full characterization of the folding reaction as in the case of the RAS protein. It is worth noting that in all these cases, the recovered information is not easily accessible by other approaches, such as elastic-network models or without explicit knowledge of the structures involved.
The approach we propose paves the way to further development in combining experimental and genomic data, going beyond the rigid structure paradigm and taking on the exploration of a protein energy landscape and its biologically relevant conformations.
Materials and Methods
Inference of Coevolutionary Couplings and Contact Predictions.
Full MSAs for the 18 families were downloaded from the Pfam database (10). The alignments were filtered, removing unaligned insertions and keeping the remaining aligned positions. Repeated sequences, sequences containing nonnatural amino acids, or those with a fraction of gaps greater than 0.2 were removed. The empirical frequencies for single and pairs of amino acids were computed from the final alignments as weighted averages to account for sampling biases (see SI Appendix for details). The parameters
Protein Coarse-Grained Model.
The protein chain is described by the heavy atoms of the backbone plus the
Acknowledgments
We thank the following for computing time: Partnership for Advanced Computing in Europe (PRACE) Research Infrastructure Resources MareNostrum at the Barcelona Supercomputing Center and the Institut Curie Paris at the Très Grand Centre de Calcul-CEA (FP7 RI-283493), PRACE-3IP Project FP7 RI-312763 Resources, and the HECBioSim Resource Archer. Support for this work was provided by Engineering and Physical Sciences Research Council Grant EP/M013898/1 (to F.L.G.) and the Spanish Ministry of Economy and Competitiveness (MINECO) Grant BIO2012-40205 (to A.V.).
Footnotes
↵1L.S. and S.M. contributed equally to this work.
- ↵2To whom correspondence may be addressed. Email: avalencia{at}cnio.es or f.l.gervasio{at}ucl.ac.uk.
Author contributions: L.S., S.M., A.V., and F.L.G. designed research; L.S. and S.M. performed research; L.S. and S.M. analyzed data; and L.S., S.M., A.V., and F.L.G. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1508584112/-/DCSupplemental.
Freely available online through the PNAS open access option.
References
- ↵.
- Altschuh D,
- Vernet T,
- Berti P,
- Moras D,
- Nagai K
- ↵.
- Korber BT,
- Farber RM,
- Wolpert DH,
- Lapedes AS
- ↵
- ↵.
- Shindyalov IN,
- Kolchanov NA,
- Sander C
- ↵.
- Taylor WR,
- Hatrick K
- ↵.
- Neher E
- ↵
- ↵
- ↵
- ↵
- ↵.
- Lapedes A,
- Giraud B,
- Jarzynski C
- ↵.
- Weigt M,
- White RA,
- Szurmant H,
- Hoch JA,
- Hwa T
- ↵
- ↵
- ↵.
- Bialek W,
- Ranganathan R
- ↵
- ↵.
- Schug A,
- Weigt M,
- Onuchic JN,
- Hwa T,
- Szurmant H
- ↵.
- Morcos F, et al.
- ↵
- ↵
- ↵.
- Hopf TA, et al.
- ↵
- ↵
- ↵.
- Jones DT,
- Buchan DW,
- Cozzetto D,
- Pontil M
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Irbäck A,
- Sjunnesson F,
- Wallin S
- ↵
- ↵
- ↵
- ↵
- ↵.
- Sułkowska JI,
- Morcos F,
- Weigt M,
- Hwa T,
- Onuchic JN
- ↵.
- Morcos F,
- Jana B,
- Hwa T,
- Onuchic JN
- ↵.
- Cheng RR,
- Morcos F,
- Levine H,
- Onuchic JN
- ↵
- ↵
- ↵
- ↵
- ↵.
- Moult J,
- Fidelis K,
- Kryshtafovych A,
- Schwede T,
- Tramontano A
- ↵
- ↵
- ↵
- ↵
- ↵.
- Rojas AM,
- Fuentes G,
- Rausell A,
- Valencia A
- ↵
- ↵
- ↵.
- Holm L,
- Rosenström P
- ↵.
- Hinton GE,
- Sejnowski TJ
- ↵.
- Dago AE, et al.
- ↵
Citation Manager Formats
Article Classifications
- Biological Sciences
- Biophysics and Computational Biology