Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Research Article

From residue coevolution to protein conformational ensembles and functional dynamics

View ORCID ProfileLudovico Sutto, Simone Marsili, Alfonso Valencia, and Francesco Luigi Gervasio
  1. aInstitute of Structural and Molecular Biology, University College London, London WC1H 0AJ, United Kingdom;
  2. bStructural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain;
  3. cDepartment of Chemistry, University College London, London WC1H 0AJ, United Kingdom

See allHide authors and affiliations

PNAS November 3, 2015 112 (44) 13567-13572; first published October 20, 2015; https://doi.org/10.1073/pnas.1508584112
Ludovico Sutto
aInstitute of Structural and Molecular Biology, University College London, London WC1H 0AJ, United Kingdom;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ludovico Sutto
Simone Marsili
bStructural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Alfonso Valencia
bStructural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: avalencia@cnio.es f.l.gervasio@ucl.ac.uk
Francesco Luigi Gervasio
aInstitute of Structural and Molecular Biology, University College London, London WC1H 0AJ, United Kingdom;
cDepartment of Chemistry, University College London, London WC1H 0AJ, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: avalencia@cnio.es f.l.gervasio@ucl.ac.uk
  1. Edited by Michael L. Klein, Temple University, Philadelphia, PA, and approved September 23, 2015 (received for review May 2, 2015)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Significance

Evolutionary-related protein sequences have been selected to preserve a common function and fold. Residues in contact in this conserved structure are coupled by evolution and show correlated mutational patterns. The exponential growth of sequenced genomes makes it possible to detect these coevolutionary coupled pairs and to infer three-dimensional folds from predicted contacts. But how far can we push the prediction of native folds? Can we predict the conformational heterogeneity of a protein directly from sequences? We address these questions developing an accurate contact prediction algorithm and a protein coarse-grained model, and exploring conformational landscapes congruent with coevolution. We find that both structural and dynamical properties can be already recovered using evolutionary information only.

Abstract

The analysis of evolutionary amino acid correlations has recently attracted a surge of renewed interest, also due to their successful use in de novo protein native structure prediction. However, many aspects of protein function, such as substrate binding and product release in enzymatic activity, can be fully understood only in terms of an equilibrium ensemble of alternative structures, rather than a single static structure. In this paper we combine coevolutionary data and molecular dynamics simulations to study protein conformational heterogeneity. To that end, we adapt the Boltzmann-learning algorithm to the analysis of homologous protein sequences and develop a coarse-grained protein model specifically tailored to convert the resulting contact predictions to a protein structural ensemble. By means of exhaustive sampling simulations, we analyze the set of conformations that are consistent with the observed residue correlations for a set of representative protein domains, showing that (i) the most representative structure is consistent with the experimental fold and (ii) the various regions of the sequence display different stability, related to multiple biologically relevant conformations and to the cooperativity of the coevolving pairs. Moreover, we show that the proposed protocol is able to reproduce the essential features of a protein folding mechanism as well as to account for regions involved in conformational transitions through the correct sampling of the involved conformers.

  • coevolution
  • network inference
  • coarse-grained
  • protein folding
  • protein dynamics

Pairs of positions along a protein sequence can show strong correlations arising both from functional and structural constraints (1⇓⇓⇓⇓⇓⇓⇓–9). Earliest approaches for detecting interdependent residues and predicting 3D contacts in proteins (1⇓⇓–4, 8) analyzed alignments containing from tens to a few hundreds sequences. Given the small size of available sequences datasets, these works relied on an independent pair approximation: a “coevolutionary coupling” between two residues was estimated independently for each pair, ignoring the rest of the network of residues. The number of known protein sequences, however, has grown dramatically in the past few years (10). Such a large increase in the size of datasets has allowed to fit—either explicitly (11, 12) or implicitly (13, 14)—pairwise models for protein sequences that take into account the whole network of correlated residues simultaneously, and are able to disentangle correlated positions from “interacting” positions by identifying the parameters of the model with the coupling constants in an Ising-like Hamiltonian (15, 16). Despite their simplicity, these models have had remarkable success in the design of synthetic sequences preserving natural function (13, 14) and in the prediction of interacting pairs of residues from the knowledge of their sequence alone (12, 17⇓⇓⇓–21).

In this paper, we tackle the problem of sampling an ensemble of structures compatible with the observed coevolution between protein residues. We will follow a two-step procedure. The first step corresponds to an inverse problem: from a set of homologous sequences to the parameters of a model. Inverse problems are notoriously computationally hard. For large sets of variables, an exact evaluation of the normalizing constant of the variables’ joint distribution (the partition function, in the language of statistical mechanics) is impracticable. Previous works in the literature focused on efficiency, circumventing this problem by adopting different, approximated solutions (12, 17⇓–19, 22⇓⇓⇓–26), generically based on tractable approximations of the likelihood function. However, given the success and the number of potential applications of coevolutionary analysis, the study of reference and more quantitative approaches is necessary. In this regard, the Monte Carlo Markov chain (MCMC)-based, maximum-likelihood approach, albeit computationally demanding, is in principle exact given a sufficient sampling at each minimization step. In this work we adopt the Boltzmann learning algorithm (11, 27), whose accuracy in inferring the parameters of the pairwise model, at variance with all of the previous approaches in the literature, is not biased a priori by the choice of a particular approximation scheme.

The second step is a direct problem: after translating the probabilistic model for sequences into an energy potential for protein structures, we can explore the resulting energy landscape using molecular dynamics (MD). After extensive sampling, we can characterize the folding reaction and find the best candidate for the native fold as well as metastable intermediates and conformers that may have a functional role. Moreover, we can spot flexible regions, directly connecting coevolution to function and dynamics. With this goal in mind, we introduce a coarse-grained model particularly apt to translate predictions of contacts to a structural ensemble. Thanks to the great reduction in the number of degrees of freedom, coarse-grained models have been widely used to study many aspects of proteins (28⇓⇓⇓⇓⇓–34). Due to their simplicity, Cα models in particular have already been used to predict protein folds from coevolutionary data (35⇓–37). Here, in the same spirit as the model presented in ref. 35, where coevolutionary information is used with a Cα coarse-grained protein model, we present a higher-resolution coarse-grained model that combines the pairwise predictions with an adapted all-atom force field for the heavy backbone atoms, similar to the approach used in ref. 38. The predicted contacts are introduced as favorable interactions between Cβ atoms of a coarse-grained side-chain, whereas the protein backbone is modeled with all of the heavy atoms to capture the secondary structure conformation with high resolution. Indeed, we show through extensive molecular dynamics simulations on a set of 18 proteins that the final accuracy of structure prediction, measured as root-mean-square deviation (RMSD) from the native experimental structure, is determined solely by the accuracy of contact predictions.

However, besides recovering a protein native fold, the main advantage of the proposed approach is its ability to capture the conformational heterogeneity and the thermodynamical features of the folding reaction as implied by coevolutionary information only. In contrast to more expensive approaches like all-atom MD or more refined coarse-grained potentials (39), we can afford an extensive equilibrium sampling of the conformational space. We illustrate this point by applying our approach to analyze two energy landscapes related to the folding of the Ras protein and the conformational dynamics of a tyrosine kinase. As expected, Ras folds cooperatively, and we find and characterize a folding intermediate. The protein kinase correctly samples an ensemble of active-like and inactive-like structures that are biologically relevant for its function and shows a flexibility pattern compatible with experimental observations.

Results

Contact Predictions via Boltzmann Learning.

Entropy maximization provides a simple procedure for building a probabilistic model that is consistent with a set of available measures. If we know the average values {Fi} of a set of variables {xi}, the associated maximum entropy distribution is given by P∝exp(∑iλixi), with a Lagrange multiplier λi for each variable xi (40). Fixing the frequencies for single and pairs of amino acids in a multiple sequence alignment (MSA), P takes the form (11): P({a})=Z−1exp[∑ihi(ai)+∑i<jJi,j(ai,aj)] over protein sequences. The parameters hi(α) and Ji,j(α,β) are the Lagrange multipliers that fix the averages of the model, fi(α) and fi,j(α,β), to the empirical frequencies Fi(α) and Fi,j(α,β) computed from the MSA, where α and β denote two particular amino acids and i, j two particular residues along the protein sequence. Due to the Boltzmann-like form of the previous equation, the parameter Ji,j(α,β) can be interpreted as the direct interaction between amino acids α and β at positions i and j, after the contributions from the interaction with other positions through indirect pathways have been disentangled (11, 12, 22).

To our knowledge, since the early work of Lapedes et al. (11), the numerical route of likelihood maximization via importance sampling, or Boltzmann learning (27), has not been explored, probably due to its computational complexity. As outlined in the introduction, this approach [see, e.g., Roudi et al. (41) for an extensive comparison between Boltzmann learning and various approximated schemes] has the advantage of having unbounded precision in retrieving the parameters of a maximum-entropy model. We tested Boltzmann learning on an assorted set of 18 proteins with varying length, from 63 to 216 residues, and different secondary structure composition (Table 1). For each alignment, we inferred a pairwise model by maximizing a regularized version of the log-likelihood of the sequences with respect to parameters {h} and {J} (see Materials and Methods for details). In all 18 cases, we were able to reproduce the empirical frequencies {F} within a reasonable error, as we checked through extensive sampling from the final models (SI Appendix, Fig. S1). Depending on the size of the protein sequence, we obtained mean absolute relative errors 〈|Fi,j−fi,j|/Fi,j〉 between the model pair frequencies fi,j and the empirical Fi,j ranging from 1% to 3% (0.2–2% for Fi,j>0.01) for the different protein families. Being that our approach is based on importance sampling, the presence of many isolated modes in the distribution of sequences could lead to poor mixing of the Markov chain and, consequently, to a large error in the estimate of the gradient of the likelihood function. Indeed, clusters of sequences in multiple sequence alignments are common and are known to reflect potential functional subfamilies among the members of a single protein family (7). As a cross-check, we verified that the fitted models capture the organization of the original alignment in clusters of sequences (Fig. 1 and SI Appendix). We point out that “external” sources of variation—such as changes in functional requirements within the same protein family—should be explicitly taken into account in future, improved models to discriminate between intrinsic and extrinsic correlations in the sequence alignment.

View this table:
  • View inline
  • View popup
Table 1.

Sum up of protein domain features and main results

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

(A and B) Energy surfaces obtained projecting sequences from the ADH_zinc_N domain family (A) and sequences sampled from the model (B) over the first two principal components of the MSA (7), and taking the negative logarithm of the resulting distribution. High-probability regions in sequence space are in dark blue. The cluster structure of the alignment is clearly reproduced by the simulated trajectories. (C and D) Mean precision of the top-ranked predictions for different values of the scaled rank (rank/total number of contacts), and the mean true positive rate for different values of the false-positive rate (ROC curve). The color bands show the SD and the interval between the minimum and maximum values.

For each protein, from the estimated set of {J} we computed a matrix of coevolutionary couplings using a protocol first proposed by Ekeberg et al. (25). Assuming that a strong, direct coevolution is an evidence of physical contact between two residues, we finally ranked the pairs of residues using the value of coevolutionary coupling as a score. The accurate determination of couplings through the Boltzmann learning algorithm resulted in high-quality predictions. Fig. 1 summarizes the performance in terms of mean precision against the top-scoring predictions rank, and as mean true positive rate as a function of false positive rate. We defined a pair of residues to be in contact when the distance between their Cβ atoms (Cα in case of GLY) is smaller than 8 Å (42, 43), and included in the analysis all of the pairs with a sequence separation larger than four. For the top N predictions, where N denotes the number of amino acids in a protein structure, the algorithm obtained a mean precision of 0.67±0.03, with a maximum of 0.89 for the cNMP_binding domain (PDB ID code 3FHI). A comparison with predictions obtained through a more standard mean-field solution (18, 19) is included in SI Appendix.

A Coarse-Grained Model for Structure Prediction and Conformational Sampling.

To take full advantage of the accurate contact predictions resulting from the Boltzmann learning algorithm, we propose a coarse-grained model that combines an all-heavy-atom description of the protein backbone with a Cβ description of the side chains (Materials and Methods). Similar coarse-grained approaches have already been successfully applied to study the thermodynamics of model proteins (30). Because we expect the residues to be evolutionary coupled through their side chains, we set the predicted interactions to act between the Cβ atoms—i.e., the first atom to branch out of the main chain. More precisely, in our model we used the N best-predicted contacts, where N is the number of residues (Materials and Methods). Moreover, the inclusion of all of the heavy atoms of the main chain allows use of a transferable potential that acts on the actual degrees of freedom of the backbone enabling the correct reproduction of the experimental population of Ramachandran backbone angles. To complement long-range predictions, we estimated the helical propensity of each residue solely from the predicted contacts and translated it in a stabilizing hydrogen bond-mimicking interaction between residues i,i+4 (see SI Appendix for details). For all of the α and α/β proteins, all of the helices are correctly predicted with the exception of h1 for 3D7I, h5 for 2GJ3, and h2 for 5P21 structures (SI Appendix, Fig. S2).

Using this simplified model, we investigated the conformational space accessible to the protein domains in Table 1. The nine best-predicted structures obtained in the folding runs are shown in Fig. 2 superimposed to their respective native folds. Those structures correspond to the conformations with minimum distance RMSD of the α carbon atoms (Cα-dRMSD; Materials and Methods) to the native conformation. For all of them, not only is the global fold correctly predicted, but also most of the secondary structure elements are present and correctly packed.

Fig. 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 2.

The nine best-predicted protein structures of the 18 simulated are shown in blue, overlaid to the native conformation in cyan (transparent). The proteins are disposed with increasing size from left to right and from top to bottom. Their Pfam identifier is shown below.

For domains of similar size, we found a clear correlation between the precision of the contact predictions and the final dRMSD values (e.g., domains Trans_Reg_C, CMD, and PAS have lower precision in contact prediction and higher dRMSD values than other domains with a similar number of amino acids; Table 1), confirming that the precision of contacts prediction is the main determinant of the quality of the final fold reconstruction. However, all 18 proteins fold within 3 Å to the native protein, except for the 2GJ3 structure (PAS domain), which performs worse (Table 1 and SI Appendix, Fig. S6). We found that the model is tolerant to both a conspicuous presence of nonnative contacts among the set of predicted contacts as well as to the absence of a large proportion of native contacts among the predicted ones.

To have a reference baseline to compare with, we also simulated the proteins using the same conditions but replacing the set of predicted contacts with the set of native contacts (see details in SI Appendix). The best dRMSDs obtained in this case are always below ∼2 Å (SI Appendix, Fig. S6, black curve). These values quantify the theoretical limit of the present model with perfectly predicted contacts. The quality of the predicted structures is as good as those predicted by the structure-based reference simulation in several cases (HTH_31, Sigma70_r2, RRM_1, Ras). Interestingly, the dRMSD of the minimum energy structure (dashed curve in SI Appendix, Fig. S6) does not deviate much from the absolute minimum dRMSD structure sampled in the trajectory. Indeed, for 17 of 18 proteins, the minimum energy structure is below 4 Å to the native structure, which indicates that the model and the contact predictions are sound and lead to a funnel-shaped energy landscape whose minimum corresponds to the crystallographic structure.

In the single case of the PAS domain, we observe a larger deviation, 7 vs. 3.6 Å for the dRMSD of the minimum energy structure and the absolute minimum, respectively (Table 1), which can be only partially explained by the presence of an associated cofactor in the experimental structure. Such a large difference indicates that though the minimum dRMSD structure still belongs to the native basin, it does not coincide with the minimum energy structure, which is the defining property of the native fold. Inspecting the set of predicted contacts, we found that five of them are in the dimeric interface of the corresponding homodimer (SI Appendix, Fig. S7). After removing these interdomain contacts, the minimum energy dRMSD of the monomeric structure decreases to 4.7 Å (SI Appendix, Fig. S6), showing that these few contacts were responsible for a large displacement from the native conformation. This result is also corroborated by the fact that when five randomly chosen false-positive contacts are removed, the dRMSD of the minimum energy structure does not improve, which we verified running 20 independent simulations (SI Appendix). As far as we know, the effect on fold reconstruction of strongly coupled pairs of residues at homodimeric (or homooligomeric) interfaces, when incorrectly classified as contacts in the monomeric structure, have not been described before. Here we show that this effect can be large, depending on the relative position of the pairs of residues at the dimer interfaces. Reasonably, similar but weaker effects are present for other cases, being that 10 over 18 proteins in our set were homodimeric in the corresponding PDB structure. The analysis demonstrates that the ability of PAS domains to form homodimers (44) is subjected to evolutionary pressure, and identifies a set of five pairs of positions involving the N-terminal helix (I40-F27; Y49-F27; A117-P35; L130-Q29; M132-A34 in the 2GJ3 PDB numbering) that emerge as crucial for the stabilization of the PAS homodimer across the protein family, as supported by their proximity in the crystal structure, strong coevolutionary coupling, and simulation.

Conformational Heterogeneity and Residue Coevolution.

To investigate the ability of coevolutionary couplings to also encode for dynamical and functional information, we analyzed the conformational space close to the native fold in the case of the catalytic domain (CD) of SRC tyrosine kinase and characterized the full folding reaction of the Ras domain from a thermodynamical point of view.

Protein kinases are known to undergo a large conformational rearrangement of a centrally located loop during activation, called activation loop or A-loop. In the case of the CD of the SRC tyrosine kinase, the A-loop spans more than 20 residues, and the structural rearrangement moves the backbone across ≈25 Å from the inactive conformation where the A-loop is folded, to the fully active structure where the A-loop is in an extended conformation. The A-loop is known to be flexible to the point of being invisible in many crystal structures. It is thus interesting to see if traces of this conformational transition can be observed by an exhaustive sampling of the native fold basin with our model. We performed a 500-ns-long parallel tempering (PT) simulation of the 250 residues CD starting from the inactive conformation with the A-loop in the so-called “half-closed conformation,” corresponding to the structure with PDB code 2SRC. We used the 250 best-predicted contacts calculated over 3,812 effective sequences from PFAM Pkinase_Tyr family, without adding any structure-based bias. Indeed, only 130 of the 250 contacts are native contacts (where a native contact is defined between CB atoms within 8 Å in the native structure). Consistent with the fact that no contacts have been predicted for the A-loop, we observe an extensive sampling of the conformational space available to the loop while the rest of the protein correctly maintains the native fold.

In Fig. 3 we compare the experimental and simulated structural flexibility of different positions along the chain. Even though we expect a sequence-dependent effect on structural order/disorder, we note that flexible regions (A-loop, 298–303 loop and αG-helix) are captured by our model with few exceptions (αC-helix, G-rich loop), suggesting that chain flexibility itself is partially encoded in the sequences of the kinase protein family (SI Appendix, Fig. S8). Indeed, these regions are known to be involved in the activation or in protein–protein interaction (45). We note that flexibility cannot be trivially deducted from the number of predicted contacts involving each residue, as shown in SI Appendix, Fig. S8, but is rather a consequence of the whole fold and network of interactions. Moreover, both the active and inactive conformations of the A-loop are repeatedly sampled within 5 Å Cα-RMSD to the crystal structure (SI Appendix, Fig. S8). The ability to reach both endpoints of the complex conformational transition in the catalytic domain is very encouraging.

Fig. 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 3.

Comparison of the fluctuations of the SRC catalytic domain. The experimental X-ray structure (PDB ID code 2SRC) is shown (Left) with the normalized B-factor color and thickness coded. The same structure with the B-factor calculated from the simulation of the SRC catalytic domain (Right). For easier comparison, the scales have been normalized. Higher values correspond to larger fluctuations.

The Ras protein is a α/β globular protein and is a crucial mediator of cellular proliferation and differentiation involved in several signaling pathways (46, 47). We performed a 180-ns-long PT simulation to explore a wide range of temperatures and to have a solid sampling. The simulated folding transition is highly cooperative with a clear peak in the heat capacity at constant volume at the folding temperature Tf (SI Appendix, Fig. S9). In Fig. 4, we show the free energy as a function of dRMSD at three different temperatures: Tlow<Tf, T≈Tf, and Thigh>Tf. In the unfolded state (U), the β-strands β2 and β3 are unfolded and lead to an extended tail departing from a rather structured core around the partially formed α3 and α4 helices. The collapse of this unfolded tail and its correct positioning lead to an intermediate, partially folded state (I, dRMSD = 6.5 Å), where helices α4 and α3 are formed. Eventually, the native basin (N, dRMSD = 3.8 Å) is reached with the formation of helices α3 and α5 and their correct packing by crossing a free-energy barrier of 4 kJ/mol. These features are not simply encoded in the native fold, because a coarse-grained Cα structure-based model (48) is unable to capture them (SI Appendix). It is interesting to note that the existence of a folding intermediate has been reported at high pressure and in denaturing conditions (49). Though we cannot directly compare our result with the experimental structures and the different experimental conditions, it is promising to observe the emergence of typical folding features, such as intermediate folding states, from a genomic-derived coarse-grained model.

Fig. 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 4.

Folding free energy of Ras protein at three different temperatures around the folding temperature Tf=317K as a function of the Cα-dRMSD. Representative structures of the native (N), intermediate (I), and unfolded (U) states are shown above with the secondary structure elements labeled.

Discussion and Outlook

Among proteins sharing a common ancestor, secondary and tertiary structure is generally much more conserved than protein sequence (50). As a result, a protein family is free to explore a large space of possible sequences while at the same time preserving a common structural framework. As shown by previous works (3, 4, 8, 11, 12, 17⇓⇓⇓⇓⇓⇓⇓⇓–26), the information contained in sequence variability can be translated to a series of couplings between protein residues, restraining the set of conformations compatible with the observed protein sequences. The main objective of this paper is to study such an evolutionary-restrained space of conformations. We extracted the optimal couplings by forgoing unnecessary approximations and using a standard reference algorithm, Boltzmann learning (51), based on MCMC simulations of trajectories in sequence space. This scheme allowed us to verify, at variance with previous works, the simple but important fact that the fitted maximum-entropy models reproduce the observed correlations between residues in multiple sequence alignments. We developed a simplified physical model for protein dynamics, showing that the combination of accurate contact predictions and a coarse-grained yet biologically meaningful protein model allows for a full sampling of the structural ensemble associated to a protein family. Our simulations support the finding (19) that the structural landscape dictated by residue coevolution is dominated by the energy minimum corresponding to the native fold, which we recover with high accuracy for a set of unrelated protein domains. We show that the presence of conserved quaternary structure in the family, as a biologically relevant homodimer or homooligomer, can lead to misclassification of coevolving residues at the dimeric interface as contacts in the monomeric structure, compromising the quality of the folding reconstruction.

Furthermore, the main end of our protocol is to explore the connection between coevolutionary couplings at the residues level and protein dynamics—an issue that has been mentioned in the literature (18, 36, 52) but not directly addressed. We show in two significant cases that coevolutionary information can be used to reproduce and predict increasingly complex protein features: from identifying flexible regions, as in the case of SRC, to the sampling of conformers crucial for the kinase activation and function up to a full characterization of the folding reaction as in the case of the RAS protein. It is worth noting that in all these cases, the recovered information is not easily accessible by other approaches, such as elastic-network models or without explicit knowledge of the structures involved.

The approach we propose paves the way to further development in combining experimental and genomic data, going beyond the rigid structure paradigm and taking on the exploration of a protein energy landscape and its biologically relevant conformations.

Materials and Methods

Inference of Coevolutionary Couplings and Contact Predictions.

Full MSAs for the 18 families were downloaded from the Pfam database (10). The alignments were filtered, removing unaligned insertions and keeping the remaining aligned positions. Repeated sequences, sequences containing nonnatural amino acids, or those with a fraction of gaps greater than 0.2 were removed. The empirical frequencies for single and pairs of amino acids were computed from the final alignments as weighted averages to account for sampling biases (see SI Appendix for details). The parameters {Ji,j(α,β)} were obtained by finding the maximum of the (l2-regularized) likelihood of the parameters as discussed in SI Appendix. Numerical maximization of the likelihood requires the calculation of the model frequencies, which we estimated through multiple (20–64) parallel Metropolis Monte Carlo simulations, and a number of sweeps per gradient estimation ranging from 104 to 105 per simulation, depending on the system. The final coevolutionary couplings Ci,j between each pair of residues were calculated from the estimated coupling parameters {Ji,j(α,β)} using a protocol proposed by Ekeberg et al. (25) (see SI Appendix for details).

Protein Coarse-Grained Model.

The protein chain is described by the heavy atoms of the backbone plus the Cβ atoms of the side chains. See SI Appendix, Fig. S5, Inset for a schematic illustration of all of the nonbonded potentials and the protein coarse-graining. The complete potential is fully described in SI Appendix and comprises contributions from the AMBER99SB-ILDN force field (53) to account for a correct backbone geometry, a nonbonded term in the form of a 12-6 Lennard–Jones function between Cβ atoms with parameters (rββ = 0.55 nm, ε = 15 kJ/mol), which accounts for the coevolution predictions and an hydrogen-bond mimicking potential between O and N atoms in the form of a 12-6 Lennard–Jones function with parameters (rON = 0.3 nm, ε = 15 kJ/mol), derived from the sequence analysis, to stabilize the helices. The simulation protocols and the analysis methods are detailed in SI Appendix.

Acknowledgments

We thank the following for computing time: Partnership for Advanced Computing in Europe (PRACE) Research Infrastructure Resources MareNostrum at the Barcelona Supercomputing Center and the Institut Curie Paris at the Très Grand Centre de Calcul-CEA (FP7 RI-283493), PRACE-3IP Project FP7 RI-312763 Resources, and the HECBioSim Resource Archer. Support for this work was provided by Engineering and Physical Sciences Research Council Grant EP/M013898/1 (to F.L.G.) and the Spanish Ministry of Economy and Competitiveness (MINECO) Grant BIO2012-40205 (to A.V.).

Footnotes

  • ↵1L.S. and S.M. contributed equally to this work.

  • ↵2To whom correspondence may be addressed. Email: avalencia{at}cnio.es or f.l.gervasio{at}ucl.ac.uk.
  • Author contributions: L.S., S.M., A.V., and F.L.G. designed research; L.S. and S.M. performed research; L.S. and S.M. analyzed data; and L.S., S.M., A.V., and F.L.G. wrote the paper.

  • The authors declare no conflict of interest.

  • This article is a PNAS Direct Submission.

  • This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1508584112/-/DCSupplemental.

Freely available online through the PNAS open access option.

References

  1. ↵
    1. Altschuh D,
    2. Vernet T,
    3. Berti P,
    4. Moras D,
    5. Nagai K
    (1988) Coordinated amino acid changes in homologous protein families. Protein Eng 2(3):193–199
    .
    OpenUrlAbstract/FREE Full Text
  2. ↵
    1. Korber BT,
    2. Farber RM,
    3. Wolpert DH,
    4. Lapedes AS
    (1993) Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: An information theoretic analysis. Proc Natl Acad Sci USA 90(15):7176–7180
    .
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. Göbel U,
    2. Sander C,
    3. Schneider R,
    4. Valencia A
    (1994) Correlated mutations and residue contacts in proteins. Proteins 18(4):309–317
    .
    OpenUrlCrossRefPubMed
  4. ↵
    1. Shindyalov IN,
    2. Kolchanov NA,
    3. Sander C
    (1994) Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng 7(3):349–358
    .
    OpenUrlAbstract/FREE Full Text
  5. ↵
    1. Taylor WR,
    2. Hatrick K
    (1994) Compensating changes in protein multiple sequence alignments. Protein Eng 7(3):341–348
    .
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. Neher E
    (1994) How frequent are correlated changes in families of protein sequences? Proc Natl Acad Sci USA 91(1):98–102
    .
    OpenUrlAbstract/FREE Full Text
  7. ↵
    1. Casari G,
    2. Sander C,
    3. Valencia A
    (1995) A method to predict functional residues in proteins. Nat Struct Biol 2(2):171–178
    .
    OpenUrlCrossRefPubMed
  8. ↵
    1. Pazos F,
    2. Helmer-Citterich M,
    3. Ausiello G,
    4. Valencia A
    (1997) Correlated mutations contain information about protein–protein interaction. J Mol Biol 271(4):511–523
    .
    OpenUrlCrossRefPubMed
  9. ↵
    1. de Juan D,
    2. Pazos F,
    3. Valencia A
    (2013) Emerging methods in protein co-evolution. Nat Rev Genet 14(4):249–261
    .
    OpenUrlCrossRefPubMed
  10. ↵
    1. Finn RD, et al.
    (2013) Pfam: The protein families database. Nucleic Acids Res 42(Database issue):D222–D230
    .
    OpenUrlPubMed
  11. ↵
    1. Lapedes A,
    2. Giraud B,
    3. Jarzynski C
    (2002) Using sequence alignments to predict protein structure and stability with high accuracy. arXiv:1207.2484
    .
  12. ↵
    1. Weigt M,
    2. White RA,
    3. Szurmant H,
    4. Hoch JA,
    5. Hwa T
    (2009) Identification of direct residue contacts in protein–protein interaction by message passing. Proc Natl Acad Sci USA 106(1):67–72
    .
    OpenUrlAbstract/FREE Full Text
  13. ↵
    1. Socolich M, et al.
    (2005) Evolutionary information for specifying a protein fold. Nature 437(7058):512–518
    .
    OpenUrlCrossRefPubMed
  14. ↵
    1. Russ WP,
    2. Lowery DM,
    3. Mishra P,
    4. Yaffe MB,
    5. Ranganathan R
    (2005) Natural-like function in artificial WW domains. Nature 437(7058):579–583
    .
    OpenUrlCrossRefPubMed
  15. ↵
    1. Bialek W,
    2. Ranganathan R
    (2007) Rediscovering the power of pairwise interactions. arXiv:0712.4397
    .
  16. ↵
    1. Schneidman E,
    2. Berry MJ 2nd,
    3. Segev R,
    4. Bialek W
    (2006) Weak pairwise correlations imply strongly correlated network states in a neural population. Nature 440(7087):1007–1012
    .
    OpenUrlCrossRefPubMed
  17. ↵
    1. Schug A,
    2. Weigt M,
    3. Onuchic JN,
    4. Hwa T,
    5. Szurmant H
    (2009) High-resolution protein complexes from integrating genomic information with molecular simulation. Proc Natl Acad Sci USA 106(52):22124–22129
    .
    OpenUrlAbstract/FREE Full Text
  18. ↵
    1. Morcos F, et al.
    (2011) Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci USA 108(49):E1293–E1301
    .
    OpenUrlAbstract/FREE Full Text
  19. ↵
    1. Marks DS, et al.
    (2011) Protein 3D structure computed from evolutionary sequence variation. PLoS One 6(12):e28766
    .
    OpenUrlCrossRefPubMed
  20. ↵
    1. Ovchinnikov S,
    2. Kamisetty H,
    3. Baker D
    (2014) Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife 3:e02030
    .
    OpenUrlPubMed
  21. ↵
    1. Hopf TA, et al.
    (2014) Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife 3:e03430
    .
    OpenUrlAbstract/FREE Full Text
  22. ↵
    1. Burger L,
    2. van Nimwegen E
    (2010) Disentangling direct from indirect co-evolution of residues in protein alignments. PLOS Comput Biol 6(1):e1000633
    .
    OpenUrlCrossRefPubMed
  23. ↵
    1. Balakrishnan S,
    2. Kamisetty H,
    3. Carbonell JG,
    4. Lee SI,
    5. Langmead CJ
    (2011) Learning generative models for protein fold families. Proteins 79(4):1061–1078
    .
    OpenUrlCrossRefPubMed
  24. ↵
    1. Jones DT,
    2. Buchan DW,
    3. Cozzetto D,
    4. Pontil M
    (2012) PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28(2):184–190
    .
    OpenUrlAbstract/FREE Full Text
  25. ↵
    1. Ekeberg M,
    2. Lövkvist C,
    3. Lan Y,
    4. Weigt M,
    5. Aurell E
    (2013) Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Phys Rev E Stat Nonlin Soft Matter Phys 87(1):012707
    .
    OpenUrlCrossRefPubMed
  26. ↵
    1. Lui S,
    2. Tiana G
    (2013) The network of stabilizing contacts in proteins studied by coevolutionary data. J Chem Phys 139(15):155103
    .
    OpenUrlCrossRefPubMed
  27. ↵
    1. Ackley DH,
    2. Hinton GE,
    3. Sejnowski TJ
    (1985) A learning algorithm for Boltzmann machines. Cogn Sci 9:147–169
    .
    OpenUrlCrossRef
  28. ↵
    1. Whitford PC, et al.
    (2009) An all-atom structure-based potential for proteins: Bridging minimal models with all-atom empirical forcefields. Proteins 75(2):430–441
    .
    OpenUrlCrossRefPubMed
  29. ↵
    1. Marrink SJ,
    2. Risselada HJ,
    3. Yefimov S,
    4. Tieleman DP,
    5. de Vries AH
    (2007) The MARTINI force field: Coarse grained model for biomolecular simulations. J Phys Chem B 111(27):7812–7824
    .
    OpenUrlCrossRefPubMed
  30. ↵
    1. Irbäck A,
    2. Sjunnesson F,
    3. Wallin S
    (2000) Three-helix-bundle protein in a Ramachandran model. Proc Natl Acad Sci USA 97(25):13614–13618
    .
    OpenUrlAbstract/FREE Full Text
  31. ↵
    1. Pasi M,
    2. Lavery R,
    3. Ceres N
    (2013) PaLaCe: A coarse-grain protein model for studying mechanical properties. J Chem Theory Comput 9:785–793
    .
    OpenUrlCrossRef
  32. ↵
    1. Kim YC,
    2. Hummer G
    (2008) Coarse-grained models for simulations of multiprotein complexes: Application to ubiquitin binding. J Mol Biol 375(5):1416–1433
    .
    OpenUrlCrossRefPubMed
  33. ↵
    1. Camilloni C,
    2. Sutto L
    (2009) Lymphotactin: How a protein can adopt two folds. J Chem Phys 131(24):245105
    .
    OpenUrlCrossRefPubMed
  34. ↵
    1. Saunders MG,
    2. Voth GA
    (2013) Coarse-graining methods for computational biology. Annu Rev Biophys 42:73–93
    .
    OpenUrlCrossRefPubMed
  35. ↵
    1. Sułkowska JI,
    2. Morcos F,
    3. Weigt M,
    4. Hwa T,
    5. Onuchic JN
    (2012) Genomics-aided structure prediction. Proc Natl Acad Sci USA 109(26):10340–10345
    .
    OpenUrlAbstract/FREE Full Text
  36. ↵
    1. Morcos F,
    2. Jana B,
    3. Hwa T,
    4. Onuchic JN
    (2013) Coevolutionary signals across protein lineages help capture multiple protein conformations. Proc Natl Acad Sci USA 110(51):20533–20538
    .
    OpenUrlAbstract/FREE Full Text
  37. ↵
    1. Cheng RR,
    2. Morcos F,
    3. Levine H,
    4. Onuchic JN
    (2014) Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc Natl Acad Sci USA 111(5):E563–E571
    .
    OpenUrlAbstract/FREE Full Text
  38. ↵
    1. Sutto L,
    2. Mereu I,
    3. Gervasio FL
    (2011) A hybrid all-atom structure-based model for protein folding and large scale conformational transitions. J Chem Theory Comput 7:4208–4217
    .
    OpenUrlCrossRef
  39. ↵
    1. Han KF,
    2. Bystroff C,
    3. Baker D
    (1997) Three-dimensional structures and contexts associated with recurrent amino acid sequence patterns. Protein Sci 6(7):1587–1590
    .
    OpenUrlCrossRefPubMed
  40. ↵
    1. Jaynes ET
    (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630
    .
    OpenUrlCrossRef
  41. ↵
    1. Roudi Y,
    2. Tyrcha J,
    3. Hertz J
    (2009) Ising model for neural data: Model quality and approximate methods for extracting functional connectivity. Phys Rev E Stat Nonlin Soft Matter Phys 79(5 Pt 1):051915
    .
    OpenUrlCrossRefPubMed
  42. ↵
    1. Moult J,
    2. Fidelis K,
    3. Kryshtafovych A,
    4. Schwede T,
    5. Tramontano A
    (2014) Critical assessment of methods of protein structure prediction (CASP)--round x. Proteins 82(Suppl 2):1–6
    .
    OpenUrl
  43. ↵
    1. Monastyrskyy B,
    2. D’Andrea D,
    3. Fidelis K,
    4. Tramontano A,
    5. Kryshtafovych A
    (2014) Evaluation of residue-residue contact prediction in CASP10. Proteins 82(Suppl 2):138–153
    .
    OpenUrlCrossRefPubMed
  44. ↵
    1. Möglich A,
    2. Ayers RA,
    3. Moffat K
    (2009) Structure and signaling mechanism of Per-ARNT-Sim domains. Structure 17(10):1282–1294
    .
    OpenUrlCrossRefPubMed
  45. ↵
    1. Kalaivani R,
    2. Srinivasan N
    (2015) A Gaussian network model study suggests that structural fluctuations are higher for inactive states than active states of protein kinases. Mol Biosyst 11(4):1079–1095
    .
    OpenUrlCrossRefPubMed
  46. ↵
    1. Downward J
    (2003) Targeting RAS signalling pathways in cancer therapy. Nat Rev Cancer 3(1):11–22
    .
    OpenUrlCrossRefPubMed
  47. ↵
    1. Rojas AM,
    2. Fuentes G,
    3. Rausell A,
    4. Valencia A
    (2012) The Ras protein superfamily: Evolutionary tree and role of conserved amino acids. J Cell Biol 196(2):189–201
    .
    OpenUrlAbstract/FREE Full Text
  48. ↵
    1. Clementi C,
    2. Nymeyer H,
    3. Onuchic JN
    (2000) Topological and energetic factors: What determines the structural details of the transition state ensemble and “en-route” intermediates for protein folding? An investigation for small globular proteins. J Mol Biol 298(5):937–953
    .
    OpenUrlCrossRefPubMed
  49. ↵
    1. Kalbitzer HR,
    2. Spoerner M,
    3. Ganser P,
    4. Hozsa C,
    5. Kremer W
    (2009) Fundamental link between folding states and functional states of proteins. J Am Chem Soc 131(46):16714–16719
    .
    OpenUrlCrossRefPubMed
  50. ↵
    1. Holm L,
    2. Rosenström P
    (2010) Dali server: Conservation mapping in 3D. Nucleic Acids Res 38(Web Server issue), W545–549
    .
  51. ↵
    1. Hinton GE,
    2. Sejnowski TJ
    (1986) Learning and relearning in Boltzmann machines. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, eds Rumelhart DE, McClelland JL, and PDP Research Group (MIT Press, Cambridge, MA), Vol. 1, pp 282–317
    .
  52. ↵
    1. Dago AE, et al.
    (2012) Structural basis of histidine kinase autophosphorylation deduced by integrating genomics, molecular dynamics, and mutagenesis. Proc Natl Acad Sci USA 109(26):E1733–E1742
    .
    OpenUrlAbstract/FREE Full Text
  53. ↵
    1. Lindorff-Larsen K, et al.
    (2010) Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins 78(8):1950–1958
    .
    OpenUrlCrossRefPubMed
    1. Sillitoe I, et al.
    (2012) New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Res 41(Database issue):D490–D498
    .
    OpenUrlPubMed
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
From residue coevolution to protein conformational ensembles and functional dynamics
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
From residue coevolution to protein dynamics
Ludovico Sutto, Simone Marsili, Alfonso Valencia, Francesco Luigi Gervasio
Proceedings of the National Academy of Sciences Nov 2015, 112 (44) 13567-13572; DOI: 10.1073/pnas.1508584112

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
From residue coevolution to protein dynamics
Ludovico Sutto, Simone Marsili, Alfonso Valencia, Francesco Luigi Gervasio
Proceedings of the National Academy of Sciences Nov 2015, 112 (44) 13567-13572; DOI: 10.1073/pnas.1508584112
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

Article Classifications

  • Biological Sciences
  • Biophysics and Computational Biology
Proceedings of the National Academy of Sciences: 112 (44)
Table of Contents

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Abstract
    • Results
    • Discussion and Outlook
    • Materials and Methods
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Water from a faucet fills a glass.
News Feature: How “forever chemicals” might impair the immune system
Researchers are exploring whether these ubiquitous fluorinated molecules might worsen infections or hamper vaccine effectiveness.
Image credit: Shutterstock/Dmitry Naumov.
Reflection of clouds in the still waters of Mono Lake in California.
Inner Workings: Making headway with the mysteries of life’s origins
Recent experiments and simulations are starting to answer some fundamental questions about how life came to be.
Image credit: Shutterstock/Radoslaw Lecyk.
Cave in coastal Kenya with tree growing in the middle.
Journal Club: Small, sharp blades mark shift from Middle to Later Stone Age in coastal Kenya
Archaeologists have long tried to define the transition between the two time periods.
Image credit: Ceri Shipton.
Illustration of groups of people chatting
Exploring the length of human conversations
Adam Mastroianni and Daniel Gilbert explore why conversations almost never end when people want them to.
Listen
Past PodcastsSubscribe
Panda bear hanging in a tree
How horse manure helps giant pandas tolerate cold
A study finds that giant pandas roll in horse manure to increase their cold tolerance.
Image credit: Fuwen Wei.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Cozzarelli Prize
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490