New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Proposed mechanism for stability of proteins to evolutionary mutations

Edited by Hans Frauenfelder, Los Alamos National Laboratory, Los Alamos, NM, and approved July 7, 1998 (received for review April 2, 1998)
Abstract
It is shown that the sequenceordering tendencies induced by design into different fastfolding, thermally stable native structures interfere. This interference results in a type of quasiorthogonality between optimal native structures, which divides sequence space into fastfolding, thermally stable families surrounded by slowfolding, low stability shells. A concrete example of this effect is provided by using a simple α carbon type model in which a complete correspondence is established between sequence and structure. It is speculated that gaps can occur in the space of proteinlike sequences separating the sequence families and resulting in a mechanism for stability and diversity of protein sequence information.
According to energy landscape principles (1–11), proteins are distinguished from nonfolding amino acid sequences by having a rugged but funnellike configurational energy landscape. In the simplest possible picture, this landscape is locally rugged with barriers among many local minima, whereas globally the landscape has an overall energy gradient that guides the chain toward its native configuration. When this gradient is dominant, the landscape is a deep funnel that allows the protein to fold on physiological timescales.
To design sequences with funnellike landscapes focused on a particular target structure, it is therefore necessary to stabilize the target energetically against the ensemble of misfolded configurations (12–20). However, when a sequence has been designed into a predetermined structure, there is no guarantee that by slightly altering this structure and redesigning the sequence, one may arrive at a new sequence with better properties. Thus, to obtain the most optimal sequence–structure combinations, it is necessary to anneal sequence and structure together (17–21). This results in sequence–structure combinations that could be called the modes of design for a polymer with the 20 letter amino acid code, and ideally, proteins correspond to such combinations.
To be more precise, a mode of design corresponds to a compact native structure for which, once a sequence has been optimally designed into it, one cannot obtain a less frustrated sequence by changing a small part of the structure and redesigning the sequence. Thus, when a mutation is applied to a minimally frustrated sequence, it always increases frustration, although in most cases it does not substantially change the folded structure. This results in a picture of sequence space as being populated by families, each folding to a particular coarse grained structure and each surrounded by a shell of increasingly frustrated sequences.
One of the goals of this paper is to explain how this situation occurs. We show that to achieve minimal frustration, the modes are driven apart, or “orthogonalized,” very much like the orthogonalization of memories in a neural network (22–24). Specifically, because the fastestfolding, most stable sequences are those that minimize the energy of one highly connected compact structure against all the others, the energy of a minimally frustrated sequence placed into the folded structure of the wrong sequence family will have one of the worst possible energies. Hence, the sequences and structures of the minimally frustrated modes tend to be mutually dissimilar.
We demonstrate the emergence of this orthogonality property in a simple α carbontype model of proteins (20) (Fig. 1), in which we have previously established a complete correspondence between sequence and structure (Fig. 2) and have determined both the folding times and folding temperatures of the sequences. The model is quite convenient to illustrate how structure information is stored in proteins, and the simple hydropathic interaction rule (26–29) is already sufficient to produce two minimally frustrated sequence families.
We parameterize the level of fast folding and stability of a sequence by the degree of frustration minimization (6, 7, 31) as measured by the ratio of folding to glass temperatures (6–8) 1 The negative of this parameter, −Λ(p), can be used to define a landscape in sequence space, and we show that this landscape has pronounced valleys or frustration minima, each containing a family of sequences and each family folding to a different coarse grained compact structure. Once again, these are structures for which, once a sequence has been optimally designed into them, one cannot obtain a less frustrated sequence by altering a small part of the structure and redesigning the sequence. Each optimal target structure is associated with a famility of sequences that fold to it, and each family is characterized by a different tendency for ordering the residues. Because the optimal structures are substantially different, the ordering tendencies of different families must oppose, or contrast each other in the way that residues are patterned within a sequence. This results in a matrix of similarity parameters x^{μν}(p), which defines the degree of sequence ordering toward one minimally frustrated sequence class (μ) and against another (ν). Large values of x^{μν}(p) are associated with class μ, and large values of x^{νμ}(p) are correspond to occupying class ν. For intermediate values of this parameter, cancellation occurs between the ordering tendencies μ and ν, in the sense that the sequences corresponding to such regions of sequence space are highly frustrated. This produces a frustration barrier, e.g., a region of frustrated sequences between each pair of minimally frustrated families. Any stepwise mutational path between one minimally frustrated sequence family and another (32) must then visit a region of slow or nonfolding sequences. This property will be clearly demonstrated for the example presented in this paper.
In the case of real proteins, the sequences in these high frustration regions are much less likely to meet physiological requirements on foldability (of course, real physiological requirements can be much more extensive than this; refs. 32–34). If the sequences in these regions do not meet the physiological criteria, then they cannot participate in biochemical processes, which means that they will be physiologically excluded. If the requirement is sufficient, the region between two families will be completely excluded, which cuts sequence space into separate fastfolding, stable parts. This provides a mechanism for partitioning protein sequence information into evolutionarily stable (31, 33), biochemically useful (foldable) subsets.
Stability to Mutations.
Implicit in the concept of sequence design is the idea that proteins must exceed a certain level of fast folding and stability to function in biochemical processes. The frustration function Λ(p) (which measures this ability) separates sequences, independent of length, into two distinct regimes (6–8). In the frustrated regime, Λ̃ < 1, the energy gap ΔE between native and nonnative (misfolded) configurations cannot be distinguished from the characteristic energy barriers δE between misfolded structures (5). This means that below the collapse (coilglobule) temperature, the chain exists in a superposition of longlived, misfolded traps. For a frustrated sequence, the misfolded structures are substantially different from the native state, and because the energy bias ΔE is weak, any small rearrangement of the sequence can drastically alter its native structure. In the low frustration regime, Λ is substantially greater than 1, and the energy gap between native and misfolded states is larger than the characteristic energy barriers between them. Furthermore, the configuration space of the chain is energetically correlated with the native structure (2, 9) so it is much less likely for a random mutation to cause any significant damage to the energy funnel (5, 20).
According to this “evolutionary” selection principle, a threshold value Λ_{0} can be introduced to describe the physiological criteria needed to be met for sequences to be biochemically useful, such that for physiologically allowed proteins 2 A crude calculation shows that Λ_{0} should be somewhere around Λ_{0}∼1.5 for single folding domain proteins (45, 46). Of course, this in itself does not stabilize any structure because it does not eliminate the possibility to evolve from one native structure into another along a pathway on which every sequence meets the requirement. Stability appears when we consider that single folding domain proteins correspond to valleys (local minima) in the landscape −Λ(p), and because the folded structures corresponding to separate valleys are substantially different, the sequenceordering tendencies induced by design into these structures must oppose each other, so that every stepwise mutational path between one sequence family and another must encounter a region where the sequenceordering tendencies counteract and the criteria Λ(p) > Λ_{0} may not be met.
To be more precise, consider two native configurations ν_{0} and ν_{1} of sequences p_{0} and p_{1} with nearly equal degrees of frustration 3 but with low structural similarity, such that 4 where Q(ν_{0},ν_{1}) is the number of cross chain contacts common to ν_{0} and ν_{1}. Furthermore, assume that p_{0} and p_{1} are the least frustrated sequences folding to ν_{0} and ν_{1} respectively. We can then define a sequence similarity parameter x(p) to measure the degree of ordering toward p_{1} and away from p_{0}. For simplicity, we define x(p_{0}) = 0 and x(p_{1}) = 1. The similarity parameter allows us to prescribe a minimal frustration path p(x) in sequence space, such that p(x) is the least frustrated sequence having the similarity parameter value x [hence p(0) = p_{0} and p(1) = p_{1}]. For example, in the α carbon model discussed below, there are two sequence ordering tendencies, characterized by the sequences 1 0 0 1 0 1 0 1 0 1 0 0 1 0 0 1 and 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 (where 1 and 0 stand respectively for H and P residues). In this situation, the similarity parameter x corresponds to the degree of clustering of the H residues. More generally, x(p) is a matrix x^{μν}(p), but for two families x^{01} = x, x^{10} = 1 − x, and x^{νν} = 1.
To complete this picture, we can interpret functions of x in terms of the minimal frustration path p(x), for example 5 6 where 7 It is clear now, that if we attempt to evolve p_{0} into p_{1}, the most optimal trajectory from the standpoint of equation (2) is along the minimal frustration path. Nevertheless, even along this path, a region of frustrated sequences will be encountered at some x = x_{m} where the sequenceordering tendencies completely counteract, and hence lose the capacity to fold sequences efficiently. Thus, because p(x) is the best path, a gap will occur, completely separating the sequence families, when the requirement Λ_{0} exceeds Λ(x_{m}).
Protein Model.
In the following sections, we present a simple concrete example of this effect in the α carbon model of proteins described in Figs. 1 and 2. The model is essentially a continuum version of the HP model (26); however, the residues also are allowed to approach each other more closely when they are nearest neighbors in sequence (i.e., along the chain) than through contacts across the chain. Thus, there are two types of interaction potentials present in the chain. The cross chain (nonlocal) interactions between hydrophobic residues are determined by a short range Morse potential (similar to the Van der Walls potential). All cross chain interactions between other pairs of species (HP, PP) are determined by the hard core of this potential. The chainbonded (local) interactions are defined by a “square well” potential. The effect of this potential is similar to having hard beads tethered together by string. The string bond minimum approach radius is onehalf the cross chain hard core radius, which results in two structural mechanisms for maximizing the number of favorable energetic connections in the core. The first mechanism is dominated by the local interactions, and the second by nonlocal interactions. These correspond to the core structures ν1 and ν0 shown in Fig. 2. More explicit details of the model are described in a recent article (20).
The basic effect of protein folding captured by this model is that, as the chain folds, it is forced to have a clearly defined inside (core) and outside (surface) determined by the twofold identity of its residues. The hydrophobicity of small, single folding domain proteins is peaked around onehalf so that roughly onehalf the residues are forced into the core. Lower hydrophobicity results in nonfolding sequences, whereas higher hydrophobicity leads to aggregation. We thus use a fraction of hydrophobic residues consistent with these observations (30). This level of representation of proteins is similar in spirit to many other minimalist models (3, 5, 26, 31, 35–44).
An important feature of this model is that the groundstate core geometry and energy of a sequence is determined uniquely by the set of internally clamped sequential Hsegments along its length (such as, H—H … H—H—H … H … H) and not by permutations of the segments within a sequence. For example, the sequence H—H—H … H … H … H—H folds into exactly the same hydrophobic core geometry as H—H … H—H—H … H … H.† For sequences that fold to the same core geometry, this is roughly true for both the folding temperature T_{f} and the folding time τ_{f}. Because the P residue chain segments can always access a significant number of configurations when the H residues are clamped in the groundstate, changing the length of these Psegments should contribute mainly to the very early stages of folding and has been seen only in the fastest folding sequences. In testing different mutations of these fastfolding sequences, we find only a small spread in T_{f} and τ_{f} within sequence families. Hence, we take the folding parameters to be essentially invariant of permutations that do not break, create, or extend the length of the Hsegments, and we only calculate the folding temperature T_{f} and the folding time τ_{f} for one sequence folding to each of the 15 core structures.
For convenience we represent the degree of frustration minimization by the following function 8 where τ_{0} = min[τ(p,T_{f})] is the minimum folding time for all 15 representative sequences, and τ(p,T_{f}) is the folding time measured at the folding temperature. The function in the denominator of this expression T_{f} logis roughly the difference between the typical energy barrier encountered in folding the sequence and the typical energy barrier for the fastest folding sequence.‡ λ(x) therefore is strongly correlated with the ratio of folding to glass temperatures Λ(x), but the functional form of this equation causes the degree of frustration minimization λ(p) to vary between the limits 0 and 1 [rather than 0 and ∞ as with Λ(x)].
λ(x) is plotted in terms of the similarity parameter x in Fig. 3. As expected, all three functions λ(x), T_{f}(x), and τ(x,T_{f}) exhibit two regions of minimal frustration (large λ, T_{f}, and τ_{f}^{−1}) between 0 ≤ x ≤ ½ (small sequence clustering) and ≤ x ≤1 (large sequence clustering). The minimally frustrated sequences in these two frustration valley regions fold to the ν_{0} and ν_{1} structures. The valley regions are separated by a “barrier” or saddle region at x ≡ x_{m} = . The least frustrated sequences from this barrier region fold to the FRUST geometry 2 4 (Fig. 2).
Stable Modes of the Model.
The two minimally frustrated sequence families in this model fold to structures that favor either the local (chain bonded) or nonlocal (cross chain) interactions. The first family occurs due to the fact that the mutual cross chain exposure of H residues can be maximized by minimizing the number of sequential H—H bonds. According to the interaction rule, H residues can interact across the chain only when they are not nearest neighbors in sequence. Thus, sequences like 9 (where 1 (0) stand for H (P) residues) maximize the number of available energetic cross chain contacts between H residues. As discussed above, the similarity parameter x(p) is the fraction of possible H—H bonds. The ground state core symmetry for these small x sequences is the ν_{0} structure. This symmetry is stable even when some of the H residues are joined together into sequential segments. However, when three or more sequential H residues occur within a sequence, interference is introduced between the local and nonlocal interactions, and the groundstate symmetry is broken (see Fig. 2, FRUST 1 2).
As we increase x, so that H residues are steadily bonded together into segments, a new mode develops to maximize energetic connectivity. This second mode occurs due to the fact that residues connected by nearest neighbor (string) bonds can approach each other more closely along the chain (the string bond hard core radius is 0.4) than across the chain (crosschain hard core radius ∼ 0.75), and therefore the core is able to compact itself into a smaller globule to increase cross chain contacts. This second mechanism operates in sequences with a nearly homogeneous grouping of H residues, 10 which fold to the ν_{1} core structure. Although this sequence family corresponds to a frustration minima, it is very small, leading to a much lower sequence entropy (logarithm of the number of sequences) (ref. 19) (Fig. 4).
Again, because the two minimal frustration sequence families are dissimilar in the way that H residues are distributed in sequence, a substantial number of exchange mutations (two to three) are required to change a sequence folding to ν_{0} into a sequence folding to ν_{1}. If we take a stepwise mutational trajectory between ν_{0} and ν_{1} along the least frustrated path, we must pass through a region where the sequences fold ∼10 times slower, whereas if we do not take this path, the situation is much worse. If sequences are required to fold faster, and be more stable than those at the cusp λ() in the frustration function, i.e., if λ_{0}, exceeds λ(), then all the sequences between the two families folding to ν_{0} and ν_{1} are excluded (Fig. 4). If these were real proteins, this would mean that the sequences could not continuously evolve from one structure into the other, i.e., we would always encounter a region of sequences that do not fold on the order of physiological timescales.
DISCUSSION
The results of this model suggest that the sequence space of single folding domain proteins is split into mutually dissimilar, low frustration families folding to mutually dissimilar native structures. The principle by which this situation emerges is the design requirement of minimal frustration, which allows efficient folding of sequences into their functional (native) structures. Each family is characterized by a particular tendency for ordering the residues, which results naturally in a matrix of similarity parameters, x^{μ≠ν} to describe the geometry of sequence space (ref. 48). Minimal frustration is expressed in the sense that one of these parameters can be large, whereas the rest are small, in other words, in a type of orthogonality (dissimilarity) property. At intermediate values of the parameters, the sequenceordering tendencies of pairs of families counteract each other, resulting in saddle regions of frustrated sequences. If the physiological requirement on folding ability exceeds the folding ability of sequences in these frustrated regions, all the sequences within them will be excluded from biochemical processes, resulting in a mechanism for evolutionarily stable partitioning of sequence information into biochemically useful subsets.
Although we have focused on a highly simplified model, we have taken into account a fundamental ingredient of the protein self interactions—the coupling between local and nonlocal interactions—which allows for two different mechanisms for maximizing energetic connectivity. It is certain that much more elaborate effects exist in proteins due to the complex interactions between different amino acids. However, the fact that this model is capable of capturing a clear mechanism for evolutionary stability lends credit to its comparison with proteins. Finally, it is important to point out that, although the minimal frustration path between sequence families is the most optimal path from the standpoint of equation (2), real population dynamics will explore a much wider region of sequence space.
Acknowledgments
This work was supported through National Science Foundation Grants DBI 9616115 and MCB 9603839 and the Los Alamos CULAR initiative. We thank Bob Leary for very useful discussions during the completion of this work.
Footnotes

↵* To whom reprint requests should be addressed. email: enelson{at}sdsc.edu.

This paper was submitted directly (Track II) to the Proceedings Office.

↵† The two example sequences have different arrangements of polar loop segments, and different backbone traces through the core, but the core nevertheless has exactly the same shape or geometry. Furthermore, different topologies of the chain can accommodate exactly the same geometry of hydrophobic core residues.

↵‡ The folding times in this model vary between ∼10^{6} and >10^{8} Monte Carlo steps and the folding temperatures between <0.1 and 0.95, where the inequalities indicate the limits on the capacity of our simulations.
 Received April 2, 1998.
 Copyright © 1998, The National Academy of Sciences
References
 ↵
 Onuchic J N,
 Schulten Z L,
 Wolynes P G
 ↵
 Leopold P E,
 Montal M,
 Onuchic J N
 ↵
Socci, N. D., Onuchic, J. N. & Wolynes, P. G. (1998) Proteins, in press.

 Onuchic J N,
 Socci N D,
 Schulten Z L,
 Wolynes P G
 ↵
 Nymeyer H,
 Socci N D,
 Onuchic J N
 ↵
 ↵
 Bryngelson J D,
 Wolynes P G
 ↵
 Bryngelson J D,
 Wolynes P G
 ↵

 Panchenko A,
 LutheyShulten Z,
 Wolynes PG
 ↵
 ↵
 Bowie J U,
 Luthy R,
 Eisenberg D

 Shakhnovich E I,
 Gutin A M
 ↵

 Pande V S,
 Grosberg A Y,
 Tanaka T
 ↵
 Wolynes P G
 ↵
 Nelson E D,
 Teneyck L F,
 Onuchic J N
 ↵
 Saito S,
 Sasai S,
 Yomo T
 ↵
 Hopfield J J

 Dotsenko V S
 ↵
 Friedrichs M S,
 Wolynes P G

 Goldstein R A,
 LutheySchulten Z,
 Wolynes P G
 ↵
 ↵
Hummer, G., Garde, S., Garcia, A., Pauliatis, M. & Pratt, L. (1998) Proc. Natl. Acad. Sci. USA95, in press.
 ↵
 ↵
 ↵
 ↵
 Frauenfelder H,
 Sligar S G,
 Wolynes P G
 ↵
 Anderson P W
 ↵

 Camacho C J,
 Thirumalai D

 Sali S,
 Shakhnovich E,
 Karplus M

 Boczko E M,
 Brooks C L
 ↵
 ↵
 Onuchic J N,
 Wolynes P G,
 Schulten Z L,
 Socci N D
 ↵

 Saven J G,
 Wolynes P G
 ↵
Citation Manager Formats
More Articles of This Classification
Biological Sciences
Related Content
 No related articles found.
Cited by...
 Universal distribution of protein evolution rates as a consequence of protein folding physics
 An Insight into the Molecular Basis of Salt Tolerance of LmyoInositol 1P Synthase (PcINO1) from Porteresia coarctata (Roxb.) Tateoka, a Halophytic Wild Rice
 Recombinatoric exploration of novel folded structures: A heteropolymerbased model of protein evolutionary landscapes
 How nativestate topology affects the folding of dihydrofolate reductase and interleukin1beta
 Pressureinduced proteinfolding/unfolding kinetics
 Exploring the origins of topological frustration: Design of a minimally frustrated model of fragment B of protein A
 Modeling evolutionary landscapes: Mutational stability, topology, and superfunnels in sequence space