Associative memory Hamiltonians for structure prediction without homology: Alphahelical proteins
See allHide authors and affiliations

Contributed by Peter G. Wolynes
Abstract
Energy landscape theory is used to obtain optimized energy functions for predicting protein structure, without using homology information. At short sequence separation the energy functions are associative memory Hamiltonians constructed from a database of folding patterns in nonhomologous proteins and at large separations they have the form of simple pair potentials. The lowest energy minima provide reasonably accurate tertiary structures even though no homologous proteins are included in the construction of the Hamiltonian. We also quantify the funnellike nature of these energy functions by using free energy profiles obtained by the multiple histogram method.
When a protein folds in the test tube, the information contained in its onedimensional sequence is transformed into the threedimensional information of its native protein structure. It is not a surprise then that the theory of protein folding has many common themes with more abstract problems of the statistical mechanics of information processing (1). Beyond the analogies at a theoretical level, many approaches to the practical problem of protein structure prediction can profitably be viewed as connectionist schemes for learning the proper sequence–structure associations from the database of known protein structure–sequence pairs. The commonality of viewpoint between folding and machine learning is quite explicit for schemes to predict local secondary structures from sequence that use neural networks (2). Going further, using this philosophy we have developed a series of algorithms for predicting tertiary structure that are based on simulated annealing of “associative memory (AM) Hamiltonians” (3, 4). These models make very active use of statistical mechanical landscape theory and capture the notion of landscapes tunable from those with perfect funnels to nearly random rugged landscapes (5–7). We have shown that these methods work quite well when the database of input structures includes one (or more) homologs. When more than one homolog is present the predicted structures combine good elements of each homolog (8) and indeed give a more accurate structure than any of the inputs. How far can this ability to generalize be pushed? What if no protein with similar overall structure is yet known? Can even small fragments of correct structure in known examples be combined? In this paper we will describe the performance of optimized AM Hamiltonians that do not use homologs in their input. Thus these algorithms provide ab initio predictions of the threedimensional protein structures. We use the word ab initio not to mean starting from the underlying physiochemical forces alone, as some do, but rather as starting without knowledge of globally similar folds, the less pure but more practical meaning (9).
One crucial idea in understanding protein folding in the laboratory has been that proteins are not randomly chosen systems but are special heteropolymers: their free energy landscape is only minimally frustrated so they fold into unique states rather than having alternate deep traps of wildly different structure. There may be exceptions to this general rule for many biomolecules: prions in nature, the Janus proteins synthesized in the laboratory (10) or, remarkably, some examples in the RNA world. Nevertheless we can use this idea in a practical way by ensuring that any energy function we use is minimally frustrated for those natural proteins that are known to fold to unique (average) lowresolution structures. To do this one must make the idea of minimal frustration a quantitative principle rather than merely a qualitative statement. This quantification involves knowing the phase diagram of the protein model and especially locating the folding and glass transitions (1). One way to do this assumes that in the vicinity of nonnative structures the landscape of natural proteins resembles that of a simple random heteropolymer. If so, the minimal frustration principle can be formulated as maximizing the energy gap between native structures and nonnative decoys, in units of the energy variance of the misfolded structures. We used this optimization procedure for AM Hamiltonians long ago (4), but it also has been used to find energy functions useful for sequence–structure alignment and to set scaling parameters in energy functions whose form is based on a priori reasoning from detailed molecular physics (11).
Unlike the situation for many machine learning problems formulated in an abstract framework, the structure space for proteins is not uniform and is quite varied: one must discriminate native folds not only from other fully collapsed structures but also from expanded ones with correct secondary structure, collapsed structures with good phase separation between hydrophobic and hydrophilic residues, etc. Different parts of the energy function determine the stability of each of these regions to varying extents. Thus implementing the minimal frustration principle involves an iterative scheme that constrains the statistics of the different classes of decoys and that selfconsistently eliminates the deepest nonnative traps. We implemented such a scheme for AM Hamiltonians when homologs were included in the database (12). Our goal here is to report on the results of carrying out a similar scheme without using homologs. In addition to describing the energy function and detailed optimization scheme, we present results of simulated annealing using the resulting energy function, limiting ourselves here to a study of all alphahelical proteins. We also show free energy profiles that quantify how funnellike or rough are the energy landscapes that result from the optimization scheme.
Materials and Methods
AM, Contact, and Backbone Potentials.
To allow molecular dynamics simulation of the entire folding process, we use a coarsegrained description of the protein. Each amino acid residue is represented by the three atoms, C_{α}, C_{β}, and O. The corresponding equations of motion for these atoms involve residue–residue or sequencedependent interactions in addition to a backbone potential that maintains chain connectivity and correct peptide stereochemistry. Interactions between residues at short to medium range sequence separations are described by AM potentials, and between more distant pairs by a series of piecewise contact potentials whose forms are chosen to roughly mimic the behavior of longrange pair correlations for C_{β}s. The AM potential is based on correlations between a target's sequence and the sequencestructure patterns in a set μ of memory proteins. The pairs in the target and in the memory are first associated by using a sequence–structure threading algorithm (4, 8), and in the present ab initio folding study, the memory proteins contain no protein homologous to the targets (see Appendix). Table 1 lists the highest Q memory protein for each target. Thus only fragmentary, local in sequence patterns are expected to be found by the threading procedure. The energy parameters γ encode similarity between residue pairs i and j in the target and the aligned pairs i′ and j′ in the memory proteins. We use a simplified, fourletter code {P_{i}} to represent the 20 naturally occurring amino acids. The AM potential encoding these sequence–structure patterns is given by where the structural similarity is measured by Gaussian functions Θ. The parameters {γ} are learned by the optimization procedure. Between nonadjacent residues the r_{ij} distances are taken only between the C_{α} and C_{β} atoms on each residue. This gives rise to four interactions per residue pair. Different γ values are used for the two proximity classes: short j − i < 5 and medium range 5 ≤ j − i ≤ 12. The specific amino acids in each category are hydrophilic (Ala, Gly, Pro, Ser, Thr), hydrophobic (Cys, Ile, Leu, Met, Phe, Trp, Tyr, Val), acidic (Asn, Asp, Gln, Glu), and basic (Arg, His, Lys). Interactions between residue pairs distant in sequence (i − j ≥ 13) are described by pair potentials between pairs of C_{β} atoms V_{long}(P_{i},P_{j},r_{ij}) = Σ_{k=1}^{3} c_{k}(N)γ_{k}(P_{i},P_{j})U(r_{ij}, r_{k}), which are approximated by three smoothed square wells covering the regions: 4.5 < r_{ij} < 8.0, 8.0 < r_{ij} < 10, 10 < r_{ij} < 15 (units of Å). The precise form of U(r_{ij},r_{k}) is where σ controls the sharpness of the potential boundaries, and r_{k}^{min,max} are the endpoints of the intervals. The contact potential includes an additional scaling, c_{k}(N) = 1/(Na_{k} + b_{k}), to account for the variation in the number of contacts over the three contact wells. C(N) is found from fitting the number of contacts in each of the regions as a function of the sequence length of the target proteins, and the parameters are given in Table 2.
The backbone potential, described in detail elsewhere (13, 14), has been updated to include a periodic torsion potential (V_{ΦΨ}) that provides a better fit to the backbone torsion angles observed in a recent Ramachandran map for nonglycine residues in wellresolved xray structures (15). The total potential used in the molecular dynamics simulations is where V_{χ} is a chirality potential that biases Lamino acid chirality, V_{ex} are the excluded volume potentials applied to nonbonded carbon and oxygen atoms that approach within 3.5 Å for (j − i) < 5, and 4.5 Å for (j − i) ≥ 5. V_{harm} is the sum of three quadratic potentials that are used along with a series of shake (16) constraints to provide backbone rigidity, maintain the planarity of the peptide bond, and maintain the appropriate bond angles. The sequencedependent potentials, V_{AM} and V_{contact}, are simultaneously optimized as described below. The energy parameters γ have been scaled so that the average value of the native state energy per residue per interaction over the training set is 1. The weights of the backbone terms, listed in Table 2, have been empirically chosen.
Constrained SelfConsistent Optimization.
The simplest statistical mechanical treatment of the phase diagram depends on only a few average properties of native structures and globules. The glass transition temperature is given by the energetic variance of the misfolded ensemble T_{g} ≃ The collapse temperature depends on the mean energy of the globule states, T_{c} ≃ The folding temperature T_{f} is given by the ratios of the difference in energy between the native state and the globules and the entropy, T_{f} ≃ δE/S_{c}. More elaborate polymer theoretical estimates of compact globules suggest this is a good approximation (17). We find these collapsed structures do have some nativelike components. The structures or conformations in the molten globule ensemble have an average Qvalue of 0.2 and a radius of gyration R_{g} of 1.2. The unconstrained maximization of is equivalent to maximizing the ratio
In the AM Hamiltonians the potentials are a sum of terms ξ_{i}, each representing a basic form of interaction. In the present study, the ξ_{i}s depend on amino acid class and the proximity of two amino acids in the sequence, as described above. If the interactions are weighted by linear parameters γ_{i}, the energy gap and variance (and the corresponding temperatures) can be expressed simply as δE = Aγ and ΔE^{2} = γBγ. A and γ are vectors of dimensionality equal to the number of interaction types, and B is a matrix given by These averages depend on the frequencies at which any given interaction occurs in the molten globule and native configurations. Maximizing the energy ratio amounts to varying the interaction weights γ_{i} and leads to an optimization problem that can be solved by straightforward linear algebra, γ = B^{−1} A up to a scalar multiple.
The mean energy of the molten globule distribution (and the corresponding collapse temperature) is a linear function of the interaction weights, 〈E〉_{mg} = A′γ.
For offlattice models, unlike many lattice model studies of sequence design and folding kinetics, which often concentrate on fully collapsed structures alone, efficient folding via molecular dynamics simulation requires a more complete statistical mechanical treatment of the phase diagram. These better approximations must account for the existence of partially ordered ensembles of states, with varying degrees of collapse and secondary structure formation (4, 12). So that such states not become competitive with the native state energy, the contribution to the mean energy of the globules from interactions in each proximity class are constrained. In models with interactions of different ranges there also can be different glass transition temperatures associated with structures on different length scales. This behavior is predicted by the generalized random energy model of Derrida (18), which has been used for random heteropolymers with contact potentials (17). It is necessary to constrain the variances, as well as the means, of subensembles, otherwise too large interactions in the short sequence range could lead to the dynamical freezing of shortrange interactions, for example, at a temperature that is high relative to the T_{g} for fixing structures involving the more distant in sequence interactions (19). The mean energy and variance of the molten globule distribution expressed in terms of contributions from the short, intermediate, and longrange interactions are fixed by imposing linear constraints in the following optimization functional where the first two terms in the sums correspond to the secondary and supersecondary interactions, λ_{k} are the Lagrangian multipliers, and c_{k} the constraint values. Maximizing this functional is equivalent to maximizing the folding temperature (energy gap) while fixing the collapse and glass transition temperatures of the subensembles. In writing the functional we have ignored correlations between the various interaction classes. Indeed these correlations are so small that they are hard to statistically determine by sampling. The constraints are chosen so that the energy of any molten globule configuration is evenly distributed among the length scales (20). Because the globules have flickering, nativelike elements the glass transition temperature T_{g} is estimated from the variance B′, which contains only the nonnative part of the energy of any molten globule configuration. Projecting out nativelike contacts is consistent with the assumptions of the random energy model estimate of T_{g}. Constrained optimization leads to the simple variational equation where 〈 〉 indicates an average over a set of training proteins. The ensemble of compact misfolded structures for each training protein is generated initially from translations of the training sequences along a database of unrelated structures. Subsequent iterations are generated by molecular dynamics. Because the misfolded structures are partially ordered and have a tendency to satisfy any especially large interaction energy term, the variational equation is solved iteratively to obtain the interactions weights γ_{(n)}. In this selfconsistent optimization the lowenergy misfolded structures are generated through molecular dynamics simulations by using the γ_{(n−1)} values from the previous round. The ensembles are generated in constant temperature simulations and the structures are censored to have Q < 0.4. Each round of optimization combines the interaction parameters from previous optimization by a simple average, γ′_{n} = ɛγ_{n−1} + (1 − ɛ)γ_{n}. This is analogous to conjugate gradient optimization.
At each step n in the constrained selfconsistent optimization, the energy gaps and variances of the molten globule states obtained in constant temperature molecular dynamics simulations with the γ_{n−1} iterate energy parameters were evaluated. The optimized interaction weights γ are the solutions of the matrix linear algebra equation (12) for the training set and the current set of misfolded states. The AM Hamiltonian with two proximity classes has 512 interaction weights, and the contact potential with three proximity classes has an additional 30. For a given set of training proteins and a given misfolded ensemble, some of these interactions may be sampled only rarely due to practical limitations on the size of the training set and the set of misfolded states. To avoid attaching an erroneously large weight to such noisy interactions, we have filtered the modes of the B′ matrix with very small variance: Here M is the matrix of eigenvectors of the unfiltered B′ matrix and (λ*)^{−1} is the diagonal matrix obtained from the diagonal matrix of eignvalues by zeroing out eigenvalues below a cutoff. By setting small eigenvalues equal to zero, we ignore these modes.
Results
The interaction weights γ were optimized by using a set of wellresolved Protein Databank structures from the class of alphahelical proteins. Ten training proteins were chosen from the most populated fold architectures (topologies) in the Cath (21) database: nonbundle [1r69(434repressor), 1utg(uteroglobin), 3icb(recoverin), 4cpv(recoverin), 1ccr(arc repressor), 1mba(globin), 1rgp(Gprotein, GTPase Activation domain)] and bundle [2mhr(fourhelix bundle), 256ba(fourhelix bundle), 2fha(granulocyte colonystimulating factor)]. The 10 training proteins were aligned to a set of 36 alphahelical proteins (see Appendix) by using the threading algorithm of Koretke et al (8). Individual memory sets were modified so that each one contained no proteins homologous to the training protein used. The alignments of the training proteins to the corresponding sets of memory proteins constitute the AM Hamiltonian.
The gaps and variances generated with the last round of iteration are shown in Fig. 1. Consistent with the imposed constraints, the energy gap between the folded state and the mean of the molten globule distribution is roughly equally divided between the shortrange and longrange interactions.
Structure Prediction Without Homologs via Molecular Dynamics
Starting from extended configurations with randomized φψ angles, the 10 training set proteins and three proteins that were not part of the training set were annealed by molecular dynamics using the energy parameters corresponding to Fig. 1. Using the standard annealing protocol, each run uses approximately 6 h on a SGI Origin 200 workstation. The results seem to be well equilibrated, so it is possible shorter runs would do as well. We measure the progress of the molecular dynamics trajectories by means of two order parameters, Q and Q_{cut}. Q is the fraction of all native C_{α} pair distances, and Q_{cut} is the same fraction, counting only pairs within some cutoff distance: Here r_{ij} refers to C_{α} distances, and we have chosen r_{c} = 8.0 Å. The advantage of Q_{cut} is in its similarity to typical contact order parameters, such as have been used in lattice studies. The offlattice Q, on the other hand, includes pairs that are separated by large distances and is sensitive to domain rotations and other distortions.
The Q of the best structure in each of the runs performed for all 10 training proteins is given in Fig. 2. The data in the figure indicate that simulated annealing using the energy function is more successful on shorter proteins than it is on longer ones. Fig. 3 presents sample folding and unfolding trajectories for 434 repressor, of length 63, and myoglobin, which has length 146. The folding trajectory of myoglobin is typical of the longer training proteins, in that collapse to nearly the final value of Q occurs at a relatively high temperature. Moreover, the size of the fluctuations in Q is somewhat smaller than in the case of the repressor. Both of these observations suggest that the collapse characteristics of the energy function are not yet optimum for longer proteins. Even so comparison of the folding and unfolding trajectories shows that the potential yields more nativelike structures with energies comparable to those of the best structures in the folding run. Although it has lower overall quality, the Q_{best} structure still has a Q of over 0.4 for up to half of the molecule.
For the test set we chose three alphahelical targets from the CASP3 structure prediction experiment (22), which were rated as moderately difficult: 1bg8(a), 1jwe, and 1bqv. Again, no homologous proteins were used in the memory set. Furthermore, these proteins are not homologs of any of those used in training. As shown in Table 3, our method gives substantially correct structures for these proteins. The superpositions of predicted and native structures in Fig. 4 indicate that the correct topology has been achieved over the majority of the protein in all cases. The Q_{best} structures appearing in Table 3 are typically sampled near the folding temperature and further annealing degrades the overall structure as shown in the folding trajectories in Fig. 3. The precise structures observed at low temperatures typically have energies lower than that of the xray structure.
At any given temperature the structure of the energy landscape may be explored more quantitatively by examining the total freeenergy and energy as a function of Q_{cut}. Using a multiplehistogram sampling technique (23), we computed the free energy profiles for our optimized model. For comparison we also calculated the free energy surface for a Golike model with the same backbone, which stabilizes only native contacts (24). The Go model possesses a nearly ideal, funneled landscape (14). The results of this analysis at a temperature near the folding temperature are shown in Fig. 5. In both cases, the total energy is funneled towards the native state, Q_{cut} = 1. As expected, the Go model is smoothly funneled over the entire range. The energy function optimized for structure prediction yields a more calderalike landscape. For the training protein 1r69 the landscape is funneled only to about Q_{cut} = 0.6, after which the energy profile levels off. For 1bg8(a), the deviation from a perfect funnel is more dramatic, with the energy actually increasing somewhat as higher Q states are sampled.
Conclusion
The results of this paper show that the present selfconsistent optimized AM Hamiltonian allows the ab initio structure prediction of small alphahelical proteins via molecular dynamics with simulated annealing. Preliminary investigation of a similar treatment for α/β and all β proteins already yields results of similar quality. The ability to predict correct overall structures without using homology information was achieved by introducing a finer division of amino acid classes beyond just simple hydrophobicity as well as constraints to control the energetic balance between short, intermediate, and longrange interactions. Degradation in the quality of prediction by simulated annealing with increasing sequence length might be addressed by simply splitting the training set, and separately optimizing for longer proteins. This may reflect an important role of nonadditive forces in the collapse. We see that the AM Hamiltonian framework provides an approach that allows the harmonious marriage of threading and ab initio strategies for protein structure prediction.
Acknowledgments
We thank Jose Onuchic for helpful remarks about the manuscript. Some of the computations used here were carried out at the National Center for Supercomputing Applications in Urbana, IL. This research was supported in part by National Institutes of Health Grant PHS 2R01GM44557.
Appendix
The 10 alphahelical proteins varied in length from 63 to 100 and 83 amino acids: 1r69, 1utg, 3cib, 256ba, 4cpv, 1ccr, 2mhr, 1mba, 2fha, 1rgp. They were selected to represent the various classes of wellresolved xray structures appearing in the pdb select 95 (25). After removing structures determined by nmr, those with resolution greater than 2.0 Å and those with length greater than 200, pairwise alignments of the remaining proteins were conducted, and the list was iteratively processed to eliminate any protein with a Q to any other protein in the list greater than 0.5. This resulted in a list of 38 proteins from which the memory proteins for the AM Hamiltonian potentials were selected. The selection process eliminated any memory protein with structural overlap greater than Q > 0.4 to any of the training proteins. The 38 memory proteins were: 1a17, 1a28a, 1aa7b, 1aep, 1ah7, 1ail, 1ak0, 1au1a, 1avsb, 1axda, 1b4fh, 1baj, 1beo, 1bgf, 1bjaa, 1bl0a, 1c3d, 1cf7, 1cola, 1e2aa, 1hiws, 1hula, 1huw, 1jhga, 1kxu, 1lbd, 1lis, 1lki, 1nsgb, 1pbv, 1rcb, 1szt, 1tx4a, 1vin, 256ba, 2a0b, 2abk, 5icb.
Footnotes

↵¶ To whom reprint requests should be addressed.

Article published online before print: Proc. Natl. Acad. Sci. USA, 10.1073/pnas.230432197.

Article and publication date are at www.pnas.org/cgi/doi/10.1073/pnas.230432197
Abbreviation
 AM,
 associative memory
 Accepted September 11, 2000.
 Copyright © 2000, The National Academy of Sciences
References
 ↵
 Peliti L
 Wolynes P G
 ↵
 Bohr H,
 Brunak S,
 Brunak S
 ↵
 Friedrichs M S,
 Wolynes P G
 ↵
 Goldstein R A,
 LutheySchulten Z A,
 Wolynes P G
 ↵

 Leopold P E,
 Montal M,
 Onuchic J N
 ↵
 Onuchic J N,
 LutheySchulten Z,
 Wolynes P G
 ↵
 ↵
 Simmons K T,
 Bonneau R,
 Ruginski I,
 Baker D
 ↵
 ↵
 Jooyoung Lee A L,
 Ripolli D R,
 Pillardy J,
 Scheraga H A
 ↵
 Koretke K K,
 LutheySchulten Z,
 Wolynes P G
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Derrida B
 ↵
 ↵
 ↵
 ↵
 Orengo C A,
 Bray J E,
 Hubbard T,
 Loconte L,
 Sillitoe L
 ↵
 ↵