New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Rapid equilibrium sampling initiated from nonequilibrium data

Communicated by Hans C. Andersen, Stanford University, Stanford, CA, August 12, 2009 (received for review January 27, 2009)
Abstract
Simulating the conformational dynamics of biomolecules is extremely difficult due to the rugged nature of their free energy landscapes and multiple longlived, or metastable, states. Generalized ensemble (GE) algorithms, which have become popular in recent years, attempt to facilitate crossing between states at low temperatures by inducing a random walk in temperature space. Enthalpic barriers may be crossed more easily at high temperatures; however, entropic barriers will become more significant. This poses a problem because the dominant barriers to conformational change are entropic for many biological systems, such as the short RNA hairpin studied here. We present a new efficient algorithm for conformational sampling, called the adaptive seeding method (ASM), which uses nonequilibrium GE simulations to identify the metastable states, and seeds short simulations at constant temperature from each of them to quantitatively determine their equilibrium populations. Thus, the ASM takes advantage of the broad sampling possible with GE algorithms but generally crosses entropic barriers more efficiently during the seeding simulations at low temperature. We show that only local equilibrium is necessary for ASM, so very short seeding simulations may be used. Moreover, the ASM may be used to recover equilibrium properties from existing datasets that failed to converge, and is well suited to running on modern computer clusters.
The functions of biological macromolecules are in large part determined by their structure and dynamics. As such, many experimental techniques have been developed and applied to probe these properties, each of which has its strengths and weaknesses. Computational methods such as molecular dynamics (MD) and Monte Carlo (MC) simulations have the potential to complement such experiments by modeling the evolution of entire systems with atomic resolution. However, it is extremely difficult to obtain equilibrium sampling of even moderately sized systems in atomic simulations because of the rugged nature of the free energy landscapes that must be explored. Without adequate sampling, it is impossible to validate the parameters, or force fields, that determine the interactions between atoms or to address phenomena that occur on biologically relevant timescales.
Many methods have been developed in an attempt to address the sampling problem. Generalized ensemble (GE) algorithms, such as the replica exchange method (REM), or parallel tempering (1, 2) and simulated tempering (ST) (3, 4), are popular approaches for studying biomolecular folding (5–15). They attempt to overcome the sampling problem by inducing a random walk in temperature space while maintaining canonical sampling at each temperature. At high temperatures, energetic barriers may be crossed easily; at low temperatures, the system is generally constrained to local minima. However, recent studies have shown that GE simulations do not yield converged equilibrium sampling much faster than standard constant temperature MD if the phenomena of interest are nonArrhenius (11, 12, 15–20).
For example, Zuckerman and Lyman (17) used the Arrhenius equation to argue that the maximum efficiency gain of GE simulations is no more than an order of magnitude at physiological temperatures, and Zheng et al. (18, 19) used a kinetic network model to show that there is an optimal temperature for nonArrhenius folding kinetics, and any time spent above this temperature will decrease the efficiency of GE simulations. This lack of improvement is the result of the interplay between energy and entropy. Though high temperatures may facilitate the crossing of energetic barriers, entropic barriers will be more difficult to cross (12).
Thus, GE simulations will provide little improvement when the dominant barriers are entropic. Hansmann and coworkers (21, 22) have made some effort to improve the effectiveness of GE algorithms by optimizing the temperature spacing. However, these methods assume that diffusion in temperature space is the ratelimiting process, so crossing entropic barriers in the conformational space—which is the true ratelimiting process—is still a problem. A number of other methods also exist, such as umbrella sampling and milestoning (23). However, these methods require that the dominant reaction coordinate is known a priori, and this information is often unavailable.
The sampling problem is exacerbated by the practice of viewing global equilibration of individual trajectories as a requirement for considering a simulation to have reached equilibrium. Global equilibration is most naturally obtained by running a simulation much longer than the longest relaxation time of the system, so that all degrees of freedom are equilibrated and many uncorrelated samples are generated from each metastable state. For example, the reversible folding metric holds that a simulation has reached equilibrium if there are multiple folding and unfolding events (24). Though this criterion is sufficient, it may not be necessary. Instead of requiring global equilibration of individual trajectories, we suggest that local equilibration may be sufficient. Local equilibration may be achieved by using multiple simulations, each of which visits only a subset of the metastable states with their correct Boltzmann probabilities but that together cover the entire accessible space. Local equilibration may require significantly less wallclock time because shorter simulations (all of which may be run in parallel) are required. The main difficulty is to analyze multiple simulations appropriately.
Markov state models (MSMs) are a powerful tool that can be used to extract equilibrium properties from a dataset that satisfies the local equilibration criterion. MSMs partition phase space into metastable states such that intrastate transitions are fast, but interstate transitions are slow (25–29). Such separation of timescales ensures that the model is Markovian, that is, the probability of being in a given state at time t + Δt, where Δt is called the lag time, depends only on the state at time t. The key point is to build a model with a lag time that is shorter than the timescale of the process of interest with few enough states that it may be understood easily. Usually, MSMs are used to study kinetics, but here we only derive thermodynamic information from them. In an MSM, the time evolution of a vector representing the population of each state may be calculated by repeatedly leftmultiplying by the transition probability matrix where P(nΔt) is a vector of state populations at time nΔt, and T is the columnstochastic transition probability matrix. The first left eigenvector of the transition matrix T corresponds to the equilibrium distribution (26). This can be an advantage and a useful opportunity, because obtaining kinetics from MSMs is challenging, but obtaining only the equilibrium thermodynamic properties might be a less demanding goal as less information is required. Indeed, we find that the populations of the dominant states are invariant with respect to the lag time so that very short simulations can be used.
Here, we introduce the adaptive seeding method (ASM) and show that it rapidly yields converged thermodynamics, even when faced with entropic barriers, by exploiting many simulations at local equilibrium. This is achieved by (i) using nonequilibrium GE simulations to obtain broad sampling, (ii) building an MSM to identify all of the metastable states, as shown in Fig. 1, (iii) starting new constant temperature simulations at the temperature of interest from each metastable state in a process called seeding, and (iv) using MSMs to extract the correct equilibrium populations from the seeding simulations. Seeding short simulations from the known equilibrium distribution of alanine dipeptide has been shown to yield good models for its kinetics (30). A key advance in our new method is that it does not require that the initial sampling has reached equilibrium. We note that many nonequilibrium GE datasets have been generated due to the difficulty in reaching equilibrium, and that there is growing interest in recovering equilibrium properties from such datasets (31). Thus, one strength of the ASM is that steps 2–4 may be used to recover the correct equilibrium thermodynamic properties from a nonequilibrium dataset. Furthermore, this procedure may be iterated and combined with adaptive sampling (32) to most efficiently use one's computational resources, i.e., using the fewest and shortest trajectories necessary to achieve a good model, because minimizing wallclock time is an important consideration for computer simulations.
To test the ASM, we apply it to a small biomolecular system with long timescale dynamics: an 8nucleotide RNA hairpin (5′GCUUUUGC3′) known as the UUUU tetraloop. Hairpins are a fundamental RNA secondary structure motif (33) and perform many biologically relevant functions, but our understanding of their folding is still incomplete (34–38). The folding of this hairpin is diffusion controlled (37, 39, 40), so despite its small size, the folding time is on the microsecond time scale, as measured by laser temperature jump experiments (35). Thus, capturing a single folding event with a single MD simulation with explicit solvent would likely take more than a year on a typical CPU. ASM, however, is able to reach converged equilibrium sampling within a week using many short parallel simulations, as judged by agreement on the populations of metastable states between distinct sets of simulations started from very different initial configurations. ASM is also found to yield converged thermodynamic properties with at least 6 times less sampling than GE simulations for this system. Finally, the fact that the most highly populated metastable state has a wellformed 2basepair stem, as in the NMR structure, provides some validation of the force field. Because there is no analytical solution for the equilibrium distribution of our RNA hairpin system, we also studied a 2D potential where the equilibrium populations can be computed analytically. Using this model, we confirm that ASM is much more efficient than ST, and also provide some guidelines for choosing the optimal number and length of the seeding simulations.
Results and Discussion
Comparison of ASM to ST.
Here we compare the findings of our long ASM procedure with an equivalent amount of ST sampling, as depicted schematically in Fig. 2. We ran 2 distinct sets of simulations starting from a nearnative and randomcoil configuration, respectively, as shown in SI Appendix, Fig. S1. Thus, we are able to judge the convergence of our findings by comparing these 2 datasets.
The first step was to run an independent set of 1,000 18ns ST simulations starting from each initial configuration to obtain broad sampling. During an initial equilibration phase (first 9 ns), the weights were updated using the simulated tempering equal acceptance ratio (STEAR) method (11, 41) described in SI Appendix. This procedure was found to give nearly equal sampling of each temperature and converged weights for each dataset (SI Appendix). During the subsequent 9ns production phase, the weights were held constant. These 2 sets of ST simulations do not reach converged sampling because of their short length, but they should be able to reach all of the metastable states.
To identify the metastable states, we built an independent MSM from the production phase of each dataset. First, all of the conformations from every temperature were divided into a large number of small sets of very structurally similar, and therefore likely kinetically similar, conformations called microstates using a hierarchical Kmedoids clustering algorithm as described in SI Appendix. We then used spectral clustering (42) (PerronCluster Cluster Analysis [PCCA]; refs. 26, 43, and 44) refined with simulated annealing to lump microstates that can interconvert rapidly into larger states called metastable states, whereas conformations separated by large, free energy barriers are grouped into different states, as depicted in Fig. 1. This algorithm was developed by Chodera et al. (26) and is also described in SI Appendix. This procedure yielded 6 states for each dataset.
To obtain equilibrium sampling we then seeded simulations from each metastable state. Specifically, 100 random conformations were chosen from each metastable state and used as starting points for 10ns constant temperature MD simulations at 300 K. The equilibrium distribution was extracted by building a new MSM. A common state definition is necessary to compare different datasets, so this MSM was built using all of the seeding data. Populations with error bars for each independent dataset were then determined under the same state definition using a Bayesian method developed by Noe (45). Fig. 3A shows that the populations from each seeding dataset, as well as the combined data, are in strong agreement and are therefore converged to the equilibrium distribution. Populations for an equivalent amount of folded and coil ST data (19 ns) were also calculated by considering only those conformations at 300 K. These 2 ST datasets have not converged yet. In particular, there is a relatively obvious difference in the populations of states 2 and 4 (about 10% and 7% respectively).
MSMs are usually used to study kinetics (25). To get a reasonable number of states and ensure that the model is Markovian, a relatively long lag time must be used, although it generally ought to be shorter than the timescale of the process of interest. Furthermore, to ensure accurate kinetics, each simulation must be at least a few times longer than the lag time so that multiple crossings of each barrier may be observed. For example, Chodera et al. (26) show that a 20state MSM for the folding of the helical Fs peptide (which occurs on a timescale of tens of nanoseconds) requires a lag time of 5 ns. Thus, obtaining accurate kinetics for the UUUU tetraloop, which folds on a microsecond timescale, should likely require orders of magnitude longer simulations than for the Fs peptide. However, obtaining accurate thermodynamics may require significantly less sampling. In particular, short lag times where the system is not Markovian may still be sufficient to estimate thermodynamic properties. In fact, Fig. 4 shows that the equilibrium populations of each state are identical within statistical error regardless of the lag time. Similar observations have been made by Hummer and coworkers (29), who found that the free energy profile for a water dewetting transition can be predicted using a very short lag time at which the kinetics are not reproduced well. In addition to the error due to nonMarkovian effects, the statistical error due to insufficient sampling of transition events will also be smaller for thermodynamic properties. In a model with N states there are only N thermodynamic parameters to determine, whereas getting accurate kinetics requires determining all N^{2} pairwise transition probabilities. Sampling all possible transitions overdetermines the free energy differences between states. Thus, obtaining accurate thermodynamics may require significantly less sampling.
Minimizing the Simulation Length.
To push the limits of the ASM, we repeated the previous procedure using drastically less data (see short ASM in Fig. 2). Ten times less data were used for equilibration, 6 times less ST data were used to identify the states, and the seeding simulations were half as long. To maximize our use of this minimal data, we combined the folded and coil ST data to identify the metastable states used for seeding. Fig. 3B shows the populations obtained from this procedure compared with a reference distribution from our long ASM runs and an equivalent amount of ST data started from both folded and coil states. All of these populations were determined using the previous state definition. The populations from these short ASM runs were found to be in agreement with the previously determined equilibrium distribution, whereas the ST data deviated significantly from equilibrium. To determine the limits of ASM, we also performed the same analysis using fewer and shorter trajectories. First we held the seeding trajectory length constant at 5 ns and varied the number of trajectories initiated from each state, finding that as few as 70 trajectories from each state gave reasonable agreement with the reference distribution. We also held the number of seeding trajectories started from each state constant at 100 and varied the trajectory length, finding that as little as 2nslong seeding simulations gave reasonable agreement with the reference. Thus, our ASM method reaches equilibrium at least 6 times faster than ST. These findings suggest that the ASM is significantly more efficient than GE simulations for sampling conformational changes that are diffusion controlled, as in hairpin folding.
To address any concerns about the validity of our reference distribution, we also studied a simple model where the equilibrium populations can be computed analytically. The model is based on a discretestate system introduced by Zwanzig (46) as a simple model for protein folding (see SI Appendix for details). There are 4 metastable states in the system (folded, unfolded, and 2 intermediate states), among which the folded state is favored energetically, and the unfolded state is favored entropically (see SI Appendix, Fig. S5). This is an attractive system for testing ASM because it has nonArrhenius folding kinetics (i.e., the folding rate decreases with temperature; see SI Appendix, Fig. S7).
We compared the efficiency with which ST and ASM reach the equilibrium state populations as a function of length and number of trajectories. As shown in SI Appendix, Fig. S12, ASM converges to the correct distribution with 4–7 times shorter simulations than ST. We suggest that seeding simulations longer than the slowest intramacrostate equilibration time should always be sufficient for convergence. In practice, however, much shorter simulations may be used, as discussed previously. When using shorter simulations, one should test that independent sets of simulations started from different configurations converge to the same distribution, and that the equilibrium distribution is invariant with respect to the lag time. We also found that using more than 200 trajectories does not increase the efficiency of ST, whereas ASM continues to scale favorably with the number of trajectories—up to 600 trajectories in this example. The optimal number of simulations to run depends on one's tolerance for statistical error. Currently, an equal number of simulations are seeded from each state. In the future, however, adaptive sampling (30) could be used to start an optimal number of simulations from each metastable state to further optimize the efficiency of this method.
There are many factors contributing to the improved efficiency of ASM. By using short GE simulations to identify the metastable states, the ASM is able to exploit the ability of GE simulations to rapidly cross energetic barriers, and avoid the penalty incurred at high temperatures for entropic barriers by using seeding simulations at low temperatures. Furthermore, only short seeding simulations are necessary, because the use of MSMs requires only local, rather than global, equilibration. Global equilibration metrics, such as reversible folding, require that each simulation is long enough to cross every barrier multiple times. Local equilibration, however, may be obtained with many short simulations run in parallel because each run need only be long enough to cross a single barrier. By using MSMs to identify the metastable states, we can initiate seeding simulations from uncorrelated conformations within every metastable state and thereby ensure every barrier is crossed.
The ASM also has limitations. For example, the initial sampling has to be broad enough to identify all of the metastable states. Failure to do so will quickly become apparent because some states will be populated in one dataset but not in another. This situation may be remedied by iterating the ASM—that is, seeding ST simulations from each state to obtain broader sampling, building an MSM to identify the metastable states, and performing new constanttemperature seeding simulations. In addition, seeding simulations at physiological temperatures are only able to cross barriers on the order of a few kT. However, this should be sufficient for most biological systems. Finally, we note that the random selection of initial configurations from each state may lead to some error if the seeding simulations are not long enough. In the future, this method might be improved by choosing initial configurations from an equilibrium distribution prepared within each metastable state (47).
Examining the States.
Fig. 3B shows that the short folded ST data spent a disproportionate amount of time in state 2, whereas the coil ST data spent a disproportionate amount of time in state 4. Based on this finding, we hypothesized that state 2 is the native state, and that state 4 is a random coil state. To test this hypothesis, we extracted representative structures for each state. The representative structure for each state is the configuration with the greatest density of nearby conformations (mathematically this is the conformation with the minimal rmsd to every other conformation in the state).
Fig. 5 shows the representative structures for each state. In fact, state 2 is the folded state, having a wellformed 2basepair stem. Our ability to identify this state without including any knowledge of the native state and the fact that it is the most populated state lends credibility to the force field used, AMBER99 (48). Furthermore, state 4 is a random coil. The other states represent various collapsed nonnative states. For example, state 1 has nativelike base stacking interactions but no clear base pairing between the 2 sides of the stem. State 3 has interactions between bases 1 and 8 as well as 2 and 7, but they are stacking interactions instead of base pairing interactions. These findings are consistent with both experimental and computational work showing that small RNA hairpins have folding intermediates with contacting end residues but without wellformed stems (34, 36). Fully validating the force field will require longer simulations to get accurate kinetic predictions and more extensive comparisons with experimental observables.
Conclusions
We have introduced the adaptive seeding method (ASM) and shown that it samples significantly more efficiently than GE simulations, which have found widespread use in studying biological systems (7, 12, 14, 49), for a 2D simple potential and RNA hairpins. The ASM takes advantage of the broad sampling possible with GE methods but can more effectively cross entropic barriers using constant temperature simulations. Moreover, by requiring local equilibration rather than global equilibration, only relatively short simulations are necessary, and these simulations may be run in parallel, rendering the calculation particularly well suited to modern computing clusters. MSMs are then used to extract global equilibrium populations from these short simulations. Besides serving as an efficient sampling algorithm, the ASM may also be used to recover equilibrium properties from nonequilibrium datasets. Thus, the ASM holds great promise for validating force fields and bridging the gap between experimental and computational timescales.
In the future, we plan to apply the adaptive seeding method to larger systems. We also hope to explore alternative sampling methods for identifying initial states. For example, coarsegrained simulations could be used to identify the dominant states of a system and seed allatom MD simulations that would elucidate the atomic details of the free energy surface. Alternatively, implicit solvent simulations run at low viscosity could be used to rapidly identify the dominant states and seed explicit solvent simulations to provide more accuracy. Finally, adaptive sampling (32) with longer simulations may be used to obtain accurate kinetics from MSMs.
Materials and Methods
Two distinct sets of ST simulations were run: one started from a folded state and the other from a random coil. An independent MSM was then built for each dataset to identify the dominant metastable states. We use MSMBuilder (50) to build an MSM. At first, conformations were first split into a large number of microstates using a hierarchical Kmedoids clustering algorithm with the allheavyatom rmsd as the distance metric (e.g., we generated 1,597 microstates for long ASM seeding runs). Kinetically related microstates were then lumped together using PCCA (26, 43, 44). One hundred random conformations were then chosen from each state and used as starting points for constant temperature 300K MD simulations, still maintaining 2 distinct sets of simulations. New MSMs were built from these constant temperature datasets. A Bayesian method (46) was used to calculate the populations of each state with error bars, and the models were compared based on these values. The original ST simulations were also extended to match the sampling of the constant temperature simulations. State populations with error bars for these long ST runs were computed using bootstrapping and compared to the populations from the constant temperature simulations. More details are available in SI Appendix.
Acknowledgments
We thank Dr. John Chodera, Dr. Vince Voelz, and Daniel Ensign for their helpful insights on sampling and MSM building. X.H. acknowledges the support of Professor Michael Levitt. Support was received from the National Institutes of Health Roadmap Grant U54 GM072970 (to X.H.), the National Science Foundation Graduate Research Fellowship Program (to G.R.B.), and a Stanford Graduate Fellowship (to S.B.). This work was also funded by National Institutes of Health Grants R01GM062868 and P01GM066275. Computer resources were provided by National Science Foundation Award CNS0619926 and Folding@Home volunteers. MSMBuilder is available on SIMTK (https://simtk.org/home/msmbuilder).
Footnotes
 ^{1}To whom correspondence should be addressed. Email: pande{at}stanford.edu

Author contributions: X.H., G.R.B., S.B., and V.S.P. designed research; X.H., G.R.B., and S.B. performed research; X.H., G.R.B., and S.B. contributed new reagents/analytic tools; X.H., G.R.B., and S.B. analyzed data; and X.H., G.R.B., S.B., and V.S.P. wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0909088106/DCSupplemental.
References
 ↵
 ↵
 Sugita Y,
 Okamoto Y
 ↵
 ↵
 ↵
 Zhou R,
 Berne BJ,
 Germain R
 ↵
 ↵
 Nymeyer H,
 Garcia AE
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Periole X,
 Mark AE
 ↵
 ↵
 ↵
 ↵
 Zheng W,
 et al.
 ↵
 ↵
 Nadler W,
 Hansmann UH
 ↵
 Nadler W,
 Hansmann UH
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Swope WC,
 Pitera JW,
 Suits F
 ↵
 ↵
 ↵
 Ytreberg FM,
 Zuckerman DM
 ↵
 ↵
 ↵
 ↵
 ↵
 Ma H,
 et al.
 ↵
 Ansari A,
 Kuznetsov SV,
 Shen Y
 ↵
 ↵
 ↵
 Wolstenholme GEW,
 O'Connor M
 Levitt M
 ↵
 ↵
 Ciarlet PG
 Schütte C,
 Huisinga W
 ↵
 ↵
 ↵
 ↵
 Zwanzig R
 ↵
 ↵
 ↵
 ↵
Citation Manager Formats
Article Classifications
 Physical Sciences
 Chemistry
 Biological Sciences
 Biophysics and Computational Biology
Sign up for Article Alerts
Jump to section
You May Also be Interested in
More Articles of This Classification
Physical Sciences
Chemistry
Biological Sciences
Related Content
Cited by...
 Simulation of spontaneous G protein activation reveals a new intermediate driving GDP unbinding
 Concestor kinase activation mechanism uncovers the cyclin dependence of CDK family kinases
 Protein folded states are kinetic hubs
 Predicting the reaction coordinates of millisecond lightinduced conformational changes in photoactive yellow protein