## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# A self-learning algorithm for biased molecular dynamics

Contributed by Michele Parrinello, August 12, 2010 (sent for review June 21, 2010)

## Abstract

A new self-learning algorithm for accelerated dynamics, reconnaissance metadynamics, is proposed that is able to work with a very large number of collective coordinates. Acceleration of the dynamics is achieved by constructing a bias potential in terms of a patchwork of one-dimensional, locally valid collective coordinates. These collective coordinates are obtained from trajectory analyses so that they adapt to any new features encountered during the simulation. We show how this methodology can be used to enhance sampling in real chemical systems citing examples both from the physics of clusters and from the biological sciences.

Many chemical systems, notably those in condensed matter and in biology, are characterized by the presence of multiple low energy states that are separated by large barriers. The presence of these barriers prevents exploration of all of configuration space during the relatively short time scales accessible in molecular dynamics (MD) simulations. Typically this means that only those configurations in a small, locally ergodic region in the vicinity of the input structure are visited.

A large number of methods have been put forward to overcome this difficulty, many of which either use some form of enhanced sampling (1–6) or focus on the transition from one local minimum to another (7–10). Other methods recognize that a small number of degrees of freedom (collective variables) accurately describe the interesting transitions and so either raise the temperature of these degrees of freedom (11–13) or introduce a bias to enhance the sampling along them (14–22). The major differences between the various approaches in this last class is the way in which the bias is generated, a particularly useful technique being to use a bias that is dependent on the previously visited trajectory. This is the basis of the metadynamics method (23, 24) that has been introduced by our group and applied to a large variety of chemical problems (25).

For many chemical systems the interesting, reactive processes take place in a relatively low-dimensional space (26, 27). However, it is often not immediately obvious how to identify a set of collective variables (CVs) that span this “reactive” subspace. Furthermore, in methods such as metadynamics, the presence of barriers in the transverse degrees of freedom leads to incomplete sampling. The obvious solution therefore is to use very large numbers of collective coordinates. However, although this is theoretically possible with methods such as metadynamics, it is impractical because the volume of bias that one must add to the free energy surface, and hence the length of the simulation, increases exponentially with dimensionality. One suggestion for resolving this issue is to run multiple short, 1D metadynamics simulations in parallel with different collective coordinates and to allow swaps between the different realizations based on a Monte Carlo criterion (28). Here we propose an alternative solution based on the realization that, if the free energy surface (FES) is to be flattened, the majority of the bias will have to be added at or near the minima in the surface. Identifying the locations of these minima is straightforward as, during a dynamical trajectory, the system should spend the majority of its time trapped in the vicinity of one or more of them (29). Therefore by using a form of “smart” bias that targets these low free energy regions specifically, we can force the system away from them and into unexplored areas of configuration space. We call the self-learning algorithm that we have developed based on these ideas Reconnaissance Metadynamics and have implemented it in the plugin for molecular dynamics PLUMED (24). In what follows we demonstrate this algorithm on two different systems—a cluster of seven Lennard–Jones atoms and a short protein.

## Background

Before introducing our method, a brief survey of established techniques for dealing with complex energy surfaces in terms of very large numbers of collective coordinates is in order. Zhu et al. (30, 31) have introduced a method that uses a variable transformation to reduce barriers and thereby increase sampling, which works with a large number of collective coordinates. Problematically though this method does not work if there are hydrogen bonds present. In contrast, methods that work by thermostating the CVs at a higher temperature (11–13) suffer no such problems and have been used successfully with large numbers of CVs (32). However, in these methods, unlike metadynamics, there is nothing that prevents the system from revisiting configurations, which could prove problematic for examinations of glassy landscapes with many local minima of equal likelihood.

For many free energy methods explicitly including a large number of collective coordinates is not feasible. However, one can use collective coordinates that describe a collective motion that involves many degrees of freedom. For example, one can use the principal components of the covariance matrix of a large set of collective coordinates, the values of which have been calculated over a short MD trajectory (33, 34). Alternatively, one can take nonlinear combinations by defining a path in the high-dimensionality CV space (35). The distance along and the distance from this path span a low-dimensional, nonlinear space, and metadynamics simulations using these two CVs have been shown to work well. However, a great deal of insight is required in choosing an initial path from which the CVs are generated as the simulation can provide meaningful insight only for the region of configuration space in the immediate vicinity of this path.

Entropy plays an important role in free energy surfaces as it can wash out potential energy minima and make it such that finite temperature equilibrium states do not correspond to minima in the potential energy surface. Nevertheless, for systems with deep minima in the potential energy surface, one can use algorithms that locate all the minima on the 0-K (potential) energy surface and assume that the properties at every point on this surface are the same as those of the nearest local minima safe in the knowledge that the entropic effects are small. These algorithms allow one to divide up configuration space and map every point to its appropriate local minimum using a minimization algorithm (36). Furthermore, the slow modes in the vicinity of each basin provide good local collective coordinates as in the vicinity of basins the system is very nearly harmonic. From a patchwork of such descriptors one could conceive of ways to obtain globally-non-linear CVs. Recently, Kushima et al. (37) have developed a self-learning, metadynamics-based algorithm based on these ideas that works by minimizing the energy and adding bias functions at the position of the minimum found so that subsequent minimizations will identify new minima.

Ideas described above for the exploration of potential energy surfaces cannot be straightforwardly transferred to the study of free energy surface because of the difficulties associated with the calculation of derivatives at finite temperature. Nonetheless, Gaussian mixture umbrella sampling (GAMUS) (29) is a method that has some similarities to the 0-K method developed by Kushima et al. In GAMUS the kinetic traps that are preventing free diffusion are located by fitting the probability distribution of visited configurations with a Gaussian mixture (GM) model. The resulting set of bespoke Gaussians are then used to update an adaptive bias that encourages the system to visit unexplored regions of configuration space. This adaptive approach accelerates the filling of basins and thus provides a considerable speed up over conventional metadynamics when one is using three or four collective variables. Nevertheless, the filling time will still increase exponentially with the number of CVs.

## Reconnaissance Metadynamics Algorithm

Reconnaissance metadynamics combines a number of ideas in order to be effective with very large numbers of CVs. The bias potential is constructed in terms of a patchwork of basins each of which corresponds to a low free energy region in the underlying FES. These features/basins are recognized dynamically by periodically analyzing the trajectory with a sophisticated clustering strategy. The region of configuration space in the vicinity of each basin is then described using a one-dimensional CV that is tuned using information collected during the clustering. Consequentially, even when the overall number of collective coordinates (*d*) is large, depressions are compensated for rapidly because the bias is added in a locally valid, low-dimensional space.

Cluster analyses are performed at regular intervals using a set of stored configurations for the CVs that are accumulated from the trajectory. During this analysis it is essential that some form of dimensionality reduction be performed as otherwise the fitting will be intractable. In addition, one must recognize that, because this is a dynamical trajectory, the system may well have hopped between different basins on the free energy surface (see Fig. 1). Therefore, because we would ideally like to treat each basin separately, principal component analysis (PCA) is not an option. Furthermore, Gaussian mixture expectation maximization algorithms (29), although able to separate the basins, will become unstable when the number of collective coordinates is large. Thankfully, however, combinations of these two algorithms exist (38–40) that allow us to cluster the data while simultaneously reducing the dimensionality.

This clustering strategy provides us with a set of Gaussian centers (** μ**) and covariance matrices (

**C**) for the various basins in the free energy surface. Some of these will have very low weights or will be very similar to previously encountered basins and can thus be safely discarded. Those remaining provide useful information on the local topology of the FES but cannot be used to predict the actual depth of the basin or its shape away from the center. Consequentially, a flexible biassing strategy must be used in the vicinity of the minimum as addition of a single Gaussian will not necessarily compensate for the depression in free energy. This failure to compensate basins fully can lead to the formation of spurious low energy features in the region surrounding the basin center (see Fig. 1

*C*), which is a problem that becomes more severe as the dimensionality is increased. To resolve these issues we assume that the basin is spherically symmetric in the metric induced by

**C**and, in the spirit of metadynamics, construct an adaptive bias composed of small Gaussian hills of height

*w*

_{i}and width Δ

*r*, along a single, radial collective coordinate

*r*(

**s**) (see Eq.

**1**). [1][2]In the above

**s**is a vector that denotes the position in the full,

*d*-dimensional CV space. The form of Eq.

**1**means that the bias associated with each hill acts in a hyper-ellipsoidal-crust-shaped region in the full,

*d*-dimensional CV space. Each of these crusts has a shape much like a layer in an onion, and so the integral of the bias added increases with

*r*(

**s**). Consequentially, the filling time for each basin no longer depends exponentially on the number of collective coordinates. Furthermore, the hills have a shape that is consistent with the underlying anisotropy of the free energy surface because of the use of the covariance matrix in Eq.

**2**.

Obviously the distance from a basin center is only a good collective coordinate when we are close to that center so each basin must have a size *S*. This size assigns the region of configurational space in which it is reasonable to add hills that will force this system away from a particular basin. Setting an initial value for this size is straightforward as we know from our fitting that the basin’s shape is well described by a multivariate Gaussian. Therefore, we choose an initial size equal to because, as shown in *SI Text*, when the angular dependency of a *d*-dimensional Gaussian is integrated out the resulting distribution of *r* is approximately equal to a 1D Gaussian with standard deviation centered at .

If the size of basins is fixed, then the problems described in Fig. 1 are encountered once more. Hence, in reconnaissance metadynamics the basin’s size is allowed to expand during the course of the simulation so as to ensure that spurious minima, which would otherwise appear at the edge of basins, are dealt with automatically. Periodically we check whether or not the system is inside the hypercrust at a basin’s rim [*S* < *r*(**s**) < *S* + Δ*r*] and also that it is not within the sphere of influence of any other basin. If these conditions are satisfied, we then decide whether or not to expand using a probabilistic criterion, which ensures that it becomes more difficult to expand basins as the simulation progresses. Based on the loose analogy with free particle diffusion outlined in *SI Text*, this probability for expansion is given by , where *S* is the current size of the basin, Δ*t* is the time between our checks on whether or not to expand, and *D* is a user-defined parameter.

The algorithm is summarized in the flow chart in Fig. 2. Further details on the components of the algorithm can be found in *Materials and Methods* and in *SI Text*.

## Results

### 2D Surface.

To illustrate the operation of the algorithm we first show how it can be used to accelerate the (Langevin) dynamics on the model, 2D potential energy surface illustrated in Fig. 3.

At low temperatures a particle rolling about on the surface shown in Fig. 3 will remain trapped in one of the deep basins. This is precisely what is observed during the first part of the simulation, when no metadynamics is performed, as the first application of our clustering algorithm demonstrates. On addition of bias, the system quickly escapes this first basin and falls into other basins, the locations of which are identified during subsequent applications of our unsupervised learning protocol. This process continues throughout the simulation so, once basins are identified, the history-dependent bias compensates for them quickly and hence the system rapidly explores the entirety of the energy surface.

Fig. 3 shows that the reconnaissance metadynamics algorithm, when properly applied, finds only basins that correspond to the true features in the free energy surface. In addition Fig. 3*D* shows how effective the adaptive bias is in dealing with regions where small basins are encompassed in larger depressions. It clearly shows that initially small hills are used to deal with the subbasins, whereas later, much larger hills are used to compensate for the superbasin.

### Lennard–Jones 7.

Small clusters of rare-gas atoms have a remarkably complex behavior despite their rather limited number of degrees of freedom. A particularly well-studied example is the two-dimensional, seven-atom, Lennard–Jones cluster (41, 42) for which four minima and 19 saddle points in the potential energy surface have been identified (43) (see Fig. 4). At moderate temperatures (*k*_{B}*T* = 0.1*ϵ*) this system spends the majority of its time oscillating around the minimum energy structure, in which one of the atoms is surrounded symmetrically by the six other atoms. Infrequently, however, the system will also undergo isomerizations in which the central atom of the hexagon is exchanged with one of the atoms on the surface (43).

To test the reconnaissance metadynamics algorithm we examined this system using the coordination numbers of all the atoms (seven collective coordinates). In doing this we neglect the interchange symmetry to see if we can reproduce this symmetry in the positions of the various basins we find. Fig. 4 gives the results of this calculation and denotes the position of each of the basins found by a circle whose area reflects the final bias at the basin center. The three lowest lying minima for this cluster have one atom that has a much higher coordination number than the others, and so the basins identified along the trajectory have been grouped, based on the index of the atom with the highest coordination number, onto seven slices. On each of these planes the positions and sizes of the basins are consistent, which suggests that the algorithm finds the correct symmetry and explores configuration space correctly.

In reconnaissance metadynamics there is no straightforward connection between the bias and the free energy because each CV is valid only in a local region, and hence the overall bias is not a function of a global order parameter. Nonetheless, the bias greatly enhances the exploration of phase space and so free energies could be obtained using an umbrella sampling approach. However, even without this additional step, a reconnaissance metadynamics trajectory gives one a feel for the lie of the land, which can be used to obtain chemical insight. For example, the basin centers provide a set of landmark points that can be used to validate a low dimensionality description of the system (27). For this simple case we did this by calculating the value at the basin centers of many different candidate coordinates. We discovered that an optimal, in-plane separation of the various basins is attained when we use the second and third moments ( and , respectively) of the distribution of coordination numbers. These two CVs clearly project out permutation symmetry, and so we also performed a conventional well-tempered metadynamics simulation. Fig. 4 shows the free energy surface obtained in its eighth slice. A comparison of this surface with the results from the reconnaissance metadynamics shows how the basins found cluster around the minima in this FES.

### Polyalanine 12.

The protein folding problem is commonly tackled using computer simulation, and there exist model systems for which the entirety of the potential energy landscape has been mapped out (36) that represent a superb test of any methodology. For example, polyalanine 12, modeled with a distance dependent dielectric (*ϵ*_{ij} = *r*_{ij} in angstroms) that mimics some of the solvent effects, has been extensively studied (44). This protein has a funnel-shaped, energy landscape with an alpha-helical, global minima. We found that during a 1-μs, conventional MD simulation started from a random configuration, the protein did not fold (see *SI Text*). Hence, examining whether or not the protein will fold during a reconnaissance metadynamics simulation will provide a third test of our methodology.

For this reconnaissance metadynamics calculation we used the 24 backbone dihedral angles as the collective coordinates as these angles provide an excellent description of the protein structure. These variables are periodic, which had to be accounted for in the method by replacing the multivariate Gaussians with multivariate von Mises distributions (45). This distribution, if sufficiently concentrated about the mean, is equivalent to a Gaussian in which the difference between any point and the mean is shifted to the minimum image. Consequentially, we can continue to use the same algorithm for trajectory analysis as long as we take into account the periodicity when we calculate differences and averages. In addition, we can define a quantity (see Eq. **3**) that is equivalent to the distance from the center of the basin (Eq. **2**) but that takes into account the periodicity of the CVs (*P*_{i}).

[3]

Fig. 5 provides a representation of a portion of a typical reconnaissance metadynamics trajectory of the protein. Fig. 5 shows the values of all the backbone torsional angles and clearly demonstrates that a large volume of configurational space is explored during this relatively short simulation. Furthermore, unlike in conventional MD, after approximately 22 ns the protein folds into the global-minimum, alpha-helix configuration. To further demonstrate that the reconnaissance metadynamics is ensuring that large portions of configurational space, which would otherwise not be visited, are being explored, we calculated the inherent structures by minimizing the energy every 20 ps. The energies of the minimized structures are shown in the uppermost panel of Fig. 5 and demonstrate that the potential energy surface is very rough and that the alpha helix is considerably lower in energy than the other energy minima encountered.

## Conclusions

We have proposed an accelerated dynamics scheme, reconnaissance metadynamics, that uses a self-learning algorithm to construct a bias that accelerates the exploration of configuration space. This algorithm works by automatically identifying low free energy features and deploying a bias that efficiently and rapidly compensates for them. The great advantage of this algorithm is that there is no limit on the number of CVs on which the bias acts. This is possible only because we collapse all these CVs into a single collective coordinate that is valid only locally and patch together a number of these local descriptors in order to reflect the fact that the important degrees of freedom are not uniform throughout configuration space. As shown in Fig. 3 this approach provides us with a simple, compact, hierarchical description of the free energy surface, which could be used to construct bias potentials for umbrella sampling.

As shown for the two chemical systems on which we have demonstrated the algorithm, our ability to work with large numbers of collective coordinates means that one can employ generic configurational data, such as torsional angles and coordination numbers, as collective coordinates and thereby avoid all the usual difficulties associated with choosing a small set of collective coordinates. In fact, even when CVs, on which there are no large barriers to motion, such as the torsional angles on the terminal amino acid groups in ala12, are included, the algorithm will still function. For both systems our method produces trajectories that contain an extensive exploration of the low energy parts of configurational space and hence provides a feel for the lie of the land. Furthermore, even without a quantitative estimate of the free energy, considerable insight can be obtained from the trajectory. In the Lennard–Jones this allowed us to attain an effective two-dimensional description of the landscape. For more complex systems, nonlinear embedding could be used to automate this procedure.

## Materials and Methods

### Mixture of Probabilistic Principal Component Analyzers.

Throughout this work we use the annealing strategy outlined in ref. 40 and in *SI Text* to do clustering. This algorithm requires one to state at the outset the number of clusters that are being used to fit the data and the number of annealing steps. For the latter we initially set *σ*^{2}, which is the quantity treated like the temperature during the annealing, equal to the maximum eigenvalue of the covariance of the data and lowered *σ*^{2} until it was less than 1% of its initial value. To establish the correct number of clusters we run multiple fits to the data using different numbers of clusters and select the fit that gives the largest value for the Bayesian Information Criterion (BIC) (46), , where is the maximized likelihood for the model, *n*_{p} is the number of parameters in the model, and *M* is the number of trajectory snapshots used in the fitting.

### Selecting Unique Basins.

As already discussed, the GM algorithm provides us with the locations of a number of basins. Some of these will have very low weights in the fit and can therefore be safely ignored in the construction of the bias. Others, however, will provide information about the basins found during prior runs—in short, information that is redundant. Therefore, we introduce a criterion for the selection of basins that requires that *f*_{i}[1 - max(*ξ*_{ij})] > TOL, where TOL is some user-defined tolerance, *f*_{i} is the weight of the new basin in the fit, and *ξ*_{ij} is the similarity between the new basin *i* and the old basin *j*. To calculate *ξ*_{ij} we use Matusita’s measure, which can be calculated exactly for multivariate Gaussians (47). For an expanded basin of size *S* its original covariance is multiplied by a factor of *S*/*S*_{0} when calculating this function so that its expanded volume is appropriately taken into account.

### Lennard–Jones.

The parameters for the simulations of Lennard–Jones 7, in Lennard–Jones units, are as follows: The temperature was set equal to *k*_{B}*T* = 0.1*ϵ* using a Langevin thermostat, with a relaxation time of . The equations of motion were integrated using the velocity verlet algorithm with a timestep of for 5 × 10^{7} steps. During reconnaissance metadynamics the CVs were stored every 100 steps, while cluster analysis was done every 1 × 10^{5} steps. Only basins with a weight greater than 0.3 were considered and to these attempts to add hills of height 0.5 *k*_{B}*T* and a width of 1.5 were made every 1,000 steps. Basin expansion was attempted with the same frequency with the parameter *D* set equal to 0.03. The coordination numbers were computed using , where *r*_{ij} is the distance between atoms *i* and *j*.

### Polyalanine 12.

All simulations of polyalanine were run with a modified version of gromacs-4.0.3 (48), the amber96 forcefield (49) and a distance dependent dielectric. A timestep of 2 fs was used, all bonds were kept rigid using the LINCS algorithm, and the van der Waals and electrostatic interactions were calculated without any cutoff. The global thermostat of Bussi et al. (50) was used to maintain the system at a temperature of 300 K. The initial random configuration of the protein was generated by setting up the protein in a linear geometry, minimizing it, and then running 1 ns of normal MD at 300 K in order to equilibrate. During reconnaissance metadynamics the CVs were stored every 250 steps, whereas cluster analysis was done every 5 × 10^{5} steps. Only basins with a weight greater than 0.2 were considered, and attempts were made every 1,000 steps to add to these basins hills of height 0.4*k*_{B}*T* and width 1.5. Basin expansion was attempted with the same frequency with the parameter *D* set equal to 0.3. Minimizations to obtain inherent structures were done by first annealing for 1.2 ns with a decay rate of 0.996 ps^{-1} and subsequently performing a steepest descent minimization. Guidelines as to how to select parameters for reconnaissance metadynamics are provided in *SI Text*. For both this system and the Lennard–Jones cluster, we obtained similar results with a variety of different parameters sets.

## Acknowledgments

The authors thank Davide Branduardi, Jim Pfaendtner, Meher Prakash, and Massimilliano Bonomi for useful discussions. David Wales is also thanked for his advice on the simulation of the ala12 system.

## Footnotes

^{1}To whom correspondence may be addressed. E-mail: gareth.tribello{at}phys.chem.ethz.ch or parrinello{at}phys.chem.ethz.ch.Author contributions: G.A.T., M.C., and M.P. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1011511107/-/DCSupplemental.

## References

- ↵
- ↵
- ↵
- ↵
- Liu P,
- Kim B,
- Friesner RA,
- Berne BJ

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Bash P,
- Singh U,
- Brown F,
- Langridge R,
- Kollman P

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Laio A,
- Parrinello M

- ↵
- ↵
- ↵
- ↵
- Das P,
- Moll M,
- Stamati H,
- Kavraki LE,
- Clementi C

- ↵
- ↵
- ↵
- ↵
- ↵
- Abrams CF,
- Vanden-Eijnden E

- ↵
- ↵
- ↵
- ↵
- Wales DJ

- ↵
- ↵
- Tipping ME,
- Bishop CM

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵

## Citation Manager Formats

### More Articles of This Classification

### Physical Sciences

### Related Content

- No related articles found.

### Cited by...

- Intrinsic map dynamics exploration for uncharted effective free-energy landscapes
- Enhanced, targeted sampling of high-dimensional free-energy landscapes using variationally enhanced sampling, with an application to chignolin
- Locating landmarks on high-dimensional free energy surfaces
- Using sketch-map coordinates to analyze and bias molecular dynamics simulations
- Locating binding poses in protein-ligand systems using reconnaissance metadynamics
- Simplifying the representation of complex free-energy landscapes using sketch-map