## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Locating landmarks on high-dimensional free energy surfaces

Edited by Michael L. Klein, Temple University, Philadelphia, PA, and approved January 28, 2015 (received for review September 25, 2014)

## Significance

The problem of generating and navigating high-dimensional free energy surfaces is a significant challenge in the study of complex systems. The approach introduced represents an advance in this area, and its ability to generate and organize the key features of a high-dimensional free energy surface, i.e., its landmarks, with high efficiency impacts numerous problems in the materials and biomolecular sciences for which prediction of optimal structures is key. These include polypeptide and nucleic acid structure and crystal design and structure prediction. Moreover, as the algorithm targets the free energy surface, candidate structures can be ranked based on their relative free energies, which is not possible with algorithms that target only the bare potential energy surface.

## Abstract

Coarse graining of complex systems possessing many degrees of freedom can often be a useful approach for analyzing and understanding key features of these systems in terms of just a few variables. The relevant energy landscape in a coarse-grained description is the free energy surface as a function of the coarse-grained variables, which, despite the dimensional reduction, can still be an object of high dimension. Consequently, navigating and exploring this high-dimensional free energy surface is a nontrivial task. In this paper, we use techniques from multiscale modeling, stochastic optimization, and machine learning to devise a strategy for locating minima and saddle points (termed “landmarks”) on a high-dimensional free energy surface “on the fly” and without requiring prior knowledge of or an explicit form for the surface. In addition, we propose a compact graph representation of the landmarks and connections between them, and we show that the graph nodes can be subsequently analyzed and clustered based on key attributes that elucidate important properties of the system. Finally, we show that knowledge of landmark locations allows for the efficient determination of their relative free energies via enhanced sampling techniques.

- free energy surface
- stochastic optimization
- activation–relaxation
- machine learning
- network representation

Understanding the conformational equilibria of complex systems remains a significant challenge in the computational molecular sciences. Whether one is interested in predicting biomolecular structure, generating and thermodynamically ranking the polymorphs of molecular crystals, or studying the phase behavior of complex materials, the very large number of degrees of freedom renders such problems highly nontrivial. Often the most important conformational states in a system can be characterized in terms of a subset of collective degrees of freedom or “collective variables” (CVs), and the problem of mapping out the conformational equilibria amounts to generating the marginal probability distribution in these CVs, from which the associated free energy surface (FES) can be generated. Unfortunately, due to the existence of many minima on the potential energy surface separated by high barriers, transitions from basin to basin on this surface are rare, so that the FES cannot be generated on any reasonable timescale using standard molecular dynamics (MD) or Monte Carlo (MC) methods. Various enhanced sampling approaches have been devised to accelerate the exploration of such “rough” or “frustrated” energy landscapes to generate such FESs, either by elevating the temperature in the subspace of the CVs (1⇓–3) or by applying a bias potential on the FES (4⇓⇓–7) as in the popular metadynamics approach (4, 5). Recently, we have shown that these two classes of methods can be effectively combined (8), and others have shown that various types of free energy dynamics are possible (9).

Although it is often claimed that a low-dimensional manifold involving the selected CVs and embedded in the full phase space could capture the most important processes in complex systems (10, 11), the problem of identifying the relevant CVs that characterize this low-dimensional surface remains unsolved in general. In particular, the chosen CVs must describe a manifold of slow motions in the system. Although considerable effort has been devoted to this subject (11⇓–13), most of the methods proposed require prior knowledge of the system, which necessitates sufficient sampling of the global configurational distribution. Moreover, quantifying different conformational equilibria might require different numbers of CVs, indicating that assigning a constant dimensionality to the low-dimensional manifold is inadequate (14). One simple solution is to use a large number of CVs for the system, which may contain some redundancy yet are sufficient to incorporate the important slow modes. When this is done, one is necessarily faced with the problem of describing a high-dimensional free energy surface (HDFES). Despite the advent of robust enhanced sampling techniques capable of exploring HDFESs (8, 15, 16), significant difficulties remain in the study of such surfaces, including the characterization and representation of a function of many variables. In fact, the low free energy regions on an HDFES possess a “spider’s web” structure that occupies a relatively small fraction of the surface (16). Thus, enhanced sampling methods designed to sample the surface uniformly could spend too much time in uninteresting, relatively high free energy regions. Even if a well-sampled HDFES is available, fitting it to some model functional form (8) and/or visualizing the samples in a low-dimensional space (16) are nontrivial problems. Interestingly, the spider’s web structure indicates that the most interesting parts of the HDFES are the minima and the transition paths connecting them. As long as minima and saddles are located on an HDFES, relative free energy differences can be calculated and transition paths can be constructed. Therefore, the minima and saddle points are important “landmarks” on an HDFES, and locating these landmarks becomes a threshold for further analysis of the surface.

An obvious way to locate landmarks is to apply a searching algorithm on an analytically represented HDFES. However, difficulties in complete sampling and high-dimensional fitting become obstacles in such an approach. In this paper, we develop a strategy for locating landmarks on an HDFES “on the fly” by direct optimization. Various optimization approaches, such as the dimer method (17⇓–19), the activation–relaxation technology (ART) (20, 21), and discrete path sampling (22), have achieved considerable success in locating minima and saddles on potential energy surfaces. However, the problem of locating such landmarks on an HDFES has received significantly less attention. It is worth noting that, in general, the complexity of a frustrated system described by a particular potential energy function can be reduced by integrating out uninteresting degrees of freedom and obtaining the FES, a surface that encodes information about important conformational states. Because the FES is determined by the expectation of an indicator function, the optimization algorithm proposed here belongs to the general class of stochastic approximations, which is a family of optimization algorithms for objective functions that can be estimated only via noisy observations. Such algorithms have been successful in the field of electrical engineering, specifically in adaptive signal processing (23), are the basis for stochastic gradient descent methods that are widely used in machine learning (24), and have recently been used to optimize a biasing potential (25) in metadynamics. As our algorithm is designed to navigate HDFESs that are multiscale in nature, the algorithm falls within the general framework of multiscale modeling, specifically heterogenous multiscale modeling (HMM) (26) or the “equation-free” approach (27). We show that the optimization algorithm can be tailored to suit the features of the FES, which are generated on the fly using machine-learning methods; this constitutes a significant difference from algorithms that operate on potential energy surfaces. The use of ART leads to a global search algorithm on the HDFES. We term the resulting approach the stochastic activation–relaxation technique (START). Once the landmarks are located, we show that they may be subsequently used as inputs to an enhanced sampling calculation to obtain, quantitatively, their associated free energies, a procedure that is of considerably greater efficiency than attempting to converge the full HDFES from scratch via enhanced sampling (with no a priori knowledge of the location of the landmarks). Finally, we address the problem of representing the HDFES explored by START by introducing a graph-based approach in which all of the landmarks obtained are colored vertices and connections between them are the edges. We then show that the graph nodes can be analyzed and clustered based on suitably chosen attributes to reveal archetypical configurations of the system.

## Stochastic Approximation on an HDFES

Consider a system of *N* particles with coordinates *n* CVs given by functions *T* is the temperature of the system. The variables *δ*-functions as the limit of a product of Gaussians (2, 3), we obtain*C* and **1** with fixed **s**. The samples come from molecular dynamics or Monte Carlo simulations with a restraint on **s**. These estimators, denoted by

Starting from one point on an HDFES, neither minimum nor saddle optimization algorithms can guarantee a complete search of the minima/saddles on the surface. Here, we obtain a global search strategy by combining two protocols iteratively: A saddle optimization approach allows the system to escape a local minimum, and minimum optimization allows the system to relax from a saddle. In the latter, **s** is updated according to**2**. The vector **s** if **n** is close to the eigenvector of **H** with the minimum eigenvalue. The second equation evolves **n** to the required eigenvector. The parameter *γ* controls the “sensitivity” of **n** to a change in **H**, thus providing a means of damping the noise in **H**. This type of optimization algorithm most closely fits the HMM framework; i.e., Eqs. **2**, **3a**, and **3b** can be viewed as macroscopic solvers for the coarse-grained variables. Information needed for these solvers can come from any microscopic approach capable of delivering the required constrained averages that produce

Note that in Eqs. **2** and **3a**, the step size **F** or *k*th step of the optimization is chosen to decay as **s** will not evolve too rapidly when **s** and thus avoid a long equilibration phase in the restrained simulations. Note also that in traditional stochastic approximations, when *k* is large, **s** will advance more slowly than desired. By contrast, START with a constant step size and normalized force guarantees the efficiency of the optimization. When the step size is constant, however, **s** does not converge to a point; rather, the trajectory of **s** generates a cluster around a minimum/saddle (32) (see Fig. 2). Such a cluster formed during a minimum/saddle optimization can be extracted by a clustering algorithm (*Materials and Methods*), and landmarks will be covered by clusters. The exact location of each landmark is then determined by investigating the FES at the cluster. The local FES can be constructed from *M* points **A** and

A cluster can also form in a flat region (not a minimum or saddle) on the FES during an optimization. The reason for this is that in such a region, the mean force is small along several directions, and the noisy nature of **F** and **H** causes a minimum/saddle optimization trajectory to remain in the region for a considerable period, thereby generating a cluster, before it is able to diffuse out. We illustrate this phenomenon using the alanine dipeptide in vacuum. The Ramachandran map, i.e., the FES as a function of the dihedral angles Φ and Ψ, exhibits a very flat region (proved by studying the converged mean force in Fig. 1*A*), corresponding to M4 in Fig. 2, that does not contain any landmarks. Fig. 1*B* shows one minimum optimization trajectory passing through this region: It wanders in this region (left box) for a relatively long period before finally diffusing out to a minimum (right box), thus forming a cluster (our criterion for cluster formation is described in *Materials and Methods*). When this happens, the optimization halts prematurely, and this cluster is then fed into the minimum/saddle analysis described above. Fig. 1*C* shows the resulting cluster around M4 together with the fitted critical point (landmark). We thus see that the fitted critical point is separated from the cluster centroid; i.e., it can be located at the edge or even outside the cluster. Inspired by the idea of support vector machines (SVMs), an algorithm that we term the “maximum separation test” is designed to screen out these false landmarks by quantifying the degree to which a point is “at the edge,” “near the center,” or “outside” of a cluster. Let **w** be the normal vector of a hyperplane through a fitted critical point. The number of samples on one side of the hyperplane is a function *m*th point in the cluster, and *σ*, and we obtain the plane of maximum separation by minimizing *l* will be approximately unity if the critical point lies close to the center of the cluster, will decrease if the critical point shifts to the edge of the cluster, and will be approximately zero if the critical point lies outside the cluster. Fig. 1*D* shows the plane of maximum separation with all points located on one side. The *l* score in this case is very small (Fig. 2*D*), and the fitted critical point should be excluded as a false landmark. The consistency of this result with our observations in Fig. 1 *A* and *D* highlights the utility of the maximum separation test.

A flowchart of the START procedure is shown in Fig. 2*A*. The initial shooting move before each optimization, like the initial shifting of one atom in the activation–relaxation technique (20, 21), forces **s** to leave a landmark more quickly, thus increasing the overall efficiency. Instead of shooting along a random direction, other strategies, such as shooting along the direction of the Hessian eigenvector with the smallest eigenvalue, can be used (29). The clusters from the clustering algorithm are shown in Fig. 2*B* for an alanine dipeptide example. Fig. 2*D* plots the *l* scores of all of the clusters. Most of these clusters have large scores except for M4 and S7, indicating that these two clusters do not pass the maximum separation test and are, therefore, flat regions on the FES. After local mean force fitting and maximum separation tests, the locations of minima/saddles and the eigenvectors of the Hessians at these points are determined and are plotted in Fig. 2*C*. The locations of the landmarks match the shape of the FES from a 10-ns benchmark simulation using the unified free energy dynamics (UFED) method (8) (background FES of Fig. 2 *B* and *C*) and are also consistent with those in previous studies (33).

## Results

### Alanine Tripeptide.

The alanine dipeptide illustrated in Fig. 2 serves as a simple, illustrative example for the START algorithm. However, because the FES can also be generated via numerous enhanced sampling methods, and the locations of minima and saddles can be read off directly from the FES (8, 34), we consider next the alanine tripeptide in vacuum to provide an example of a nontrivial four-dimensional FES. It is instructive to test START on such a system. The two pairs of Ramanchandral angles are the most flexible CVs for this problem. To search as many minima and saddles as possible, 600 START iterations are used, and the total simulation time is nearly 150 ns. Local mean force fitting, followed by maximum separation testing, was applied to generate the critical points, eigenvalues, eigenvectors, and *l* scores. Eighteen minima (M1–M18) and 45 saddles (S1–S45) were selected (*SI Appendix*). Seventeen minima found in this study also show up in the previously reported UFED study (8), and one extra minimum, M14, is found. However, this minimum lies close to S40 and possesses a correspondingly small Hessian eigenvalue (*SI Appendix*), indicating that this minimum is shallow. With the information of all minima and saddles, it is possible to set up a network or graph connecting all metastable configurations. The network consisting of all possible conformational changes through various saddle points is exhibited in Fig. 3*A*. Blue and yellow vertices correspond to minima and saddles, respectively.

Fig. 3*B* shows the number of clusters obtained as a function of the number of iterations. The number of clusters visited increases rapidly at the beginning and more slowly thereafter because the probability of finding a previously visited cluster increases during the START search procedure. After roughly 300 iterations, START has identified nearly all of the minima and most of the saddles. The subsequent 300 iterations simply provide additional mean force samples, which improves the accuracy of the local mean force fitting.

### Met-Enkephalin.

Met-enkephalin (Tyr-Gly-Gly-Phe-Met) is a pentapeptide well known as an endogenous ligand of the opioid receptors and distributed throughout the central nervous system. For this small peptide, we used the full set of 10 Ramachandran angles as CVs. In terms of these CVs, the FES is nontrivial (15, 16). Thus, this system provides a very challenging 10-dimensional FES to test the ability of START. Our simulations used 6,905 START iterations (3.5 μs). After locally fitting the FES around a landmark to a quadratic form and testing all clusters by the maximum separation test, 1,081 minima and 1,431 saddles are located (*SI Appendix*). Saddles connecting to one minimum due the periodicity of the CVs are excluded here because they provide no information on conformational changes.

Similiar to Fig. 3*A*, a network representation with all minima and saddles as vertices can be constructed for this example. Viewing this network is clearly nontrivial. However, groups of landmarks may share structural similarities on a more coarse-grained level, and the 10 chosen CVs could exhibit structural redundancy. We therefore use the five α-carbon atoms to distinguish structural archetypes. The similarity between two different landmarks is based on the root-mean-square deviation (rmsd) of these α-carbon atoms. The network can be organized by grouping landmarks based on structural similarity, which here is achieved using the sketch-map method (16). The sketch map is a variation of multidimensional scaling that seeks a meaningful dimensional reduction of a high-dimensional configuration space while retaining the cluster structure in the low-dimensional representation (details in *Materials and Methods*). As shown in Fig. 4*A*, similar landmarks are clustered, and three regions indicate three major classes of structures. The ability to explore various folded and unfolded structures clearly proves that START is powerful as a global search tool on an HDFES. The locations of the landmarks generated by START are now used as inputs into an enhanced sampling calculation to obtain the free energies of these landmarks. Several enhanced sampling algorithms can be used to estimate the free energies of the landmarks. Here, we use driven adiabatic free energy dynamics (d-AFED)/temperature-accelerated MD (TAMD) (2, 3) to evaluate free energies of the landmarks because of d-AFED’s ability to sweep rapidly over the free energy landscape and quickly generate free energies at the START landmark locations. The same CVs as in START are used in the d-AFED run (details in *Materials and Methods*). The free energies of landmarks from a 500-ns d-AFED simulation are shown in Fig. 4*B*. In fact, the first 200 ns of the d-AFED simulation are already sufficient to select quantitatively the low free energy landmarks (<8 kcal/mol) to within 1 kcal/mol (Fig. 4*C*). Thus, it should be clear that having this information available a priori is considerably more efficient than searching for the landmarks from scratch via the enhanced sampling procedure.

## Conclusion and Perspective

A robust and efficient approach for searching minima and saddles (landmarks) on an HDFES, START, has been introduced. START provides a strategy for analyzing HDFESs and circumventing the difficulties of uniform sampling and global construction of the HDFES. We used the alanine tripeptide and met-enkaphlin examples to show that START is a powerful tool for the global exploration of an HDFES and for the fast and accurate pinning down of exact locations of landmarks. Although our examples were performed in the gas phase, it is important to note that, since the START algorithm targets the FES in selected CVs, these systems could just as well have been performed in explicit solvent, a feature that distinguishes START from methods that directly target the potential energy surface. If the main goal of a study is to explore the FES, including identifying important configurations such as folded structures in proteins and polymorphs of molecular crystals and computing free energy differences between them, the landmarks located by START are important inputs. The saddles located by START can also be used to construct a good initial guess for subsequent searching of transition paths connecting the two minima on the FES with methods such as the nudged-elastic band approach (17) or the string method (35). The relative free energies of all (or some) landmarks could be easily evaluated by other enhanced sampling methods once the landmarks are located. A network or graph containing all of the landmarks can be constructed as a representation of the HDFES. This network, which also specifies connections between saddles and minima as the edges in the graph, provides important information for studying mechanisms of conformational changes in the system. A related representation was proposed to set up a network for organic reactions (36). In the present scheme, our graph representation allows us to apply graph analysis methods on the HDFES in which the vertices are characterized by a set of key features or molecular structural similarity, so that important properties of the system can be subsequently revealed and elucidated. In a manner that is similar to building a stationary point database in discrete path sampling (22), landmarks located by START, together with their free energies and the graph representation, can also be used for further kinetic studies.

## Materials and Methods

Two test systems, here the alanine di- and tripeptides and met-enkaphalin in vacuum, are studied by the START algorithm. Samples for *κ* for the restrained simulations was 278.2 kcal⋅mol^{−1}⋅rad^{−2}. The simulation time for each restrained simulation was 1 ps for the alanine di- and tripeptides. In the met-enkaphalin example, the simulation time for each restrained simulation was 1 ps for optimizing to minima, whereas the simulation time was 10 ps for saddle optimizations. In all three cases, the step size **n** in GAD was chosen to be the same as this random shooting direction. In the saddle optimization, the parameter

The density-based spatial clustering for applications with noise (DBSCAN) (40) is suitable for analyzing and classifying the clusters generated in the two optimization steps of the START protocol. The DBSCAN algorithm begins with an arbitrary unvisited sample point on the FES and a count of the number of neighboring points in the sample whose Euclidean distance from the sample point is not larger than some specified value *r*. If the number of neighboring sample points is larger than some lower bound *p*, a new cluster is assigned. Otherwise, the point is considered “noise.” When a new cluster is assigned, DBSCAN combines all of the neighboring points (neighborhood) within the distance *r*, together with their neighborhood, provided the points are not noisy points. The cluster growth continues until no points remain (that are not designated as noise) that can be assigned to the cluster. Unlike clustering methods such as *k*-means and Gaussian mixture models, DBSCAN requires no initial guess of the number or shape of the clusters. In this study, the radius of the neighborhood *r* is 0.122 rad or 7.0° and the lower bound is *l* score, above which a cluster passes the test, is 0.2 in the alanine tripeptide example and 0.25 (for minima) and 0.11 (for saddles) in the met-enkaphalin example.

In the met-enkaphalin example, the objective function of the sketch map is *a* and *b* are 3.0 and 9.0 for *σ* is 0.5 Å in both cases. The optimization strategy follows that in ref. 16. In the d-AFED simulation, the temperature of **s** is 800 K, and the mass is 168.0 Å^{2}⋅rad^{−2} times the mass of the proton. The other simulation parameters are the same as those prescribed before. Independent samples are taken from the trajectories every 1 ps. These samples are convoluted with a homogeneous Gaussian kernel with a variance of 20° to generate a probability distribution of **s**, and landmark free energies can be evaluated following the method in ref. 3.

## Acknowledgments

M.E.T. and M.C. acknowledge support from the National Science Foundation Grant CHE-1301314. M.C. also acknowledges the Margaret and Herman Sokol Doctoral Fellowship in the Sciences.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: mark.tuckerman{at}nyu.edu.

Author contributions: M.C., T.-Q.Y., and M.E.T. designed research; M.C. and T.-Q.Y. performed research; M.C. and T.-Q.Y. analyzed data; and M.C. and M.E.T. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1418241112/-/DCSupplemental.

## References

- ↵
- ↵
- ↵
- ↵.
- Laio A,
- Parrinello M

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Das P,
- Moll M,
- Stamati H,
- Kavraki LE,
- Clementi C

- ↵.
- Tenenbaum JB,
- de Silva V,
- Langford JC

- ↵.
- Coifman RR, et al.

- ↵
- ↵.
- Tribello GA,
- Ceriotti M,
- Parrinello M

- ↵.
- Ceriotti M,
- Tribello GA,
- Parrinello M

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Widrow B,
- Stearns SD

- ↵.
- Bottou L

*Lecture Notes in Artificial Intelligence, LNAI 3176*, eds Bousquet O, von Luxburg U (Springer, Berlin), pp 146–168 - ↵
- ↵.
- E W,
- Engquist B

- ↵Gear WC, et al. (2003) Equation-free, coarse-grained multiscale computation: Enabling macroscopic simulators to perform system-level analysis..
*Commun Math Sci*1(4):715–762 - ↵.
- Maragliano L,
- Fischer A,
- Vanden-Eijnden E,
- Ciccotti G

*J Chem Phys*125(2):024106 - ↵Samanta A, Chen M, Yu T-Q, Tuckerman M, E W (2014) Sampling saddle points on a free energy surface..
*J Chem Phys*140(16):164109 - ↵
- ↵.
- Samanta A,
- E W

*J Chem Phys*136(12):124104 - ↵.
- Borkar VS

- ↵
- ↵
- ↵.
- E W,
- Ren W,
- Vanden-Eijnden E

- ↵
- ↵
- ↵
- ↵
- ↵.
- Ester M,
- Kriegel H-p,
- Jörg S,
- Xu X

*Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96)*, eds Simoudis E, Han J, Fayyad U (AAAI Press, Menlo Park, CA), pp 226–231

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Physical Sciences
- Chemistry