# Neural networks-based variationally enhanced sampling

^{a}Department of Physics, ETH Zurich, 8092 Zurich, Switzerland;^{b}Facoltà di Informatica, Instituto di Scienze Computazionali, Università della Svizzera italiana (USI), 6900 Lugano, Switzerland;^{c}National Center for Computational Design and Discovery of Novel Materials (MARVEL), USI, 6900 Lugano, Switzerland;^{d}Department of Chemistry and Applied Biosciences, ETH Zurich, 8092 Zurich, Switzerland;^{e}Computational Science, Italian Institute of Technology, 16163 Genova, Italy

See allHide authors and affiliations

Contributed by Michele Parrinello, July 9, 2019 (sent for review May 8, 2019; reviewed by Giuseppe Carleo and Jim Pfaendtner)

## Significance

Atomistic-based simulations are one of the most widely used tools in contemporary science. However, in the presence of kinetic bottlenecks, their power is severely curtailed. In order to mitigate this problem, many enhanced sampling techniques have been proposed. Here, we show that by combining a variational approach with deep learning, much progress can be made in extending the scope of such simulations. Our development bridges the fields of enhanced sampling and machine learning and allows us to benefit from the rapidly growing advances in this area.

## Abstract

Sampling complex free-energy surfaces is one of the main challenges of modern atomistic simulation methods. The presence of kinetic bottlenecks in such surfaces often renders a direct approach useless. A popular strategy is to identify a small number of key collective variables and to introduce a bias potential that is able to favor their fluctuations in order to accelerate sampling. Here, we propose to use machine-learning techniques in conjunction with the recent variationally enhanced sampling method [O. Valsson, M. Parrinello, *Phys. Rev. Lett.* 113, 090601 (2014)] in order to determine such potential. This is achieved by expressing the bias as a neural network. The parameters are determined in a variational learning scheme aimed at minimizing an appropriate functional. This required the development of a more efficient minimization technique. The expressivity of neural networks allows representing rapidly varying free-energy surfaces, removes boundary effects artifacts, and allows several collective variables to be handled.

Machine learning (ML) is changing the way in which modern science is conducted. Atomistic-based computer simulations are no exceptions. Since the work of Behler and Parrinello (1), neural networks (NNs) (2, 3) or Gaussian processes (4) are now almost routinely used to generate accurate potentials. More recently, ML methods have been used to accelerate sampling, a crucial issue in molecular dynamics (MD) simulations, where standard methods allow only a very restricted range of time scales to be explored. An important family of enhanced sampling methods is based on the identifications of suitable collective variables (CVs) that are connected to the slowest relaxation modes of the system (5). Sampling is then enhanced by constructing an external bias potential

Here, we shall focus on a relatively new method, called Variationally Enhanced Sampling (VES) (17). In VES, the bias is determined by minimizing a functional

Although different approaches have been suggested (23, 24), the way in which VES is normally used is to expand

In this paper, we use the expressivity (25) of NNs to represent the bias potential and a stochastic steepest descent framework for the determination of the NN parameters. In so doing, we have developed a more efficient stochastic optimization scheme that can also be profitably applied to more conventional VES applications.

## Neural Networks-Based VES

Before illustrating our method, we recall some ideas of CV-based enhanced sampling methods and particularly VES.

### Collective Variables.

It is often possible to reduce the description of the system to a restricted number of CVs *Z* is the partition function of the system,

### The Variational Principle.

In VES, a functional of the bias potential is introduced:

### The Target Distribution.

In VES, an important role is played by the target distribution

### NN Representation of the Bias.

The standard practice of VES has been so far of expanding linearly

We are employing NNs since they are smooth interpolators. Indeed, the NN representation ensures that the bias is continuous and differentiable. The external force acting on the *i*th atom can be then recovered as:

### The Optimization Scheme.

As in all NN applications, training plays a crucial role. The functional Ω can be considered a scalar loss function with respect to the set of parameters w. We shall evolve the parameters following the direction of the Ω derivatives:

At every iteration, the sampled configurations are used to compute the first term of the gradient. The second term, which involves an average over the target distribution, can be computed numerically when the number of CVs is small or using a Monte Carlo scheme when the number of CV is large. In doing so, the exponential growth of the computational cost with respect to the number of CV can be circumvented. This scheme allows the NN to be optimized on-the-fly, without the need to stop the simulation and fit the network to a given dataset. In this respect, the approach used here resembles a reinforcement learning algorithm. The potential

Since the calculation of the gradients requires performing statistical averages, a stochastic optimization method is necessary. In standard VES applications the method of choice has so far been the one of Bach and Moulines (33), which allows reaching with high accuracy the minimum, provided that the CVs are of good quality. Applying the same algorithm to NNs is rather complex. Therefore we espouse the prevailing attitude in the NN community. Namely, we do not aim at reaching the minimum, rather at determining a value of the bias **5**, we monitor at iteration step n an approximate KL divergence between the running estimates of the biased probability

The simulation is thus divided into 3 parts. In the first one, the ADAM optimizer (34) is used, until a preassigned threshold value ϵ for the distance **11**. The longer the second part, the better the bias potential

## Results

### Wolfe–Quapp Potential.

We first focus on a toy model, namely the 2D Wolfe–Quapp potential, rotated as in ref. 24. This is shown in Fig. 2*A*. The reason behind this choice is that the dominant fluctuations that lead from one state to the other are in an oblique direction with respect to x and y. We choose on purpose to use only x as a CV in order to exemplify the case of a suboptimal CV. This is representative of what happens in the practice when, more often than not, some slow degrees of freedom are not fully accounted for (24).

We use a 3-layer NN with [48,24,12] nodes, resulting in 1,585 variational parameters, which are updated every 500 steps. The KL divergence between the biased and the target distribution is computed with an exponentially decaying average with a time constant of *SI Appendix*).

In the upper right panel (Fig. 2*B*), we show the evolution of the CV as a function of the number of iterations. At the beginning, the bias changes very rapidly with large fluctuations that help to explore the configuration space. It should be noted that the combination of a suboptimal CV and a rapidly varying potential might lead the system to pathways different from the lowest energy one. For this reason, it is important to slow down the optimization in the second phase, where the learning rate is exponentially lowered until it is practically zero. When this limit is reached, the frequency of transitions becomes lower, reflecting the suboptimal character of the CV (24), as can be seen in Fig. 2*B*.

Although the bias itself is not yet fully converged, still the main features of the FES are captured by *C*). This is ensured by the fact that while decreasing the learning rate, the running estimate of the KL divergence must stay below the threshold ϵ; otherwise, the optimization proceeds with a constant learning rate. This means that the final potential is able to promote efficiently transitions between the metastable states. The successive static reweighting refines the FES and leads to a very accurate result (Fig. 2*D*), removing also the artifacts caused in the first part by the rapidly varying potential.

### Alanine Dipeptide and Tetrapeptide.

As a second example, we consider the case of 2 small peptides, alanine dipeptide (Fig. 4) and alanine tetrapeptide (Fig. 5) in vacuum, which are often used as a benchmark for enhanced sampling methods. We will refer to them as Ala2 and Ala4, respectively. Their conformations can be described in terms of the Ramachandran angles

We want to show here the usefulness of the flexibility provided by DEEP-VES in different systems. For this purpose, we use the same architecture and optimization scheme as in the previous example. We only decrease the decay time of the learning rate in the second phase since the dihedral angles

The simulation of Ala2 reached the KL divergence threshold after 3 ns, and from there on, the learning rate was exponentially decreased. The NN bias was no longer updated after 12 ns. At variance with the first example, when the learning is slowed down and even when it is stopped, the transition rate is not affected: this is a signal that our set of CVs is good (*SI Appendix*). In Fig. 3, we show the free-energy profiles obtained from the NN bias following Eq. **4** and the one recovered with the reweighting procedure of Eq. **11**. We compute the root-mean-square error (RMSE) on the FES up to 20 kJ/mol, as in ref. 19. The errors are 1.5 and 0.45 kJ/mol, corresponding to 0.6 and 0.2

In order to exemplify the ability of DEEP-VES to represent functions of several variables, we study the FES of Ala4 in terms of its 6 dihedral angles **10** is calculated numerically on a grid. This is clearly not possible in a 6D space; thus, we resort to a Monte Carlo estimation based on the Metropolis algorithm (35). Convergence is helped by the fact that the target **7**). In this case, we cannot use Eq. **12** to monitor convergence. However, as the number of CV treated increases, the probability of the space spanned by the CVs of covering all of the relevant slow modes is very high. Thus, less attention needs to be devoted to the optimization schedule. The transition from the constant learning-rate phase to the static bias region can be initiated after multiple recrossings between metastable states are observed. In order to verify the accuracy of these results, we compared the free-energy profiles obtained from the reweight with the results from metadynamics and its Parallel Bias version (36) (*SI Appendix*).

### Silicon Crystallization.

In the last example, the phase transition of silicon from liquid to solid at ambient pressure is studied. This is a complex phenomenon, characterized by a high barrier around 80

This can be easily addressed using the flexibility of NNs, which we use here to represent rapidly varying features of the FES and to avoid boundaries artifacts. In Fig. 7, we show the free-energy profile learned by the NN as well the one recovered by reweighting. Even in the presence of very sharp boundaries, the NN bias is able to learn a good

### Choice of the Parameters.

After presenting the results, we would like to spend a few words on the choice of parameters. This procedure requires setting 3 parameters, namely the time scale for calculating the KL divergence, the KL threshold, and the decay time for the learning rate. Our experience is that if the chosen set of CVs is good, this choice has a limited impact on the simulation result. However, in the case of nonoptimal CVs, more attention is needed.

The KL divergence should be calculated on a time scale in which the system samples the target distribution, so the greater the relaxation time of the neglected variables, the greater this scale should be. However, this is only used to monitor convergence and therefore it is possible to choose a higher value without compromising speed. A similar argument applies to the decay constant of the learning rate, but in the spirit of reaching a constant potential quickly, one tries to keep it lower (in the order of thousands of iterations). Finally, we found that the protocol is robust with respect to the epsilon parameter of the KL divergence threshold, provided that it is chosen in a range of values around 0.5. In the case of FES with larger dimensionality, it may be appropriate to increase this parameter in order to reach rapidly a good estimate of

## Conclusions

In this work, we have shown how the flexibility of NNs can be used to represent the bias potential and the FES as well, in the context of the VES method. Using the same architecture and similar optimization parameters we were able to deal with different physical systems and FES dimensionalities. This includes also the case in which some important degree of freedom is left out from the set of enhanced CVs. The case of alanine tetrapeptide with a 6D bias already shows the capability of DEEP-VES of dealing with high-dimensional FES. We plan to extend this to even higher dimensional landscapes, where the power of NNs can be fully exploited.

Our work is an example of a variational learning scheme, where the NN is optimized following a variational principle. Also, the target distribution allows an efficient sampling of the relevant configurational space, which is particularly important in the optics of sampling high-dimensional FES.

In the process, we have developed a minimization procedure alternative to that of ref. 17, which globally tempers the bias potential based on a KL divergence between the current and the target probability distribution. We think that conventional VES can also benefit from our approach and that we have made another step in the process of sampling complex free-energy landscapes. Future goals might include developing a more efficient scheme to exploit the variational principle in the context of NNs, as well as learning not only the bias but also the CVs on-the-fly. This work also allows tapping into the immense literature on ML and NNs for the purpose of improving enhanced sampling.

## Materials and Methods

The VES-NN is implemented on a modified version of PLUMED2 (38), linked against LibTorch (PyTorch C++ library). All data and input files required to reproduce the results reported in this paper are available on PLUMED-NEST (www.plumed-nest.org), the public repository of the PLUMED consortium (39), as plumID:19.060 (40). We plan to release the code also in the open-source PLUMED package. We take care of the construction and the optimization of the NN with LibTorch. The gradients of the functional with respect to the parameters are computed inside PLUMED according to Eq. **10**. The first expectation value is computed by sampling, while the second is obtained by numerical integration over a grid or with Monte Carlo techniques. In the following, we report the setup for the VES-NN and the simulations for all of the examples reported in the paper.

### Wolfe–Quapp Potential.

The Wolfe–Quapp potential is a fourth-order polynomial:

### Peptides.

For the alanine dipeptide (Ace-Ala-Nme) and tetrapeptide (Ace-Ala3-Nme) simulations, we use GROMACS (41) patched with PLUMED. The peptides are simulated in the canonical ensemble using the Amber99-SB force field (42) with a time step of 2 fs. The target temperature of 300 K is controlled with the velocity rescaling thermostat (43). For Ala2, we use the following parameters: decay constant equal to

### Silicon.

For the silicon simulations, we use Large-Scale Atomic/Molecular Massively Parallel Simulator (44) patched with PLUMED, employing the Stillinger and Weber potential (45). A 3 × 3 × 3 supercell (216 atoms) is simulated in the isothermal-isobaric ensemble with a timestep of 2 fs. The temperature of the thermostat (43) is set to 1,700 K with a relaxation time of 100 fs, while the values for the barostat (46) are 1 atm and 1 ps. The CV used is the number of cubic diamond atoms in the system, defined according to ref. 47. The decay time for the learning rate is

## Acknowledgments

This research was supported by the National Centre for Computational Design and Discovery of Novel Materials MARVEL, funded by the Swiss National Science Foundation, and European Union Grant ERC-2014-AdG-670227/VARMET. Calculations were carried out on the Euler cluster at ETH Zurich. We thank Michele Invernizzi for useful discussions and for carefully reading the paper.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: parrinello{at}phys.chem.ethz.ch.

Author contributions: L.B. and M.P. designed research; L.B. and Y.-Y.Z. performed research; L.B., Y.-Y.Z., and M.P. analyzed data; and L.B. and M.P. wrote the paper.

Reviewers: G.C., Flatiron Institute; and J.P., University of Washington.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1907975116/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- ↵
- J. Behler

- ↵
- L. Zhang,
- J. Han,
- H. Wang,
- R. Car,
- E. Weinan

- ↵
- ↵
- O. Valsson,
- P. Tiwary,
- M. Parrinello

- ↵
- W. Chen,
- A. R. Tan,
- A. L. Ferguson

- ↵
- M. Schöberl,
- N. Zabaras,
- P.-S. Koutsourelakis

- ↵
- C. X. Hernández,
- H. K. Wayment-Steele,
- M. M. Sultan,
- B. E. Husic,
- V. S. Pande

- ↵
- C. Wehmeyer,
- F. Noé

- ↵
- D. Mendels,
- G. Piccini,
- M. Parrinello

- ↵
- J. M. Lamim Ribeiro,
- P. Bravo,
- Y. Wang,
- P. Tiwary

- ↵
- L. Mones,
- N. Bernstein,
- G. Csányi

- ↵
- R. Galvelis,
- Y. Sugita

- ↵
- H. Sidky,
- J. K. Whitmer

- ↵
- L. Zhang,
- H. Wang,
- E. Weinan

- ↵
- L. Bonati,
- M. Parrinello

- ↵
- ↵
- M. Invernizzi,
- O. Valsson,
- M. Parrinello

- ↵
- ↵
- ↵
- P. M. Piaggi,
- O. Valsson,
- M. Parrinello

- ↵
- P. M. Piaggi,
- M. Parrinello

- ↵
- J. McCarty,
- O. Valsson,
- M. Parrinello

- ↵
- M. Invernizzi,
- M. Parrinello

- ↵
- I. Goodfellow,
- Y. Bengio,
- A. Courville

- ↵
- ↵
- A. Laio,
- M. Parrinello

- ↵
- P. Shaffer,
- O. Valsson,
- M. Parrinello

- ↵
- ↵
- Y. A. LeCun,
- L. Bottou,
- G. B. Orr,
- K. Robert Müller

- ↵
- G. Bakir,
- T. Hofman,
- B. Scholkopf,
- A. Smola,
- B. Taskar

- Y. LeCun,
- S. Chopra,
- R. Hadsell,
- M. Ranzato,
- F. Huang

- ↵
- G. Carleo,
- M. Troyer

- ↵
- F. Bach,
- E. Moulines

- ↵
- D. P. Kingma,
- J. Ba

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- L. Bonati,
- L. Zhang,
- M. Parrinello

- ↵
- ↵
- V. Hornak et al.

- ↵
- ↵
- S. Plimpton

- ↵
- F. H. Stillinger,
- T. A. Weber

- ↵
- ↵
- P. M. Piaggi,
- M. Parrinello

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Chemistry