# Combining experimental and simulation data of molecular processes via augmented Markov models

See allHide authors and affiliations

Edited by David Baker, University of Washington, Seattle, WA, and approved June 12, 2017 (received for review March 22, 2017)

## Significance

Structural biology is moving toward a paradigm characterized by data from a broad range of sources sensitive to multiple timescales and length scales. However, a major open problem remains: to devise an inference method that optimally combines all of this information into models amenable to human analysis. In this work, we make a significant step toward achieving this goal. We introduce a statistically rigorous method to merge information from molecular simulations and experimental data into augmented Markov models (AMMs). AMMs are easy-to-analyze atomistic descriptions of biomolecular structure and exchange kinetics. We show how AMMs may provide accurate descriptions of molecular dynamics probed by NMR spin relaxation and thereby provide a unique way to integrate analysis of experimental with simulation data.

## Abstract

Accurate mechanistic description of structural changes in biomolecules is an increasingly important topic in structural and chemical biology. Markov models have emerged as a powerful way to approximate the molecular kinetics of large biomolecules while keeping full structural resolution in a divide-and-conquer fashion. However, the accuracy of these models is limited by that of the force fields used to generate the underlying molecular dynamics (MD) simulation data. Whereas the quality of classical MD force fields has improved significantly in recent years, remaining errors in the Boltzmann weights are still on the order of a few

- molecular dynamics
- integrative structural biology
- maximum entropy
- Markov state models
- augmented Markov models

Atomistic molecular dynamics (MD) simulation is a popular tool to investigate mechanisms underlying biomolecular function, including ligand binding (1, 2) and allostery (3), whereas coarse-grained molecular models are often used when studying assembly and interactions of supramolecular systems (4, 5). With recent advances in massively paralleled computation, simulating thousands of short- to medium-length realizations of many biomolecular systems has become feasible (6, 7). Systematic analysis of such large sets of MD data in terms of Markov state models (MSMs) (8, 9) now enables the study of slow dynamic processes, including protein folding (10, 11), conformational transitions (12, 13), and quantitative comparison with experiments (14⇓–16) otherwise affordable only on special-purpose supercomputers (17).

Whereas these technologies are closing the gap as to which macromolecular systems and timescales can be directly simulated, it is becoming increasingly evident that systematic errors in empirical models—force fields—limit our ability to predict experimental data quantitatively (18). Although force fields are undergoing continuous improvement, high accuracy needs to be balanced with computational efficiency. A viable approach is to aim at force fields that are good enough such that they can be reweighted or biased by experimental data and thus make them quantitatively predictive (19, 20).

Key to this approach is a consistent framework to integrate simulation and experimental data. Popular approaches include using available experimental data to bias the empirical models during simulation in an ad hoc fashion (21), using Bayesian (22⇓⇓–25) or maximum entropy (26⇓–28) frameworks, or using a posteriori reweighting of the simulated ensembles (24, 29⇓–31). Although these approaches generally improve agreement with experimental data, they have several technical issues (32) and a number of intrinsic limitations: Biased simulation strategies make it difficult to reuse simulation data if new experimental data become available. Existing reweighting techniques come at the cost of losing dynamic information in the unbiased simulation data. Both approaches currently require that the simulation ensemble can be sampled directly and do not make use of MSMs to sample long timescales. Further, most existing approaches do not clearly distinguish between systematic (force-field) error and statistical (sampling) error, which may result in unexpected behavior or involve user-specified weighting factors. Several Markov state model estimators have been developed that are conditioned on auxiliary data, especially microscopic quantities such as the stationary distribution or functions of the transition probability matrix (33⇓⇓⇓–37). However, the direct augmentation of Markov state models with experimental data using a judicious treatment of force-field and sampling errors is still an open issue.

This work introduces augmented Markov models (AMMs). AMMs are MSMs that balance information from simulation and averaged experimental data during estimation. This balance is achieved by an information-theoretic treatment of systematic force-field errors and a probabilistic treatment of sampling and experimental errors. Using this approach, we show that AMMs more accurately recapitulate complementary stationary and dynamic experimental data than their MSM counterparts. AMMs are therefore a unique way to approach integrative structural biology, where information about the kinetics of conformational transitions is also accessible. This method moves the field closer to truly mechanistic data-driven models.

## Theory

### Augmented Markov Models.

MSMs describe the kinetics between a set of

Simulations rarely match the in situ conditions of experiments exactly; however, even if they could, we would still expect systematic errors in the Boltzmann weights of the simulation’s molecular configurations due to inaccuracies of the force field. As a result, quantities computed from simulation and measured by experiment will differ even in the limit of extensive MD sampling. Specifically, the experimental state probability

### Estimating AMMs from Simulation and Experimental Data.

Consider that we measure the expectation values of

Although we do not have direct access either to the true equilibrium distribution of the experimental ensemble

Now we can write an expression of the true experimental distribution *SI Appendix*, *Theory*):

Whereas Eq. **1** couples the experimental observations made in the ensemble *SI Appendix*, *Theory*). Furthermore, the likelihood can be combined with a prior on the transition counts to enable Bayesian inference (*SI Appendix*, *Theory*), which was used to compute AMM uncertainties in the present article.

We can maximize Eq. **2** with respect to the unknown parameters *SI Appendix*, *Theory*. The AMM estimator is available in the PyEMMA software, version 2.5 (www.pyemma.org) (40).

## Results

### Rescuing Biased Relaxation Dynamics in a Protein-Folding Model.

To illustrate the potential power of AMMs, we turn to an eight-state model of protein folding. The states represent different stages of folding of an idealized bundle of helices, a, b, and c, along multiple parallel folding pathways (Fig. 2*A* and *Materials and Methods*). For each state we assign a value to each of two observables: an average helicity, *SI Appendix*, Table S1). Consequently, given an equilibrium distribution of the states we can evaluate expectation values of these observables that may correspond to bulk circular dichroism and fluorescence quenching experiments, respectively. We generate a number of different variants of this model: a true reference model (*SI Appendix*, Table S1). As a result, the equilibrium distribution and kinetics of the biased models differ to an increasing degree from the true model.

For each model, we generated numerous simulation trajectories (*SI Appendix*, *SI Materials and Methods*). Indeed, the obtained distributions of observables *B* and *C*). Kinetic quantities such as the mean first passage times (mfpts) of unfolding and folding are also affected by the bias (Fig. 2 *D* and *E*). To recover from this bias, we combine the biased simulation data with an “experimental” measurement of the unbiased expectation values of *SI Appendix*, *Theory*).

Although we restrain only the mean helicity to the true expectation value, the AMMs accurately recover the equilibrium distributions of both the helicity and the fluorescence quenching observables (Fig. 2 *B* and *C*) and improve agreement with dynamic observables such as mfpts (Fig. 2 *D* and *E*). Thus, when the stationary distribution of the simulation model,

### Incorporating Experimental Data Increases the Similarity Between Protein Simulations in Different Force Fields.

Next, we estimate MSMs and AMMs of two previously published 1-ms trajectories of the native-state equilibrium fluctuations of ubiquitin, using the CHARMM22* (C22*) (42) and CHARMM-*h* (C-*h*) (43) force fields. Previous analyses of these simulations have shown that the stability of certain configurations is overestimated, either due to insufficient sampling or due to minor force-field inaccuracies (42, 43). Both force fields are among the state-of-the-art force fields for folded proteins and thus represent an interesting test case for AMMs. However, they are highly similar, differing only by a term to aid cooperativity of helix formation in C-*h* (44). Ubiquitin further serves as an ideal test system, as multiple sources of stationary and dynamic experimental data are readily available. These conditions put us in a a great position to thoroughly evaluate the presented methodology.

The simulations are embedded in a joint space of time-lagged independent components (TICs) (11, 45). TICs express slow order parameters in a system as linear combinations of a large number of input features or molecular descriptors. This space is then discretized into 256 nonoverlapping microstates, using *SI Appendix*, *SI Materials and Methods*). Transition counts were determined for each trajectory independently and used as input for estimation of MSMs and AMMs. The MSMs were tested for convergence and self-consistency before we proceeded to estimation of AMMs (*SI Appendix*, Fig. S1). The AMMs used observables from NMR spectroscopy as experimental restraints:

The two MD simulations have significant overlap in their densities in the TIC space. Yet there are regions that are sampled only in the C22* or only in the C-*h* simulation (Fig. 3*A*). This observation can be due to either insufficient sampling or the slightly different parameterizations of these force fields: The number of microstates visited in both trajectories is 188 or roughly 73% overlap in terms of states and 96% of the probability mass. We use the Jensen–Shannon divergence (*SI Appendix*, *SI Materials and Methods*). For the two MSMs we find *h* changes little (

### AMMs Reconcile Stationary and Dynamic Experimental Data.

The estimated AMMs show an improved agreement with the *SI Appendix*, Figs. S2 and S3) used in the estimation process compared with the MSMs estimated from simulation data alone. The improvement in RDCs for the C-*h* AMM may seem insignificant when judged by the Q factor only; however, a detailed inspection of the most strongly enforced experimental data restraints (largest *SI Appendix*, Fig. S4). The predicted ensemble averages of complementary data including cross-correlated relaxation (CCR) (51) and exact nuclear Overhauser enhancements (eNOEs) (52) show an on par agreement compared with their corresponding MSMs (*SI Appendix*, Table S2). This suggests that these experimental observables are largely insensitive to the changes in the equilibrium distributions in the AMMs relative to those in the MSMs and that we are not overrestraining when estimating the AMMs. Nevertheless the predictions of these experimental observables given by the MSMs and AMMs agree fairly well with their experimental values.

The slowest correlation times (*SI Appendix*, Table S2). Dielectric relaxation spectroscopy data of ubiquitin has identified a process with a timescale in the fast microsecond range (53), which indicates that the AMMs may be giving a more realistic kinetic description compared with the corresponding MSMs, although no kinetic data were used in the model generation.

This finding prompted us to analyze the molecular kinetics of the AMMs and MSMs more quantitatively by comparing predictions of NMR spin-relaxation data (14) with recently reported experimental measurements (54). These data are sensitive to the correlation times of conformational transitions, the structures of the metastable configurations involved in the transitions, and their relative populations (55). As the data are resolved at the single-spin level, they provide a direct means to validate the altered dynamics seen in the AMMs at atomic resolution. We compare with data recorded at 292 K and 308 K (54) (*SI Appendix*, Figs. S5 and S6), both fairly close to the simulation temperature of 300 K. It is apparent that the AMMs improve the overall agreement with these data compared with the MSMs. We quantify this impression by computing *E* and *F* and *SI Appendix*, *SI Materials and Methods*). AMMs have dramatically improved *h* (Fig. 4 *B* and *D* and *SI Appendix*, Fig. S5) and C22* (Fig. 4 *A* and *C* and *SI Appendix*, Fig. S6) force fields, although many *E* and *F*). Still, as these improvements correlate with improved agreement with stationary experimental observables (RDCs and

### Microscopic Differences in MSMs and AMMs.

Above we noted how integration of experimental data into the Markov model estimation procedure results in more similar models, compared with models trained on simulation data only. The latter observation opens the question of whether the underlying microscopic changes of the model also become more similar. To this end, we coarse grained the MSMs and AMMs into six metastable states, corresponding to density peaks in the equilibrium distribution (Fig. 3*A*). Five of these states consist of very similar molecular structures that overlap in the TIC projection between the two force fields. The F state is by far the most thermodynamically stable of these and closely resembles the configuration identified by X-ray and NMR structure determination. Metastable state A is very similar to F, but has a different backbone configuration in residues 58–62. The E state differs in its propensity to flip the configuration of residues 50–55. In states C and B one and two turns of the central

The probabilities of most of the metastable states change considerably in the AMMs relative to the corresponding MSMs (Fig. 3*C*). The combined population of the native-like states F and A increases in both cases, whereas the populations of all other metastable states are significantly reduced in the AMMs relative to the MSMs. This result mirrors the findings in *SI Appendix*, Fig. S4: The RDCs enforced the strongest in the AMMs are localized to residues 10–12, 30–40, and 45–55, which roughly correlates with the structural differences between states A and F and all of the others states. This illustrates how the same experimental restraints enforced in AMMs of different force fields roughly translate into consistent microscopic changes in the inferred AMMs, compared with their corresponding MSMs.

## Conclusion

We present a method to combine simulation data generated using an empirical force field with unknown systematic errors and averaged, noisy, experimental data into kinetic models of molecular dynamics: AMMs. We show how this approach can accurately recover thermodynamics and kinetics in a protein-folding model subject to varying degrees of systematic error. In addition, we find the method improves the accuracy of models derived from atomistic molecular dynamics simulations. In particular, when including experimental data via AMMs, both thermodynamics and kinetics extracted from simulation data using two different molecular mechanics force fields become more alike, and their agreement with complementary experimental data, both stationary and dynamic, improves.

The presented method makes it possible to preserve dynamic information following integration of stationary experimental data. This account shows how the integration of stationary experimental data in such a framework may also improve the prediction of complementary dynamic observables. These two results open up the possibility of establishing more elaborate models for integrative structural biology where full thermodynamic and kinetic descriptions may be obtained. Future developments may allow for the integration of dynamic observables, such as relaxation rates or correlation functions, into AMMs by adopting a maximum caliber functional (56) or Bayesian methods (33, 34) to account for the systematic errors in these quantities.

## Materials and Methods

### Protein-Folding Model System.

Seven eight-state Markov models were constructed for the different free-energy potentials *SI Appendix*, Table S1). For each one, transition probabilities were obtained using kinetic Monte Carlo by row normalization of a matrix with *A*, *k* is Boltzmann’s constant and *SI Appendix*, *SI Materials and Methods*.

### AMMs and MSMs of Ubiquitin.

Estimation and validation were carried as outlined in *Results* above; further details are given in *SI Appendix*, *SI Materials and Methods*. Uncertainties were estimated using a Bayesian scheme (*SI Appendix*, *Theory*), and AMMs and MSMs were compared using Bayes factors and the Jensen–Shannon divergence (*SI Appendix*, *SI Materials and Methods*).

## Acknowledgments

We thank D. E. Shaw Research for sharing the two ubiquitin simulations. We thank J. F. Rudzinski, T. Bereau, and G. Hummer for inspiring discussions. We gratefully acknowledge funding from Freie Universität Berlin and the European Commission (Dahlem Research School POINT-Marie Curie COFUND postdoctoral fellowship to S.O.), the European Research Council (ERC Starting Grant pcCell to F.N.), the Deutsche Forschungsgemeinschaft (NO 825/2-2, SFB 1114/A4, and SFB 1114/C3 to F.N.), the National Science Foundation (Grant CHE-1265929 to C.C.), and the Welch Foundation (Grant C-1570 to C.C.).

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: simon.olsson{at}fu-berlin.de or frank.noe{at}fu-berlin.de.

Author contributions: S.O. and F.N. designed research; S.O., H.W., F.P., C.C., and F.N. performed research; S.O., H.W., F.P., and C.C. contributed new reagents/analytic tools; S.O. analyzed data; and S.O. and F.N. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1704803114/-/DCSupplemental.

## References

- ↵
- ↵
- ↵.
- Bowman GR,
- Bolin ER,
- Hart KM,
- Maguire BC,
- Marqusee S

- ↵.
- van Eerden FJ, et al.

- ↵
- ↵.
- Pronk S, et al.

- ↵.
- Doerr S,
- Harvey MJ,
- Noé F,
- Fabritiis GD

- ↵.
- Bowman GR,
- Pande VS,
- Noé F

*An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation*, Advances in Experimental Medicine and Biology (Springer, Heidelberg), Vol 797. - ↵
- ↵.
- Noe F,
- Schutte C,
- Vanden-Eijnden E,
- Reich L,
- Weikl TR

- ↵
- ↵
- ↵
- ↵.
- Olsson S,
- Noé F

- ↵.
- Hart KM,
- Ho CMW,
- Dutta S,
- Gross ML,
- Bowman GR

- ↵.
- Noe F, et al.

- ↵.
- Shaw DE, et al.

- ↵
- ↵
- ↵.
- MacCallum JL,
- Perez A,
- Dill KA

- ↵
- ↵
- ↵
- ↵.
- Hummer G,
- Köfinger J

- ↵.
- Bonomi M,
- Camilloni C,
- Cavalli A,
- Vendruscolo M

- ↵
- ↵
- ↵
- ↵
- ↵.
- Olsson S,
- Strotz D,
- Vögeli B,
- Riek R,
- Cavalli A

- ↵.
- Leung HTA, et al.

- ↵.
- Olsson S,
- Cavalli A

- ↵.
- Rudzinski JF,
- Kremer K,
- Bereau T

- ↵
- ↵.
- Trendelkamp-Schroer B,
- Noé F

- ↵
- ↵.
- Trendelkamp-Schroer B,
- Noé F

- ↵
- ↵
- ↵
- ↵
- ↵.
- Piana S,
- Lindorff-Larsen K,
- Shaw DE

- ↵.
- Lindorff-Larsen K,
- Maragakis P,
- Piana S,
- Shaw DE

- ↵
- ↵
- ↵.
- Maltsev AS,
- Grishaev A,
- Roche J,
- Zasloff M,
- Bax A

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Smith CA, et al.

- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology