Weighting of experimental evidence in macromolecular structure determination
See allHide authors and affiliations

Edited by Axel T. Brunger, Stanford University, Stanford, CA, and approved December 15, 2005

↵*M.H. and W.R. contributed equally to this work. (received for review July 27, 2005)
Abstract
The determination of macromolecular structures requires weighting of experimental evidence relative to prior physical information. Although it can critically affect the quality of the calculated structures, experimental data are routinely weighted on an empirical basis. At present, crossvalidation is the most rigorous method to determine the best weight. We describe a general method to adaptively weight experimental data in the course of structure calculation. It is further shown that the necessity to define weights for the data can be completely alleviated. We demonstrate the method on a structure calculation from NMR data and find that the resulting structures are optimal in terms of accuracy and structural quality. Our method is devoid of the bias imposed by an empirical choice of the weight and has some advantages over estimating the weight by crossvalidation.
Experimental data are typically insufficient to determine a biomolecular structure in their own right but need to be complemented with prior physical information. Therefore, structure determination amounts to the search for conformations that have a low physical energy and that, at the same time, minimize a cost function E _{data} quantifying the disagreement between a structural model X and the data. This approach is implemented as minimization of a hybrid energy (1, 2) where the force field E _{phys} compensates a lack of data by imposing physical constraints on the structure. A target function of this form is widely used in macromolecular structure determination, notably from NMR data (3, 4) and from homologyderived restraints (5). The weight w _{data} controls the contribution of the data relative to the force field. Its value can be critical: If it is too large, the contribution of the force field might be too small to avoid overfitting; if the weight is too small, the data contribute too little to define the structure. The choice of the weight also concerns the question of how to judge structural quality. Overfitted structures reach a low R value (6, 7) but exhibit a poor stereochemistry or an unlikely fold.
Usually, experimental data are weighted empirically: w _{data} is set ad hoc and held constant during structure calculation. However, already when introducing the hybrid energy concept, Jack and Levitt (1) remarked that correct weighting of the data “is something of a problem.” They proposed to adjust the weight to equalize E _{phys} and w _{data} E _{data}; this adjustment was later refined, for example, in ref. 8. At present, the most rigorous quantitative method to determine the optimal weight is complete crossvalidation (6, 7). However, crossvalidation can become unstable and timeconsuming in the case of sparse and heterogeneous data with several independent weights.
In this work, we introduce an objective and unique way to weight experimental data. We show that a quantitative treatment does not necessitate heuristics like crossvalidation: everything we need is contained in the rules of probability theory.
Theory
Inferential Structure Determination.
We recently introduced a probabilistic approach to structure determination (9, 10), which permits the estimation of unknowns, such as theory parameters, in addition to the conformational degrees of freedom. We represent the unknown structure through a conditional probability p(X) = dP(XD, I)/dX that quantifies the likelihood of X being the true molecular structure in light of the data D and of relevant prior knowledge I. The posterior distribution p(X) spreads the uncertainty about the structure over the entire conformational space and peaks in regions where conformations are in accord with the data and the prior knowledge. Bayes’ theorem (11) states that the posterior distribution is proportional to the product of the likelihood function L(X) and the prior distribution π(X): p(X) ∝ L(X)π(X). The likelihood function derives from the probability of observing the measurements given the molecular structure, i.e., L(X) = P(DX, I). The conformational prior distribution π(X) = dP(XI)/dX represents general knowledge about the unknown structure of the target molecule. If the mean energy or likewise the temperature β^{−1} of the system is known, the Boltzmann distribution π(X) ∝ exp{−βE _{phys}(X)} is the least biasing prior distribution (12). The most probable conformations minimize the negative logarithm of the posterior distribution, and we can establish a formal analogy to hybrid energy minimization: −logπ corresponds to the force field E _{phys} because it describes a priori meaningful structures; −logL is similar to w _{data} E _{data} because it penalizes structures that do not fit the data.
We model the data as independent measurements and use a distance measure δ(y _{i}, y _{i}(X)) to evaluate the discrepancy between the ith observation y _{i} and its prediction y _{i}(X). Thus, the likelihood of the data is This likelihood function is of the leastsquares type with χ^{2}(X) = Σ_{i}[δ(y_{i} , y_{i} (X))]^{2} evaluating the average disagreement between backcalculated and observed data. The residual χ^{2} is minimal if the theory exactly matches the experiment; the factor Z(σ) normalizes the likelihood function with respect to the data (i.e., Z(σ) = ∏_{i}∫dy_{i}e ^{−(1/2σ2)}[δ(y _{i}, y _{i}(X))]^{2}). In case of Gaussian data, for example, we have δ(y_{i} , y_{i} (X)) = y_{i} − y_{i} (X) and obtain Z(σ) = (2πσ^{2}) ^{n} ^{/2}, χ ^{2} (X) = Σ_{i=1} ^{n} [y_{i} − y_{i} (X)]^{2}; σ is the standard deviation of the measurements.
Joint Posterior Distribution.
In general, the parameter σ evaluates to which extent the structure can be fit to the data and serves as a “unit” of the distance measure δ. It depends on both the quality of the data and the precision of the theory used to backcalculate the data. Therefore, σ can be viewed as an “error” that includes experimental noise as well as systematic contributions. In practice, this error is unknown, just like the coordinates. When evaluating the likelihood function for a conformation, we have to set σ and are facing the same dilemma as in the hybrid energy approach where w _{data} is unknown. Consequently, we need to consider the likelihood factor not only a function of the coordinates but also of the error. We symbolize this dependence explicitly through L _{joint}(X, σ) instead of L(X).
The essence of Bayesian inference is that probabilities can be attributed to any statement, not only to those concerning “random variables.” Probabilities express ignorance. If both X and σ are unknown, the bearing of the data on them is quantified by a joint posterior distribution p _{joint}(X, σ); p _{joint} is a probability distribution for the unknown coordinates and the unknown error. Formally, this distribution is obtained by replacing X with (X, σ) in Bayes’ theorem. To this end, we introduce a joint prior distribution π(X, σ) = π(σX)π(X). Because knowledge of the coordinates has no bearing on the error, we obtain as joint posterior distribution for all unknowns, i.e., for X and σ. Usually, concrete prior knowledge about the error is lacking. However, we know that σ is positive and that its value has no absolute meaning, because it depends on the units of the distance measure δ. Changes in the units of δ can be compensated by scaling the error appropriately. Therefore, π(σ) should be invariant under scaling, leading to π(σ) = σ^{−1} (13).
The joint posterior distribution p _{joint} summarizes our knowledge about the structure and the error, and all inferences on these unknowns can be derived from p _{joint}. Usually, one is primarily interested in the coordinates and not in the error. The statistically correct way to eliminate the error is to integrate over all possible values of σ. This socalled marginalization (11) projects the joint posterior distribution to conformational space: The marginal posterior distribution p _{marginal}(X) = ∫dσp _{joint}(X, σ) no longer involves an error parameter. We can either use p _{marginal}(X) or the integrated likelihood function (14) L _{marginal}(X) = ∫dσL _{joint}(X, σ)π(σ) to determine the structure. Often, marginalization integrals can be solved analytically. Complicated models, however, require the use of numerical integration techniques. Analytical calculation of L _{marginal}(X) avoids the problem of choosing an appropriate weight from the start. The error is then only introduced to devise the likelihood function and eliminated afterward by marginalization.
Results
To demonstrate the outlined formalism, we analyzed NMRderived distance measurements for the protein ubiquitin (15) (Protein Data Bank ID code 1d3z). We extracted from the deposited 2, 727 interproton distances 1, 444 nonredundant entries by retaining only the smallest distance in case multiple measurements were available for the same pair of protons. Because distances are nonnegative, we model deviations between n measured and calculated distances with a lognormal distribution (16): for this choice χ^{2}(X) = Σ_{i}log^{2}[d_{i} /d_{i} (X)] and Z(σ) = (2πσ^{2})^{n/2}. We used posterior simulation techniques (17) to calculate structures and to simultaneously estimate the error of the lognormal model. Structures were parameterized in torsion angles; nonbonded interactions were represented with a purely repulsive potential (18). Simulations of the posterior distributions were carried out with our software for inferential structure determination (isd; M.H. and W.R., unpublished results) using the random sampling strategies outlined in refs. 9, 10, and 19.
Impact of the Weight on Structural Quality.
We first calculated structures for fixed weights by simulating the conditional conformational posterior distribution p
_{joint}(X, σ = 1/
Probabilistic Interpretation.
Because the logarithm is a monotonically increasing function, maximization of the posterior probability can be achieved by minimizing its logarithm. Therefore, the negative logarithm of the joint posterior distribution, −logp _{joint}(X, σ) = −log[L _{joint}(X, σ)π(X)π(σ)], can be interpreted as a joint hybrid energy E _{joint}(X, σ) now depending not only on the coordinates but also on the error. If we insert L _{joint} and our choices for the prior distributions π(X) (Boltzmann factor) and π(σ) and neglect constants that neither depend on the structure nor on the error, we obtain as an extended joint hybrid energy. The equivalences E _{data}(X) = χ^{2}(X)/2 and w _{data} = 1/σ^{2} clarify the nature of the weight: It is not merely a fudge factor but quantifies the quality of the data and the reliability of the theoretical model used to predict them. If observed and backcalculated data are in good agreement, σ will be small and w _{data} large. In case of disagreement, σ will be big and w _{data} small. By choosing the error appropriately, we balance the contribution of experimental and prior information. The joint hybrid energy (Eq. 5 ) E _{joint}(X, σ) = E _{hybrid}(X) + log[Z(σ)/π(σ)] contains a term, log[Z(σ)/π(σ)], not included in the standard target function E_{hybrid} (Eq. 1 ). Only this additional term allows us to determine the error. To derive this “regularizer” for σ, the use of a probabilistic framework for structure determination is indispensible: Z(σ) originates in the normalization of P(DX, I), π(σ) is required by Bayes’ theorem: both terms are missing in purely optimizationbased approaches where normalization constants and prior probabilities are usually not incorporated.
Estimation of the Weight.
For Model 4, it holds that Z(σ)/π(σ) ∝ σ
^{n}
^{+1}. In the joint target function E
_{joint} (Eq. 5
), two contributions counterbalance each other: χ^{2}/σ^{2} decreases when σ increases, thus preferring large values for the error when E
_{hybrid} is minimized with respect to the error. The additional term log[Z(σ)/π(σ)] = (n + 1)logσ is monotonically increasing and favors small errors (see Fig. 2). Thus, only when log[Z(σ)/π(σ)] is added to the standard hybrid energy, one obtains a joint hybrid energy that exhibits a finite minimum in σ and can directly be used to determine the error or, likewise, the weight from the data. Minimization of the resulting joint hybrid energy E
_{joint}(X, σ) yields the most probable structure X
_{max} and the most probable error σ_{max}. In case of model 4, we have σ_{max} =
In our approach, we do not minimize E
_{joint}(X, σ) but rather draw random samples from the joint posterior distribution p
_{joint}(X, σ) using a Gibbs sampling scheme (24). In this scheme, samples of the coordinates and the error are drawn in an iterative fashion by alternately setting σ or X to the previous sample in the joint posterior distribution. When the error is fixed in p
_{joint}(X, σ), we again obtain a distribution that is proportional to exp{−E
_{hybrid}(X)} and that can be sampled using the hybrid Monte Carlo method. If we fix the coordinates to the most recent conformational sample, we obtain a probability distribution for the error that is proportional to σ^{−(n+1)}exp{−χ^{2}(X)/2σ^{2}}. By substituting σ with 1/
We thus can sample the weight or, likewise, the error using a random number generator for the gamma distribution. The resulting histogram is shown in Fig. 1 and demonstrates that the coordinates and the weight can be estimated simultaneously. The optimal weights sampled by our algorithm lie within the region where the what if quality scores reach their best values. Moreover, the weights scatter around a most likely value (≈40) leading to conformations that are closest to the xray structure when the weight is fixed during structure calculation. Thus, our algorithm adapts the weight to yield optimal structures in terms of accuracy (rmsd to the xray structure).
How is it possible to estimate the coordinates and the error simultaneously? Intuitively, the true molecular structure minimizes the deviations between observed and calculated data. If we knew the correct structure, the weight would just reflect the width of the distribution of deviations between observations and predictions. Assuming no systematic errors, the distribution of the discrepancies δ_{i} = log[d
_{i}
/d
_{i}
(X)] will ideally be a zerocentered Gaussian with standard deviation
This intuitive behavior of the weight is contained in our formulation. As outlined before, the posterior distribution of the weight given the structure is a gamma distribution (cf. Eq. 6
). Thus, we obtain 〈w
_{data}〉 = n/χ^{2}(X) as an estimate for the unknown weight; this estimate is identical to the intuitive estimate 1/
Elimination of the Weight.
As mentioned in Theory, it is sometimes possible to integrate out the error in the likelihood function. For Model 4, this calculation is straightforward: The marginal likelihood is proportional to ∫dσσ^{−(n+1)}exp{−χ^{2}(X)/2σ^{2}}; like in the derivation of Eq. 6
we can substitute σ with 1∕
Because the weight is not completely determined by the data but adopts different values for different structures, the marginal hybrid energy is less pronounced than the standard hybrid energy: The leastsquares residual is transformed logarithmically and weighted with the number of measurements. The marginal hybrid energy (Eq. 7 ) is less biased than the hybrid energy (Eq. 1 ) because it relies on the data only and does not assume knowledge of the weight.
There exists a close relation to the joint hybrid energy (Eq. 5
). We can turn the argument that led to our estimate 〈w
_{data}〉 around by stating that every conformation X requires its own weight n/χ^{2}(X). Therefore, one can eliminate the weight heuristically by considering E
_{joint}(X, σ =
Regarding the conformational degrees of freedom, the marginal posterior distribution p _{marginal}(X) and the joint posterior distribution p _{joint}(X, σ) contain the same information: Sampling p _{joint}(X, σ) can be viewed as numerically integrating out the error. A simulation of p _{marginal}(X) confirms the equivalence of the joint and the marginal posterior distribution also in practice (Fig. 4). For both simulations, we obtain a rmsd of 0.63 ± 0.06 Å to the crystal structure (the optimal value w _{data} = 40 resulted in a rmsd of 0.60 ± 0.05 Å). The similarity of the distribution of rmsd, radius of gyration, and χ^{2} indicates the equivalence of both simulations.
Comparison with CrossValidation.
Crossvalidation (7) is based on the idea that the weight should be chosen such that overfitting is prevented. The data are divided into two nonoverlapping sets: a working set A that is used for structure calculation and a test set T for which the “free” R value is calculated to assess a certain choice of the weight. Ideally, the optimal weight minimizes the free R value. We randomly divided the 1, 444 distances into 10 sets of approximately equal size and carried out a complete 10fold crossvalidation. We used a simulated annealing protocol (25) implemented in cns (26) to calculate structures for each working set. Because the cost function resulting from the lognormal distribution (4) is not implemented in cns, we used a Gaussian error distribution leading to harmonic terms. For comparison, we simulated the joint posterior distribution for a Gaussian likelihood, i.e., δ _{i} = d_{i} − d_{i} (X), using the isd software.
The result of a complete crossvalidation is shown in Fig. 5. By increasing the weight, one can reduce the residual of the working set to smaller and smaller values. In contrast, the fit of the test set becomes slightly worse if the weight is too large (Fig. 5 A). The standard R value, χ_{T} ^{2}/Σ_{i∈T} d _{i} ^{2}, is proportional to the residual χ_{T} ^{2} = Σ_{i∈T}[d_{i} − d_{i} (X)]^{2} of the test set. For NOESY data, Brünger et al. (7) used R _{1/6} = Σ_{i∈T}d _{i} ^{−1} − d _{i} ^{−1}(X)/Σ_{i∈T} d _{i} ^{−1}. Neither of the two free R value curves (Fig. 5 B) exhibits a pronounced minimum. Therefore, the choice of the weight by crossvalidation remains ambiguous. In contrast, the Bayesian result is very clearcut. It avoids overfitting by weighting the data as little as possible while maintaining structures of high quality. Fig. 5 C again shows that the Bayesian choice is also optimal in terms of accuracy. The distribution of the rmsd to the crystal structure obtained with the Bayesian calculation concentrates at low values and agrees well with the result obtained with crossvalidation. As can be seen from the diagram, accuracy changes only little for different weights. For reasons discussed above, we nevertheless obtain sharp posterior histograms for the weight, which demonstrates the high sensitivity of a Bayesian approach.
Further advantages of our approach over crossvalidation for the purpose of setting the weight are as follows. First, crossvalidation contains unspecified elements that still depend on the choice of the user and are not solely determined by the data. The data need to be partitioned, and the choice of a particular weight has to be assessed by an appropriate R value. Because NMR data are usually preprocessed, several R values have been proposed (27), and there exists no consensus as to which R value performs best. Second, crossvalidation will be impractical for sparse data because of the reduced size of the working set, whereas the Bayesian algorithm is still stable. All operations such as analytical marginalization or posterior sampling are also valid for data sets much smaller than the one used in the calculations shown here. For example, in ref. 9 we report on a structure determination with only 10% of the data used here for a protein of similar size. Finally, we expect the Bayesian approach to be significantly more efficient than crossvalidation. Currently, our implementation relies on posterior sampling techniques that require more computational resources than standard structure calculation methods. However, our findings directly apply to minimization approaches. Instead of minimizing E _{hybrid} with w _{data} fixed by some empirical rule, we propose to minimize E _{joint} or likewise E _{marginal} using, for example, simulated annealing. Minimization of E _{joint} or E _{marginal} is of the same complexity as a minimization of E _{hybrid}. Therefore, a single minimization of a probabilistically motivated target function can produce the same information as crossvalidation calculations that require a whole set of minimization runs.
Conclusions
By using Bayesian inference, we resolved the issue of weighting experimental data in macromolecular structure determination and demonstrated the method on a structure calculation from NMR data. Probability calculus provides definite rules to determine the unknown data weight: It can either be estimated from the data or eliminated analytically by marginalization. Both strategies are equivalent and also could be implemented in minimizationbased computer programs.
Our method is objective in the sense that both the weight and the structures are uniquely determined by the experimental data at hand and some additional, but required, assumptions such as the choice of the force field and of the theory used to backcalculate the data.
The probabilistic model describing the data also provides a clear interpretation of the weight as being related to the average discrepancy between observed and calculated data. Thus, the weight quantifies limitations both of the data and of the theory.
The formalism can be applied to a large class of structure calculation problems that are based on hybrid energy minimization. The data term translates into the likelihood function; its normalization constant is obtained by integrating exp{−w _{data} E _{data}(X)} over the data. It is important to incorporate this term into the hybrid energy. One can then use the joint target function (Eq. 5 ) to determine the coordinates and the optimal weight simultaneously. For example, the formalism is directly applicable to homology modeling and to structure calculation from restraints obtained from FRET or mutagenesis experiments.
However, the proposed formalism does not necessitate the application of posterior sampling techniques. Instead the negative logposterior distributions could be used for minimizationbased structure calculation. In analogy to the Gibbs sampling procedure, the joint hybrid energy E _{joint} could be minimized by periodically updating the weight with the estimate χ^{2}/n during a simulated annealing run. Alternatively, one could eliminate the weight analytically and minimize E _{marginal}.
A further advantage of our method is that multiple data sets can be treated in the same way. The Bayesian result is similar to that of a complete crossvalidation but more clearcut and stable. In addition to an estimate of the weight, we obtain its reliability. The estimated weight can be used to assess the quality of the data and may serve as a figure of merit similar to the free R value.
Acknowledgments
This work was supported by European Union Grants QLG2CT200001313 and QLG2CT200200988.
Footnotes
 ^{§}To whom correspondence should be addressed. Email: nilges{at}pasteur.fr

↵ ^{†}Present address: Max Planck Institute for Developmental Biology, Spemannstrasse 35 and Max Planck Institute for Biological Cybernetics, Spemannstrasse 38, 72076 Tübingen, Germany.

↵ ^{‡}Present address: Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, United Kingdom.

Author contributions: M.H., W.R., and M.N. designed research; M.H. and W.R. performed research; M.H. and W.R. contributed new reagents/analytic tools; M.H. and W.R. analyzed data; and M.H. and W.R. wrote the paper.

Conflict of interest statement: No conflicts declared.

This paper was submitted directly (Track II) to the PNAS office.
 Abbreviation:
 rmsd,
 rms difference
 © 2006 by The National Academy of Sciences of the USA
References
 ↵
 ↵
 ↵

↵
 Clore G. M. ,
 Gronenborn A. M.
 ↵
 ↵

↵
 Brünger A. T. ,
 Clore G. M. ,
 Gronenborn A. M. ,
 Saffrich R. ,
 Nilges M.

↵
 Adams P. D. ,
 Pannu N. S. ,
 Read R. J. ,
 Brünger A. T.

↵
 Rieping W. ,
 Habeck M. ,
 Nilges M.

↵
 Habeck M. ,
 Nilges M. ,
 Rieping W.

↵
 Jaynes E. T.
 ↵

↵
 Jeffreys H.

↵
 Berger J. O. ,
 Liseo B. ,
 Wolpert R. L.
 ↵
 ↵

↵
 Chen M. H. ,
 Shao Q. M. ,
 Ibrahim J. G.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵