New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Crossvalidated maximum likelihood enhances crystallographic simulated annealing refinement
Abstract
Recently, the target function for crystallographic refinement has been improved through a maximum likelihood analysis, which makes proper allowance for the effects of data quality, model errors, and incompleteness. The maximum likelihood target reduces the significance of false local minima during the refinement process, but it does not completely eliminate them, necessitating the use of stochastic optimization methods such as simulated annealing for poor initial models. It is shown that the combination of maximum likelihood with crossvalidation, which reduces overfitting, and simulated annealing by torsion angle molecular dynamics, which simplifies the conformational search problem, results in a major improvement of the radius of convergence of refinement and the accuracy of the refined structure. Torsion angle molecular dynamics and the maximum likelihood target function interact synergistically, the combination of both methods being significantly more powerful than each method individually. This is demonstrated in realistic test cases at two typical minimum Bragg spacings (d_{min} = 2.0 and 2.8 Å, respectively), illustrating the broad applicability of the combined method. In an application to the refinement of a new crystal structure, the combined method automatically corrected a mistraced loop in a poor initial model, moving the backbone by 4 Å.
The aim of macromolecular crystallography is to obtain an accurate atomic model based on the observed diffraction data. This model needs to be optimized to obtain the best agreement with the observed diffraction data. Several specialized optimization methods have been developed to refine macromolecules, including partial and full matrix leastsquares (1, 2), conjugate gradient minimization (3), and molecular dynamicsbased simulated annealing (4, 5). However, the refinement of macromolecular structures is often difficult for several reasons. First, the datatoparameter ratio is low, creating the danger of overfitting the diffraction data. Therefore, the apparent ratio of data to parameters is often increased by incorporation of chemical information, i.e., bond length and bond angle restraints obtained from ideal values seen in high resolution structures (1). Second, the initial model is often poor because of the typically limited quality of experimental phases, and there is, therefore, some discrepancy to the actual structure of the crystallized molecule. Third, local (false) minima exist in the target function. The more local minima there are and the deeper the minima are, the more likely refinement will fail. Fourth, model bias in the electron density maps complicates the process of manual refitting between cycles of automated refinement. This problem is exacerbated by overfitting the diffraction data in the refinement process.
Methods have been devised to address these difficulties. Crossvalidation, in the form of the free Rvalue, can be used to detect overfitting (6–8). The radius of convergence of refinement can be increased by the use of stochastic optimization methods such as molecular dynamicsbased simulated annealing (4). Introduction of torsion anglebased approaches (9, 10) reduce the number of degrees of freedom by making the reasonable assumption that bond lengths and angles are approximately constant. The implementation of this method within the framework of molecular dynamics has been shown to increase dramatically the radius of convergence of refinement and to decrease overfitting (10, 11).
As with any optimization problem, the efficiency of the refinement method depends on the complexity of the landscape of the target function. This landscape is also influenced by the way model incompleteness and errors are modelled. The most commonly used target function (E^{LSQ}) for macromolecular refinement employs the leastsquares residual for the diffraction data, 1 where F_{o} and F_{c} are the observed and calculated structurefactor amplitudes, k is a relative scale factor, w_{a} is a weight, and E_{restraints} are geometric (bond length, bond angle, and atomic repulsion) restraints. A decrease of this function can sometimes be due to accumulation of systematic errors in the model without improvement or even a worsening of the model (12). The underlying reason can be found in the fact that the leastsquares residual does not account for the effects of phase errors in the calculated structure factors, so it is poorly justified when the model is far away from the correct answer or incomplete (13). A more appropriate target for macromolecular refinement can be obtained through a maximum likelihood formulation (14–16), and such implementations have been recently described (13, 17, 18).
The goal of the maximum likelihood method is to determine the probability of making a set of measurements, given the model, and estimates of its errors and of errors in the measured intensities. The effects of model errors (misplaced atoms and missing atoms) on the calculated structure factors are first quantified with σ_{A} values, which correspond roughly to the fraction of each structure factor that is expected to be correct (13, 19). To achieve an improvement over the leastsquares residual (Eq. 1), crossvalidation (6, 8) was used for the computation of σ_{A}, necessitating its calculation with a randomly selected test set of diffraction data that was never included in the refinement process (13). The need for crossvalidation is demonstrated in Results. The crossvalidated σ_{A} values (σ_{A}^{cv}) are then used to compute the expected value of 〈F_{o}〉^{cv} and the corresponding variance (σ_{MLcv}^{2}) (13). 〈F_{o}〉^{cv} for the acentric case is given by 2 For the centric case, the expected value is 3 where ɛ is the expected intensity factor, σ_{Δ}^{2} = Σ_{N} − D^{2}Σ_{P}, Σ_{N} is the distribution parameter of the Wilson intensity distribution for F_{o} (20) and Σ_{P} is the distribution parameter of the Wilson intensity distribution for F_{c}, and Φ(a, b, x) is Kummer’s confluent hypergeometric function. The parameters Σ_{N}, Σ_{P}, and D are calculated using crossvalidated data. 〈F_{o}〉^{cv} and σ_{MLcv}^{2} can be readily incorporated into a maximum likelihood target function, 4 Fig. 1 shows the improvement of the target function by maximum likelihood; the false minimum is less pronounced compared with the leastsquares residual. Despite this major improvement, local minima still exist (Fig. 1), which causes a limited radius of convergence when gradient descent optimization methods are used. Here we show that torsion angle molecular dynamics and the crossvalidated maximum likelihood target E^{ML}, each of which independently improves the refinement of poor models, combine synergistically to significantly improve macromolecular crystallographic refinement.
METHODS
Refinement Protocol.
A maximum likelihood function using crossvalidated σ_{A} weighting (19) and experimental amplitudes has been described previously (13). Several improvements to the original method were implemented for the smoothing of the crossvalidated σ_{A} function, the determination of the optimal weight between experimental and geometric restraints (w_{a}), and automation of the refinement procedure.
The σ_{A} function is used to obtain an estimate of the effect of model errors and incompleteness. To make this estimate as unbiased as possible, crossvalidation is used by setting aside a randomly selected set of typically 10% of the diffraction data for the calculation of σ_{A}^{cv}, while the remaining 90% is used in the target function (“working set,” Eq. 4) (6, 8, 13). The relatively small number of reflections in the test set results in large fluctuations in the σ_{A}^{cv} values when calculated as a function of resolution. Therefore, during the optimization of likelihood as a function of σ_{A}^{cv}, a harmonic term is added to the target to restrain the resolutiondependent σ_{A}^{cv} values to the average of the two neighboring resolution bins. This restraint serves to smooth the σ_{A}^{cv} function.
The optimal weight (w_{a}) between the xray and geometric terms (Eqs. 1 and 4) was estimated by 5 after an unrestrained (i.e., with w_{a} set to 0 in Eqs. 1 and 4) molecular dynamics simulation (4). For the maximum likelihood target (Eq. 4), it became clear that the choice of w_{a} is very important, in contrast to leastsquares residual based refinements where changes of up to a factor of 2 in w_{a} make little difference (unpublished data). Therefore, an automated method was developed to adjust w_{a} during the course of refinement. At stages in the refinement protocol where the model may have significantly changed, the new weight (w_{a}) between the geometric and xray terms was recalculated (using Eq. 5). For the maximum likelihood target, the σ_{A} values were recalculated before recalculation of w_{a}. The points at which model changes were greatest were after initial minimization and simulated annealing (Fig. 2).
Details of this new highly automated protocol are shown in Fig. 2. Note that the protocol does not require any “preparation” stage, minimizing the possibility of user error. Simulated annealing refinements were repeated 10 times with different initial velocities. A modified version of the parameter set of Engh and Huber (21) was used for all refinements, where the nonbonded terms were modified to be similar to those implemented in prolsq using a quartic, repulsive nonbonded energy function (1, 22). Temperatures during simulated annealing were maintained at a constant value by coupling to a temperature bath (23).
Refinements were carried out with a developmental program that combines the maximum likelihood target with both simulated annealing using torsion angle molecular dynamics and conjugate gradient minimization (which will be made available upon request from A.T.B.).
Test Cases.
Test refinements were carried out using the crystal structure of amylase inhibitor (24). Unit cell dimensions in spacegroup P2_{1}2_{1}2_{1} are a = 61.76 Å, b = 40.73 Å, and c = 26.74 Å. Diffraction data were measured for the resolution range 17–2.0 Å, and all data with F < 2σ_{F} were discarded for comparison with the published refined structure (24). Refinements using the maximum likelihood target (Eq. 4) used all data from 17 to 2Å resolution. For this test case, use of the bulk solvent model with the leastsquares residual gave worse results than use of a truncated data range. This was probably due to the large coordinate error for many of the starting models and the relatively low solvent content (30%) of the crystal. Therefore, no bulk solvent model and a truncated data range of 8–2 Å was used in the leastsquares refinements described below.
Models with increasing coordinate error were generated by performing an unrestrained molecular dynamics simulation at 600 K starting from the published crystal structure (10). The resulting scrambled models were energy minimized to achieve starting models with good geometry and nonbonded interactions. Five models were obtained with rms backbone (C, CA, N) errors between 0.6 and 1.9 Å from the crystal structure.
Molecular Replacement Is Successful with the Models Scrambled by Molecular Dynamics.
To ensure that the resulting scrambled models represent realistic starting structures, molecular replacement was carried out using the experimental amylase inhibitor data. The molecular replacement protocol consisted of the direct rotation function (25) followed by translation searches (26) around the top 10 grid points of the rotation function computed using data from 15 to 3.5Å resolution. Briefly, a 10° grid search in Euler angle space using 5° steps was carried out around each of the top 10 grid points, resulting in 1,250 translation searches. The top translation function solutions were rigid body refined and scored according to the Patterson correlation coefficient (27). The correct solution was obtained in all five cases studied both with a fullatom and a polyalanine model. The signaltonoise ratio (ratio of the Patterson correlation coefficient for the correct and highest incorrect solution) ranged from 1.44 for the best model to 0.93 for the worst model. The correct solution was always the top translation search peak except in the case of the worst search model, which had a backbone coordinate rms deviation of 1.9 Å. However, torsion angle molecular dynamics simulated annealing refinement, using the maximum likelihood target, of the top two peaks for the worst model clearly indicated that the correct solution was the second peak, as assessed by both the Rvalue and free Rvalue. Thus, all five scrambled models contain sufficient information to solve the crystal structure through an extensive molecular replacement search.
The models after refinement were compared with the published amylase inhibitor crystal structure (24). The unweighted phase error was calculated for all data (17–2 Å or 17–2.8 Å). The rms error for backbone atoms (excluding residues 1–4, which are very flexible as indicated by their high Bvalues) was also calculated.
Application to a New Crystal Structure.
A partially refined model of the recently solved crystal structure of human hnRNP A1 (28) was used. This small RNA binding protein was crystallized in spacegroup P2_{1} with unit cell dimension a = 38.1 Å, b = 44.0 Å, c = 56.1 Å, and β = 94.8°. The initial free Rvalue for the model, against a lower resolution data set to 2.2 Å, was 46.9%, and the Rvalue was 38.6%. Conjugate gradient minimization and simulated annealing using torsion angle dynamics from a starting temperature of 10,000 K were carried out. The full resolution range of 35–2.2 Å, with no weak data excluded, was used in refinements using either the leastsquares or maximum likelihood targets. A bulk solvent correction (29) was applied (k_{sol} = 0.42 electron/Å^{3} and B_{sol} = 154.8 Å^{2}), resulting in a major improvement in both Rvalue and free Rvalue, in contrast to the amylase inhibitor test case. Refinements were repeated five times, with different initial velocities.
RESULTS
Test Case with All Atoms.
The models placed according to the molecular replacement search solutions were used as starting points for refinement against the amylase inhibitor diffraction data. Both conjugate gradient minimization and simulated annealing with torsion angle molecular dynamics were carried out to compare the performance of the maximum likelihood target against the leastsquares residual. Fig. 3 shows that the maximum likelihood target produces models with lower phase error. For conjugate gradient minimization, the average phase improvement, compared with the leastsquares residual, is approximately 5°. It can also be seen that torsion angle molecular dynamics using the leastsquares residual has a larger radius of convergence than minimization with the maximum likelihood target, and that the convergence of torsion angle molecular dynamics is further improved by the use of the maximum likelihood target (Fig. 3). For the most scrambled model, an average phase improvement, compared with the leastsquares residual, of more than 15° is obtained. The resulting structures are very close to the published crystal structure (Fig. 4). The large spread in structures observed for the leastsquares residual refinement is due to limited radius of convergence and, in this particular case, should not be interpreted as conformational variability in the structure.
Local minima cause gradient descent methods to become trapped, whereas simulated annealingbased methods can partially overcome this problem. This clearly also applies to the maximum likelihood target (Fig. 3). Extensive minimization consisting of 10 macrocycles of conjugate gradient minimization, each of 200 steps followed by an update of σ_{A} and its derived quantities, was carried out for the model with 1.53 Å rms initial backbone coordinate error (a total of 2,000 steps of minimization, with the estimates of σ_{A}^{cv} being updated every 200 steps). The free Rvalue and phase error were decreased (Table 1). However, the resulting structures are much poorer than those obtainable by simulated annealing (Fig. 3).
Crossvalidation is essential in the calculation of the maximum likelihood target (8, 13). Its importance was demonstrated by using all reflections in the calculation of σ_{A}^{cv}. Repeated conjugate gradient refinements were performed as above, interspersed by σ_{A}^{cv} and w_{a} updates, for the model with 1.53 Å rms initial backbone coordinate error. The refinement without crossvalidation gave much poorer results, as indicated by the high free Rvalue and phase error (Table 1). In addition, the difference between the Rvalue and free Rvalue is significantly larger (8.1%) compared with the refinement where crossvalidation was used (6.6%). Therefore, crossvalidation is essential to obtain good results with this maximum likelihood target.
Test Case with Low Resolution Data.
The molecular replacement solutions for the five scrambled structures were used also in test refinements against lower resolution diffraction data that were created from the observed amylase inhibitor amplitudes by the application of a Bfactor of 45 Å^{2} to both amplitudes (F) and σ_{F} values. Refinements were carried out as before, starting with the 3.5Å molecular replacement solution, but at the reduced resolution range of 17–2.8 Å for the maximum likelihood target and 8–2.8 Å for the leastsquares target against this lower resolution data set. As for the higher resolution refinements, the maximum likelihood target produces models with lower phase error in most cases (Fig. 5). For smaller initial coordinate error, the results for either target are comparable. However, at higher initial error, maximum likelihood only gives superior results when used in combination with simulated annealing.
Test Case with PolyAlanine Models.
To thoroughly test the limits of the methods, both molecular replacement and refinement were repeated using polyalanineonly models. Molecular replacement was successful in all five cases (with signaltonoise ratios ranging from 1.57 to 1.07). After refinement against the amylase inhibitor diffraction data, the final phase errors show that the combination of simulated annealing and the maximum likelihood target are again consistently better except for the worst initial model (Fig. 6). In this latter case all methods fail to converge. However, torsion angle molecular dynamics simulated annealing in combination with the maximum likelihood target is able to successfully refine models with initial rms deviation errors of between 1.3 Å and 1.5 Å (Fig. 6). The leastsquares residual is unable to improve these models.
Application to a New Crystal Structure.
A partially refined model of the recently solved crystal structure of human hnRNP A1 (28) was refined at 2.2Å resolution using simulated annealing with both the leastsquares residual and the maximum likelihood target. This structure had proved difficult to refine, partially because of some large initial errors in the model. Refinement of the structure was only successful using a combination of simulated annealing, torsion angle molecular dynamics and multiple refinements followed by averaging (L. M. Rice, Y. Shamoo, and A.T.B., unpublished work). A model from the early stages of refinement, therefore, provided a good test case for the new maximum likelihood target.
Conjugate gradient minimization refinement with the maximum likelihood target and the leastsquares residual target both reduced the free Rvalue to 44.4%. The maximum likelihood target, however, reduces the amount of model bias to the incorrect initial model (Fig. 7b). This is apparent in the smaller difference between the free Rvalue and Rvalue (6%) compared with the leastsquares residual target (9%).
Torsion angle molecular dynamics simulated annealing refinements with the maximum likelihood target and the leastsquares residual target both reduced the free Rvalue compared with conjugate gradient minimization, with average values over five refinements of 42.4% and 42.9%, respectively. However, the structure with the lowest free Rvalue was produced by torsion angle dynamics with the maximum likelihood target (free Rvalue = 40.9%, Rvalue = 36.2%). In this structure, a mistraced loop was automatically corrected by a backbone movement of 4 Å (Fig. 7d), with a resulting rms deviation from the final model for C^{α} atoms in this loop region of 0.98 Å. In contrast, refinements with the leastsquares residual made no significant change to this loop, with the resulting rms deviation for C^{α} atoms being 2.5 Å. In addition, the lowest free Rvalue from the leastsquares target (free Rvalue = 41.8%, Rvalue = 33.7%) shows a high level of model bias in the electron density map (Fig. 7a). The other torsion angle refinements with the maximum likelihood target, despite only small movements in this loop region, show significantly less model bias than the leastsquares refinements (Fig. 7c).
DISCUSSION
The crossvalidated maximum likelihood target (Eq. 4) is most powerful when used in combination with torsion angle simulated annealing. Although this new target also improves conjugate gradient minimization based refinement, the radius of convergence is still smaller than that of torsion angle simulated annealing. Even with the leastsquares residual (Eq. 1), simulated annealing is superior to minimization with maximum likelihood. However, the addition of the maximum likelihood target significantly improves the radius of convergence of torsion angle simulated annealing and reduces overfitting. In this series of tests, refinement of structures with initial rms backbone errors up to 1.8 Å converge to the correct solution. These examples are typical for weak molecular replacement solutions where the initial phases are so poor that the refinement method has to be relied upon to fully solve the structure.
A fundamental problem with the leastsquares residual (Eq. 1) is the danger of overfitting the diffraction data. Systematic errors that reduce the difference between F_{o} and F_{c} can be introduced into the model. This is of particular concern in macromolecular crystallography, where repeated cycles of manual rebuilding and refinement are performed. A significant advantage of the maximum likelihood formulation is that the amount of overfitting is reduced (Table 1). In general the maximum likelihood target produces higher Rvalues, but more accurate models and lower or equal free Rvalues, compared with the leastsquares target. This reduced overfitting serves to reduce modelbias in electron density maps, of crucial importance during the manual rebuilding stages of refinement (Fig. 7).
The crossvalidated maximum likelihood target tested here is shown to be superior to the standard leastsquares residual. Final phase errors are reduced and the convergence of macromolecular refinement is improved. The direct incorporation of model errors using the σ_{A} function into the maximum likelihood target makes it most powerful for high initial phase errors, when the model is often incomplete and has significant deviations from the crystal structure. This is the case at both medium (2.0 Å) and lower (2.8 Å) resolution. The maximum likelihood target allows all diffraction data to be used without the need for artificial resolution or F_{o}/σ_{F} truncations and without the need for bulk solvent correction at the early stages of refinement. This and the larger radius of convergence make the combination of crossvalidated maximum likelihood and torsion angle simulated annealing an important new tool for macromolecular crystallography.
Acknowledgments
We thank Luke Rice for stimulating discussions and critical reading of this manuscript. We also thank Yousif Shamoo and Thomas A. Steitz for access to coordinates and diffraction data for human hnRNP A1. This work was supported by a grant from the National Science Foundation (ASC 93–181159) to A.T.B., a grant from the Natural Sciences and Engineering Research Council of Canada to N.S.P., and grants from the Alberta Heritage Foundation for Medical Research and the Medical Research Council of Canada (MT11000) to R.J.R.
Footnotes

↵¶ To whom reprint requests should be addressed. email: brunger{at}laplace.csb.yale.edu.

Paul B. Sigler, Yale University, New Haven, CT
 Received January 10, 1997.
 Accepted March 10, 1997.
 Copyright © 1997, The National Academy of Sciences of the USA
References
 ↵
 ↵
 ↵
 ↵
 Brünger A T,
 Kuriyan J,
 Karplus M
 ↵
 ↵

 Brünger A T
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Bailey S,
 Dodson E
 Bricogne G,
 Irwin J
 ↵
 Bailey S,
 Dodson E
 Murshudov G N,
 Dodson E J,
 Vagin A A
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Pflugrath J W,
 Wiegand G,
 Huber R,
 Vértesey L
 ↵
 ↵
 ↵
 ↵
 Shamoo Y,
 Krueger U,
 Rice L M,
 Williams K R,
 Steitz T A
 ↵
 ↵
 ↵