New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Conformationfamily Monte Carlo: A new method for crystal structure prediction

Contributed by Harold A. Scheraga
Abstract
A new global optimization method, Conformationfamily Monte Carlo, has been developed recently for searching the conformational space of macromolecules. In the present paper, we adapted this method for prediction of crystal structures of organic molecules without assuming any symmetry constraints except the number of molecules in the unit cell. This method maintains a database of low energy structures that are clustered into families. The structures in this database are improved iteratively by a Metropolistype Monte Carlo procedure together with energy minimization, in which the search is biased toward the regions of the lowest energy families. The Conformationfamily Monte Carlo method is applied to a set of nine rigid and flexible organic molecules by using two popular force fields, AMBER and W99. The method performed well for the rigid molecules and reasonably well for the molecules with torsional degrees of freedom.
Crystal structure prediction is one of the most challenging and important problems in theoretical and applied crystal chemistry. It plays an extremely important role in fields in which the rational design of new organic solids is involved (e.g., pharmaceuticals, explosives, pigments, photosensitive and optoelectronic materials, etc.). The significance of crystal structure prediction for solving the problem of polymorphism has been discussed in detail (1, 2). Despite much effort by many scientific groups over the past 20 years, the problem of crystal structure prediction is far from being solved (3). Generally, “crystal structure prediction” is understood as a search for the most thermodynamically and kinetically favorable crystal structures for a given molecular composition without using any experimental information (in many cases, however, experimental data are included implicitly in the force field or taken into consideration by conducting the search in the most common space groups). Unfortunately, no theoretical methods capable of taking into account the kinetic factors (conditions of nucleation and growth, nature of solvent, etc.) have been developed. Therefore, crystal structure prediction is based solely on thermodynamic considerations and the assumption that the structure observed experimentally corresponds to the global minimum of the free energy. But, free energy is not a function of geometrical coordinates of a single crystal structure; therefore, the traditional approach to crystal structure prediction assumes that the free energy of a crystal can be approximated by its potential energy (which can be computed easily) with the lowest minima corresponding to the structures observed experimentally.
There are two main obstacles making crystal structure prediction based on potential energy calculations very difficult. First of all, to calculate the potential energy of a crystal, which is considered as a sum of pairwise atom–atom interactions (4), a highly accurate interatomic potential is required. Several different potentials have been proposed (5–8). Second, a search for the global minimum on the potential energy surface has to be carried out to predict the crystal structure for a given molecule. The search has to be conducted in a multidimensional space with the number of dimensions increasing rapidly with the number of molecules in the unit cell and the complexity of the molecules. Therefore, a reliable and efficient search method is required to solve the problem. Several methods that can be used for crystal structure prediction with varying degrees of confidence have been developed, and these have been described in comprehensive reviews (1, 3). Most of the methods are based on the use of statistical information (most common space groups, symmetry elements, etc.) derived from the Cambridge Structural Database (CSD; ref. 9). Such an approach significantly reduces the dimensionality of the problem and, as a result, enables simpler methods (for example, systematic or random search) to be used (10, 11).
Thus far, only two global optimization methods have been used for crystal structure prediction. One of these, which does not rely on any statistical information about crystal packing, was developed by Karfunkel and Gdanitz (12). It is based on Monte Carlo simulated annealing with partial energy minimization carried out in every Monte Carlo step. Another global optimization method used for crystal structure prediction is the selfconsistent basintodeformedbasin mapping method (13). It is based on deforming and smoothing the original potential energy surface, thereby greatly reducing the number of minima and simplifying the conformational search. This method also does not use statistical information. The selfconsistent basintodeformedbasin mapping method has been applied to the crystal structure prediction of five small rigid organic molecules and to evaluating potentials (13).
A new, highly efficient global optimization method, Conformationfamily Monte Carlo (CFMC), has been developed recently (14). Initially it was used to search the conformational space of proteins and identify their low energy conformations. In this paper we present a version of this method applied to crystal structure prediction. The efficiency of the method is demonstrated by predicting crystal structures of different degrees of complexity. Global optimizations are carried out for four rigid and five flexible H, C, N, and Ocontaining molecules by using two different force fields, AMBER (5) and W99 (8). For rigid molecules, we found that the CFMC method is more efficient than the selfconsistent basintodeformedbasin mapping method.
Methods
CFMC.
The CFMC method can be considered as an extension of the Monte Carloplusminimization method (15, 16). The most important difference between the original Monte Carloplusminimization and CFMC methods is that the latter does not use a single conformation for a Monte Carlo step; instead, it uses the whole family of conformations (and consequently only the moves between families are accepted or rejected), and the database of the families and structures encountered during the calculations is maintained throughout the simulation. The CFMC method already has been used successfully for predicting protein structures in a unitedresidue representation (14).
The central element of the CFMC method is the structure family database, which is an ensemble of structures clustered into families. To control the computational expense, the number of families and structures within each family are restricted to N_{f} families and N_{c} structures, respectively.
The structure family database for a CFMC run is initialized by successively generating N_{f} random structures and minimizing them. Each random structure is generated in two steps. First, the unit cell lengths (a, b, and c) are chosen at random between maximum and minimum values. The maximum length is calculated as 1.5n_{mol}r_{mol}, where n_{mol} is the number of molecules in the unit cell and r_{mol} is the molecular radius. The minimum length is half the maximum length. All the initial unit cell angles are set to 90°. Each molecule then is placed at a randomly selected position in the unit cell with fractional coordinates equal to [i/n_{mol}, j/n_{mol}, and k/n_{mol}], where i, j, and k are random integer numbers from 0 to n_{mol} − 1. It has been shown (17) that these positions are most likely to be occupied in a typical unit cell. The orientations of each molecule are chosen at random. For each randomly generated structure, the energy is evaluated before minimizing it; if this initial energy is higher than 10^{8} kcal/mol (which indicates numerous clashes between atoms), the structure is rejected, and another random structure is generated. The set of minimized structures and their families constitute the initial structure family database.
Any two structures described by sets of structural parameters x_{1} and x_{2} belong to the same family of structures if the sets x_{1} and x_{2} are different descriptions of the same crystal structure and can be transformed one into the other. From a crystallographic point of view, the kinds of structures described above are equivalent (both can be transformed to the same standard representation), but from a numerical point of view they represent two completely different points in the phasespace of our variables, and therefore treating them as different structures makes the search much wider. The method used for structure comparison is discussed in the last subsection.
In each iteration of a CFMC run a new structure C′ is generated from a structure C already present in the structure family database. This structure C is chosen from a specific family F, denoted the generative family. As the CFMC simulation progresses, this generative family can change, as described below. For the first CFMC iteration, the generative family is set to the lowest energy family in the initial structure family database. An iteration of CFMC consists of four steps, as follows.
Step 1.
A structure C is chosen from the generative family F with a probability proportional to its Boltzmann weight.
Step 2.
This structure C is modified to yield a new structure C′. (The methods for modifying structures are described in the next subsection.) If the new structure is identical to another structure C′′ in the database (within numerical error), the lower energy structure of the pair is stored in the database, and the algorithm returns to step 1.
Step 3a.
If the new structure C′ does not belong to any family in the database, a new family F′ is created, the sole member of which is C′ (of course, structures may be added to this new family in subsequent CFMC iterations, as shown in step 3b). If the number of families in the database exceeds the limit N_{f}, the family with the highest energy is eliminated. The algorithm then jumps to step 4.
Step 3b.
If the new structure C′ belongs to a family F′ in the database, the structure is added to this family. If the number of structures in the family exceeds the limit N_{c}, the structure with the highest energy is eliminated. The algorithm then jumps to step 4.
Step 4.
If the new family F′ found in step 3 is not identical to the original generative family F of step 1, a Metropolis criterion is applied to determine whether to make F^{′} the generative family. F′ becomes the new generative family if it has a lower energy than F or if its Boltzmann factor exp(−βΔE) [with ΔE = (E − E_{min})/(E_{max} − E_{min}), where E_{max} and E_{min} are maximum and minimum energies in a family (when choosing from a family) or in the whole database (when evaluating moves)] is greater than a randomly generated number in the interval (0,1), where β 1/kT as usual (β was equal to 0.01 in the present work) and ΔE is the energy difference between families F and F′. (The energy of a family is defined as that of its lowest energy structure.) If this Metropolis criterion is not met, then F remains the generative family. At this point, the algorithm returns to step 1, and a new CFMC iteration begins.
Methods for Producing New Structures.
In the second step of a CFMC iteration, the structure C is modified to yield a new structure C′. There are two general classes of moves used in the CFMC method: internal (or local) moves, intended to generate structures geometrically close to the starting structure C, and external (global) moves, intended to search the variable space for new families (and usually producing structures geometrically distant from the starting structure C). Within each class, there are three kinds of moves: perturbation, search, and averaging. There are 10 moves in the CFMC algorithm, five internal and five external.
Perturbations are used for searching the space of the molecular translations, rotations, and unit cell parameters. In the internal move 1 and the external move 1 all translations (positions) of molecules are perturbed randomly, whereas in the internal move 2 (and the external move 2) all the Eulerian angles of rotation of all molecules are perturbed randomly. The above internal moves differ form the corresponding external moves only by the range of perturbations. The external move 3 randomly perturbs all possible degrees of freedom (i.e., unit cell parameters, rotations, and translations for all molecules). The internal move 3 searches the rotations of all molecules in a systematic way by generating a threedimensional grid for rotational degrees of freedom of each molecule one by one.
The external move 4 and the internal move 4 are both averaging. This is an entirely different kind of move for which two different structures are necessary. For external averaging, these structures are chosen from different families, whereas for internal averaging, they are chosen from the same family. From these two structures, an averaged (or interpolated) structure then is calculated by using a randomly chosen “mixing ratio” x (in the range from 0 to 1). Thus, every variable ν of the averaged structure is calculated according to the formula: 1 where νand ν correspond to the two structures.
The external and internal moves 5 change the internal degrees of freedom of flexible molecules (and they are not used for rigid molecules). All torsional angles in all molecules are perturbed in these moves. A torsional angle is perturbed randomly when there is only one low energy conformation for that angle; otherwise the angle is perturbed with a step equal to the angular distance between low energy conformations to cover all distinct conformations.
Energy minimization was carried out for all newly generated structures. However, in most cases, random perturbations create structures with numerous atomic clashes, and a simple local minimization is very ineffective. This is because molecules in a crystal are tightly packed, and there is almost no room for any movement. The other problem that must be addressed is that some structures produced may be relatively large and loosely packed. The intermolecular interactions in such a loosely packed cell are relatively weak (i.e., the gradient components corresponding to variables that determine the spacial extension of a crystal are very small); thus, a local minimizer is usually found to choose to change unit cell angles instead of its lengths and molecular positions. This leads to extremely distorted structures (with one or more unit cell angles being unusually low), which are unphysical and cause serious numerical (accuracy) problems. This problem has been described previously in the literature (12).
To avoid these problems the local minimization is carried out in three steps. In the first and second steps, all angular variables are fixed. In the first step, the clashes are removed in the central unit cell only, i.e., the other unit cells are not taken into account. Each molecule (except molecule 1) is translated along the vector pointing from the first molecule to the current molecule until all clashes are removed. Then it is moved back along the same vector until the first intermolecular contacts are established. Second, the surrounding unit cells are added, and the clashes between molecules in different unit cells are relieved by adjusting unit cell parameters. These two steps are very similar to the procedure used by Gdanitz (18). Finally, in the third step local minimization with respect to all variables is carried out by using the SUMSL algorithm (19).
Calculation of Potential Energy.
The algorithm described by Gibson and Scheraga (20) was used for energy and gradient calculations. The potential energy is assumed to be a sum of pairwise interatomic interactions and includes three terms: electrostatic, nonbonded, and torsional: 2 Electrostatic interatomic interactions were modeled by the Coulomb formula, 3 where q_{i} and q_{j} are point charges positioned on the atom sites. The electrostatic energy was calculated by using the Ewald summation (21) without including a dipole moment correction term.
The energy of nonbonded interactions was calculated with the LennardJones “6–12” and Buckingham “6exp” potential functions. The atom–atom contributions were summed in a special way to avoid small discontinuities of energy caused by the fact that nonbonded energy terms vanish only at infinity; these terms were smoothed (“feathered”) to zero at a large but finite distance by using a cubic spline and a cutoff (20), chosen so as to ensure that the energy and its first derivative are continuous everywhere. The 6–12 nonbonded atom–atom potential used in the present work is described by 4 where a, b, c, d, and r_{1} and r_{2} are constants calculated for each pair of atoms (see ref. 20 for more details).
In the “6exp1” potential function, the total energy of an atom–atom interaction may go to minus infinity at short distances because of the much slower changes in the 6exp part of the potential compared with the electrostatic part. To avoid this problem, a cubic approximation of the nonbonded energy was used for distances shorter than r_{1}, the equilibrium interatomic distance for a given pair of atoms minus 1 Å; the parameters a_{0} and b_{0} were chosen so as to ensure that the energy and its first derivatives were continuous. The 6exp potential is described by 5 The torsional energy is calculated with a thirdorder Fourier expansion. 6 where ω is a torsional angle; k_{1}, k_{2}, and k_{3} are torsional parameters obtained by fitting the torsional energy to the difference between ab initio and molecular mechanic (sum of nonbonded and electrostatic) profiles.
The basis vectors of the unit cell a, b, and c were chosen so that the direction of a coincides with the x axis, the vector b lies in the (x,y) plane, and the lattice vectors form a righthanded system. No crystal symmetry was assumed. During energy minimization, the torsional angles around bonds and all translation vectors t_{m} of molecules, Euler angles φ_{m}, θ_{m}, ψ_{m}, and components a_{x}, b_{x}, b_{y}, c_{x}, c_{y}, and c_{z} of the lattice vectors were allowed to vary independently.
Structure Comparison.
Our procedure for structure comparison consists of three steps. In the first step, the total lattice energies and the volumes of the unit cells of two structures i and j are compared. If the deviations are greater than some preset values, the structures are considered to be different. Otherwise, the second step consisting of comparison of unit cell parameters a, b, c, α, β, and γ is carried out. If the deviations in structural parameters are small (lower than some preset threshold), the structures are identical.
To distinguish between different structures and structures that are different representations of the same structure, the following procedure is used. For each structure, all interatomic distances within a cutoff radius r_{d} are calculated and sorted according to their values, and the shortest 1,000 of these are stored. The cutoff radius is chosen as d/2 < r_{d} < 3d/2, where d is the largest component of the unit cell vectors. This set of distances is used in the final (third) step of structure comparison. If all of the differences (r − r) are lower than some threshold value, the structures i and j are different representations of the same structure; otherwise they are different structures.
Results and Discussion
A compilation of the molecules considered in the present paper together with their CSD reference codes is presented in Fig. 1. These molecules can be divided into two groups: (i) rigid molecules without internal rotations and (ii) flexible molecules with different numbers of internal rotations. The molecule containing an aliphatic ring was considered as rigid. The molecular parameters (bond lengths and bond and torsional angles) were taken from xray diffraction or neutron diffraction data (neutron structures were preferred when available). If more than one CSD entry was found for a given molecule, we selected the structure with the lowest xray diffraction discrepancy factor (R). Positions of the hydrogen atoms were adjusted to give the average experimental bond lengths of 1.083 Å for C–H, 1.009 Å for N–H, and 0.983 Å for O–H, as obtained by neutron diffraction (22). The atomic charges were obtained by fitting (23) to the molecular electrostatic potential obtained by ab initio calculations (ref. 24; HF 631G* in the case of AMBER and HF 631G** in the case of the W99 force field) for the experimental geometry. Additional lonepair electron sites were included for nitrogens in heterocycles and in nonplanar amino groups as suggested in ref. 8.
Local Minimization of Experimental Structures.
To obtain the reference structures for global search, all experimental crystal structures were locally minimized with the AMBER and W99 force fields. The parameters of the minimized experimental structures as well as the initial experimental parameters are given in Table 1.
On the average, deviations of the parameters of the minimized structures from their experimental values are much larger for the five flexible molecules than for the four rigid ones. This is expected, because even quite small deviations in torsional angles may cause significant changes in molecular conformation. This was the case for FORAMO01 (AMBER and W99) for which the largest average deviations of the unit cell parameters of 8.6% for AMBER and 6.2% for W99 were obtained. The experimental values of the torsional angles were reproduced quite well for all flexible molecules except FORAMO01 and LIQVUC. Deviations of the torsional angles obtained for LIQVUC influenced the quality of the minimized crystal structure insignificantly because the largest deviations were obtained for rotations of “spherical” methyl groups. Average deviations of the unit cell parameters from their experimental values for the rigid molecules are quite small and similar for both force fields (1.92% for AMBER and 2.23% for W99). Larger deviations were observed for IMIDAZ06 crystal structure minimized with the AMBER (4.36%) and W99 (5.46%) potentials. The symmetry of the experimental structures was preserved after energy minimization for all rigid molecules as well as for DNITBZ03, NITPOL03 and FORAMO01. For BENZON10 and LIQVUC, local minimization of the lattice energy without any symmetry constraints led to the space group P2_{1} with two independent molecules in the unit cell, suggesting defects in the force fields for these two molecules.
Global Optimization.
Global optimization runs were carried out for all the molecules shown in Fig. 1 using 5,000 local minimizations in each global optimization. The number of molecules in the unit cell, Z, was chosen as the number of molecules in the experimental crystal structure, and no symmetry constraints were used. In the case of chiral molecules, two global optimization runs have to be carried out: one with the L (or R) enantiomer and the other one with the racemic mixture. The only chiral molecule (according to the xray experimental data) in our list is PHYPHM, which forms an optically active crystal with the symmetry P2_{1}2_{1}2_{1}. The results of global minimizations with the W99 and the AMBER force fields are summarized in Table 2.
The first four molecules in Table 2 are rigid. In its current form, the method seems to perform well for these molecules; the energy of the global minimum found for each molecule is always lower or equal to the energy of the corresponding minimized experimental structure. In all cases except one (PRMDIN with the AMBER force field), the minimized experimental structure was found with both force fields even when it was not one of the lowest minima, as in the case of IMAZOL06 (with the W99 force field) where the minimized experimental structure has a high rank of 24. The only molecule for which the reference structure was not found is PRMDIN, but its rank is ≈82 with AMBER, which is out of the search range. Both force fields performed reasonably well for the rigid molecules except IMAZOL06 with the W99 and PRMDIN with the AMBER force field.
The results with the set of flexible molecules are not as good as with the rigid molecules. This set of flexible molecules consists of molecules of different levels of flexibility varying from two to four torsional angles per molecule. The simplest of all was DNITBZ03 (two torsional angles and Z = 2) for which the search method as well as both force fields performed very well in finding the reference structure as the global minimum. The second molecule with two torsional degrees of freedom was NITPOL03, but this crystal is more complicated because of the possibility of forming different networks of hydrogen bonds. The minimized experimental structure had the rank of 8 and 18 for the AMBER and the W99 force fields, respectively, and was never found by the global search. The two molecules with three torsional degrees of freedom were FORAMO01 and BENZON10. In the case of BENZON10, the search method failed to produce structures lower or equal in energy to the minimized experimental structure. The crystal structure of FORAMO01 probably was the most difficult to predict among all structures considered in this paper. The number of possible hydrogen bond networks for this molecule is very large, and the difference in stability between them is usually small. Thus, the quality of the atom–atom potentials used should be crucial in this case. The minimized experimental structure of FORAMO01 has an extremely high rank for both force fields. Taking into account large structural deviations obtained for the reference structure relative to the experimental one (Table 1), we conclude that both force fields are clearly inadequate for this molecule, making the prediction almost impossible. For the last two molecules in the set, the search method had problems locating low energy structures. For LIQVUC, the reference structure was not found either for the AMBER or W99 force field. With W99, the reference structure had an energy slightly higher (0.04 kcal/mol) than the energy of the lowest minimum found. The a, b, and c unit cell parameters and the molecular torsional angles are similar for the reference and for our lowest energy structures, although the crystal structures are different. These results suggest that the presence of additional torsional degrees of freedom requires more extensive search than that used in this work.
The global optimization results show that the method performs well for rigid molecules and simple flexible molecules, although for molecules with a larger number of torsional degrees of freedom, the search does not cover the conformational space as well. Most likely the reason is an insufficient number of local minimizations during the global search. The CFMC method applied to a small protein such as Protein A (ref. 12; 43 principal degrees of freedom) usually required 50,000 local minimizations to converge. There is no reason to believe that the global optimization of crystals is a less demanding task. However, the energy calculation for a crystal is quite expensive, and this has limited us thus far to the 5,000 local minimizations per run. Reduction of the cost of a single local energy minimization would probably improve the performance of the algorithm.
Acknowledgments
This research was supported by National Institutes of Health Grant GM14312, National Science Foundation Grant MCB0003722, and Fogarty Foundation Grant R03 TW1064. This work was carried out by using computational resources provided in part by (i) the Cornell Theory Center, which receives funding from Cornell University, New York State, the National Center for Research Resources at the National Institutes of Health (P41RR04293), and members of the Theory Center's Corporate Partnership Program, and (ii) with our own array of 55 dualprocessor PC computers.
Abbreviations
 CFMC,
 Conformationfamily Monte Carlo;
 CSD,
 Cambridge Structural Database
 Accepted September 12, 2001.
 Copyright © 2001, The National Academy of Sciences
References
 ↵
 Verwer P,
 Leusen F J J
 ↵
 Gavezzotti A,
 Filippini G
 ↵
 ↵
 Pertsin A J,
 Kitaigorodsky A I
 ↵

 Ewig C S,
 Thacher T S,
 Hagler AT

 Mooij W T M,
 van Duijneveldt F B,
 van Duijneveldtvan de Rijdt J G C M,
 van Eijck B P
 ↵
 ↵
 ↵
 Eijck B P,
 van Spek A L,
 Mooij W T M,
 Kroon J
 ↵
 ↵
 ↵
 Pillardy J,
 Wawak R J,
 Arnautova Y A,
 Czaplewski C,
 Scheraga H A
 ↵
 ↵
 Li Z,
 Scheraga H A
 ↵
 Li Z,
 Scheraga H A
 ↵
 Motherwell W D S
 ↵
 Gdanitz R J
 ↵
 ↵
 Gibson K D,
 Scheraga H A
 ↵
 Ewald P
 ↵
 Desiraju G R,
 Steiner T
 ↵
 ↵
Citation Manager Formats
Sign up for Article Alerts
Jump to section
You May Also be Interested in
More Articles of This Classification
Physical Sciences
Chemistry
Related Content
 No related articles found.
Cited by...
 No citing articles found.