# Protein–protein docking by fast generalized Fourier transforms on 5D rotational manifolds

^{a}Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794;^{b}Moscow Institute of Physics and Technology, Moscow Region 141700, Russia;^{c}Department of Biomedical Engineering, Boston University, Boston, MA 02215;^{d}Innopolis University, Innopolis 420500, Russia;^{e}Institute of Computer Aided Design of the Russian Academy of Sciences, Moscow 123056, Russia;^{f}Inria Nancy, 54600 Villers-les-Nancy, France;^{g}Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794;^{h}Institute for Advanced Computational Sciences, Stony Brook University, Stony Brook, NY 11794

See allHide authors and affiliations

Edited by Michael Levitt, Stanford University School of Medicine, Stanford, CA, and approved May 13, 2016 (received for review March 8, 2016)

## Significance

Expressing the interaction energy as sum of correlation functions, fast Fourier transform (FFT) based methods speed the calculation, enabling the sampling of billions of putative protein–protein complex conformations. However, such acceleration is currently achieved only on a 3D subspace of the full 6D rotational/translational space, and the remaining dimensions must be sampled using conventional slow calculations. Here we present an algorithm that employs FFT-based sampling on the 5D rotational space, and only the 1D translations are sampled conventionally. The accuracy of the results is the same as those of earlier methods, but the calculation is an order of magnitude faster. Also, it is inexpensive computationally to add more correlation function terms to the scoring function compared with classical approaches.

## Abstract

Energy evaluation using fast Fourier transforms (FFTs) enables sampling billions of putative complex structures and hence revolutionized rigid protein–protein docking. However, in current methods, efficient acceleration is achieved only in either the translational or the rotational subspace. Developing an efficient and accurate docking method that expands FFT-based sampling to five rotational coordinates is an extensively studied but still unsolved problem. The algorithm presented here retains the accuracy of earlier methods but yields at least 10-fold speedup. The improvement is due to two innovations. First, the search space is treated as the product manifold

Determining putative protein–protein interactions using genome-wide proteomics studies is a major step toward elucidating the molecular basis of cellular functions. Understanding the atomic details of these interactions, however, requires further biochemical and structural information. Although the most complete structural characterization is provided by X-ray crystallography, solving the structures of protein–protein complexes is frequently very difficult. Thus, it is desirable to develop computational docking methods that, starting from the coordinates of two unbound component molecules defined as receptor and ligand, respectively, are capable of providing a model of acceptable accuracy for the bound receptor–ligand complex (1⇓⇓–4). In view of the large number of putative protein–protein interactions, the computational efficiency of docking is also a concern.

Most global docking methods start with rigid body search that assumes only moderate conformational change upon the association, accounted for by using a smooth scoring function that allows for some level of steric overlaps (3). Rigid docking was revolutionized by the fast Fourier transform (FFT) correlation approach, introduced in 1992 by Katchalski-Katzir et al. (5). The major requirement of the method is to express the interaction energy in each receptor–ligand orientation as a sum of *P* correlation functions, i.e., in the form

Most FFT-based methods (6⇓–8, 10⇓–12) define **1**, can be expressed in terms of the Fourier transforms *E* can be calculated over the entire translational space using *P* forward and one inverse FFT. If *N* denotes the size of the grid in each direction, then the efficiency of this approach is

Despite the usefulness of the above algorithm, using FFTs only in translational space has three major limitations. First, FFTs on a new grid must be computed for each rotational increment of the rotating molecule; thus acceleration applies only to half of the degrees of freedom (Fig. 1). Second, each term in the scoring function requires a separate FFT calculation. Thus, accounting for electrostatics, desolvation, and, particularly, pairwise interactions substantially increases the required computational efforts. Third, experimental techniques such as NMR Nuclear Overhauser effect measurements and chemical cross-linking yield information on approximate distances between interacting residues across the interface, and this information can be used to perform the docking subject to pairwise distance restraints. Unfortunately, each pairwise distance restraint requires a new correlation function term. Because the required computational effort is proportional to *P*, the number of correlation functions in the energy expression, the increasing complexity reduces the numerical advantage of the FFT approach.

In principle, the above problems can be avoided by applying the transforms first, and then moving the proteins in the Fourier space without the need for recomputing the transforms. However, it is difficult to carry out rotations in the translational Fourier space, and, thus, to perform rotations efficiently, it is natural to use spherical coordinates. This approach was applied to crystallography in the early 1970s by Tony Crowther, who realized that the rotation function can be computed more quickly using the FFT, expressing the Patterson maps as spherical harmonics (13). A few groups also used this idea for the development of docking algorithms (14, 15). Most notable is the Hex method of Ritchie and Kemp (14), which represents protein shapes using Fourier series expansions of spherical harmonic and Gauss–Laguerre polynomials. This representation allows rotational searches to be accelerated by angular FFTs, and it enables translations to be calculated analytically in the Fourier basis (15). A similar approach has been developed by Chacon’s group (16, 17), in which translations are calculated numerically. However, both approaches were found to have lower accuracy than traditional Cartesian FFT sampling (15). This may be attributed to three main factors. Firstly, the energy functions used were less detailed than in some of the Cartesian approaches. In particular, we used only van der Waals and electrostatic terms (15). Secondly, because the computational cost of the polar Fourier translation matrices grows as

In this paper, we describe a fast manifold Fourier transform (FMFT) algorithm that eliminates the above shortcomings and, on the average, results in a 10-fold decrease in computing time while retaining the accuracy of the traditional Cartesian FFT-based docking. As will be further emphasized, even more important is that, using FMFT, the computational efforts required are essentially independent of the number of correlation function terms in the scoring function, thus enabling the efficient use of more accurate but also more complex energy expressions, as well as accounting for any number of pairwise distance restraints. Developing the method, we took advantage of the generalization of the Cartesian FFT approach to the rotational group manifold

As already mentioned, a general shortcoming of using Fourier decomposition in spherical spaces is the relatively slow convergence of the series of spherical basis functions. Thus, using a large number of terms reduces computational efficiency, whereas truncating the series limits the accuracy of the energy values calculated by the method. Therefore, a key factor explaining the success of our manifold FFT docking method is that we select the centers of highly populated clusters of low-energy docked structures rather than simply low-energy conformations as predictions of the complex. Such clusters occur in low-energy regions around the local minima in the conformational space. The size of each cluster represents the width of the corresponding energy well, and hence provides some information on entropic contributions to the free energy. Model selection based on cluster size has been used in our very successful docking server ClusPro and, in a substantial fraction of docking problems, enabled the identification of the docked structure closest to the native complex (20). We note that a similar clustering step is implemented in the protein structure prediction program Rosetta (21). For a somewhat more formal justification of the cluster-based approach to model selection, we argue that, using FFT, we globally and systematically sample the energy landscape of the interacting protein pair on a grid, and hence we can calculate an approximate partition function of the form *j*th pose, and we sum over all poses. For the *k*th low-energy cluster, the partition function is given by *k*th cluster is given by *j* in the low-energy clusters. This simplification implies that *k*th cluster. Therefore, we select the centers of highly populated clusters of docked structures, rather than low-energy conformations, as predictions of the complex. Although neglecting the energy differences within the low-energy clusters seems to be arbitrary, the success of the ClusPro server demonstrates that the approximation is valid in a large fraction of cases. The significance of model selection based on cluster size rather than energy values is that it does not require very accurate energy evaluation, and hence, in FMFT, it is sufficient to use a limited number of spherical basis functions in the Fourier space, increasing numerical efficiency without noticeable loss of docking accuracy.

The high efficiency of the FMFT algorithm enables solving very demanding docking problems, way beyond what was considered feasible in the past. After demonstrating that the accuracy of FMFT is comparable to that of the traditional Cartesian FFT-based docking, we present here a few applications that require a large number of docking calculations. Such problems include docking ensembles of models obtained by NMR or homology modeling, and exploring a large number of putative peptide conformations in peptide–protein docking. As will be described, an additional and very favorable property of the FMFT algorithm is that the required computational efforts are almost completely independent of the number *P* of the correlation function terms in the energy expression given by Eq. **1**, and hence the method can be efficiently used with scoring functions of arbitrary complexity. In contrast, in the traditional FFT approach, the efforts are proportional to *P*, and hence it is difficult to perform docking subject to pairwise distance restraints, as each restraint gives rise to an additional term in the scoring function. Using FMFT, we demonstrate that this problem can be solved effectively without significant increase in running times (Fig. S2).

## Results and Discussion

### FFT-Based Docking on 5D Rotational Manifolds.

Here we demonstrate that, by taking advantage of the special geometry of the space characterizing molecular movement upon protein–protein association, it is possible to construct an extremely efficient FFT-based docking algorithm. We present the basic idea of this algorithm as the generalization of the translational FFT method described in the Introduction. Because we plan to work in the rotational space, we change the Cartesian coordinates to polar coordinates *N* is the order of expansion used. Eq. **4** looks like a Fourier transform, but

Consider again the derivation of the convolution theorem (Eq. **1**) but, this time, on the manifold *z* axis,*d* functions, related to Jacobi polynomials (19). Eqs. **6** and **7** show that the rotational operator in the rotational group **2**), apart from the asymmetry of the middle angle *β*, which requires special treatment. Describing the translation of the ligand along the *z* axis in the Fourier space is far from simple, and requires updating a set of coefficients. However, it is only 1 degree of freedom (as opposed to 3 degrees in the Cartesian space), and hence it can be accomplished relatively efficiently (24). Now we apply the translation operator and the rotation operator (Eq. **7**) to the integral in Eq. **5**. Based on the orthonormality of the generalized Fourier basis functions, interchanging the order of integration and summation yields**8** is similar to Eq. **3** in Cartesian coordinates, with the difference that, instead of a 3D inverse Fourier transform, we have a generalized FMFT, which involves the Wigner *d* functions **3**, for each rotation of the ligand, we have to calculate the Fourier transforms *P* components of the ligand energy function separately, form the product with the transform *p*th component of the receptor energy function, sum all terms, and take the inverse transform. In contrast, according to Eq. **8**, we calculate the sum of initial precalculated generalized Fourier coefficients in the internal loop only once, and perform all rotations in Fourier space rather than calculating an FFT for each rotation. This allows us to calculate multiple energy terms using a single FMFT for each translation. Thus, as already emphasized, the computational efforts are essentially independent of the number *P* of the correlation function terms in the energy expression. Because inverse manifold Fourier transforms can be efficiently calculated by methods due to ref. 19, this approach provides substantial computational advantage, particularly if *P* is high.

### Execution Times.

Execution times of the FMFT sampling algorithm were measured by docking unbound structures of component proteins in 51 enzyme–inhibitor pairs from the established Protein Docking Benchmark (25) (Table S1). The times were compared with those required for docking the same proteins using PIPER, a protein docking program based on the Cartesian FFT approach (8). The FFTW (Fastest Fourier Transform in the West) library (26) was used for FFT calculations. All runs were performed using the standard PIPER scoring function, consisting of eight correlation function terms. Execution times were measured on one or several Intel Xeon E5-2680 processors. Using the FMFT algorithm, the average execution time was 15.39 min. In comparison, the average execution time for the same set of proteins using PIPER was 232.15 min, indicating that FMFT speeds up the calculations ∼15-fold. Using parallel versions of the algorithms on 16 CPU cores, the average execution times measured were 2.67 min and 20.19 min for FMFT and PIPER, respectively, which shows about a 7.5-fold speedup.

### Application 1: Constructing Enzyme–Inhibitor Complexes.

The quality of FMFT and PIPER results was determined by docking the same 51 enzyme–inhibitor pairs that we have used for comparing execution times (Table S2). In both cases, the scoring function was the same one normally used in PIPER for docking enzyme–inhibitor pairs, and it consisted of attractive and repulsive van der Waals, Coulombic electrostatics, generalized Born, and knowledge-based Decoys As the Reference State terms, the latter representing nonpolar solvation (27). The docking procedure for these cases was the one normally used by PIPER (20). First, the conformational space was sampled using either the FMFT or the PIPER protocol. After docking, the 1,000 lowest energy poses were retained and clustered using interface C_{α} rms deviation (RMSD) as the distance metrics with a fixed 9 Å clustering radius. The clusters were ranked according to cluster populations (i.e., number of poses in the cluster), and the centers of up to 30 largest clusters were reported as putative models of the complex (Table S2).

Fig. 2 *A*–*C* shows the results of docking. The number of hits shown in Fig. 2*A* is the number of near-native poses, defined as having less than 10 Å *B* and *C* shows the properties of models obtained by clustering low-energy poses using pairwise IRMSD as a distance metric. A large number of low-energy poses typically yields a well-populated and thus highly ranked near-native cluster, reported as one of the final models. Based on all these results, FMFT and PIPER show comparable docking performance, both in terms of the number of near-native structures (Fig. 2*A*), the ranks of the clusters that define the final near-native models (Fig. 2*B*), and the IRMSD (Fig. 2*C*) of these models.

### Application 2: Docking Interacting Protein Domains.

We further compared FMFT and PIPER by docking interacting domains extracted from proteins that are defined as “Other” type in the Protein Docking Benchmark (25) (Tables S3 and S4). This problem is generally more challenging than docking inhibitors to enzymes because the Other category includes complexes with highly variable properties. Restricting consideration to individual domains eliminates the additional problem that the domains in multidomain proteins may shift relative to each other, affecting the docking results. Thirty cases representing domain–domain binding were selected from the Others section of the Protein Docking Benchmark (Table S4). Nineteen cases from this set represent binding of single-domain proteins (or single domains taken from larger proteins), and thus full protein structures were used for docking. In another 11 cases, receptor and/or ligand are composed of several domains, so reduced representations of protein structures were prepared: Only the binding domains were retained for docking, and the rest of the structure was cleaved. Residue ranges for binding domains were assigned according to structural classification of proteins (SCOP) domain classification (28). To prevent possible association at intraprotein domain–domain binding interfaces exposed by the cleavage, additional repulsion grids were used in the docking procedure. These were constructed by taking the backbone atoms of the original structure lying within 10 Å (but not closer than 5 Å) of the binding domain and placing repulsive spheres with 0.5-Å radius at the positions of those atoms. The 5-Å lower bound to the distance range specifying the thickness of this “repulsive padding” was introduced to ensure that additional repulsion doesn’t affect binding to the relevant portion of protein surface. During the docking process, such repulsive padding grid was correlated with the standard repulsive van der Waals grid of the binding partner. The docking procedure overall was the same as that used for enzyme–inhibitor targets, except that 1,500 low-energy poses were used for clustering, generated from three docking runs (500 poses from each) performed with differently weighted components of the scoring function (20). Similarly to the results obtained for enzyme–inhibitor complexes, FMFT and PIPER show comparable performance (Fig. 2 *D*–*F* and Table S3). Although PIPER generates large numbers (>200) of near-native structures for more complexes than FMFT, the number of complexes with very few (<10) such near-native structures is substantially smaller using FMFT than using PIPER. Thus, FMFT shows better performance for the more difficult-to-dock complexes (Fig. 2*D*). In addition, using PIPER, the number of models that are not ranked in the top 10 is much higher than using FMFT (Fig. 2*E*). Based on these results, FMFT performs as well as PIPER.

### Application 3: Accounting for Pairwise Distance Restraints.

An important consideration for selecting a docking method is the maximum complexity of the scoring function that still allows for solving problems with reasonable execution times. As mentioned, all FFT-based approaches require the use of scoring functions that can be written as sums of correlation functions. This is not a major limitation, because such functions may include many commonly used physics-based energy terms, such as steric repulsion, van der Waals interaction, and Coulombic electrostatics. It has also been shown that some energy terms that are not inherently correlation-based, such as the widely used pairwise interaction potentials, can be efficiently approximated by a sum of several correlation functions (27). Altogether, this makes the number of correlations a crucial parameter, because this number effectively defines the complexity of the scoring function in the particular sampling run.

One important task, especially demanding in terms of scoring function complexity, is incorporating pairwise distance restraints, based on known interactions between residue pairs, into the docking procedure. Such restraints can be derived in a variety of experiments, including NMR, cross-linking, and mutagenesis assays (29). The restraints can be implemented as short-distance attractive terms in the scoring function, but each will add a correlation function term. As emphasized, in Cartesian FFT, the number of transforms required is proportional to the number *P* of correlation functions (Eq. **3**), whereas, in FMFT, the number of transforms is independent of *P*. To demonstrate this difference, we determined the structure of the glucose-specific enzyme IIA (E2A)-histidine-containing phosphocarrier protein (HPr) complex [Protein Data Bank (PDB) entry 1GGR] (30) from the structures of its constituents in their unbound form (PDB entries 1F3G and 1POH) and 20 ambiguous interaction restraints (AIRs) based on NMR titration data (29). The docking procedure was the one used for docking enzyme–inhibitor pairs, but with 20 additional correlations terms in the scoring function due to the restraints (29). Each restraint is specified as a residue in one of the proteins, and a set of residues on the partner protein that are in contact with the first residue, where “contact” means ≤3 Å distance between any two atoms of the residue pair. To represent these restraints, receptor and ligand correlation components were constructed by placing 3-Å radius attractive spheres on the atoms of the particular residue on the first protein and the attracted point “charges” on the atoms of the interacting residues on the partner protein. Docking was performed using both FMFT and PIPER. Incorporation of restraints increased the population of the near-native cluster from 201 to 410, which became the most populated cluster and thus provided the putative model of the complex (Fig. 3 and Table S5) without any significant change in the IRMSD of the cluster center (5.25 Å for the unrestrained case versus 5.15 Å for the restrained). Adding the restraints increased the number of correlation function terms in the scoring function from 8 to 28. For PIPER, this resulted in a proportional increase in execution time (from 96.15 min to 373.80 min). In contrast, running FMFT, the execution time barely changed, from 12.32 min to 15.30 min. This result demonstrates that FMFT can be used with very complex scoring functions (Fig. S2).

### Application 4: Docking Ensembles of NMR Models.

Multiple docking runs may be required when one or both component proteins are given as ensembles of structures, obtained by NMR experiments or by extracting snapshots from molecular dynamics simulations. Because accounting for multiple structures may substantially improve docking results, the high efficiency of the FMFT method is particularly useful. As an example, we considered calculating the complex formed by the *Escherichia coli* Colicin E9 DNase domain and its cognate immunity protein IM9. Four different X-ray structures of the unbound E9 DNase domain (chains B, C, D, and E of PDB entry 1FSJ) were docked in a pairwise manner to 20 NMR models of the IM9 protein (PDB entry 1E0H), thus performing 80 docking calculations. Unstructured termini of the receptor were masked and didn’t contribute to the calculated energy scores. The 50 lowest energy poses were extracted from each of the 80 docking runs and merged, yielding a total of 4,000 poses that were then clustered as usual in PIPER. Fig. 4 *A*–*C* shows the docking results. In short, merging the 50 lowest energy poses from each docking run, followed by clustering, provided a 2.94-Å IRMSD model of the complex ranked fifth, where the 1IBX structure of the native complex was used for comparison when evaluating the accuracy of results. To emphasize the advantage of ensemble docking, we also docked a single pair of structures, chain B of 1FSJ and the first NMR model of the ligand from the PDB file 1E0H. The standard docking protocol was used, and we retained the 1,000 lowest energy poses for clustering. Docking the single pair, the best near-native model obtained was ranked 13, and had the IRMSD value of 3.45 Å. Thus, in the case of structural uncertainty of the component proteins, ensemble docking can substantially improve the results, and, in this type of application, the higher speed of FMFT is a major advantage. Computational efficiency will be particularly important for genome-wide analyses of protein–protein interactions, but we think that, for such applications, it will be necessary to better understand the docking of homology models (see *Application 5: Identification of Binding Sites by Docking Homology Models*), because, generally, structures are not available for a substantial fraction of proteins.

### Application 5: Identification of Binding Sites by Docking Homology Models.

It has been shown that protein–protein interaction sites can be found by determining the highly populated interfaces in the ensemble of structures generated by global docking (31, 32). We implemented this approach by clustering the “interfacial” atoms in the low-energy docked poses. Although this method usually requires structures of the component proteins, we extended the approach to proteins with yet undetermined structures by docking multiple homology models. The extended method was applied to determining the interface in the Nef–Fyn(R96I)SH3 complex (PDB entry 1EFN). Ten models of the receptor (SH3 domain) and 2 models of the ligand (HIV-1 Nef protein) were constructed using the MODELLER program (33) and based on homologous templates with 30–60% sequence identity (see Table S6 for the list of templates used). All possible receptor–ligand model pairs were docked using the approach developed for Other type of complexes (20). From each of the 20 docking runs, we selected the 1,500/20 = 75 lowest energy poses that were merged and clustered using RMSD as the distance metrics. The structures at the centers of these clusters were used to define interface atoms as atoms located within 5 Å of any atom of the partner protein. These interfacial atoms were then subjected to bottom-up hierarchial clustering using the Euclidian distance as the metrics. Clustering was terminated, i.e., neighboring clusters were not merged, if the minimal distance between a pair of their atoms was larger than the value of a separation parameter. The resulting clusters were ranked according to cluster population (i.e., the number of atoms in each cluster), and the largest cluster was considered to be the most probable prediction of the protein–protein interaction site. For comparison, we also predicted the interaction site by docking a single pair of homology models based on the templates with the highest sequence identity. In this case, a slightly larger value of the clustering separation parameter was used (1.35 Å rather than 1.30 Å). This change was due to the fact that a single docking run provided fewer interfacial atoms for hierarchical clustering, resulting in clusters that were too small. Therefore, the value of the cutoff parameter was increased to ensure that the relative population of the largest cluster was comparable to that obtained by merging the results from 20 docking runs. As shown in Fig. 4 *D* and *E*, docking of multiple homology models of the component proteins increased the accuracy of binding site prediction, compared with the result of using the maximum sequence identity models alone.

### Application 6: Docking Flexible Peptides.

The difficulty in docking short linear peptides is that their structure in solution is generally unknown and may be ill-defined. One possible solution is to dock a variety of peptide conformations, thus requiring multiple docking runs. We have recently developed an algorithm based on the use of structural templates extracted from the PDB with sequences that matched the known sequence motif in the peptide. These templates were docked individually using the FMFT algorithm. From each run, a number of low-energy poses were retained, the pooled peptide structures were clustered, and the highly populated cluster centers were reported as final models as in all applications of our docking algorithm.

Here we demonstrate this algorithm by docking the ace-PQQATDD peptide to the tumor necrosis factor receptor-associated factor 2 (TRAF2). For this peptide, the PXQ motif sequence known from the literature was extended to length 7 (PXQXXDD) and used to extract 316 structural templates from the PDB database. These templates were then used to model the target peptide. The models were aligned and clustered using the backbone RMSD as the distance measure, with 0.5 Å as the fixed clustering radius. Peptide structures corresponding to the centers of the 25 most-populated clusters were docked to the unbound receptor structure (chain A of PDB entry 1CA4).

The 250 lowest energy poses were retained from each docking run. The poses were merged and clustered using backbone RMSD as a distance measure with 3.5 Å as the fixed clustering radius. Cluster centers were ranked according to cluster populations and reported as final models. Docking results were evaluated using the backbone RMSD from the structure of the peptide in the native complex (chain A of PDB entry 1CZY). A near-native model of the protein–peptide complex was ranked fourth and had the backbone RMSD of 3.3 Å from the conformation in the X-ray structure (Table S7 and Fig. 5*A*). Note that docking only the most frequently occurring structural template provides less accurate models, as demonstrated in Fig. 5 *B* and *C*.

## Conclusions

Extending the classical 3D Cartesian implementation of the FFT correlation approach to perform rotations in Fourier space without the need for recalculating the transforms has been a long-outstanding and extensively studied problem. The main difficulty in developing such methods is that, to achieve numerical efficiency, one can use only a moderate number of spherical basis functions to span the search space, and this may reduce the accuracy of energy evaluation. However, because we base model selection on the population of low-energy clusters rather than on energy values, minor deviations in energy generally do not affect the accuracy of final models. Here we present an elegant manifold FFT implementation of 5D search that is more than 10-fold faster than the traditional 3D approach. A major advantage of the method is that adding correlation function terms in the scoring function is computationally inexpensive, and hence the method works efficiently with very complex energy evaluation models, possibly including pairwise distance restraints that are difficult to deal with in traditional FFT-based docking. The improved efficiency implies that we can solve new classes of docking problems, including the docking of large ensembles of proteins rather than just a single protein pair, docking homology models, and flexible peptides that may have a large number of potential conformations. We note that the beta version of a code implementing the FMFT algorithm can be downloaded from https://bitbucket.org/abcgroup_midas/fmft_dock/, thus providing an opportunity for testing and using the method. In addition, we are in the process of adding FMFT as a new option to the server.

## Materials and Methods

This section summarizes the implementation of the FMFT approach. For the mathematical details of the algorithm, see *SI Materials and Methods*.

The procedure starts with receptor- and ligand-associated components of each correlation term of the energy function being represented as sets of coefficients **4**. Here *N* governs the order at which the series is truncated. These coefficients, together with the translation range to be sampled (i.e., minimal and maximal distances between protein centers, calculated from the geometrical properties of the proteins), are submitted as input parameters to the program performing the FMFT-based sampling. To improve efficiency, two stages of FMFT sampling are being executed: The first one, performed with a maximal coefficient order

The actual sampling stage can be described as follows: After loading the input parameters, the program starts to iterate the allowed translation range in steps of 1 Å. For each translation step, the *K* (on the order of 1,000 for a typical sampling run) lowest energy samples are retained for each translation step. After the entire translation range is processed, the low-energy samples from individual translation steps are merged and resorted by energy value to select the *K* lowest energy samples that are presented as the final results.

It is important to note here that the sampling of the

## SI Materials and Methods

### FFT-Based Sampling on Manifolds.

#### The FMFT basis set.

As stated by the Peter–Weyl theorem, a complete set of Wigner *D* functions *β* and *γ* with *α* set to zero], the basis can be specified as the following collection of Wigner function products:*D* functions have the same index *m* accounts for the lacking degree of freedom associated with the *α* angle.

A function **[S1]**:

We refer to this expansion as the manifold Fourier expansion, and note that it specifies an inverse Fourier transform on the

#### Correlations on ( S O ( 3 ) ∖ S 1 ) × S O ( 3 ) .

Consider the problem of matching a pair of 3D patterns, each defined as a function on *A*) that need to be rotated to achieve superimposition.

If the patterns are defined as real-valued density functions *f* and *g*,

In this expression, *f* and *g* in two different spherical coordinate systems *B*).

Important for this work is that correlation function **[S3]** is defined on the **[S2]**, which allows significant acceleration of the search for the optimal

As a first step, we recognize that, because both *f* and *g* are defined on *r*, *f* and *g* to be band-limited functions in the selected basis set, which means that **[S4]** become approximate when finite *N* is used.

Next, we substitute **[S4]** into **[S3]** and take advantage of the linearity of rotation operators

Next, we capitalize on the rotational properties of spherical harmonics. More precisely, we use the fact that a rotated spherical harmonic can be expressed as linear combination of spherical harmonics of the same order,**[S5]** and changing the order of summation, we obtain the following relation:

At this point, we perform a change of coordinates, which we discussed previously: *B*). This makes each of the *z* is the distance between the origins of coordinate systems. For the case of Gaussian-type orbitals, an analytical expression is available for this type of transformation (24),

Substituting **[S9]** into **[S7]** and using the symmetry of translation matrix elements (24),

Finally, taking into account the orthonormality of basis functions and changing the order of summation yields

As already mentioned, this expression is an inverse manifold Fourier transform. This result can be rewritten in a more conventional Fourier-like form by using the following expression for Wigner rotation matrices (34):

Subsituting into **[S12]**, using

A certain asymmetry associated with the *γ* angle arises from complex conjugation and can be dealt with by simply substituting *γ* with

This substitution does not have any significant consequence, due to the periodicity associated with rotations by the *γ* angle (i.e., if we substitute *γ* with

Before we finish the discussion of correlations on *f* is a linear combination of these basis elements, we can view it as a shifted version of some other function *h*, that is, *h* is being rotated around the same point as *g*. This allows us to rewrite expression **[S5]** using just one coordinate system but explicitly introducing a translation operator instead,**[S5]**, provides a more convenient way for thinking about problems in which the initial position of the two patterns relative to each other is not given explicitly, but, instead, multiple different configurations need to be sampled. One example of such a problem is the problem of protein docking, where both proteins are rotated around their centers of mass and different initial distances between centers of mass need to be sampled.

### Application to Protein Docking.

#### Development of the docking protocol.

Consider the case where *f* and *g* describe the matching physical characteristics of a pair of interacting proteins, such that their overlap integral over the 3D space gives a component of receptor–ligand interaction energy. For example, the electrostatic potential of one protein and the charge distribution of the other give the electrostatic energy of the system. We will denote a pair of such characteristics of receptor and ligand as *f* and *g*; that is, from now on, we consider **[S15]** gives the energy scores for all orientations corresponding to a fixed distance between receptor and ligand centers of mass,

To develop a protocol for sampling of the full 6D space of mutual receptor–ligand orientations, one needs to complement the FMFT based sampling of five rotational degrees of freedom with explicit sampling of the only translational degree of freedom that is associated with the distance between receptor and ligand centers of mass. In fact, we do not have to do any additional manipulation to achieve this goal, because expression **[S17]** already contains a translation matrix that specifies the distance between the rotation centers, or, in our specific case of protein docking, between protein centers of mass. By simply changing the value of parameter *z*, we are able to sample configurations corresponding to different receptor–ligand distances,

Note here that the summation over different correlation terms is performed in the inner loop, which allows us to execute the manifold Fourier transform only once for all correlation components of the scoring function, which makes the total execution time much less sensitive to the number of correlations used in the scoring function.

In practice, the algorithm is organized as follows: For each translation distance of interest, we calculate the product of expansion coefficients and translation matrix elements,*Implementing the FMFT* explains the actual procedure being used to compute Fourier transform on the

#### Implementing the FMFT.

Before the discussion of implementation details, a few words need to be said about the sampling of *β* and *m*, *β* and *d* functions are precalculated for these values and stored in memory for future reuse.

The actual FMFT procedure is implemented in two steps.

First, we evaluate the expression*β* and *d* functions, thus performing an inverse 2D Discrete Wigner transform. Although a straightforward approach to this problem has a complexity of **[S23]**, the same result can be achieved in two

The second step is application of *z* between the centers of receptor and ligand. Because the complexity of each 3D FFT is

In summary, our inverse FMFT algorithm consists of *d* functions for calculation of the inverse Wigner transform. It should also be noted that, although we deal with multiplication by Wigner *d* functions (i.e., with inverse discrete Wigner transform) by using a simple

#### Symmetries of Fourier arrays.

It is important to mention that receptor and ligand properties that are being correlated are described by real functions, and, consequently, the values of the energy function should be real as well. This means that an array of real values is to be expected as a result of the FFT procedure, which can only be true if, for fixed *u* and *v* indices, the complex

In practice, this property allows us to fill only the elements corresponding to nonnegative values of the *m* index and to use specialized complex-to-real FFT algorithms (36), reducing the number of calculations and the amount of memory used by almost twofold.

## Acknowledgments

This work was supported by National Science Foundation (NSF) Grants AF 1527292 and DBI 1458509, NIH, National Institute of General Medical Sciences, Grants R35 GM118078 and R01 GM093147, Russian Scientific Foundation Grant 14-11-00877, and US Israel Binational Science Foundation Grant 2009418.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: midas{at}laufercenter.org.

Author contributions: D.P., A.K., Y.K., D.W.R., S.V., and D.K. designed research; D.P., A.K., B.S.Z., K.A.P., B.X., S.E.M., and D.K. performed research; D.P. and A.K. analyzed data; and D.P., A.K., D.W.R., S.V., and D.K. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1603929113/-/DCSupplemental.

## References

- ↵
- ↵
- ↵
- ↵.
- Smith GR,
- Sternberg MJE

- ↵.
- Katchalski-Katzir E, et al.

- ↵
- ↵
- ↵
- ↵
- ↵.
- Mandell JG, et al.

- ↵
- ↵
- ↵.
- Rossmann MG

- Crowther R

- ↵
- ↵.
- Ritchie DW,
- Kozakov D,
- Vajda S

- ↵
- ↵.
- Garzon JI, et al.

- ↵.
- Ritchie DW,
- Venkatraman V

- ↵
- ↵
- ↵
- ↵.
- Zare RN

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Garrett DS,
- Seok YJ,
- Peterkofsky A,
- Clore GM,
- Gronenborn AM

*Escherichia coli*phosphotransferase system. Biochemistry 36(15):4393–4398. - ↵.
- Hwang H,
- Vreven T,
- Weng Z

- ↵
- ↵.
- Eswar N, et al.

- ↵.
- Biedenharn L,
- Louck J

- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology