DeepTracer for fast de novo cryo-EM protein structure modeling and special studies on CoV-related complexes

Significance Electron cryomicroscopy (cryo-EM), a 2017 Nobel prize-awarded technology, provides direct 3D maps of macromolecules and explains the shape and interactions of protein complexes such as SARS-CoV-2 viral proteins and human cell receptors. This understanding can be combined with detailed structural information gathered using other technologies to form the basis for modeling course of diseases and for designing therapeutic drugs. However, ab initio modeling of protein complex structure remains a challenging problem. Here, we present DeepTracer, a fully automated and robust tool that determines the all-atom structure of a protein complex based solely on its cryo-EM map and amino acid sequence, with improved accuracy and efficiency compared to previous methods. We also provide a web service for global access.


S1 Pre-Processing
The goal of the pre-processing steps is to prepare the cryo-EM maps for the neural network. These steps are crucial as cryo-EM maps can differ significantly in terms of shape, quality, resolution, and more. Therefore, we need to process the maps and convert them into a consistent format such that the neural network can understand connections across different maps. The pre-processing steps that achieve this are examined in Sections S1.1, S1.2, and S1.3.

S1.1 Data Grid Resampling
The first pre-processing step is to standardize the voxel size of all cryo-EM maps. The grid storing the volume data of the map has an associated voxel size, which determines the size of a single grid element or voxel in Angstrom. Without standardizing this voxel size to a fixed value the neural network could not draw conclusions about how far two voxels are from each other, making it difficult to predict the location of any amino acids. Therefore, this step ensures that each maps has a voxel size of exactly 0.5Å. The value 0.5 was chosen based on several rounds of testing as a trade-off between prediction precision and memory usage of the resulting grids.
To set the voxel size of a cryo-EM map to 0.5Å, we cannot simply change the meta data of the map. We had to resample the volume data onto a new grid in which each voxel represents 0.5Å. An example of a resampling process from an origin grid to a grid with half the voxel size is shown in Figure S1. Here, the shape of the volume remains the same, however, we require eight times the number of voxels to represent it. To realize the resampling step, DeepTracer utilizes UCSF Chimera [1]. First, it creates a new grid of the same size in Angstrom as the original map, but with a voxel size of 0.5Å. Then, it uses Chimera's resampling command to resample the original map onto the newly created grid. Figure S1: Data grid resampling. Visualization of the resampling process from an origin grid onto a grid with half the voxel size.

S1.2 Density Value Normalization
The absolute value of a voxel in itself contains little information. We have to rely on the density values of other voxels to make conclusions about the protein structure. Consequently, we can normalize the density values without risking information loss. The normalization process makes sure that the range of density values is identical for all maps. In the case of experimental maps, this range can initially differ substantially with some maps contain values from -0.1 to 0.1 and other maps ranging from -10 to 20.
To normalize values, we can usually divide each value by the overall highest value. However, this process is problematic for some cryo-EM maps as there are outlier density values, which have values that are much higher than all other values. If we were to divide all other values using these outlier, all other density values would end up being close to zero. Therefore, we used the 95th percentile of the density values to divide all other values with. Afterwards, we simply set the few values that are greater than 1 to 1. Additionally, we set all values below 0 to 0 as they contain no valuable information for our use case. This leaves us with a range from 0 to 1, which contains all density values. An example of the density value histograms before and after the normalization step can be seen in Figure S2.

S1.3 Grid Division
The input layer of the model takes a cryo-EM map to make a prediction. We have to make sure that the dimensions of the volume data grid of the map are identical to those of the input layer of the model to avoid mismatching errors. However, the dimensions of the grid vary from map to map, demanding modification of the grid to match its dimensions of the input layer. Unfortunately, we cannot simply scale the map for it to fit the input layer as this would change the size of each voxel in Angstrom, which has to remain 0.5 as mentioned in Section S1.1. Therefore, we divided the grid into multiple sub grids each the size of the input layer of the deep learning model. We divided the volume data grid into sub grids of size 64 3 . The number 64 was chosen as it creates a relatively small input layer that is still broad enough for the deep learning model to detect larger patterns, such as secondary structure elements. Dividing the grid, however, can aggravate predictions in areas close to the border of sub grids as relevant information from neighboring voxels might be cut off. Therefore, we introduce a core grid of size 50 3 in the center of each sub grid. Although each sub grid has a size of 64 3 , we only used the predictions from the inner core grid. Consequently, when dividing the grid we overlapped the 64 3 sub grids such that core grids of all sub grids cover the entire original grid without overlap. An example of such a division for a two-dimensional grid can be seen in Figure S3.

S2 Comparison with MAINMAST and Rosetta
In addition to Phenix, MAINMAST and Rosetta are two further established cryo-EM prediction methods. We conducted a brief analysis of their performances compared to DeepTracer based on a test set of nine cryo-EM maps taken from the previous papers [2,3]. Note that the cryo-EM maps were cropped such that they captured only a single protein chain. This cropping was necessary as both methods can only perform single-chain predictions. To evaluate predictions we utilize Phenix's chain comparison tool. The results of this analysis can be seen in Table S1. We can note that DeepTracer outperforms Rosetta in all four metrics with particularly significant improvements in the percentage of matched residues as well as falsepositive predictions. Compared to the MAINMAST method DeepTracer performed worse in three of the four metrics. However, predictions of DeepTracer were much more complete with an average matching percentage of 93.4% compared to only 36.4% with MAINMAST. That means that MAINMAST correctly predicted only around 1/3 of the protein structure. Here, we can see that some amino acids are hard to discern. This comes from the fact that there are 20 different amino acids that the neural network has to predict and that some amino acids could look very similar in an experimental cryo-EM map. (E) Solved structure (PDB-3J9S) of cryo-EM map. (F) Atoms prediction segment next to solved structure.  Figure S5: Atom mask. Portion of the atom mask containing backbone atoms for part of a helix from the PDB-6NQ1 structure. The gray labels indicate carbon alpha atoms, the blue labels carbon atoms, and the yellow labels nitrogen atoms. Figure S6: Probability density function for connection confidence. Normalized probability density function used to calculate confidence score for the euclidean distance and average backbone confidence between two Cα atoms. This is used to express a distance between two atoms for the traveling salesman algorithm tracing the backbone.