Learning the shape of protein microenvironments with a holographic convolutional neural network

Significance Proteins are the machinery of life facilitating the key processes that drive living organisms. Recent advances have increased the number of experimentally resolved or computationally predicted tertiary structures of proteins. However, we still lack an understanding of how tertiary structure determines the function of a protein. M. Pun et al. introduce a physically motivated machine learning approach to learn interpretable models for protein structures, reflecting the underlying biophysics. This model accurately predicts the impact of mutations on protein stability and binding of protein complexes. The flexibility and efficiency of this approach also show promise for building generative models to design novel protein structures with desired stability and binding reactivity.

Proteins are the machinery of life.They facilitate the key processes that drive living organisms and only rely on twenty types of amino acids to do so.The physical arrangement of a protein's amino acids dictates how it folds and interacts with its environment.While this compositional nature gives rise to the diversity of existing proteins, it also makes determining protein function from structure a complex problem.
With the growing amount of data and computational advances, machine learning has come to the forefront of protein science, especially in predicting structure from sequence [1][2][3][4][5].However, the problem of how protein function is determined from its sequence or structure still remains a major challenge.
Techniques from natural language processing are used to determine functional motifs in protein sequences by allowing residues far away in sequence to form information units about function [6][7][8][9][10][11][12].However, since protein function is closely related to protein structure, models trained to map protein sequences to function must account for the complex sequence-structure map.Despite AlphaFold's remarkable success at predicting protein folding, it still struggles to determine the effect of mutations on the stability and function of a protein [13].Nonetheless, it is suggested that AlphaFold has learned an effective physical potential to fold proteins, and therefore, it could be used to characterize the effect of mutations or general protein function [14].Given the availability of high-resolution tertiary structures, obtained either experimentally or computationally, the information on atomic coordinates in a 3D structure of a protein can be used to learn a direct map to function.Indeed, models aware of the geometry of protein 3D structure that attempt to solve the inverse folding problem, i.e., designing a sequence that folds into a desired structure, can be used to reliably infer the functional effect of mutations in a protein sequence [15], or even engineer diverse sequences that have a desired function [16].
Here, we introduce holographic convolutional neural network (H-CNN) to learn physically grounded structure-to-function maps for proteins.H-CNN learns local representations of protein structures by encoding the atoms within protein micro-environments in a spherical Fourier space as holograms, and processes these holographic projections via a 3D rotationally equivariant convolutional neural network (Fig. 1) [25][26][27].The resulting model respects rotational symmetry of protein structures and characterizes effective inter-amino-acid potentials in protein micro-environments.
We train H-CNN on protein structures available in the Protein Data Bank (PDB) [28], and perform the supervised task of predicting the identity of an amino acid from its surrounding atomic neighborhood with a high accuracy and computational efficiency.The amino acids that H-CNN infers to be interchangeable have similar physicochemical properties, and the pattern is consistent with substitution patterns in evolutionary data.The H-CNN model encodes a more complete set of geometric features of protein structures compared to the other geometryaware models of proteins [12,15,16].Therefore, it can predict protein function, including binding and stability effects of mutations, only based on local atomic compo-sitions in a protein structure.Our results showcase that principled geometry-aware machine learning can lead to powerful and robust models that provide insight into the biophysics of protein interactions and function, with a potential for protein design.

Model
We define the micro-environment surrounding an amino acid as the identity and the 3D coordinates of atoms within a radius of 10 Å of the focal amino acid's α-carbon; this neighborhood excludes atoms from the focal amino acid.
A common approach to encode such atomic neighborhoods for computational analysis is to voxelize the coordinates, which is a form of binning in 3D [30,31].However, this approach distorts the information, since the voxel boundaries are arbitrary-too large voxels average over many atoms, and too small voxels lead to very sparse data.
The other obstacle for modeling such data is more fundamental and related to the rotational symmetries in encoding a protein structure neighborhood.A given neighborhood can occur in different orientations within or across proteins, and a machine learning algorithm should account for such rotational symmetry.One approach known as data augmentation, mainly used in image processing, trains an algorithm on many examples of an image in different orientations and locations.Data augmentation is computationally costly in 3D, and it is likely to result in a model of amino acid interactions that depends on the neighborhood's orientation, which is a non-physical outcome.Another approach is to orient the amino acid neighborhoods based on a prior choice (e.g., along the backbone of the protein) [30,31].However, this choice is somewhat arbitrary, and the specified orientation of the protein backbone could inform the model about the identity of the focal amino acid.
To overcome these obstacles, we introduce holographic convolutional neural networks (H-CNNs) for protein micro-environments.First, we encode the point clouds of different atoms in an amino acid neighborhood using 3D Zernike polynomials as spherical basis functions (Fig. 1 and Methods).3D Zernike polynomials can be used to expand any function in three dimensions along angular and radial bases and can uniquely represent the properties of the encoded object in a spherical Fourier space, given enough terms in the Fourier series.Conveniently, the angular component of the Zernike polynomials are spherical harmonics, which form an equivariant basis under rotation in 3D.Rotational equivariance is the property that if the input (i.e., atomic coordinates of an amino acid's neighborhood) is rotated, then the output is transformed according to a linear operation determined by the rotation (Fig. S1).As a result, these Zernike pro-jections enable us to encode the atomic point clouds from a protein structure without having to align the neighborhoods.Zernike projections in spherical Fourier space can be understood as a superposition of spherical holograms of an input point cloud, and thus, we term this operation as holographic encoding of protein micro-environments; see Fig. 1 and Fig. S2 and Methods for details.
The holograms encoding protein neighborhoods are input to a type of convolutional neural network (CNN).This network is trained on the supervised task of predicting the identity of a focal amino acid from the surrounding atoms in the protein's tertiary structure.Conventional CNNs average over spatial translations and can learn features in the data that may be in different locations (i.e., they respect translational symmetry).For the analysis of protein neighborhoods we need to infer models that are insensitive to the orientation of the data (i.e., that respect 3D rotational symmetry of the point clouds in a protein neighborhood).Recent work in machine learning has expanded CNNs to respect physical symmetries beyond translations [25][26][27].For 3D rotations, generalized convolutions use spherical harmonics, which arise from the irreducible representations of the 3D rotation group SO(3) [32].For our analysis, we use Clebsch-Gordon neural networks [26], in which the linear and the nonlinear operations of the network layers have the property of rotational equivariance (Methods); see Fig. S2 for detailed information on network architecture and Fig. S3 and Table S1 for details on hyper-parameter tuning and training of the network.
Taken together, the H-CNN shown in Figs. 1, S2, takes as input holograms that encode the spatial composition of different atoms (Carbon, Nitrogen, Oxygen, Sulfur, Hydrogen) and physical properties such as charge and solvent accessible surface area (SASA).The input is processed by a 3D rotationally equivariant CNN to learn statistical representations for protein neighborhoods.We train this H-CNN as a classifier on protein neighborhoods, collected from tertiary structures from the PDB, and use the trained network to quantify the preferences for different amino acids in a given structural neighborhood; see Methods for details on data pre-processing.

H-CNN reveals physico-chemical properties of amino acids, consistent with evolutionary variation
H-CNN predicts the identity of an amino acid from its surrounding micro-environment with 68% accuracy (Fig. 2).The accuracy of H-CNN is comparable to stateof-the-art approaches with conventional CNNs that voxelize and orient the data along the backbone of a central amino acid, while using a smaller atomic microenvironement for performing this classification task [30,31]; see Table S2 for a detailed comparison of models.Notably, restricting the training of H-CNN to the subspace of models that are rotationally equivariant leads to a substantial speedup in the training of H-CNN com- A neighborhood within a radius of 10 Å around a focal amino acid (masked in orange) in a protein structure is separated into its constituent atomic and chemical channels.The information in these channels is encoded in a rotationally equivariant form, using 3D Zernike polynomials, which defines holograms in spherical Fourier space.These holographic encodings are processed by a rotationally equivariant convolutional neural network (Clebsch-Gordan net [26]).The invariant features of the network layers are then collected and processed through fully connected feed-forward layers to determine the preferences, i.e., statistical weights (pseudo-energies) and probabilities, for different amino acids residing at the center of the input neighborhood.The set of predicted probability vectors across all 20 amino acids defines an amino acid profile.The network is trained by learning the categorical classification task with a softmax cross-entropy loss on a one-hot label of the neighborhood determined by the true central residue in the protein structure.A more detailed network architecture is presented in Fig. S2.
pared to the conventional techniques [30,31].Moreover, H-CNN is more accurate than other symmetry-aware approaches for molecular modeling, while using an order of magnitude fewer parameters [33,34]; see Methods and Table S2 for a detailed comparison of models.
H-CNN predicts the conformationally unique amino acids of Glycine and Proline with over 90% accuracy.Meanwhile, amino acids with typical side-chains cluster based on their sizes and the physico-chemical properties of the side-chains including aromatic, hydrophobic, and charged groupings (Fig. 2A).The inferred amino acid preferences cluster well according to the input amino acid type (true label) in the low-dimensional UMAP representation [35], and amino acids with similar physicochemical properties cluster in nearby regions in the UMAP (Fig. 2B,C).
H-CNN predictions reflect amino acid preferences seen in evolutionary data, even though the network is not trained on multiple sequence alignments (MSAs) of protein homologs.Specifically, the interchangeability of amino acids that H-CNN predicts is 71% correlated with the substitution patterns in evolutionary data, represented by the BLOSUM62 matrix (Fig. 2D).In addition, the amino acid preferences predicted by H-CNN at each site are consistent with evolutionary preferences inferred from the covariation of residues in multiple sequence alignments of protein families [29,36,37]; see Fig 2E,F and Methods for details.
Ablation studies further reveal that the H-CNN's processing of information corresponds to physical intuition.Removing information about SASA or charge from the input data results in roughly a 10% drop in classification accuracy; see Fig. S4 and the discussion on ablation studies in the Methods.Information from SASA mostly impacts the network's ability to predict hydrophobic amino acids, with some hydrophilic amino acids (R,K,E) also impacted.When charge is removed, the network demonstrates worse predictions on charged and polar amino acids most notably R, C, N, and E.
Since H-CNN is trained to predict the most natural amino acid given its neighborhood, it should also be able to recognize an unnatural protein configuration.To test this hypothesis, we characterize the response of the H-CNN predictions to physical distortions in native atomic micro-environments.We introduce distortions through local in silico shear perturbation of the protein backbone at a given site i by an angle δ, resulting in a transformation of the backbone angles by 3A and Methods); this perturbation is local with minimum downstream effects [28].We measure the distortion of the protein structure due to shear by calculating the change in the root-mean-square deviation in the pairwise distances of all atoms of the perturbed protein structure relative to that of the wild-type (RMS∆D ab , for all pairs of atoms (a, b)); Fig. 3B.
We measure the response of the protein to shear perturbation by analyzing the change in the logits produced by H-CNN, which we term pseudo-energies E α i (Fig. 1).Specifically, for a given distorted structure, we re-evaluate the pseudo-energy of each amino acid in the protein, and define the total H-CNN predicted energy by summing over the pseudo-energies of all the amino acids in a protein (Fig. 3C).The change in the predicted energy of a protein due to distortion (relative to the wild-type) ∆E is a measure of H-CNN's response to a given perturbation.A positive ∆E indicates an unfavorable change in the protein structure.
inference of evo-couplings predicted amino acid  [29]) to characterize the probability of a given amino acid, given the rest of the sequence (left); see Methods for details.To compare evolutionary and H-CNN predictions for site-specific amino acid profiles, the profile overlap is computed as the centered cosine similarity between the predicted probability profiles (right); see Methods.(F) The profile overlaps are strongly peaked around one, implying perfect overlap in data (purple); the average profile overlap across 11,221 sites from a total of 67 protein families is ρ = 0.67.The H-CNN predictions are notably different for the shuffled data, for which the profile overlap peaks near zero (cyan), with an average of 0.002.
In protein G, the change in the predicted energy ∆E as a function of distortion in the structure RMS∆D ab due to shearing at different sites reveals two trends (Fig. 3D).First, the protein network energy appears to respond locally quadratically to perturbations.Second, perturbations generally result in higher protein network energy, corresponding to a less favorable protein microenvironment.Taken together, by training on a classification task and by constraining the network to respect the relevant rotational symmetry, H-CNN has learned an effective physical potential for protein micro-environments in which the native structure is generally more favorable and robust to local perturbations (i.e., it is at the energy minimum).This observation of a minimum energy extends beyond the wild-type sequence when biophysically similar amino acids are substituted in the energy sum (Methods and Fig. S5A).Notably this pattern appears not to be just an artifact of the structure since the minimum disappears when random amino acids are used to calculate the network energy (Fig. S5B).

H-CNN predicts effect of mutations on protein stability
Characterizing amino acid preferences in a protein neighborhood is closely related to the problem of finding the impact of mutations on protein function.Here, we test the accuracy of H-CNN in predicting the stability effect of mutations in 40 different variants of the T4 lysozyme protein.Each of these variants is one amino acid away from the wild-type, with variations spanning 23 residues of the protein.Notably, the tertiary structure of the wild-type T4 lysozyme protein as well as the 40 mutants are available through different studies [30,[38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55]; see Table S3 for details on these mutants.
H-CNN predicts that the wild-type amino acids are the most favorable in the wild-type structure, while the mutant amino acids are generally more favorable in the mutant structures, regardless of their stabilizing effects (Fig. 4A).These variant-specific preferences are not surprising since the folded protein structure can re-lax to accommodate for amino acid changes, resulting in a structural neighborhood that is more consistent with the statistics of the micro-environments around the mutated amino acid than that of the wild-type.However, the confidence that H-CNN has in associating an amino acid with a given structural neighborhood can change depending on the stability effect of the mutation.The log-ratio of the H-CNN inferred probability for the mutant amino acid in the mutant structure versus that of the wild type amino acid in the wild type structure, ∆ log P = log P mut /P wt can provide an approximation to the ∆∆G associated with the stability of a mutation (Methods).
The inferred H-CNN predicted log-probability ratio is generally negative for destabilizing mutations, and non-negative for neutral/weakly beneficial mutations (Fig. 4B).Previously, a structure-based CNN model with voxelized protein structures has shown a similar qualitative result [30].Further quantitative analysis shows that the log-probability ratio is 67% correlated with the A ... experimentally evaluated ∆∆G values for these variants (Fig 4C ).Moreover, the receiver-operating-characteristic (ROC) curves in Fig. S6A show that the log-ratio of amino acid probabilities can reliably discriminate between destabilizing and neutral mutations, with an area under the curve (AUC) of 0.90.The availability of tertiary structures for a large number of variants is a unique feature of this dataset, and in most cases such structural resolution is not accessible.To overcome this limitation and predict the stability effect of mutations by relying on the wild-type structure alone, we used PyRosetta to relax the wild-type T4 lysozyme structure around a specified amino acid change [28] (Methods).We find that the log-probability ratios ∆ log P estimated based on these in silico relaxed mutant structures are mostly negative (non-negative) for destabilizing (neutral) mutations (Fig. S7) and are correlated with the stability effect of mutations ∆∆G (Fig. S7).However, structural relaxation can add noise to the data, causing the protein micro-environments to deviate from the natural structures that H-CNN is trained on.Thus, using the in silico relaxed structures slightly reduces the discrimination power of our model between deleterious and near-neutral mutations (AUC = 0.83); see Fig. S6A.
In contrast, the preferences estimated based on the wild-type structure only can discriminate between destabilizing and neutral mutations very well, even though most mutations are inferred to be deleterious with respect to the wild-type (AUC = 0.93 in Fig. S6).In other words, by using the wild-type structure only, our model can predict the relative stability effect of mutations correctly but not the sign of ∆∆G (Fig. S6, S7).Indeed, our inferred log-probability ratios based on the wild-type structure show a substantial correlation of 64% (Pearson correlation) with the stability effect of a much larger set of 310 single point mutants [56], for which protein structures are not available (Fig 4D).When no experimentally determined structure is available, computationally resolved protein structures from AlphaFold can also be used to predict the stability effect of mutations.The H-CNN predictions using the template-free AlphaFold2 predicted structure of T4 lysozyme wild-type sequence display substantial discrimination ability between destabilizing and near-neutral mutations (Fig. S6) and are correlated with the mutants' ∆∆G values (Figs.S6, S7).

H-CNN predicts fitness effect of mutations for binding of SARS-CoV2 to the ACE2 receptor
Recent deep mutational scanning (DMS) experiments measured the effect of thousands of mutations in the receptor-binding domain (RBD) of SARS-CoV-2 on the folding of the RBD (through expression measurements) and its binding to the human Angiotensin-Converting Enzyme 2 (ACE2) receptor [57,58].
H-CNN can be used to predict the effect of mutations on RBD, either in isolation or bound to the ACE2 receptor.The former can be interpreted as the effect of mutations on the stability of RBD, which is measured by the expression of the folded domain in the experiments [57][58][59], while the latter can be used to characterize amino acid preferences for binding at the RBD-ACE2 interface.Fig. 5A,B shows that the H-CNN predictions are correlated with the stability and binding measurements in the DMS experiments from ref. [58]; site specific effects are depicted in Figs.S8, S9.
The average effect of mutations on expression and binding can define three categories of sites and/or mutations (Fig. 5C): (i) sites that are intolerant to mutations (due to destabilizing effects) and show a substantially reduced expression of mutants (blue), (ii) sites that are tolerant of mutations for expression but not binding (green), and (iii) sites that are tolerant of mutations for both expression and binding (pink).Using the isolated structure of RBD, H-CNN can well classify mutations according to their stability effect (AUC = 0.8; Fig. 5D).Similarly, with the structure of the RBD-ACE2 complex, H-CNN can classify mutations according to their tolerance for binding (AUC = 0.74; Fig. 5D).
Expectantly, the sites that are tolerant of mutations for expression but not binding (green category from the DMS data in Fig. 5C) are located at the interface of the RBD-ACE2 complex, and H-CNN correctly predicts this composition (Fig. 5E, Fig. S10).The overall impact of mutations on binding for these sites is shown in Fig. 5E.
Identifying candidate sites that can tolerate mutations and can potentially improve binding is important for designing targeted mutagenesis experiments.Instead of agnostically scanning single point and (a few) double mutations over all sites, these predictions can inform experiments to preferentially scan combinations of viable mutations on a smaller set of candidate sites.In previous work, evolutionary information was used to design such targeted mutagenesis for the HA and NA proteins of influenza [60,61].A principled structure-based model could substantially improve the design of these experiments.

Discussion
The success of AlphaFold has demonstrated the power of machine learning in predicting protein structure from sequence [3].The challenge now is to leverage the experimentally and computationally determined protein structures to better understand and predict protein function.Our H-CNN model is a computationally powerful method to represent protein tertiary structures, and characterizes local biophysical interactions in protein micro-environments.Our model is physically motivated in that it respects rotational symmetry of protein structure data, allowing for significantly faster training time compared to previous approaches [30,31].
Similar to recent language models, H-CNN also demonstrates strong cross-task generalization by predicting quantitative effects of amino acid substitutions on function (i.e., zero-shot predictions), including protein stability or binding of protein complexes.Generally, massive language models trained on large and diverse protein sequence databases are shown to generalize well to predict mutational effects in proteins without any supervision [6,7,10,62,63].State-of-the-art methods include ESM-1b for zero-shot predictions [10] and MSA transformers that use evolutionary information from MSAs of protein families to predict the effect of mutations [62].The benchmark for these methods is the large set of DMS experiments, for which most zero-shot sequencebased predictions show an average accuracy of about 50% in predicting the rank order of the mutational effects [63].Our structure-based H-CNN method shows a comparable accuracy in predicting the mutational effect in DMS experiments of the RBD protein in SARS-CoV-2, yet with much fewer parameters; a more systematic analysis would be necessary to compare these different approaches.Nonetheless, it would be interesting to see how the features extracted by H-CNN can complement the sequence-based language models to potentially improve zero-shot predictions for mutational effects in proteins.
Recent work has shown that combining structural data with evolutionary information from MSAs in deep learning models can be powerful in predicting mutational effects in proteins [64].We have shown that H-CNN recapitulates the functional information reflected in evolutionary data, further reinforcing the idea that physically guided structure-based machine learning models could be sufficient in predicting protein function, without a need for MSAs.Importantly, our MSA-independent approach enables us to apply H-CNN to protein structures with no available homologs, including the de novo protein structures.
The H-CNN learned representations of amino acid neighborhoods could be used as input to a supervised algorithm to learn a more accurate model for mutational effects in proteins; a similar approach has been used to model the stability effect of mutations in ref. [65].Moreover, the all-atom representation of protein structures used to train H-CNN allows for generalizability, e.g. using the inferred model to analyze non-amino acid molecules or extending the model and accommodate other elements to study protein-drug or protein-DNA interactions.
Solving the inverse protein folding problem by designing a sequence that folds into a desired structure is a key step in protein design.Recent deep learning methods, including protein MPNN [16] and transformer-based ESM-IF1 [12,15], have shown promise in designing viable sequences with a desired fold for de novo proteins.H-CNN's ability to learn an effective potential in protein micro-environments merits investigation as to whether similar techniques can be used to solve the inverse folding problem for de novo proteins.
The learned representation of protein microenvironments with H-CNN enables us to characterize the preferences of different amino acid compositions in a structural neighborhood.
Additionally, these rotationally equivariant representations could be used as building blocks of larger protein structure units, e.g. to characterize how different molecular features on a protein surface could determine its interactions with other proteins.A study in this direction could shed light on the structure-to-function map of the protein universe.

II. ACKNOWLEDGMENT
This work has been supported by the National Institutes of Health MIRA award (R35 GM142795), the CAREER award from the National Science Foundation (grant No: 2045054), the Royalty Research Fund from the University of Washington (no.A153352), the Microsoft Azure award from the eScience institute at the University of Washington.This work is also supported, in part, through the Department of Physics, and the College of Arts and Sciences at the University of Washington.
Processing of the protein structure data.We minimally process the protein structures from the Protein Data Bank (PDB) [1].We use PyRosetta [2] to infer coordinates of hydrogen atoms in the protein structure.We also use PyRosetta to assign solvent accessible surface area (SASA) and partial charge to all atoms.
Constructing the training, the validation, and the test sets for protein neighborhood classifications.We classified amino acid neighborhoods extracted from the tertiary protein structures from the PDB.We used ProteinNet's splittings for training and validation sets of the CASP12 competition to avoid redundancy due to similarities in homologous protein domains [3].Since PDB identifiers in ProteinNet were only provided for the training and the validation sets, we used ProteinNet's training set as the base of both our training and validation and ProteinNet's validation set as our testing set.Specifically, to define our training and validation sets, we make a [80%, 20%] split in the ProteinNet's training data of proteins with X-ray crystallography structures of 2.5 Å resolution or better.Furthermore, in anticipation of validating our model on experimental stabilities of T4 lysozyme and SARS-CoV2 receptor-binding domain (RBD), we removed all structures that belonged to the same UniProt family as the domains from each of these proteins.We used all of ProteinNet's validation structures as our testing set.
This splitting resulted in 10,957 training structures, 2,730 validation structures, and 212 testing structures, each of which contained 2,810,503, 682,689, and 4,472 neighborhoods, respectively.
Data processing to assess the stability effect of mutations in the T4 lysozyme protein.Predictions for the stability effect of mutations in the T4 lysozyme protein were made using the PDB structure 2LZM [4] as the wild-type.The PDB identifiers for structures of all the T4 lysozyme variants used in our analyses (shown in Figs. 4 and S7) along with the reported ∆∆G and the pH values of the stability measurements are reported in Table S3.
Experimental measurements of ∆∆G values in variants for which we do not have a matching protein structure (shown in Fig. S6) are taken from the dataset FireProtDB [5].
Relaxation of mutant T4 lysozyme structures for the in silico model in Figs.4B,C and S7, was done using PyRosetta FastRelax with the score function ref2015 cart [2].Backbone flexible positions were restricted to the mutated residue and the neighboring residues in the sequence.Sidechain flexible positions were restricted to the neighbors with alpha carbons within 10 Å of the mutated residue's alpha carbon.
Data processing to assess the fitness effect of mutations in the RBD of SARS-CoV-2.Deep mutational scanning experiments from ref. [7] were used to obtain the SARS-CoV-2 RBD expression and binding to the ACE2 receptor.Predictions of the RBD-ACE2 binding strength, shown in Fig. 5B, were made using the co-crystallized protein structure with the PDB identifier 6M0J [8].The predictions for the RBD stability, shown in Fig. 5A, were made using the 6M0J structure with the ACE2 chain computationally removed.This computational removal was performed by selecting the RBD structure in PyMol [9] and exporting a PDB file from said selection.
Data processing to assess the effect of shearing in Protein G. To analyze the effect of shearing in protein G, shown in Figs. 3 and S5, we used the native structure of protein G with the PDB identifier 1PGA [10].Sheared structures were produced with PyRosetta by manually editing the backbone angles φ and ψ of each residue (Fig. 3A) via Pose.setphi and Pose.set psi [2].

II. ROTATIONALLY EQUIVARIANT STRUCTURE-TO-FUNCTION MAP FOR PROTEINS
The tertiary protein structure is a result of a folding process whereby amino acid sidechains interact with each other and the backbone structure to form the three-dimensional (3D) shape of the protein.The favorability of a certain amino acid occurring at a given site in a protein is determined by numerous factors: the presence of this amino acid during translation, the ability of the polypeptide chain to fold with that amino acid present, the amino acid's interactions with the local structural surroundings of the protein, and the amino acid's interactions with the surrounding environment.Our work focuses on revealing the third factor's contribution.Specifically, we aim to learn how a micro-environment in a protein's tertiary structure (i.e., atoms in a given structural neighborhood) constrain the usage of different amino acids in the neighborhood.
We define a structural neighborhood surrounding a focal amino acid by the coordinates (and identities) of all the N atoms from the surrounding amino acids in the structure, within a given radius of the focal residue.Mathematically, we want to find a map f : R N ×3 → [R + ≤1 ] 19 between the 3D coordinates of the N neighboring atoms and the vector of probabilities p ∈ [R + ≤1 ] 19 associated with preferences for different types of amino acids that can be placed at the center of the neighborhood; note that the amino acid probability vector has 19 independent entries due to normalization.
Assuming that the protein structure is a crystal (at least on average), the local interactions between the amino acid at the focal site of interest and the surrounding ones in the protein structure have rotational symmetry, in that they only depend on the pairwise distances of atoms and not on their absolute orientation in space.This symmetry implies that the interactions are unchanged under global rotations since a global rotation preserves pairwise distances and relative orientations.Therefore, any model of amino acid preferences based on local structural information should respect 3D rotational symmetry.Thus, our goal becomes learning a map f , as described above, that is also invariant to global rotations acting on the inputs.A rotation acting on any Cartesian vector can be expressed as an Euler angle matrix R euler (Fig. S1).Since our inputs x for coordinates of the atoms in a structural neighborhood are Cartesian vectors x ∈ R N ×3 , a global rotation R would act on these vectors via a Euler angle matrix, which we represent by D cart.(R).Therefore, the rotationally invariant map should have the property f In the rest of this section, we describe the mathematical foundation to construct 3D rotationally equivariant models for protein structure data.In the following section (Section II A), we specifically focus on our approach in encoding the structural data and training neural networks to learn 3D rotationally equivariant models for protein neighborhoods.The methodology discussed in Sections II B, II C is self-contained, and if a reader wishes they can choose to skip Section II A on the mathematical foundation of our approach.

A. Mathematical foundation for 3D rotationally equivariant transformations
The simplest approach to train a neural network model that is invariant to global rotations of proteins is to use invariant descriptors of the tertiary structure as inputs to the neural network.For example, one can use dot products between all Cartesian vectors describing the coordinates of atoms within a structural neighborhood.However, recent work has shown that neural networks that are allowed to work with full geometric data as opposed to simple invariants have higher expressivity and perform better on learning tasks [11].
Networks that build more complex invariants rely on representation theory and the idea of equivariance to build more expressive models.Equivariance is a generalization of invariance.Rather than requiring no change when inputs are transformed, equivariance allows the outputs of a map to change but requires them to change in a well-defined way.Under a transformation of the inputs corresponding to a rotation R, an equivariant map f satisfies the relation where D input (R) is the operator corresponding to the rotation in the input space, and D out (R) is the operator corresponding to the rotation in the output space.Note that if D out (R) is the identity matrix for all rotations, then the map is invariant.
The key to building an equivariant map is to know the operator D out for the transformation of interest.The prevailing method in the field of equivariant machine learning is to modularly build a neural network out of equivariant units where each unit has an input and output space where these operators are known.The input and output space are often the same and tend to be the Fourier space defined by the symmetry that is respected.
Rotations in three dimensions are non-commutative which means that the output of two consecutive rotations can depend on the order by which they act on the input vector.As a result, the Fourier space for rotations is more complicated thaN that of a commutative transformation like translations.Here, we will first use translations as an illustrative example for equivariant operations in Fourier space and then generalize to rotations.
Translational equivariance.The Fourier transform of a signal on a real line f (x) can be expressed as F (k) = ∞ −∞ e −ikx f (x)dx.We define a translated signal f a (x) = f (x − a) which is the original signal translated by a distance a.The Fourier transform of this translated signal follows Thus, the Fourier transform of a translated signal transforms as Fa a −→ e −ika F (k), implying that the Fourier transform is equivariant under translation with an output operator that is simply a phase shift D out (a) = e −ika .These operators D out (a) form a representation of the group of 1D translations [12].
In three dimensions, translations act on vectors via x → x + a and the representation is given by the operators operator D out (a) = e −ik•a .Rotational equivariance.Similar to translation, we can use Fourier space to define an equivariant transformation for rotations.We consider rotations about a given reference point that defines an origin for a spherical coordinate system in which points in 3D can be expressed by the resulting spherical coordinates (r, θ, φ).Since the radius r (i.e., the distance of a point from to the reference) does not change under rotations about the origin, we will ignore the radial component for now and consider a signal over the sphere of radius r, f (θ, φ) : S 2 (r) → R. The Fourier transform F of the signal on the sphere follows, where Y m (θ, φ) is the spherical harmonic of degree and order m defined as where is a non-negative integer (0 ≤ ), and m is an integer within the interval − ≤ m ≤ .P m (cos θ) is the Legendre polynomial of degree and order m, which, together with the complex exponential e imφ , define sinusoidal functions over the angles θ and φ.In quantum mechanics, spherical harmonics are used to represent the orbital angular momenta, e.g. for an electron in a hydrogen atom.In this context, the degree relates to the eigenvalue of the square of the angular momentum, and the order m is the eigenvalue of the angular momentum about the azimuthal axis.The operators that describe how spherical harmonics transform under rotations are called the Wigner D-matrices, denoted by D mm (R) [12]; this operator is a matrix because the rotation group in 3D is not commutative.Under a rotation R, spherical harmonics transform as Since spherical harmonics transform in this well-defined way, we can now examine how a rotation acts on the Fourier transform of a signal over the sphere.Suppose we have a signal f (n) defined on the elements n of the spherical shell S 2 (i.e., the set of angular coordinates for points on the sphere).The Fourier coefficients associated with the rotated where dΩ = sin 2 θ dθ dφ is the angular differential in the spherical coordinate system.Under a rotation R about the origin, the spherical Fourier coefficients of a real-space signal transform according to F m R −→ m =− D m m (−R) F m , which is simply a matrix product.
It should be noted that the Wigner D-matrices form an irreducible representation of SO(3), which is the group of all rotations around a fixed origin in three-dimensional Euclidean space.Therefore, it provides a natural basis to expand functions in SO(3) while respecting the rotational symmetry.
Nonlinear transforms respecting rotational equivariance.One key feature of neural networks is applying nonlinear activations, which enable a network to approximately model complex and non-linear phenomena.Commonly used nonlinearities include reLU, tanh, and softmax functions.However, these conventional non-linearities can break rotational equivariance in the Fourier space.To construct expressive rotationally equivariant neural networks we can use the Clebsch-Gordan tensor product, which is the natural nonlinear (or bilinear in the case of using two sets of Fourier coefficients) operation in the space of spherical harmonics [12].
Given two spherical tensors F 1 and Ĝ 2 , we can take a product between all of their components F 1m1 Ĝ 2m2 , which would allow us to express nonlinearities.Although these products do not behave well under rotations, linear combinations of them do transform in well-defined ways.The Clebsch-Gordan coefficients are the coefficients that decompose these products back into the space of spherical harmonics, where Wigner D-matrices define the rotation operations.Specifically, the product of spherical tensors F 1 and Ĝ 2 will yield spherical tensors of degree L such that where LM | 1 m 1 ; 2 m 2 is a Clebsch-Gordan coefficient and can be precomputed for all degrees of spherical tensors [12].Similar to spherical harmonics, Clebsch-Gordan products also appear in quantum mechanics, and they are used to express couplings between angular momenta.In following with recent work on group-equivariant machine learning [11,[13][14][15], we will use Clebsch-Gordan products to express nonlinearities in 3D rotationally equivariant neural networks for protein structures.

B. Rotationally equivariant encoding of protein micro-environments as spherical holograms
We define an amino acid's micro-environment as all the atoms associated with the surrounding amino acids within a 10 Å of the central residue's α-carbon.We use the position of the central residue's α-carbon as the reference point of the neighborhood N .Each atom i within this neighborhood, located at position r i with respect to the reference point, has three attributes associated with it: (i) the element type of the atom, e i ∈ {C, N, O, S, H}, (ii) the partial charge of the atom Q i , and (iii) the SASA A i of the atom.With these characteristics we define a feature vector v c i where c indexes the input channels {C, N, O, S, H, charge, SASA}.This vector takes on values given by where δ ij is a Kronecker delta function, and takes the value 1 if i = j and 0 otherwise.Thus, for c ∈ {C, N, O, S, H} the feature vector v c i takes value 1 if atom i is of type c and 0 otherwise.Q i and A i denote the partial charge and the SASA (in units of Å2 ) of atom i, respectively.These quantities are computed by PyRosetta for each protein structure [2].
Each channel defines a point cloud with the attributed values stored at the coordinates of the corresponding atoms.We express the point cloud of each channel in a spherical coordinate system with the origin set at the alpha carbon of the central amino acid in the neighborhood.We then define an atomic density as a sum of Dirac delta distributions parameterized by each atomic coordinate in the neighborhood ρ c (r) = i∈N v c i δ (3) (r − r i ).Here δ (3) (x) denotes a normalized Dirac delta probability density in 3D, which is zero everywhere except at the origin x = 0, where it is infinite.We use 3D Zernike transforms to encode the point clouds associated with each channel in a 3D rotationally equivariant fashion in the spherical Fourier space: where dΩ is the angular differential in the spherical coordinate system, Y m (θ, φ) is the spherical harmonics (eq.S4), and R n (r) is the radial function of the 3D Zernike transform, which can be expressed as, where 2 F 1 (•) is the ordinary hypergeometric function.The index n ≥ 0 is a non-negative integer, and the radial function R n (r) is nonzero only for even values of n − ≥ 0. Zernike polynomials form a complete orthonormal basis in 3D and, therefore, can be used to expand and retrieve any 3D shape, if large enough and n coefficients are used.
Approximations that restrict the series to finite n and are often sufficient for shape retrieval, and hence, desirable algorithmically.Importantly, expansion of an input signal by Zernike polynomials (eq.S8) is rotationally equivariant since a rotation R transforms the resulting coefficients through Wigner-D matrices as Ẑc (eq.S5).Zernike projections in the spherical Fourier space can be thought of as a superposition of spherical holograms of an input point cloud, and thus, we term this operation the holographic encoding of protein micro-environments.

C. Rotationally equivariant model of protein micro-environments with H-CNN
We use the coefficients of the Zernike expansion Ẑc nlm in eq.S8 (i.e., the holographic encoding of the data) as inputs to a rotationally equivariant convolutional neural network to classify amino acids based on their surrounding atomic micro-environments.We term this network holographic convolutional neural network (H-CNN).
The convolutional neural network that we use for our analysis is similar to the Clebsch-Gordan network (CGNet), described in ref. [13].This network is built out of three rotationally equivariant modular units: (i) linear transformation of spherical harmonics, (ii) Clebsch-Gordan nonlinearity, and (iii) spherical batch normalization.
Below we will describe the computations done in each of these modular equivariant units.Without a loss of generality, we denote the operations in each unit of the network in terms of a general equivariant signal F k lm where k is a catch-all index that represents all non-rotational indices (i.e., channel c and index n in eq.S8), and and m correspond to the angular indices (eq.S4).The output of each unit will be denoted by Ĝk lm .
Linear transformations in CG-nets.Linearities are ubiquitous in neural networks and contribute partially to their powerful expressiveness.Our linearity takes the form Ĝk where W ( ) kk denotes two dimensional matrices that mix different input channels k (e.g.combinations of radial index n and channel c in the input or general learned representations in intermediate layers) producing different output channels k.To preserve rotational symmetry, we will impose the restriction of only combining information (i.e., forming linear combinations of inputs) that belongs to the same irreducible representations (irreps) of SO(3) in eq.S10.In other words, we mix only inputs associated with the same degree and order m of spherical harmonics.These inputs are mixed across channels associated with different chemical or radial information in the first layer, or composite channels in the following layers.Generally, different weights are learned for each irrep but shared across coefficients of different order m.
Clebsch-Gordan nonlinearity.The last equivariant component of the network is a Clebsch-Gordan nonlinearity.Nonlinearities are ubiquitous in neural networks.The Clebsch-Gordan nonlinearity is the simplest equivariancerespecting nonlinearity, and it is expressive enough for most learning tasks [13,15].This Clebsch-Gordan nonlinearity is simply a product of spherical tensors that is decomposed back into the spherical harmonic basis via the Clebsch-Gordan coefficients, where Generally we have a choice for which combinations of values k 1 , k 2 , 1 , and 2 are used in this product.We focus on two choices for these indices.We define fully connected networks, for which k 1 , k 2 , 1 , and 2 are allowed to take on all possible values.We also define simply connected networks, for which we impose the condition that k 1 = k 2 and 1 = 2 .
The exact choice made in combining these indices affects the dimensionality of the network.The width of the Clebsch-Gordan nonlinearity output in a fully connected network with a maximum spherical degree L max and a width d h (i.e., k ∈ {1, ..., d h }) is given by Meanwhile, for a simply connected network with the same L max and d h , the width of the Clebsch-Gordan nonlinearity is Spherical batch normalization.Batch normalization is a common feature in neural networks that allows for smooth training of the network.Since the output of our nonlinear Clebsch-Gordan product is not bounded, batch normalization becomes extremely important to ensure that activations remain finite throughout training.We impose a batch normalization layer after each linear operation, producing normalized inputs to the nonlinear operation (Clebsch-Gordan product).We use the following batch normalization during training, where 1), * denotes complex conjugation, • batch denotes averaging over a batch, and is a small number used to ensure numerical stability, which we set to = 10 −3 .During training the network stores a moving average of the norm for each in each channel as where ξ serves as the momentum of our moving average, set to ξ = 0.99.This moving average is then used as the norm during testing in the following way: The different batch normalization used for training and testing prevents the batching of the validation data to affect the evaluation of the model accuracy during testing.

III. NETWORK AND TRAINING PROCEDURE FOR H-CNN
Network architecture.A complete H-CNN features an equivariant unit of Clebsch-Gordan layer followed by an invariant unit of dense layers as detailed in Fig. S2.A Clebsch-Gordan layer is comprised of an equivariant linearity, a spherical batch norm, a Clebsch-Gordan product, and a concatenation.Invariants from the inputs as well as all outputs of the each CG layer are collected for use in the invariant layers (Figs. 1, S2).
Training procedure.Networks were trained with all training data being read at run time using the keras fit function and a data generator made from a subclass of the keras.utils.sequeunceclass [16].Generally 40 workers were used for reading the data and a max queue size of 900 was found to be optimal for read/evaluation balance.Approximately one tenth of the validation data (51,200 neighborhoods) was evaluated after an approximate one tenth of the training data (256,000 neighborhoods) was seen.This was implemented by using tf.keras.Model.fitarguments steps per epoch=1000 and validation steps=100.Data was shuffled at the end of each evaluation interval.These evaluation intervals also served as checkpoints for both early stopping and weight saving.The best network was trained for 8.47 epochs in 4.54 hours on one NVIDIA A40 GPU hosted at the Hyak supercomputer at the University of Washington.
We used stochastic gradient descent and Adam optimizer [17] for training our networks.Adam generally showed better performance, and thus, all networks were trained with Adam during hyperparameter optimization.The main component of the loss was defined as the categorical cross entropy between the one-hot amino acid encoding and the softmax output of the network where p αi is the network-assigned probability of the true central amino acid α i in neighborhood i.Some hyperparameters were scheduled according to the behavior of the network's loss.The learning rate was put on a schedule of ReduceLROnPLateau to decrease by a factor of 0.1 whenever the validation loss did not improve over 10 epochs, with a minimum learning rate of 10 −9 .Ultimately, the networks were trained with early stopping when the validation loss did not improve by a difference of 0.01 over the course of 20 epochs.
Hyperparameter optimization.Hyperparameter optimization was performed in two steps: First, we scanned a larger parameter regime and trained 500 networks to identify a reasonable subregion of the parameter space (Fig. S3A).This sampling revealed networks with parameter ranges listed in Table S1.These parameters were then sampled more finely to get the best network models.The resulted training curves are shown in Fig. S3B,C.
It should be noted that we optimized over two classes of networks: (i) fully connected networks, and (ii) the simply-connected networks, as defined in Section II C. In simply connected networks, the nonlinear Clebsch-Gordan products are restricted to the square of Fourier coefficients with the same and channel index, in contrast to the fully connected networks that use bilinear products between all sets of Fourier coefficients (see equation (S11)).These restricted nonlinearities allowed us to use linear layers that had roughly one order of magnitude larger output dimensions in simply vs. fully connected networks (150 as opposed to 20); see Fig. S3E.We separately optimized the hyperparameters for these two network classes.
Model performance.For both the simply connected and the fully connected networks, we find the best hyperparameters that minimize the validation loss function in equation S17.The best simply-connected model had 62% classification accuracy and a minimal validation loss of 1.2 while the best fully-connected model had 68% classification accuracy and a minimal validation loss of 0.91.This difference in predictive power of the models appeared consistent across all training scenarios.
We compare the accuracy and the training efficiency of our model to the existing methods that classify amino acids based on all-atom representations of their local environments in Table S2.For this comparison, we used the metric of testing accuracy of the amino acid class since this was more commonly reported in the literature.Specifically, we compare H-CNN with two 3D CNN methods that used conventional convolutional neural networks on voxelized protein structures [18,19].Notably, these 3D CNN methods require local alignment of neighborhoods to the central amino acid.Also, each of these methods uses a cube of side length 20 Å, while spheres of radius 10 Å are used as inputs to H-CNN.Our method thus sees approximately half (π/6) of the volume that these other models see.Overall, our model has a comparable accuracy to the best 3D CNN method [19], but it is much more efficient in training.
We also compare our model to other methods that prioritize rotational symmetry, i.e., the Spherical CNN introduced in ref. [20] and the 3D Steerable CNN from ref. [21].Spherical CNN [20] is approximately rotationally invariant by performing 3D convolutions on a discretized grid over the 3D sphere.3D Steerable CNN [21] is a rotationally invariant method by applying spherical convolutions in a steerable basis.H-CNN performs better than both of these methods in the classification task with an order of magnitude fewer parameters (Table S2).
Alternative models.We studied the effect of our post-processing of atomic neighborhoods by performing ablation studies on charge and SASA.For each ablation, we reran the second stage of our hyperparameter optimization and only considered fully connected networks.The overall accuracy from removing SASA was 57% while the overall accuracy from removing charge was 56%, in contrast to the 68% accuracy of the complete model shown in Fig. 2. A similar 10% drop in accuracy associated with SASA and charge was previously reported in ref. [19].Fig. S4 shows the confusion matrices for each model compared to the best model's confusion matrix.

IV. COMPARING H-CNN WITH EVOLUTIONARY MODELS FOR AMINO ACID PREFERENCES
As shown in the main text, the interchangeability of amino acids predicted by H-CNN is consistent with the evolutionary substitution patterns reflected by the BLOSUM62 matrix (Fig. 2D).In addition to this coarse-grained measure, we also assessed the consistency of substitutions at the single-site level between the H-CNN predictions and evolutionary data.Evolutionary variation at a given site in a protein reflects both the substitutability of amino acids at that position (e.g.due to their common physico-chemical properties) and the epistatic interactions with the other covarying residues in the protein.
To isolate the site-specific effects of substitutions, we used the software EV-Couplings [22] to infer a model for amino acid preferences (and their couplings), using amino acid covariations in multiple sequence alignments (MSAs) of protein families.Specifically, EV-Couplings infers a maximum likelihood Potts model for interactions between residues in a protein sequence [23].From this model, we can evaluate the relative preference for occurrence of different amino acids at a given position in a protein sequence.
We used EV-Couplings to infer models of amino acid preferences for all of the protein domains in our test set.For each protein chain in our test data, we used EV-Couplings to compute the MSA for the protein associated with UniProt ID listed under the associated PDB entry.The MSA is computed using UniRef90 [24] that clusters homologous proteins such that sequences within each cluster have at least 90% identity to and 80% overlap with the longest sequence (seed).For inference of the Potts model, we used the default EV-Couplings settings for the rest of the parameters.Ultimately we were able to infer evolutionary models on 67 protein families corresponding to 67 chains in our testing set.These protein families amounted to 11,221 unique sites, for which we compare the EV-Couplings predictions on amino acid preferences with the H-CNN predictions in Fig. 2F.
To measure the similarity between the predictions of H-CNN and EV Couplings, we define the profile overlap ρ for amino acid preferences at a given position.Specifically, for two sets of amino acid probabilities P A , P B ∈ [R + ≤1 ] 19 , the vector overlap is defined as the normalized dot-product between the centered probability vectors, where • sites denotes the average of probability vector overs all sites in the data.Vector overlap takes values between -1 and 1, with 1 indicating a perfect alignment between the two (centered) vectors of amino acid preferences, and -1 indicating complete disagreement.

V. EFFECT OF SHEAR PERTURBATION ON PROTEIN MICRO-ENVIRONMENTS
We locally perturbed protein structures around their crystal conformation with the shearing deformation.As demonstrated in Fig. 3, shearing by an angle δ at a given residue of a protein is a perturbation to the backbone dihedral angles at two neighboring residues given by This perturbation was chosen because it can substantially change the local protein structure near the residue of interest while minimally affecting the far-away residues.Physical interactions in a protein should be dependent on the pairwise distances between atoms.We define D ab to be the pairwise distance in angstroms between a pair of atoms (a, b) in the protein structure.Then ∆D i,δ ab = D i,δ ab −D 0 ab measures the deviation of these pairwise distances from the pairwise distances in the native (unperturbed) structure D 0 ab , when site i is sheared by an angle δ (Fig. 3).We estimate the total shear-induced distortion in a protein structure as the root-mean-square of the deviations of pairwise distances RMS∆D k,δ ij ; here, the average is taken over all pairs of atoms that were at distances smaller than 10 Å in the unperturbed structure, i.e., {(a, b) | D 0 ab < 10 Å}.We characterize the model's response to the perturbation by the change in the protein network energy (Fig. 3), defined as the sum of pseudo-energies (logits) of amino acids over all sites in the protein, where α i denotes the protein's amino acid at site i.
As shown in Fig. 3D, the native protein structure appears to be at the energy minimum of the network and shear perturbations often result in less favorable protein micro-environments.To assess the robustness of this result, we evaluated the protein network energy in eq.S20 for alternative amino acids, while keeping the surrounding protein structure intact (i.e., using the wild-type neighborhood for model evaluation).Fig. S5 shows that if amino acid substitutions preserve the physico-chemical properties of the residue (i.e., substitutions according to the BLOSUM62 matrix), the undistorted structure remains close to the energy minimum, yet the landscape becomes shallower.However, for random amino acid substitutions, no energy minima is recovered.These results further suggest that H-CNN has learned an effective physical potential for protein micro-environments.
Parameterization of rotations and rotational equivariance.We use x-y-z Euler angles to parameterize rotations in 3D.This parametrization (A) can be written in a matrix form, and (B) it can be visualized on a 3D object.Starting from a given orientation in (B.i), the object's rotation can be broken down into three steps: First, rotation around the z-axis of angle α (B.ii).Second, rotation around the new y-axis by angle β (B.iii).And finally, rotation around the new z-axis by angle γ (B.iv).A map from the geometric descriptions of an object to physical observables can be invariant or equivariant under rotations.(C) A rotationally invariant map, such as fmass which gives the mass of a molecule, remains unchanged under the molecule's rotation.(D) A rotationally equivariant map, such as f dipol which gives the dipole moment of a molecule, transforms in a well-defined way with the molecule's rotation, denoted by the operator Din(R).Specifically, in this case, the transformation operator for the dipole moment under rotation Dout(R) acts linearly on the dipole moment in the original frame and is uniquely determined by the parameters of the rotation R. The Fourier transform becomes three dimensional when radial resolution is required thereby increasing the number of discrete n-values used nn.The multiplicity of coefficients is also determined by the number of point clouds being transformed nρ (e.g. the different atomic channels on the left).Since both the point clouds and the radial information are invariant under rotations, these two dimensions can be combined in index to give preference in organizing information according to the degree and order m of the spherical harmonics.(C) The Clebsch-Gordan block (CG block) is fully equivariant and is comprised of a linear layer, a spherical batch norm (SBN), Clebsch-Gordan product (CG prod), and a degree-wise concatenation (concat).(D) Throughout all layers of the network, all invariants ( = 0) are collected and fed into a series of dense layers which are simple linear combinations with training dropout.(E) These two main components in (C,D) comprise the bulk of the entire network.First the input Fourier coefficients are processed through successive CG layers.Invariants are collected from the original input and from the output of every CG layer.The real and imaginary components of these invariants are split and concatenated (complex concat) and then fed into a series of dense layers.This schematic reflects the dimensions of the optimal model discovered in this work.Four CG layers were used.We note that the maximal width of the network as determined by the output of the nonlinearity depends on due to the selection rules of the CG prod.We denote this width d 2 h Ω ,L where L is the maximal degree used in the network and are given in eqs.(S12) and (S13).Two dense layers were used in the optimal model.The dimensions of these layers are (4852 × 500) and (500 × 20).The output of these dense layers are termed pseudo-energies and act as logits in a softmax to estimate probabilities for each amino acid.We emphasize the only nonlinearities in the network are the Clebsch-Gordan products and the ultimate softmax.The difference in the confusion matrix for networks with ablated (A) SASA and (B) charge with respect to the network with the full information (Fig. 2A) is shown.Negative (red) values demonstrate less probability assigned to the input/predicted pair when the input is ablated, while positive (blue) values show increased probability mass under ablation.The removal of SASA (A) appears to decrease the network's ability to predict all amino acids as seen on the diagonal.The effect is most pronounced for hydrophobic amino acids in the upper left, with some hydrophilic amino acids (R, K, E) also displaying strong effects.Interestingly, most of the probability mass is on average redistributed to small amino acids A and S. When charge is removed (B), the network demonstrates worse predictions on charged and polar amino acids most notably R, C, N, and E.     3, but the change in network energy is evaluated for alternative amino acids at the shear position, while keeping the surrounding protein structure intact (i.e., using the wildtype neighborhood for model evaluation).The alternative amino acids are drawn according to the BLOSUM62 matrix in (A), and randomly in (B); the wild-type and the alternative sequence are shown each panel.When amino acids are substituted according to their BLOSUM62 scores, potential wells are generally recovered (A), while no energy minima is recovered for random substitutions (B).Specific sequences used are given by seq sub (substitutions according to the BLOSUM62 matrix), and seq rand (random substitutions) in each panel and are shown in comparison to the wildtype sequence seq wt .S10. H-CNN predictions for stability and binding of RBD.The H-CNN predicted mean pseudo-energy per site predicted from the RBD-ACE2 protein complex (measure of binding) is plotted as a function of the mean pseudo-energy per site, inferred from the isolated RBD structure (measure of stability).Points are colored according to their experimental values of expression and binding as illustrated in Fig. 5.Most points lie on the diagonal since the RBD pseudo-energy is determined from the RBD-ACE2 crystal structure with the ACE2 masked.However, sites at the RBD-ACE2 interface show deviations from the diagonal.H-CNN prediction for sites that are tolerant of mutations for stability but not binding are indicated by crosses (below the diagonal points with large stability values).These sites tend to be at the RBD-ACE2 interface and overlap with the green category from Fig. 5.

FIG. 1 .
FIG. 1. Schematic of Holographic Convolutional Neural network (H-CNN) for protein micro-environments.A neighborhood within a radius of 10 Å around a focal amino acid (masked in orange) in a protein structure is separated into its constituent atomic and chemical channels.The information in these channels is encoded in a rotationally equivariant form, using 3D Zernike polynomials, which defines holograms in spherical Fourier space.These holographic encodings are processed by a rotationally equivariant convolutional neural network (Clebsch-Gordan net[26]).The invariant features of the network layers are then collected and processed through fully connected feed-forward layers to determine the preferences, i.e., statistical weights (pseudo-energies) and probabilities, for different amino acids residing at the center of the input neighborhood.The set of predicted probability vectors across all 20 amino acids defines an amino acid profile.The network is trained by learning the categorical classification task with a softmax cross-entropy loss on a one-hot label of the neighborhood determined by the true central residue in the protein structure.A more detailed network architecture is presented in Fig.S2.

FIG. 2 .
FIG. 2. H-CNN predicts amino acid preferences in protein micro-environments.(A) The confusion matrix for amino acid predictions with H-CNN shows the mean H-CNN predicted probabilities of each of the twenty amino acids (output) conditioned on a specific central amino acid (input).Overall prediction accuracy is 68%.The hierarchical clustering for these predictions reflects known similarities in size and physico-chemical properties of amino acids.(B,C) Low dimensional projections of the prediction outputs (3D UMAPs) are shown.UMAPs are annotated by (B) the amino acid types, and (C) the physico-chemical clusters in (A), with panels showing a different view of the UMAP in each case.Neighborhoods are closely clustered by amino acid types (B), and are spatially arranged based on the physico-chemical properties of the side-chains (C); colors in (C) are consistent with (A).(D) Amino acid confusion in (A) correlates with the substitutability of amino acids in natural proteins as determined by the BLOSUM62 matrix; 71% Pearson correlation.(E) Schematic shows how evolutionary covariation of amino acids in multiple sequence alignments of protein families can be used to fit Potts models (EV-couplings[29]) to characterize the probability of a given amino acid, given the rest of the sequence (left); see Methods for details.To compare evolutionary and H-CNN predictions for site-specific amino acid profiles, the profile overlap is computed as the centered cosine similarity between the predicted probability profiles (right); see Methods.(F) The profile overlaps are strongly peaked around one, implying perfect overlap in data (purple); the average profile overlap across 11,221 sites from a total of 67 protein families is ρ = 0.67.The H-CNN predictions are notably different for the shuffled data, for which the profile overlap peaks near zero (cyan), with an average of 0.002.

FIG. 3 .
FIG. 3. Response of H-CNN predictions to physical distortions in protein structures.(A)The schematic shows shear perturbation in a protein backbone by an angle δ at site i as a rotation of side-chains around the backbone by the angles [φi, ψi−1] → [φi + δ, ψi−1 − δ][28].(B) Shearing changes the pairwise distance matrix D ab between all atoms in a protein structure.The total physical distortion is computed as the root-mean-square of changes in the pairwise distances that are less than 10 Å (i.e., residues within the same neighborhood), multiplied by the sign of the change in the angle ψ. (C) For a given perturbation, the network energy E is determined by the sum of pseudo-energies of the wild-type amino acid at all sites in the protein, and the change in this quantity by shearing ∆E measures the tolerance of a structure to a given perturbation.(D) Panels show the change in the network energy in response to the structural distortion by shear perturbation at all sites in protein G, with the amino acid type and the site number indicated above each panel.

FIG. 4 .
FIG.4.Predicting the stability effect of mutations in T4 lysozyme with H-CNN.(A) Heatmaps of H-CNN predicted log probability of different amino acids (columns) relative to that of the wild-type amino acid for 40 variants with single amino acid substitution from the wild-type (rows).For each variant (row), the position and the identity of the wild-type amino acid and the mutation are denoted between the two heatmaps as: wild-type, site number, mutation.The left panel shows the predictions using the wild-type protein structure (subscript (wt)), while the right panel shows the predictions using the structure of the specified mutant at each row (subscript (mut)).In each row the wild-type amino acid is indicated by an ×, and a dotted box shows the amino acid of the mutant.(B) The H-CNN predicted log-probability ratio ∆ log P = log P mut (mut) /P wt (wt)of the mutant amino acid on the mutant structure with respect to the wild-type amino acid on the wild-type structure is shown for all variants.The predicted ratios for destabilizing mutations are negative, while those for the neutral / beneficial mutations are positive.(C) The H-CNN predicted log-probability ratio ∆ log P shown against the experimentally evaluated ∆∆G for the stability effect of mutations in each protein structure; Pearson correlation of 67%.(D) The H-CNN predictions for the relative log-probabilities ∆ log Pwt using the wild-type structure only are shown against the experimentally measured ∆∆G values for 310 single point mutation variants of T4 lysozyme.Mean ∆∆G was used when multiple experiments reported values for the same variant.The colors show the density of points as calculated via Gaussian kernel density estimation.The predictions are accurate with correlations indicated in the panel.

FIG. 5 .
FIG. 5. Predicting the stability and binding of the RBD protein of SARS-CoV-2 with H-CNN.(A)The density plot shows H-CNN predictions for the RBD stability, using the isolated protein structure of RBD, against the mutational effects on the RBD expression from the DMS experiments; Spearman correlation r = 0.52.(B) The density plot shows H-CNN predictions for the RBD binding to the ACE2 receptor, using the co-crystallized RBD-ACE2 protein structure, against the DMS measurements for mutational effects on binding; shared color bar for (A) and (B).(C) The mean effect of mutations at each site on the RBD-ACE2 binding is shown against the mean effect on the RBD expression.The histograms show the corresponding distribution of effects across sites along each axis.The categories are shown: (i) sites that are intolerant to mutations due to destabilizing effect, i.e., low expression (blue), (ii) sites that are tolerant of mutations for expression but not binding (green), and (iii) sites that are tolerant of mutations for both expression and binding (pink).(D) Blue: true positive vs. false positive rate (ROC curve) for classification of amino acid mutations into stable (expr > −1) vs. unstable (expr < −1), based on the H-CNN predictions using the isolated RBD structure; AUC =0.8.Red: the ROC curve for mutation classification into bound (bind > −1) vs. unbound (bind < −1), based on the H-CNN predictions using the co-crystallized RBD-ACE2 structure; AUC= 0.74.(E) The effect of mutations on binding from the DMS experimental data for the green sites in (C) (top) and the corresponding H-CNN predictions from the RBD-ACE2 structure complex for sites identified by H-CNN in Fig.S10to be tolerant of mutations for stability but not binding (bottom) are shown throughout the structure.
Fig. S2.Schematics of rotationally equivariant data encoding and network architecture in H-CNN.(A) Any point cloud (left) can be written in the equivariant spherical harmonic Fourier basis by integrating the point cloud density against the spherical harmonics along with a choice of a radial function (middle), which depends on the spherical harmonic degree .The triangular structure (right) demonstrates the resulting Fourier coefficients f m , in which the rows reflect the different degrees of spherical harmonics 0 ≤ , which are non-negative integers.Individual cells within each row correspond to different orders m of the spherical harmonics, which are integers bounded by − ≤ m ≤ .(B) The Fourier transform becomes three dimensional when radial resolution is required thereby increasing the number of discrete n-values used nn.The multiplicity of coefficients is also determined by the number of point clouds being transformed nρ (e.g. the different atomic channels on the left).Since both the point clouds and the radial information are invariant under rotations, these two dimensions can be combined in index to give preference in organizing information according to the degree and order m of the spherical harmonics.(C) The Clebsch-Gordan block (CG block) is fully equivariant and is comprised of a linear layer, a spherical batch norm (SBN), Clebsch-Gordan product (CG prod), and a degree-wise concatenation (concat).(D) Throughout all layers of the network, all invariants ( = 0) are collected and fed into a series of dense layers which are simple linear combinations with training dropout.(E) These two main components in (C,D) comprise the bulk of the entire network.First the input Fourier coefficients are processed through successive CG layers.Invariants are collected from the original input and from the output of every CG layer.The real and imaginary components of these invariants are split and concatenated (complex concat) and then fed into a series of dense layers.This schematic reflects the dimensions of the optimal model discovered in this work.Four CG layers were used.We note that the maximal width of the network as determined by the output of the nonlinearity depends on due to the selection rules of the CG prod.We denote this width d 2 h Ω ,L where L is the maximal degree used in the network and are given in eqs.(S12) and (S13).Two dense layers were used in the optimal model.The dimensions of these layers are (4852 × 500) and (500 × 20).The output of these dense layers are termed pseudo-energies and act as logits in a softmax to estimate probabilities for each amino acid.We emphasize the only nonlinearities in the network are the Clebsch-Gordan products and the ultimate softmax.

Fig. S3 .Fig
Fig.S3.Hyperparameter optimization for H-CNN.Hyperparameter optimization was performed in two steps for both the simply connected and the fully connected networks: First, hyperparameters were scanned across a wide regime of hyperparameters via random sampling.From this scan, a narrow band of hyper-parameters were analyzed to determine the optimal network (A).Off the diagonal, each dot's color shows the validation loss seen during this first step of hyperparameter tuning for a network with the combination of hyperparameters shown on the two axes.On the diagonal, the density of networks is shown as a function of loss and the hyperparameter tested.Validation accuracy (B) and loss (C) are shown as a function of training epochs for all networks trained in the second stage of hyperparameter tuning.Red lines show the network with the best validation loss that is used throughout this work.
Fig.S5.Robustness of response to shear perturbation.Similar to Fig.3, but the change in network energy is evaluated for alternative amino acids at the shear position, while keeping the surrounding protein structure intact (i.e., using the wildtype neighborhood for model evaluation).The alternative amino acids are drawn according to the BLOSUM62 matrix in (A), and randomly in (B); the wild-type and the alternative sequence are shown each panel.When amino acids are substituted according to their BLOSUM62 scores, potential wells are generally recovered (A), while no energy minima is recovered for random substitutions (B).Specific sequences used are given by seq sub (substitutions according to the BLOSUM62 matrix), and seq rand (random substitutions) in each panel and are shown in comparison to the wildtype sequence seq wt .

Fig
Fig. S8.Experimental RBD expression of SARS-CoV-2 in DMS experiments and the H-CNN predictions.(A) Experimental expression of single-point mutation variants of SARS-CoV2 RBD relative to the wildtype (dots) as measured using yeast surface display experiments.Expression effect e is calculated as ∆ log 10 (MFI) where MFI is the mean fluorescent intensity.Mutations to G or P dominate the negative tail of expression effects.To better show the effect of all mutations, the negative portion of the colorbar is bounded by the expression of the most detrimental variant that is not a mutation to G or P. (B) The mean expression effect of mutations ē is listed next to each site.(C,D) The H-CNN predicted stability effect of mutations (D), and the mean predicted stability effect for each site Ē (C), evaluated based on the computationally isolated RBD structure, are shown.Again the negative portion of the colorbar is bounded from below by the most destabilizing prediction not involving mutations to G or P.

TABLE S1 .
Hyperparameter bounds and optimal values.Hyperparameters varied during model optimization are listed.The two different bounds used during the two steps of hyperparameter optimization are shown in the broad and fine sampling spaces.Different bounds for fully connected (fc) and simply connected (sc) networks are shown when appropriate.Finally, the optimal value is listed for both the optimal fully connected and simply connected networks.

TABLE S2 .
Comparison of structure-informed models for protein neighborhoods.H-CNN and existing methods trained on classifying all-atom neighborhoods are listed along with the summary quantities of the models.H-CNN demonstrates the power of respecting symmetry since it has fewer parameters and trains faster than 3D-CNNs despite being trained on at least the same amount of data.
The overall network's accuracy after removing SASA and Charge are 57% and 56%, respectively.