On the evolution of protein–adenine binding

Significance How proteins evolved to recognize and bind their ligands is a key mystery in protein function evolution. To explore this mystery, we study how proteins bind adenine, an ancient fragment. We characterize physicochemical patterns of protein–adenine interactions and link these to proteins’ evolutionary origins. In conflict with previous findings, we see that all of adenine’s hydrogen donors and acceptors have been used to bind proteins, and that adenine binding is likely to have emerged multiple times in evolution. To identify adenine-binding sites of shared origin, we use “themes”: short amino acid segments suggested to constitute evolutionary building blocks. We detect specific themes that are engaged in adenine binding; the detection of these in a protein’s sequence might reveal its function.

noticed an additional interaction with adenine's N3 in 103 of the proteins in the cluster, and interactions with the rest of adenine's binding atoms with water molecules in its environment. The themes shared by proteins in this cluster are divided into two groups, according to the binding mode they represent: the first group contains themes representing the reverse motif (themes 39,40,46,320,332,333,1055,1200,1202,1219,1220,1221,1454,1663,1683,1734,1921,1968,2086,2329), while the second group donates backbone interaction with adenine's N3 (themes 331, 335, 365, 61, 981, 1919). In addition, we noticed less common themes, representing interaction with adenine's N7 (themes 461 and 561). Cluster #6B The cluster is characterized by the "Asp" motif (found in 9 of the 10 proteins in the cluster), but with an additional interaction of adenine's N3 with the protein (or with a water molecule, in three proteins), and in some cases, interactions of adenine's N6 in the Hoogsteen edge either with the residues of amino acids or with water molecules. The themes in the cluster can be divided into three groups: a theme involved in the "Asp" motif (theme 623), a theme involved in binding of adenine's N3 (theme 279), and themes involved in the binding of adenine's N6 in the Hoogsteen edge (themes 101, 498, 648, 831, 832, 962). Cluster #6C This cluster is mainly characterized by the variation of the direct motif, as described in cluster #5A, which is found in 15 of the 17 binding sites composing the cluster. Themes shared by proteins in the cluster indeed represent this binding mode (themes 100, 281, 977, 1363). 'Bridging nodes' between cluster #6C and cluster #6D There are three nodes connecting the cluster of SAM-binding proteins with NAD-binding proteins. Two of the nodes represent NAD binding sites; the third node represents an ATP binding site. For the two NAD-binding proteins, the theme connecting the proteins to the SAM-binding proteins creates a large scaffold for the entire binding of adenine (theme 944). It has some overlap with the themes connecting them to the cluster of NAD-binding proteins (theme 949). For the ATP-binding protein, the same theme (theme 944) has smaller coverage; hence it is shorter. However, it includes an interaction between a backbone amide group and adenine's N3, and the variation of the direct form of the adenine-binding motif as described in cluster #6D. Cluster #6D The cluster is composed of 81 nodes; the vast majority of them (66) bind adenine in the "Asp" motif, where N6 in the Watson-Crick edge hydrogen-bonds either to the carboxylate group of aspartate, to the amide group of asparagine, or to the hydroxyl group of serine. In addition, most of the binding sites in this cluster (59) have an additional interaction, where adenine's N3 hydrogen-bonds to the amide group of another residue, which is found 25-30 amino acids upstream to the amino acid binding N1. The themes shared by proteins in this cluster can be divided into three groups: themes that create the 'scaffold' of the binding (themes 186, 793, 1756), a theme that donate the interactions between the protein and the Watson-Crick edge of adenine (318), and themes that donate the interaction with adenine's N3 (31,317,631). Cluster #7: This cluster, with 11 binding sites in total, is composed mainly of FAD binding sites, but also includes two ATP binding sites and two NAD binding sites. Most of the binding sites (8) bind adenine via the "Asp" motif, except for two binding sites using the reverse motif, and another binding site where the interaction with adenine's N6 in the Watson-Crick edge is mediated by a water molecule. Seven of the binding sites have an additional interaction between adenine's N7 and a backbone amide group, and six of the binding sites have an additional interaction between adenine's N6 in the Hoogsteen edge and a water molecule. The themes shared by the binding sites in this cluster donate the "Asp" motif together with the interaction with adenine's N3 (themes 89, 92, 485, 488, 491, 492, 1020, 1023, 1025). Cluster #8: The cluster is composed of 10 SAM binding sites; 9 of them bind adenine via the reverse motif, and in 7 of them there is an additional interaction between adenine's N6 in the Hoogsteen edge and a backbone carbonyl group (in another binding site this interaction is mediated by a water molecule). Themes shared by the binding sites in the cluster donate both interactions, in a large binding-site scaffold (themes 516), or only form the additional interaction with adenine's N6 (themes 2300, 2301, 2344). Cluster #9: This cluster is composed of 10 SAM-binding proteins, all of which belong to PFAM's 'SET' family. Their adenine-binding pattern is unique: it is very similar to the reverse motif, except here the interactions are between the backbone amide and carboxyl group of the amino acid in "position III" and adenine's N6 in the Hoogsteen edge and N7. The themes shared by proteins in this cluster form this motif (themes 1125, 1127, 1128, 1843, 1845, 2314, 2325). Other interactions that are found in this cluster are not represented by themes shared by proteins in the cluster. Cluster #10: The cluster is composed of 21 FAD-binding proteins, all of which bind adenine by the reverse motif. In all of them there is an additional interaction between adenine's N6 in the Hoogsteen edge, mostly with water molecules. In addition, in 17 of the binding sites there is an additional interaction between adenine's N7 and, in most cases, a water molecule. The themes shared by proteins in the cluster form the reverse motif (themes 254, 255, 256, 257, 776, 777, 778, 889, 895, 987, 1173 2166, 2169, 2170, 2179, 2397, 2398). It is noteworthy that many of these themes are quite long; some cover more than 100 amino acids.

Supplementary Methods
The ComBind methodology 2D representations of the ligands. 3D coordinates of the ligands are downloaded from the PDB, and their 2D representations are produced using the OpenBabel chemistry toolbox (4).
Identify the adenine fragment in a ligand. To identify the adenine fragment of a ligand, the distances (in Angstroms) between all the nitrogen atoms in a 2D representation of the ligand are calculated, and compared to the distances between adenine nitrogen atom pairs. When the algorithm identifies a pair of identical distances, it shifts the ligand to the adenine to match the two nitrogen atoms. Next, it uses the Hungarian algorithm (5) to find the ligand atoms that have the closest proximity to the adenine atoms, and calculates the RMSD between these atoms and adenine. The ligand is considered as containing adenine if this RMSD is small enough (less than 0.1Å). ComBind then uses the Kabsch (6) algorithm to calculate a rotation and translation that optimally superimposes the adenines on one another, and uses it to transform the bound proteins. ComBind can be used with any rigid fragment of ligands. The code can be found in http://bitbucket.org/ayanarun/combind/src/master/.
Identify hydrogen bonds between the adenine and its environment. The polar interactions between the adenine and its surroundings (e.g., water molecules and amino acids) are identified using Arpeggio (7). The ligand and the atoms that hydrogen-bond with the adenine are extracted to a PyMOL session, and this collection of atoms is referred to as the 'interaction site' of adenine.
Our focus on hydrogen bonds for characterizing binding patterns is motivated by the fact that such bonds, which are very common in proteins (8)(9)(10)(11), are prevalent in ligand-binding sites and are important for the specificity of protein-ligand interactions (12)(13)(14). The role of such bonds in specificity is due in part to the dependence of their free energy on their geometry (i.e., on bond length and angles) (10,15,16). Moreover, hydrogen bonds are easier to identify compared with other proteinadenine interactions, such as those involving  electrons (− and cation- interactions). The latter interactions, although specific in nature (17), are weaker than canonical hydrogen bonds, less common, and their energy dependence on chemical and geometric characteristics is much more complicated (18). Generally, these contribute to the affinity of the binding, while hydrogen bonding adds to its specificity (19)(20)(21)(22). Aromatic residues are often found in ligand binding sites (23)(24)(25)(26).
However, the interactions of these residues with the ligand include also non-specific hydrophobic and van der Waals components.

Composing the datasets
We collected all the proteins in the PDB that bind adenosine triphosphate (ATP) and its analogs (PDB ligands: ADP, AMP, ANP, ACP, DTP, AGS, DAT, APC, A12, AN2, ADX, M33), nicotinamide adenine dinucleotide (NAD) and its analogs (PDB ligands: 8NA, A3D, CNA, DND, NXX, NAP, NA0, NJP), flavin adenine dinucleotide (FAD) and its analogs (PDB ligands: 6FA, FAS, 5X8), S-adenosyl methionine (SAM) and its analogs (PDB ligands: SAH, SMM), and coenzyme A (CoA) and its analogs (PDB ligands: CAO, COS, COZ, 1VU, ACO, BCO, IVC, ACO). We selected analogs that did not change the functional part of the ligands, and where the adenine fragment remained unchanged. We removed redundancy from the dataset, selecting only the proteins sharing at most 30% sequence identity. Clustering was performed using the sensitive cluster mode of MMseq2 (27,28) and at a length coverage of 70%. From the resulting clusters, one representative per cluster was chosen based on resolution, R-free factor, and completeness; when possible, crystal structures were preferred over NMR structures. The resulting dataset included 985 entries; 751 of them (76%) were structures of very good quality (resolution under 2.5Å, free-R value under 0.25), in 113 (11%) entries the resolution was between 2.5-6.93Å, and in the rest (121, 12%) of the structures the resolution was good (2.5Å or better) but the free-R value was between 0.25 and 0.3.

Network of binding patterns
To compare the adenine interaction sites in two different complexes, the closest binding site atoms were detected using the Hungarian algorithm. After the matching atoms were detected, the RMSD between them was calculated using the Kabsch (6) algorithm. We considered two interaction sites as 'similar' if this RMSD was under 0.3Å, and the corresponding atoms included at least 60% of the atoms in each of the interaction sites. We performed this calculation for all vs. all adenine interaction sites in our ATP datasets, and used Cytoscape (29) to visualize the network (Figure 4). Each node in the network represents an ATP (or analog)-interaction site, where the adenine fragment has at least 3 hydrogen bonds with its environment. Two nodes are connected by an edge if their respective interaction sites have 'similar' geometry. The length of the edge corresponds to the overall similarity between the two interaction sites; the shorter the edge, the more similar the two binding sites. Nodes that were not part of the main connected component, forming small clusters, have been removed for a clearer view of the network.

Composing the theme dataset for adenine-binding proteins
To generate the dataset of themes for adenine-binding proteins we took the following steps: Alignments: We used HHSearch (30) to obtain hidden Markov model (HMM) alignments for all the chains in our adenine-binding dataset. We filtered the alignments, keeping only those with E-value under 10 -2 . For each chain we collected all the alignments to other proteins in the set.
Generate candidate segments for the themes: For each chain, we calculated variations of different minimal lengths: 30, 40, 50, 60, 70, 80 amino acids, and used a unified naming scheme.
Identify the connected components in the chain network: From the alignments generated in the first step we composed a network in which each node is a protein chain, and two nodes are connected by an edge if the respective protein chains are aligned to each other. Next, we separated the chains into connected components. We performed the next step on each connected component separately.

Search and join the variations:
For each chain in the adenine-binding dataset, and for each set of variations in the chain, we searched for the connected component of that chain. Each chain was represented by a node in a graph. For each node, we listed all the variations found in it. Starting from a specific node C with a specific variation M_n1, and a specific range of residues (s, e), we considered only edges that connected C via alignments that matched the residues between (s, e). We restricted the edge to connect two nodes with alignment of approximately the same length as (s, e), and for which 80% of the residues of the variation M_n1 were matched to some residues in the alignment.
Theme generation: Once we had the list of pairs of variations that were similar, we grouped them into themes. The themes are the connected components in another graph, where the nodes are the variations we described earlier, and the edges are the similarity relationships between them. We assigned each theme (or connected component) a number. The theme is a set of protein fragments, and evidence of similarity amongst them. The full list of themes can be found in http://trachelsrv.cs.haifa.ac.il/rachel/for_aya/Adenine_related_themes.tar.gz .

Detecting themes in adenine interaction sites
To detect all the themes that bind adenine, we expanded the initial dataset of themes. First, we searched for the themes in the UniProt database (31): we used each theme as an HMM and used HMMER (32), with a threshold of E-value smaller than 10 -5 . We added the identified matches to the HMM representing each theme and used this new HMM. In order not to lose the proteins that were initially included in the HMM of the theme, we made sure to add them to the resulting HMM. We applied this expansion process twice for each theme. Next, we used HMMER (32) again on our adenine-binding dataset with each theme, and searched for the proteins in this database containing the theme.

Theme network
We listed all the themes found in the interaction site of each of the proteins in our dataset, according to the unified naming scheme described above. We compare all-vs.-all of the proteins in the dataset to search for all the themes shared by pairs of proteins: we created a network where each node represented a protein interaction site, and two nodes were connected by an edge if the corresponding binding sites had a shared theme, hence, amino acids which hydrogen-bonds to adenine in both proteins are part of the same theme (see Figure 5; only clusters with 10 or more nodes are shown). We used Cytoscape (29) and CytoStruct (33) to view the network.

Discover adenine-binding proteins in protein datasets
We used HMMER to search the entire PDB/UniProt databases for each of the themes identified as being shared by the proteins in our dataset of adenine-binding proteins ( Figure 5) (34). When a theme was found in a dataset entry with E-value smaller than 10 -5 , we listed this entry as "suspected adenine binding protein". When searching against the PDB, we used ComBind to check for adenine-containing ligands in this entry and created two lists. The first list contained proteins that had a theme related to adenine binding, and that also had adenine as part of their structure (possibly in the context of a larger ligand); the second list included proteins that contained a theme related to the binding, but with no adenine in their structure. We used BLAST (35) to search for all the proteins in the second list against proteins from the first list, with the goal of checking whether PDB entries with no bound adenine may share sequence similarity with proteins that do bind adenine. A protein was considered as a probable candidate for binding adenine if it shared at least 80% sequence identity, with 80% coverage, with a protein that was known to bind adenine.  Table S1. Hydrogen-bond definitions, taken from a variety of commonly used tools.

Reference
A. B.  C. Figure S3. Themes can form adenine-binding patterns in proteins. The protein is shown in wheat with the themes highlighted in colors. The adenine-containing ligand is shown using a bond-stick model, and the hydrogen bonds to specific amino acids of the themes and water molecules are shown in black dashed lines. (A) A theme representing the reverse motif with an additional interaction between adenine's N6 in the Hoogsteen edge and a carboxyl group at 'position XV/XVI' (here R102). Demonstrated using PDB 4hg0. (B) A theme representing a variation of the reverse motif in the Hoogsteen edge, as found in PDB 4qeo. (C) A relatively long theme creating a scaffold for adenine binding, as found in PDB 1ej0. The "Asp" motif is used here, with an additional interaction with adenine's N3. A water molecule forming hydrogen bonds with adenine's Hoogsteen edge is also conserved.
A. B.
E. F. Figure S4. Different hydrogen-bond definitions and geometrical-similarity thresholds may lead to different network representations of protein-ATP complexes, without changing our main conclusions. A network representation of protein-ATP complexes (colored circles) connected based on the geometry of their interaction regions. The nodes are colored according to the PFAM family assignment of the binding protein; only families represented by more than 3 nodes are colored, the rest are in grey. The color scheme is the same as the one used in Figure 4A. (A) ComBind's results for the dataset of ATP-binding sites, when the distance threshold for hydrogen bond is 3.9Å. (B) Same as A, with distance threshold of 3.2Å.
(C) A network representation of protein-ATP complexes, when the distance threshold for a hydrogen bond is set at 3.9Å. (D) Same as C, with a threshold of 3.2Å. E. Using lax thresholds leads to a larger network, with less noticeable clusters. Two nodes are connected by an edge if the RMSD between the binding sites is under 0.4Å and at least 60% of the interacting atoms are located in close proximity. F. Using strict similarity thresholds breaks the network into numerous connected components. Two nodes are connected by an edge if the RMSD between the binding sites is under 0.2Å and at least 70% of the interacting atoms are located in close proximity. Figure S5. A theme is a set of similar segments that recur, or are 'reused', across protein space: These segments have approximately the same length and their sequences are similar. Here we observe the segments corresponding to an example theme, 'theme 1403', which appears in the adenine-binding sites of proteins in cluster #4 (see supplementary text; "Themes used in adenine binding"). The theme was constructed from nine segments taken from seven chains. We represent each sequence by a cartoon line whose length is proportional to the number of residues in the sequence, and the positions of the segments within them are shown as colored blocks. Similarly, we show the structures of these chains and the colored segments within them. For example, in chain 3LFR_A, there is a brown segment of 48 residues (approximately half of the chain's residues); in chain 2YZQ_A, there are two segments, the first one is colored in cyan, and the second in blue. The bottom of the figure shows the MSA of these segments. Figure S6. Adenine's binding mode with 1uw1, resembles the mode shared by proteins in cluster 6A in Figure 5. 1uw1 was designed by function (adenine binding) directed in-vitro evolution (41). The protein uses the reverse motif to bind adenine in the Watson-Creek edge, and in addition, a water molecule hydrogen bonds adenine's Hoogsteen edge, while the protein backbone forms another hydrogen bond with adenine's N3.

N3
N1 N6 N7 resolution values, the bound ligand, the atoms that participate in the hydrogen bonds with any of adenine's nitrogen atoms, and ECOD's X-and F-group assignments of the binding atoms.
Dataset S2 (separate file). The list of UniProt accession codes that include one of the themes involved in adenine binding.