Pairing interacting protein sequences using masked language modeling

Edited by Herbert Levine, Northeastern University, Boston, MA; received August 12, 2023; accepted December 18, 2023
June 24, 2024
121 (27) e2311887121

Significance

Deep learning has brought major advances to the analysis of biological sequences. Self-supervised models, based on approaches from natural language processing and trained on large ensembles of protein sequences, efficiently learn statistical dependence in this data. This includes coevolution patterns between structurally or functionally coupled amino acids, which allows them to capture structural contacts. We propose a method to pair interacting protein sequences which leverages the power of a protein language model trained on multiple sequence alignments. Our method performs well for small datasets that are challenging for existing methods. It can improve structure prediction of protein complexes by supervised methods, which remains more challenging than that of single-chain proteins.

Abstract

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.
Interacting proteins play key roles in cells, ensuring the specificity of signaling pathways and forming multi-protein complexes that act, e.g., as molecular motors or receptors. Predicting protein–protein interactions and the structure of protein complexes are important questions in computational biology and biophysics. Indeed, high-throughput experiments capable of resolving protein–protein interactions remain challenging (1), even for model organisms, and experimental determination of protein complex structure is demanding.
A major advance in protein structure prediction was achieved by AlphaFold (2) and other deep learning approaches (35). Extensions to protein complexes have been proposed (69), including AlphaFold-Multimer (AFM) (7), but their performance is heterogeneous and less impressive than for monomers (10). Importantly, the first step of AlphaFold is to build multiple-sequence alignments (MSAs) of homologs of the query protein sequence. The results of the CASP15 structure prediction contest demonstrated that MSA quality is crucial to further improving AlphaFold performance (11, 12). For protein complexes involving several different chains (heteromers), paired MSAs, whose rows include actually interacting chains, can provide coevolutionary signal between interacting partners that is informative about inter-chain contacts (1316). However, constructing paired MSAs poses the challenge of properly pairing sequences. Accordingly, the quality of pairings strongly impacts the accuracy of heteromer structure prediction (9, 17, 18). Pairing interaction partners is difficult because many protein families contain several paralogous proteins encoded within the same genome. This problem is known as paralog matching. In prokaryotes, genomic proximity can often be used to solve it, since most interaction partners are encoded in close genomic locations (19, 20). However, this is not the case in eukaryotes. Large-scale coevolution studies of protein complexes (2123) and deep learning approaches (69, 24) have paired sequences by using genomic proximity when possible (8, 21, 24), and/or by pairing together the closest, or equally ranked, hits to the query sequences, i.e., relying on approximate orthology (69, 17, 2225).
Aside from genomic proximity and orthology, phylogeny-based methods have addressed the paralog matching problem (2635), exploiting similarities betweenevolutionary histories of interacting proteins (3640). Other methods, based on coevolution (13, 4145), rely on correlations in amino acid usage between interacting proteins (14, 15, 46, 47). These correlations arise from the need to maintain physico-chemical complementarity among amino acids in contact, and from shared evolutionary history (48, 49). Phylogeny and coevolution can be explicitly combined, improving performance (50). However, coevolution-based approaches are data-thirsty and need large and diverse MSAs to perform well. This limits their applicability, especially to eukaryotic complex structure prediction. Nevertheless, the core idea of finding pairings that maximize coevolutionary signal holds promise for paralog matching and complex structure prediction.
We develop a coevolution-based method for paralog matching which leverages recent neural protein language models taking MSAs as inputs (2, 51). These models are one of the ingredients of the success of AlphaFold (2). We focus on MSA Transformer (51), a protein language model which was trained on MSAs using the masked language modeling (MLM) objective in a self-supervised way. We introduce Differentiable Pairing using Alignment-based Language Models (DiffPALM), a differentiable method for predicting paralog matchings using MLM. We show that it outperforms existing coevolution methods by a large margin on difficult benchmarks of shallow MSAs extracted from ubiquitous prokaryotic protein datasets. DiffPALM performance further quickly improves when known interacting pairs are provided as examples. Next, we apply DiffPALM to the hard problem of paralog matching for eukaryotic protein complexes. For this, we take sequences paired by DiffPALM as input to AFM. Among the complexes we tested, using DiffPALM substantially improves structure prediction by AFM in some cases, and, when using orthologs as known interacting pairs, does not yield any significant deterioration. It also achieves competitive performance with using orthology-based pairing.

Results

Leveraging MSA-Based Protein Language Models for Paralog Matching.

MSA-based protein language models, which include MSA Transformer (51) and the EvoFormer module of AlphaFold (2), are trained to correctly fill in masked amino acids in MSAs with the MLM loss (see SI Appendix, MSA Transformer and Masked Language Modeling for MSAs and (51, 52) for details). To this end, they use the rest of the MSA as context, which allows them to capture coevolution. Indeed, MSA Transformer achieves state-of-the-art performance at unsupervised structural contact prediction (51), captures pairwise phylogenetic relationships between sequences (52), and can be used to generate new sequences from given protein families (53). While MSA Transformer was only trained on MSAs corresponding to single chains, inter-chain coevolutionary signal has strong similarities with intra-chain signal (1315). As illustrated by SI Appendix, Fig. S1, MSA Transformer is able to detect inter-chain contacts from a properly paired MSA (54). SI Appendix, Fig. S1 further shows that it cannot do so from a wrongly paired MSA. Moreover, we find that the MLM loss (used for the pre-training of MSA Transformer) decreases as the fraction of correctly matched sequences increases, see SI Appendix, Fig. S2. These results demonstrate that MSA Transformer captures inter-chain coevolutionary signal.
In this context, we ask the following question: Can we exploit MSA-based protein language models to address the paralog matching problem? Let us focus on the case where two MSAs have to be paired, which is the relevant one for heterodimers. Paralog matching amounts to pairing these two MSAs, each corresponding to one of two interacting protein families, so that correct interaction partners are placed on the same row of the paired MSA. Throughout, we will assume that interactions are one-to-one, excluding cross-talk, which is valid for proteins that interact specifically (55). Thus, within each species, assuming that there is the same number of sequences from both families, we aim to find the correct one-to-one matching that associates one protein from the first family to one protein from the second family. We also cover the case where the two protein families have different numbers of paralogs within the same species, see Materials and Methods. Motivated by our finding that the MLM loss is lower for correctly paired MSAs than for incorrectly paired ones, we address the paralog matching problem by looking for pairings that minimize an MLM loss. A challenge is that the number of possible such one-to-one matchings scales factorially with the number of sequences in the species, making it difficult to find the permutation that minimizes the loss by a brute-force search. With DiffPALM, we address this challenge by formulating a differentiable optimization problem that can be solved using gradient methods, to yield configurations minimizing our MLM loss, see Materials and Methods.

DiffPALM Outperforms Other Coevolution Methods on Small MSAs.

We start out by considering a well-controlled benchmark dataset composed of ubiquitous prokaryotic proteins from two interacting families, namely histidine kinases (HKs) and response regulators (RRs) (56, 57), see SI Appendix, Datasets. These proteins interact together within prokaryotic two-component signaling systems, important pathways that enable bacteria to sense and respond to environment signals (55). They possess multiple paralogs (on the order of ten per genome, with substantial variability), and have strong specificity for their cognate partners. Because most cognate HK-RR pairs are encoded in the same operon, many interaction partners are known from genome proximity, which enables us to assess performance. In addition, earlier coevolution methods for paralog matching were tested on this dataset, allowing rigorous comparison (14, 47, 50). Here, we focus on datasets comprising about 50 cognate HK-RR pairs. Indeed, this small data regime is problematic for existing coevolution methods, which require considerably deeper alignments to achieve good performance (14, 15, 47, 50). Furthermore, this regime is highly relevant for eukaryotic complexes, because their homologs have relatively low sequence diversity, as shown by the effective depth of their MSAs in SI Appendix, Table S1. While prokaryotic proteins such as HKs and RRs feature high diversity, focusing on small datasets allows us to address the relevant regime of low diversity in this well-controlled benchmark case. We hypothesize that MSA Transformer’s extensive pre-training can help to capture coevolution even in these difficult cases. To assess this, we first test two variants of our DiffPALM method (Materials and Methods) on 40 MSAs from the HK-RR dataset comprising about 50 HK-RR pairs each (SI Appendix, Datasets). We first address the de novo pairing prediction task, starting from no known HK-RR pair, and then we study the impact of starting from known pairs.
Fig. 1 shows that DiffPALM performs better than the chance expectation, obtained for random within-species matching. Moreover, it outperforms other coevolution-based methods, namely DCA-IPA (14), MI-IPA (47), which rely respectively on Potts models and on mutual information, and GA-IPA (50), which combines these coevolution measures with sequence similarity, a proxy for phylogeny. Importantly, these results are obtained without giving any paired sequences as input to the algorithm. The performance of DiffPALM is particularly good for pairs with high confidence score (Materials and Methods, Result and confidence), as shown by the “precision-10” curve, which focuses on top 10% predicted pairs, when ranked by predicted confidence (Fig. 1). We also propose a method based on a protein language model trained on a large ensemble of single sequences, ESM-2 (650M) (5), see Materials and Methods, Pairing Based on a Single-Sequence Language Model. DiffPALM also outperforms this method, even though the latter is faster (no need for backpropagation) and is formulated as a linear matching problem, which is solved exactly. This confirms that the coevolution information contained in the MSA plays a key role in the performance of DiffPALM, which is based on MSA Transformer. A key strength of MSA Transformer and thus of DiffPALM is that they leverage the power of large language models while starting from MSAs, and thus allow direct access to the coevolutionary signal. Fig. 1 shows that both variants of DiffPALM, namely multi-run aggregation (MRA) and iterative pairing algorithm (IPA), outperform all baselines, and that precision of MRA increases with the number of runs used (see SI Appendix, Table S2 for details). In MRA, we aggregate pairs predicted via independent optimization runs, while in IPA, we gradually add pairs with high confidence as positive examples (Materials and Methods). DiffPALM-IPA thus exploits the good results obtained for high-confidence pairs, evidenced by the high precision-10 scores in Fig. 1. Note that the distribution of precision-10 scores over the different MSAs we considered is skewed, especially after many MRA runs, see SI Appendix, Fig. S3. For many MSAs, almost perfect scores are reached, while performance is bad for a few others. MSAs with a smaller average number of sequence per species tend to yield larger precision, as the pairing task is then easier.
Fig. 1.
Performance of DiffPALM on small HK-RR MSAs. The performance of two variants of DiffPALM (MRA and IPA; see Materials and Methods, Improving precision: MRA and IPA) is shown versus the number of runs used for the MRA variant, for 40 MSAs comprising about 50 HK-RR pairs. The chance expectation, and the performance of various other methods, are reported as baselines. Three existing coevolution-based methods are considered: DCA-IPA (14), MI-IPA (47), and GA-IPA (50). We also consider a pairing method based on the scores given by the ESM-2 (650M) single-sequence protein language model (5), see Materials and Methods, Pairing Based on a Single-Sequence Language Model. With all methods, a full one-to-one within-species pairing is produced, and performance is measured by precision (also called positive predictive value or PPV), namely, the fraction of correct pairs among predicted pairs. The default score is “precision-100,” where this fraction is computed over all predicted pairs (100% of them). For DiffPALM-MRA, we also report “precision-10,” which is calculated over the top 10% predicted pairs, when ranked by predicted confidence within each MSA (Materials and Methods). For DiffPALM, we plot the mean performance on all MSAs (color shading), and the SE range (shaded region). For our ESM-2-based method, we consider 10 different values of masking probability p from 0.1 to 1.0, and we report the range of precisions obtained (gray shading). For other baselines, we report the mean performance on all MSAs.
So far, we addressed de novo pairing prediction, where no known HK-RR pair is given as input. Can DiffPALM precision increase by exploiting “positive examples” of known interacting partners? This is an important question, since experiments on model species may for instance give some positive examples (Materials and Methods, The Paralog Matching Problem). To address it, we included different numbers of positive examples, by using the corresponding nonmasked interacting pairs as context (Materials and Methods, Construction of an appropriate MLM loss). The Left panel of Fig. 2 shows that the performance of DiffPALM significantly increases with the number of positive examples used, reaching almost perfect performance for the highest-confidence pairs (precision-10).
Fig. 2.
Impact of positive examples, MSA depth, and extension to another pair of protein families. We report the performance of DiffPALM with five MRA runs (measured as precision-100 and precision-10, see Fig. 1), for various numbers of positive examples, on the same HK-RR MSAs as in Fig. 1 (Left panel). We also report the performance of DiffPALM (using no positive examples) versus MSA depth for both HK-RR and MALG-MALK pairs (Middle and Right panel). In all cases, we show the mean value over different MSAs and its SE, and we plot the chance expectation for reference. Note that MSA depth can vary by ±10% around the reported value because complete species are used (SI Appendix, Datasets).
Another important parameter is MSA depth. Indeed, deeper MSAs yield better performance for traditional coevolution methods (14, 47, 50). The Middle panel of Fig. 2 shows that DiffPALM performance also increases with MSA depth, while already reaching good performance for shallow MSAs. Note that MSA Transformer’s large memory footprint makes it difficult to consider very deep input MSAs (SI Appendix, Datasets).
While we focused on HK-RR pairing so far, DiffPALM is a general method. To assess how it extends to other cases, we consider another pair of ubiquitous prokaryotic proteins, namely homologs of the Escherichia coli proteins MALG-MALK, which are involved in ABC transporter complexes. These proteins form permanent complexes, while HK-RR interact transiently to transmit signal. The Right panel of Fig. 2 shows results obtained on 40 MSAs comprising about 50 MALG-MALK pairs, without positive examples. We observe that DiffPALM outperforms the chance expectation by a large margin. It also significantly outperforms existing coevolution methods (14, 47, 50), as well as our method based on ESM-2 (650M), see SI Appendix, Table S2. Note that all approaches yield better performance for MALG-MALK than for HK-RR, as the number of MALG-MALK pairs per species is smaller than that of HK-RR pairs. Finally, while Fig. 2 reports the final MRA performance, SI Appendix, Fig. S4 shows that both performance scores increase with the number of MRA runs.
HK-RR and MALG-MALK are interesting benchmarks because they have multiple paralogs per species and because ground-truth pairings are known from genome proximity. However, both domains involved in these pairs are sometimes present within a single chain. In fact, gene fusions often exist for interacting proteins (58), and are used as a clue to predict interactions (59). While hybrid HKs comprising both HK- and RR-specific domains are known to have lower specificity and weaker coevolutionary signal than other pairs (45), the presence of such proteins in the training set of MSA Transformer could be an ingredient of the success of DiffPALM. However, such proteins are also included in the training set of ESM-2, which we found to perform far less well than MSA Transformer for pairing. To further investigate if their presence in the training set of MSA Transformer is key to DiffPALM performance, we considered another prokaryotic pair, NUOA-NUOJ (NADH-quinone oxidoreductase). This system has very few fusions: only one protein with both domains is referenced in the InterPro database (60). We ran DiffPALM for four MSAs of 50 pairs of NUOA-NUOJ and obtained an average precision-100 score of 0.86 (SE 0.06), to be compared to a chance expectation of 0.5. Thus, DiffPALM performs well even when gene fusions are very rare.

Using DiffPALM for Eukaryotic Complex Structure Prediction by AFM.

An important and more challenging application of DiffPALM is predicting interacting partners among the paralogs of two families in eukaryotic species. Indeed, eukaryotes often have many paralogs per species (61) but eukaryotic-specific protein families generally have fewer total homologs and smaller diversity than prokaryotes. Moreover, most interacting proteins are not encoded in close proximity in eukaryotic genomes. Paired MSAs are a key ingredient of protein complex structure prediction by AFM (7, 9). When presented with query sequences, the default AFM pipeline (7) retrieves homologs of each of the chains. Within each species, homologs of different chains are ranked according to Hamming distance to the corresponding query sequence. Then, equal-rank sequences are paired. Can DiffPALM improve complex structure prediction by AFM? To address this question, we consider 15 complexes, listed in SI Appendix, Table S1, whose structures are not included in the training set of the AFM release we used unless specified otherwise (v2, SI Appendix, General Points on AFM), and for which the default AFM complex prediction was previously reported to perform poorly (7, 18) (SI Appendix, Eukaryotic Complexes).
Fig. 3 (Top panels) shows that DiffPALM can improve complex structure prediction by AFM (see SI Appendix, Fig. S5 for details). This suggests that it is able to produce better paired MSAs than those from the default AFM pipeline. In particular, substantial improvements are obtained for the complexes with PDB identifiers 6L5K and 6FYH, see SI Appendix, Figs. S6 and S7 for structural visualizations. For 6FYH, using DiffPALM also yields much higher average AFM confidence scores than the default AFM pipeline (Fig. 3, Bottom panels). For 6L5K, both pairing methods yield very high confidence scores, but none of the structures predicted with the default AFM pipeline agree with the experimental structure. We found no cases in which using DiffPALM improved AFM confidence scores but deteriorated structure predictions. In most cases, the quality of structures predicted using DiffPALM pairing is comparable to that obtained using the pairing method adopted, e.g., by ColabFold (8), where only the orthologs of the two query sequences, identified as their best hits, are paired in each species (resulting in at most one pair per species) (8, 9, 2225), see Fig. 3. Note however that, for 6PNQ, the ortholog-only pairing method is outperformed both by DiffPALM and by the default AFM pairing. Indeed, the raw and effective MSA depths are smaller for this structure than, e.g., for 6L5K and 6FYH (SI Appendix, Table S1). Thus, further reducing diversity by restricting to paired orthologs may be negatively impacting structure prediction in this case. Given the good results obtained overall with orthology-based pairings, we tried using them as positive examples for DiffPALM. Given the very good precision obtained by DiffPALM for high-confidence HK-RR pairs, we also tried restricting to high-confidence pairs. For most structures, we obtained no significant improvement over the standard DiffPALM using these variants (SI Appendix, Fig. S8). However, for 6WCW, we could generate several higher-quality structures, particularly when using orthologs as positive examples. Overall, when compared with the default AFM pairing pipeline, DiffPALM with orthologs as positive examples never significantly deteriorates prediction quality for our structures.
Fig. 3.
Performance of AFM using different pairing methods. We use AFM to predict the structure of protein complexes starting from differently paired MSAs, each of them constructed from the same initial unpaired MSAs. Three pairing methods are considered: the default one of AFM, only pairing orthologs to the two query sequences, and a single run of DiffPALM (equivalent to one MRA run). We used a single run for computational time reasons. Performance is evaluated using DockQ scores (Top panels), a widely used measure of quality for protein–protein docking (62), and the AFM confidence scores (Bottom panels), see SI Appendix, General Points on AFM. The latter are also used as transparency levels in the Top panels, where more transparent markers denote predicted structures with low AFM confidence. For each query complex, AFM is run five times. Each run yields 25 predictions which are ranked by AFM confidence score. The top five predicted structures are selected from each run, giving 25 predicted structures in total for each complex. Out of the 15 complexes listed in SI Appendix, Table S1, we restrict to those where any two of these three pairing methods yield a significant difference (>0.1) in average DockQ scores for at least one set of predictions coming from different runs but with the same within-run rank according to AFM confidence. Panels are ordered by increasing mean DockQ score for the AFM default method.
Although DiffPALM achieves similar performance on these structure prediction tasks as using orthology, it predicts some pairs that are quite different from orthology-based pairs. Indeed, SI Appendix, Fig. S9 shows that the fraction of pairs identically matched by DiffPALM and by orthology is often smaller than 0.5. SI Appendix, Fig. S10 further shows that, for the sequences that are paired differently by DiffPALM and by orthology, the Hamming distances between the two predicted partners is often above 0.5. Nevertheless, most of the pairs that are predicted both by DiffPALM and by using orthology have high DiffPALM confidence (SI Appendix, Fig. S9), confirming the importance of these pairs.
How does DiffPALM compare to traditional coevolution methods for these eukaryotic complexes? To address this, we paired chains using MI-IPA (47) for the six complexes with deepest pairable MSAs (SI Appendix, Table S1), and used these paired MSAs as input for AFM. SI Appendix, Fig. S11 shows that MI-IPA also yields some improvement over default AFM pairing for the structures improved by DiffPALM. However, DiffPALM yields stronger improvement than MI-IPA in these cases. This confirms the efficiency of DiffPALM at using coevolutionary signal to pair the sequences. Note that we focused on deep MSAs because of the depth requirements of traditional coevolution methods, and that DiffPALM performed better despite deep MSAs being split for DiffPALM due to memory limitations (SI Appendix, Datasets), which reduces the coevolutionary information available to DiffPALM.
Finally, SI Appendix, Fig. S12 compares the performance of DiffPALM to the AFM default and the ortholog-only pairing methods using the latest AFM release (v3, SI Appendix, Eukaryotic Complexes). Since all the structures considered here are included in the training set of this release, we expect the standard AFM pipeline to perform well. However, we observe that using DiffPALM yields similar or better performance than using the AFM default or ortholog-only pairing methods. In particular, the structure prediction of 6L5K remains substantially improved by using DiffPALM rather than AFM default pairing. Moreover, for 6THL, which was poorly predicted by all methods with the older AFM release (DockQ<0.1), DiffPALM yields a substantial improvement compared to both other pairing methods with the latest AFM release.

Discussion

We developed DiffPALM, a method for pairing interacting protein sequences that builds on MSA Transformer (51), a protein language model trained on MSAs. MSA Transformer efficiently captures coevolution between amino acids, thanks to its training to fill in masked amino acids using the surrounding MSA context (5153). It also captures inter-chain coevolutionary signal, despite being trained on single-chain MSAs (54). We leveraged this ability in DiffPALM by using a masked language modeling loss as a coevolution score and looking for the pairing that minimizes it. We formulated the pairing problem in a differentiable way, allowing us to use gradient methods. On shallow MSAs extracted from controlled prokaryotic benchmark datasets, DiffPALM outperforms existing coevolution-based methods and a method based on a state-of-the-art language model trained on single sequences. Its performance quickly increases when adding examples of known interacting sequences. It also increases with MSA depth, as for traditional coevolution methods. Paired MSAs of interacting partners are a key ingredient to complex structure prediction by AFM. We found that using DiffPALM can improve the performance of AFM, and achieves competitive performance with orthology-based pairing.
Recent work (18) also used MSA Transformer for paralog matching, in a method called ESMPair. It relies on column attention matrices and compares them across the MSAs of interacting partners. This makes it quite different from DiffPALM, which relies on coevolutionary information via the MLM loss. ESMPair may be more closely related to phylogeny- or orthology-based pairing methods, since column attention encodes phylogenetic relationships (52). 13 out of the 15 eukaryotic protein complexes we considered were also studied in ref. 18, but no substantial improvement (and often a degradation of performance) was reported for those when using ESMPair instead of the default AFM pairing, except for 7BQU. By contrast, DiffPALM yields strong improvements for 6L5K and 6FYH, and no significant performance degradation when using orthologs as positive examples. Explicitly combining coevolution and phylogeny using MSA Transformer is a promising direction to further improve partner pairing. Indeed, such an approach has already improved traditional coevolution methods (50). Other ways of improving MSA usage by AFM have also been proposed (63) and could be combined with advances in pairing. Besides improving MSA construction (64) and the extraction of MSA information, other promising approaches include exploiting structural alignments (65), using massive sampling and dropout (66), and combining AFM with more traditional docking methods (67, 68), which has allowed, e.g., to improve structure prediction of 6A6I (67).
While DiffPALM performance increases with MSA depth, MSA Transformer’s memory and time requirements scale quadratically with MSA depth and with MSA length, limiting the size of input MSAs. Using, e.g., FlashAttention (69), or using a simpler MSA-based protein language model in place of MSA Transformer, might help alleviate some of these limitations.
DiffPALM illustrates the power of neural protein language models trained on MSAs, and their ability to capture the rich structure of biological sequence data. The fact that these models encode inter-chain coevolution, while they are trained on single-chain data, suggests that they are able to generalize. We used MSA Transformer in a zero-shot setting, without fine-tuning it to the task of interaction partner prediction. Such fine-tuning could yield further performance gains (70).
Like MSA Transformer, AlphaFold’s EvoFormer (2) also processes MSA input. Furthermore, a portion of the total loss used during its training as part of a supervised structure prediction method consisted of an MLM loss. Thus, one could develop a paralog matching method based on EvoFormer. However, the substantially larger pre-training dataset of MSA Transformer should make DiffPALM more broadly applicable. Furthermore, the monomer version of EvoFormer performs less well than MSA Transformer for mutational effect prediction, despite being an excellent contact predictor (71). Since mutational effect prediction probes the quality of the entire probability vectors output at masked positions, this hints that DiffPALM might not perform as well if we were to use the monomer version of EvoFormer instead of MSA Transformer. Nevertheless, an interesting perspective would be to integrate paralog matching into an end-to-end retraining of AlphaFold-Multimer.
The fact that DiffPALM outperforms existing coevolution methods (14, 47, 50) for shallow MSAs is reminiscent of the impressive performance of MSA Transformer at predicting structural contacts from shallow MSAs (51). While traditional coevolution methods either compute local coevolution scores for two columns of an MSA (47) or build a global model for an MSA (14, 15), MSA Transformer was trained on large ensembles of MSAs and shares parameters across them. This presumably allows it to transfer knowledge between MSAs, and to bypass the usual needs for deep MSAs of traditional coevolution methods (14, 15, 46, 47, 50), or of MSA-specific transformer models (72). This constitutes major progress for the use of coevolutionary signal.
After the transformative progress brought by deep learning to protein structure prediction (25), predicting protein complex structure and ligand binding sites is fast advancing with AFM and related methods, but also with other deep learning models based on structural representations (7376). Combining the latter (77, 78) and, more generally, structural information (79) with the power of sequence-based language models is starting to bring even further progress.

Materials and Methods

The Paralog Matching Problem.

Goal and notations.

Paralog matching amounts to pairing a pair of MSAs, each one corresponding to one of the two protein families considered. We assume that interactions are one-to-one. Let M(A) and M(B) be the (single-chain) MSAs of two interacting protein families A and B, and let K denote the number of species represented in both MSAs and comprising more than one unmatched sequence in at least one MSA. Species represented in only one MSA are discarded since no within-species matching is possible for them. Species with only one unmatched sequence in each MSA are not considered further since pairing is trivial. There may also be Npos known interacting pairs: they are treated separately, as positive examples (see “Exploiting known interacting partners” below). Here, we focus on the unmatched sequences. For k=1,,K, let Nk(A) and Nk(B) denote the number of unmatched sequences belonging to species k in M(A) and M(B) (respectively).

Dealing with asymmetric cases.

The two protein families considered may have different numbers of paralogs within the same species. Assume, without loss of generality, that Nk(A)<Nk(B) for a given k. To solve the matching problem with one-to-one interactions, we would like to pick, for each of the Nk(A) sequences in M(A), a single and exclusive interaction partner out of the Nk(B) available sequences in M(B). The remaining sequences of the species in M(B) are left unpaired. In practice, we achieve this by augmenting the original set of species-k sequences from M(A) with Nk(B)Nk(A) “padding sequences” made entirely of gap symbols. By doing so (and analogously when Nk(A)>Nk(B)), the thus-augmented interacting MSAs have the same number Nk:=max(Nk(A),Nk(B)) of sequences from each species k. In practice, this method is used for the AFM complex structure prediction, while the curated benchmark prokaryotic MSAs do not have asymmetries (SI Appendix, Datasets).

Formalization.

The paralog matching problem corresponds to finding, within each species k, a mapping that associates one sequence of M(A) to one sequence of M(B) (and reciprocally). Thus, within each species k, one-to-one matchings can be encoded as permutation matrices of size Nk×Nk. A brute-force search through all possible within-species one-to-one matchings would scale factorially with the size Nk of each species, making it prohibitive. Note that the Iterative Pairing Algorithm (IPA) (14, 47) is an approximate method to solve this problem when optimizing coevolution scores. Here, we introduce another one, which allows to leverage the power of deep learning.

Exploiting known interacting partners.

Our use of a language model allows for contextual conditioning, a common technique in natural language processing. Indeed, if any correctly paired sequences are already known, they can be included as part of the joint MSA input to MSA Transformer. In this case, we exclude their pairing from the optimization process—in particular, by not masking any of their amino acids, see below. We call these known paired sequences “positive examples.” In Fig. 2, we randomly sampled species and included all their pairs as positive examples, until we reached the desired depth Npos±10%. For eukaryotic complex structure prediction, we treated the query sequence pair as a positive example.

DiffPALM: Paralog Matching Based on MLM.

Here, we explain our paralog matching method based on MLM, which we call DiffPALM. Background information on MSA Transformer and its MLM loss is collected in SI Appendix, MSA Transformer and masked language modeling for MSAs. DiffPALM exploits our differentiable framework for optimizing matchings, see SI Appendix, A differentiable formulation of paralog matching. The key steps are summarized in Fig. 4.
Fig. 4.
Schematic of the DiffPALM method. First, the parameterization matrices Xk are initialized, and then the following steps are repeated until the loss converges: 1) Compute the permutation matrix M(Xk) and use it to shuffle M(A) relative to M(B). Then pair the two MSAs. 2) Randomly mask some tokens from one of the two sides of the paired MSA and compute the MLM loss, see SI Appendix, Eq. S1. 3) Backpropagate the loss and update the parameterization matrices Xk, using the Sinkhorn operator Ŝ for the backward step instead of the matching operator M (SI Appendix, A differentiable formulation of paralog matching).

Construction of an appropriate MLM loss.

Using the tools just described, we consider two interacting MSAs (possibly augmented with padding sequences), still denoted by M(A) and M(B). Given species indexed by k=1,,K, we initialize a set Xkk=1,,K of square matrices of size Nk×Nk (the case K=1 corresponds to X in SI Appendix, A differentiable formulation of paralog matching). We call these “parameterization matrices”. By applying to them the matching operator M [SI Appendix, Eq. S2], we obtain the permutation matrices {M(Xk)}k=1,,K, encoding matchings within each species in the paired MSA. Using gradient methods, we optimize the parameterization matrices so that the corresponding permutation matrices yield a paired MSA with low MLM loss. More precisely, paired MSAs are represented as concatenated MSAs with interacting partners placed on the same row,* and our MLM loss for this optimization is computed as follows:
1.
Perform a shuffle of M(A) relative to M(B) using the permutation matrix M(Xk) in each species k (plus an optional noise term, see below), to obtain a paired MSA M;
2.
Generate a mask for M (excluding any positive example tokens from the masking);
3.
Compute MSA Transformer’s MLM loss for that mask, see SI Appendix, Eq. S1.
Importantly, we only mask tokens from one of the two MSAs, chosen uniformly at random within that MSA with a high masking probability p0.7. Our rationale for using large masking probabilities is that it forces the model to predict masked residues in one of the two MSAs by using information coming mostly from the other MSA—SI Appendix, Fig. S2. We stress that, if padding sequences consisting entirely of gaps are present (see “Dealing with asymmetric cases” above), we mask these symbols with the same probability as those coming from ordinary sequences. Of the two MSAs to pair, we mask the one with shorter length if no padding sequences exist (i.e. here for our prokaryotic benchmark datasets). Else, if lengths are comparable but one MSA contains considerably more padding sequences than the other, we preferentially mask that MSA. Otherwise, we randomly choose which of the two MSAs to mask.
We fine-tuned all the hyperparameters involved in our algorithm using two joint MSAs of depth 50, constructed by selecting all the sequences of randomly sampled species from the HK-RR dataset (SI Appendix, Datasets).

Noise and regularization.

Following ref. 80, after updating (or initializing) each Xk, we add to it a noise term given by a matrix of standard i.i.d. Gumbel noise multiplied by a scale factor. The addition of noise ensures that the Xk do not get stuck at degenerate values for the right-hand side of SI Appendix, Eq. S2, and more generally may encourage the algorithm to explore larger regions in the space of permutations. As scale factor for this noise, we choose 0.1 times the sample SD of the current entries of Xk, times a global factor tied to the optimizer scheduler (see “Optimization” below). Finally, since the matching operator is scale-invariant, we can regularize the matrices Xk to have a small Frobenius norm. We find this to be beneficial and implement it through weight decay, set to be w=0.1.

Optimization.

We backpropagate the MLM loss on the parameterization matrices Xk. After each gradient step, we mean-center the parameterization matrices. We use the AdaDelta optimizer (81) with an initial learning rate γ=9 and a “reduce on loss plateau” learning rate scheduler which decreases the learning rate by a factor of 0.8 if the loss has not decreased for more than 20 gradient steps after the learning rate was last set. The learning rate scheduler also provides the global scale factor which, together with the SD of the entries of Xk, dynamically determines the magnitude of the Gumbel noise.

Exploring the loss landscape through multiple initializations.

We observe that the initial choice of the parameterization set Xkk=1,,K strongly impact results. Slightly different initial conditions for Xk lead to very different final permutation matrices. Furthermore, we observe a fast decrease in the loss when the Xk are initialized to be exactly zero (our use of Gumbel noise means that we break ties randomly when computing the permutation matrices M(Xk); if noise is not used, similar results can be achieved by initializing Xk with entries very close to zero). Thus, we can cheaply probe different paths in the loss landscape by performing several short runs using zero-initialized parameterization matrices Xk. In practice, we use 20 different such short runs each consisting of 20 gradient steps. Then, we average all the final parameterizations together to warm-start a longer run made up of 400 gradient steps.

Result and confidence.

We observe that, even though the loss generally converges to a minimum average value during our optimization runs, there are often several distinct hard permutations associated with the smallest loss values. This may indicate a flattening of the loss landscape relative to the inherent fluctuations in the MLM loss, and/or the existence of multiple local minima. To extract a single matching per species from one of our runs (or indeed from several runs, see Improving precision: MRA and IPA), we average the hard permutation matrices associated with the q lowest losses and evaluate the matching operator [SI Appendix, Eq. S2] on the resulting averages. We find final precision-100 figures to be quite robust to the choice of q. On the other hand, for individual (warm-started) runs as described in Exploring the loss landscape through multiple initializations, precision-10 benefits from setting q to its maximum possible value of 400.
Furthermore, we propose using each entry in the averaged permutation matrices as an indicator of the model’s confidence in the matching of the corresponding pair of sequences. Indeed, pairs that are present in most low-loss configurations are presumably essential for the optimization process and are captured in most runs, pushing their confidence value close to 1. Conversely, noninteracting pairs are in most cases associated with higher losses and therefore appear sporadically, obtaining confidences close to zero. Accordingly, we refer to the averaged hard permutations used to extract a single matching per species as “confidence matrices,” and to the final in-species matchings as “consensus permutations.”

Improving precision: MRA and IPA.

We propose two methods for improving precision further. In the first method, which we call Multi-Run Aggregation (MRA), we perform Nruns independent optimization runs for each interacting MSA. Then, we collect together the hard permutations independently obtained from each run, and aggregate the q=400 lowest-loss permutations from this larger collection to obtain more reliable confidence matrices and hard permutations.
The second method is an iterative procedure analogous to the Iterative Pairing Algorithm (IPA) of refs. 14 and 47, and named after it. The idea is to gradually add pairs with high confidence as positive examples. Assuming a paired MSA containing a single species for notational simplicity, the n-th iteration (starting at n=1) involves the following steps:
1.
Perform an optimization run and extract from it a confidence matrix C(n) as described in Result and confidence, using the currently available positive examples;
2.
Compute the moving average C~(n)=mean(C(n),C~(n1),,C~(1)) (where C~(1)C(1));
3.
Define candidate matchings via the consensus permutation M(n)=M(C~(n));
4.
Repeat Steps 1 to 3 a maximum of three times, until the average MLM loss estimated using M(n), and 200 random masks, is lower or statistically insignificantly higher than what could have been obtained using M(n1) and the same positive examples as in Step 1;
5.
If Step 4 fails, set C~(n)=C~(n1) and M(n)=M(C~(n1)) (but removing rows and columns corresponding to the positive examples added at iteration n1);
6.
Check that the average MLM loss estimated using M(n) and 200 random masks, but only regarding as positive examples those available at the beginning of iteration n1, is not statistically significantly higher than the average MLM loss estimated using M(n1) and those same positive examples;
7.
If Step 6 fails, terminate the IPA. Otherwise, pick the top 5 pairs according to C~(n), promote them to positive examples in all subsequent iterations, and remove them from the portion of the paired MSA to be optimized.
If several species are present, they are optimized together (Construction of an appropriate MLM loss) and confidence values from all species are used to select the top five pairs.

Pairing Based on a Single-Sequence Language Model.

To assess whether a single-sequence model is able to solve the paralog matching problem, we consider the 650M-parameter version of the model ESM-2 (5). We score candidate paired sequences using the MLM loss in SI Appendix, Eq. S1. In contrast with MSA Transformer, the input of the model is not paired MSAs but single paired sequences. Therefore, it is sufficient to individually score each possible pair within each species, without needing to consider all permutations. Denoting by Nk the number of sequences from each family in species k, the number of possible pairs is Nk2 while the number of permutations is Nk!. This complexity reduction allows us to evaluate the scores of all possible pairs. This removes the need of backpropagating the loss on the permutation matrix. Accordingly, this method is much faster, since we only need to use the model in evaluation mode, without gradient backpropagation.
For each candidate paired sequence, we evaluate the average of the MLM losses computed over multiple random masks (with masking probability p). Once the average MLM losses are computed for all the Nk2 pairs, we compute the optimal one-to-one matching by using standard algorithms for linear assignment problems (82) on the Nk×Nk matrix containing all the losses.

Assessing the Impact of Pairing on AFM Structure Prediction.

Pairing methods employed in AFM and ColabFold.

When presented with a set of query chains, AFM retrieves homologs of each of the chains by running JackHMMER (83) on UniProt, and further homology searches on other databases (7). UniProt hits are partitioned into species§ and ranked within each species by decreasing Hamming distance to the relevant query sequence. A paired MSA is obtained by matching equal-rank hits. Sequences left unpaired are discarded. In addition, AFM produces “block MSAs” constructed by “pairing” hits from the remaining databases with padding sequences of gaps. The input for AFM comprises the paired MSA and the block MSAs.
While sharing the same architecture and weights as AFM, ColabFold retrieves homologs using MMseqs2 (84) on ColabFoldDB (8). In each species, hits are sorted by increasing E-value, and the best hits are paired (8, 9, 2225). Thus, contrary to the default AFM pipeline, the paired MSA in ColabFold contains at most one sequence pair per species for a heterodimer. Because the databases and homology search methods used by ColabFold differ from those used by AFM, a direct comparison does not allow one to isolate the effect of their different pairing schemes. Therefore, we employed the ColabFold pairing method starting from the sequences that are paired in the default AFM pipeline.

Pairing using DiffPALM.

To assess the impact of DiffPALM on complex structure prediction by AFM, we started from the sequences that are paired in the default AFM pipeline. We left out species with large unbalances between the number of sequences in the two families considered. Specifically, if the ratio of the larger to the smaller of these two numbers exceeds an ad-hoc “maximum size ratio” MSR (SI Appendix, Table S1), if there is only one sequence in both families, or if there are more than 50 sequences in at least one family, then we do not attempt pairing via DiffPALM, and revert to default AFM pairing. When the full MSA to be paired with DiffPALM is too deep and/or long, optimizing it as a whole is not possible due to GPU memory limitations. Instead, we partition it into several small enough sub-MSAs, which we optimize independently. Note that this may reduce performance, since pairing quality increases with MSA depth. We always use the query sequences as positive examples. For all these optimization runs, we use a masking probability p=0.7.

Data, Materials, and Software Availability

A Python implementation of DiffPALM is freely available in the Zenodo archive (85).

Acknowledgments

We thank Sergey Ovchinnikov for valuable feedback on a preliminary version of this manuscript and Alex Hawkins-Hooker for interesting conversations. This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851173, to A.-F. B.).

Author contributions

U.L., D.S., and A.-F.B. designed research; U.L. and D.S. performed research; U.L. and D.S. contributed new reagents/analytic tools; U.L., D.S., and A.-F.B. analyzed data; and U.L., D.S., and A.-F.B. wrote the paper.

Competing interests

The authors declare no competing interest.

Supporting Information

Appendix 01 (PDF)

References

1
S. V. Rajagopala et al., The binary protein-protein interaction landscape of Escherichia coli. Nat. Biotechnol. 32, 285–290 (2014).
2
J. Jumper et al., Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
3
M. Baek et al., Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
4
R. Chowdhury et al., Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
5
Z. Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
6
I. Humphreys et al., Computed structures of core eukaryotic protein complexes. Science 374, 1340–1352 (2021).
7
R. Evans et al., Protein complex prediction with AlphaFold-Multimer. bioRxiv [Preprint] (2021). https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1 (Accessed 11 October 2023).
8
M. Mirdita et al., ColabFold: Making protein folding accessible to all. Nat. Methods 19 (6), 679–682 (2022).
9
P. Bryant, G. Pozzati, A. Elofsson, Improved prediction of protein-protein interactions using AlphaFold2. Nat. Commun. 13 (1), 1265 (2022).
10
H. Schweke et al., An atlas of protein homo-oligomerization across domains of life. bioRxiv [Preprint] (2023). https://www.biorxiv.org/content/10.1101/2023.06.09.544317v1 (Accessed 11 October 2023).
11
A. Elofsson, Progress at protein structure prediction, as seen in CASP15. Curr. Opin. Struct. Biol. 80, 102594 (2023).
12
L. T. Alexander et al., Protein target highlights in CASP15: Analysis of models by structure providers. Proteins 91, 1–29 (2023).
13
M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, T. Hwa, Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. U.S.A. 106, 67–72 (2009).
14
A.-F. Bitbol, R. S. Dwyer, L. J. Colwell, N. S. Wingreen, Inferring interaction partners from protein sequences. Proc. Natl. Acad. Sci. U.S.A. 113, 12180–12185 (2016).
15
T. Gueudre, C. Baldassi, M. Zamparo, M. Weigt, A. Pagnani, Simultaneous identification of specifically interacting paralogs and interprotein contacts by direct coupling analysis. Proc. Natl. Acad. Sci. U.S.A. 113, 12186–12191 (2016).
16
H. Szurmant, M. Weigt, Inter-residue, inter-protein and inter-family coevolution: Bridging the scales. Curr. Opin. Struct. Biol. 50, 26–32 (2018).
17
Y. Si, C. Yan, Protein complex structure prediction powered by multiple sequence alignments of interologs from multiple taxonomic ranks and AlphaFold2. Brief. Bioinform. 23, bbac208 (2022).
18
B. Chen et al., Improved the heterodimer protein complex prediction with protein language models. Brief. Bioinform. 24, bbad221 (2023).
19
S. Orchard et al., The MIntAct project IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 42 (D1), D358–D363 (2013).
20
J. M. Peters et al., A comprehensive, CRISPR-based functional analysis of essential genes in bacteria. Cell 165 (6), 1493–1506 (2016).
21
S. Ovchinnikov, H. Kamisetty, D. Baker, Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014).
22
Q. Cong, I. Anishchenko, S. Ovchinnikov, D. Baker, Protein interaction networks revealed by proteome coevolution. Science 365, 185–189 (2019).
23
A. G. Green et al., Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat. Commun. 12, 1396 (2021).
24
H. Zeng et al., ComplexContact: A web server for inter-protein contact prediction using deep learning. Nucleic Acids Res. 46, W432–W437 (2018).
25
G. Pozzati et al., Limits and potential of combined folding and docking. Bioinformatics 38, 954–961 (2021).
26
C.-S. Goh, F. E. Cohen, Co-evolutionary analysis reveals insights into protein-protein interactions. J. Mol. Biol. 324, 177–192 (2002).
27
A. K. Ramani, E. M. Marcotte, Exploiting the co-evolution of interacting proteins to discover interaction specificity. J. Mol. Biol. 327, 273–284 (2003).
28
J. Gertz et al., Inferring protein interactions from phylogenetic distance matrices. Bioinformatics 19, 2039–2045 (2003).
29
J. M. Izarzugaza et al., TSEMA: Interactive prediction of protein pairings between interacting families. Nucleic Acids Res. 34, W315–W319 (2006).
30
E. R. Tillier, L. Biro, G. Li, D. Tillo, Codep: Maximizing co-evolutionary interdependencies to discover interacting proteins. Proteins 63, 822–831 (2006).
31
J. M. Izarzugaza, D. Juan, C. Pons, F. Pazos, A. Valencia, Enhancing the prediction of protein pairings between interacting families using orthology information. BMC Bioinform. 9, 35 (2008).
32
E. R. Tillier, R. L. Charlebois, The human protein coevolution network. Genome Res. 19 (10), 1861–1871 (2009).
33
S. Bradde et al., Aligning graphs and finding substructures by a cavity approach. EPL 89, 37009 (2010).
34
I. Hajirasouliha, A. Schönhuth, D. de Juan, A. Valencia, S. C. Sahinalp, Mirroring co-evolving trees in the light of their topologies. Bioinformatics 28, 1202–1208 (2012).
35
M. El-Kebir et al., Mapping proteins in the presence of paralogs using units of coevolution. BMC Bioinform. 14, S18 (2013).
36
F. Pazos, A. Valencia, Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng. Des. Sel. 14, 609–614 (2001).
37
D. Juan, F. Pazos, A. Valencia, High-confidence prediction of global interactomes based on genome-wide coevolutionary networks. Proc. Natl. Acad. Sci. U.S.A. 105, 934–939 (2008).
38
R. Jothi, M. G. Kann, T. M. Przytycka, Predicting protein-protein interaction by searching evolutionary tree automorphism space. Bioinformatics 21, i241–i250 (2005).
39
D. Ochoa, F. Pazos, Studying the co-evolution of protein families with the Mirrortree web server. Bioinformatics 26, 1370–1371 (2010).
40
D. Ochoa, D. Juan, A. Valencia, F. Pazos, Detection of significant protein coevolution. Bioinformatics 31, 2166–2173 (2015).
41
G. Casari, C. Sander, A. Valencia, A method to predict functional residues in proteins. Nat. Struct. Mol. Biol. 2, 171–178 (1995).
42
A. S. Lapedes, B. G. Giraud, L. Liu, G. D. Stormo, Correlated mutations in models of protein sequences: Phylogenetic and structural effects. Stat. Mol. Biol. Genet. - IMS Lect. Notes -Monogr. Ser. 33, 236–256 (1999).
43
F. Morcos et al., Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. U.S.A. 108, E1293–1301 (2011).
44
D. S. Marks et al., Protein 3D structure computed from evolutionary sequence variation. PLoS One 6, 1–20 (2011).
45
R. R. Cheng, F. Morcos, H. Levine, J. N. Onuchic, Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl. Acad. Sci. U.S.A. 111, E563–E571 (2014).
46
L. Burger, E. van Nimwegen, Accurate prediction of protein-protein interactions from sequence alignments using a Bayesian method. Mol. Syst. Biol. 4, 165 (2008).
47
A.-F. Bitbol, Inferring interaction partners from protein sequences using mutual information. PLoS Comput. Biol. 14, e1006401 (2018).
48
G. Marmier, M. Weigt, A.-F. Bitbol, Phylogenetic correlations can suffice to infer protein partners from sequences. PLoS Comput. Biol. 15, e1007179 (2019).
49
A. Gerardos, N. Dietler, A.-F. Bitbol, Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput. Biol. 18, e1010147 (2022).
50
C. A. Gandarilla-Perez, S. Pinilla, A.-F. Bitbol, M. Weigt, Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins. PLoS Comput. Biol. 19, e1011010 (2023).
51
R. M. Rao et al., “MSA transformer” in Proceedings of the 38th International Conference on Machine Learning, M. Meila, T. Zhang, Eds. (PMLR, 2021), vol. 139, pp. 8844–8856. https://proceedings.mlr.press/v139/rao21a.html.
52
U. Lupo, D. Sgarbossa, A.-F. Bitbol, Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat. Commun. 13 (2022).
53
D. Sgarbossa, U. Lupo, A.-F. Bitbol, Generative power of a protein language model trained on multiple sequence alignments. Elife 12, e79854 (2023).
54
Z. Xie, J. Xu, Deep graph learning of inter-protein contacts. Bioinformatics 38, 947–953 (2022).
55
M. T. Laub, M. Goulian, Specificity in two-component signal transduction pathways. Annu. Rev. Genet. 41, 121–145 (2007).
56
M. Barakat et al., P2CS: A two-component system resource for prokaryotic signal transduction research. BMC Genom. 10, 315 (2009).
57
M. Barakat, P. Ortet, D. E. Whitworth, P2CS: A database of prokaryotic two-component systems. Nucleic Acids Res. 39, D771–D776 (2011).
58
A. J. Enright, I. Iliopoulos, N. C. Kyrpides, C. A. Ouzounis, Protein interaction maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999).
59
D. Szklarczyk et al., The STRING database in 2023: Protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646 (2023).
60
T. Paysan-Lafosse et al., InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2022).
61
K. S. Makarova, Y. I. Wolf, S. L. Mekhedov, B. G. Mirkin, E. V. Koonin, Ancestral paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell. Nucleic Acids Res. 33, 4626–4638 (2005).
62
S. Basu, B. Wallner, DockQ: A quality measure for protein-protein docking models. PLoS One 11, 1–9 (2016).
63
P. Bryant, F. Noé, Improved protein complex prediction with AlphaFold-multimer by denoising the MSA profile. bioRxiv [Preprint] (2023). https://www.biorxiv.org/content/10.1101/2023.07.04.547638v1 (Accessed 11 October 2023).
64
W. Zheng, Q. Wuyun, P. L. Freddolino, “Multi-MSA strategy for protein complex structure modeling” in CASP15 Abstract (2022). https://predictioncenter.org/casp15/doc/CASP15_Abstracts.pdf.
65
J. Liu et al., Enhancing AlphaFold-Multimer-based protein complex structure prediction with MULTICOM in CASP15. bioRxiv [Preprint] (2023). https://www.biorxiv.org/content/10.1101/2023.05.16.541055v1 (Accessed 11 October 2023).
66
B. Wallner, Improved multimer prediction using massive sampling with AlphaFold in CASP15. Proteins 91, 1734–1746 (2023).
67
U. Ghani et al., Improved docking of protein models by a combination of Alphafold2 and ClusPro. bioRxiv [Preprint] (2022). https://www.biorxiv.org/content/10.1101/2021.09.07.459290v1 (Accessed 11 October 2023).
68
K. Olechnovič, L. Valančauskas, J. Dapkunas, Č. Venclovas, Prediction of protein assemblies by structure sampling followed by interface-focused scoring. bioRxiv [Preprint] (2023). https://www.biorxiv.org/content/10.1101/2023.03.07.531468v1 (Accessed 11 October 2023).
69
T. Dao, D. Fu, S. Ermon, A. Rudra, C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness” in Advances in Neural Information Processing Systems, S. Koyejo et al., Eds. (2022), vol. 35, pp. 16 344–16 359. https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf.
70
A. Hawkins-Hooker, D. T. Jones, B. Paige, “Using domain-domain interactions to probe the limitations of MSA pairing strategies” in Machine Learning for Structural Biology Workshop, NeurIPS (2022). https://www.mlsb.io/papers_2022/Using_domain_domain_interactions_to_probe_the_limitations_of_MSA_pairing_strategies.pdf.
71
M. Hu et al., “Exploring evolution-aware & -free protein language models as protein function predictors” in Advances in Neural Information Processing Systems, S. Koyejo et al., Eds. (2022), vol. 35, pp. 38 873–38 884.
72
B. Meynard-Piganeau, C. Fabbri, M. Weigt, A. Pagnani, C. Feinauer, Generating interacting protein sequences using domain-to-domain translation. Bioinformatics 39, btad401 (2023).
73
P. Gainza et al., Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020).
74
J. Tubiana, D. Schneidman-Duhovny, H. J. Wolfson, ScanNet: An interpretable geometric deep learning model for structure-based protein binding site prediction. Nat. Methods 19, 730–739 (2022).
75
L. F. Krapp, L. A. Abriata, F. Cortés Rodriguez, M. Dal Peraro, PeSTo: Parameter-free geometric deep learning for accurate prediction of protein binding interfaces. Nat. Commun. 14, 2175 (2023).
76
M. N. Pun et al., Learning the shape of protein micro-environments with a holographic convolutional neural network. bioRxiv [Preprint] (2022). https://www.biorxiv.org/content/10.1101/2022.10.31.514614v1.full (Accessed 11 October 2023).
77
F. Wu, L. Wu, D. Radev, J. Xu, S. Z. Li, Integration of pre-trained protein language models into geometric deep learning networks. Commun. Biol. 6, 876 (2023).
78
Y. Si, C. Yan, Protein language model embedded geometric graphs power inter-protein contact prediction. bioRxiv [Preprint] (2023). https://www.biorxiv.org/content/10.1101/2023.01.07.523121v1 (Accessed 11 October 2023).
79
J. Su et al., SaProt: Protein language modeling with structure-aware vocabulary. bioRxiv [Preprint] (2023). https://www.biorxiv.org/content/10.1101/2023.10.01.560349v1 (Accessed 11 October 2023).
80
G. E. Mena, D. Belanger, S. Linderman, J. Snoek, “Learning latent permutations with Gumbel-Sinkhorn networks” in 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings (2018), pp. 1–22. https://openreview.net/forum?id=Byt3oJ-0W.
81
M. D. Zeiler, ADADELTA: An adaptive learning rate method. arXiv [Preprint] (2012). https://arxiv.org/abs/1212.5701 (Accessed 11 October 2023).
82
H. W. Kuhn, The Hungarian method for the assignment problem. Naval Res. Logist. Quart. 2, 83–97 (1955).
83
L. S. Johnson, S. R. Eddy, E. Portugaly, Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinform. 11, 1–8 (2010).
84
M. Steinegger, J. Söding, MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
85
U. Lupo, D. Sgarbossa, A.-F. Bitbol, Bitbol-Lab/DiffPALM: DiffPALM Public Release v1.0. Zenodo. https://doi.org/10.5281/zenodo.10462561. Deposited 5 January 2023.

Information & Authors

Information

Published in

The cover image for PNAS Vol.121; No.27
Proceedings of the National Academy of Sciences
Vol. 121 | No. 27
July 2, 2024
PubMed: 38913900

Classifications

Data, Materials, and Software Availability

A Python implementation of DiffPALM is freely available in the Zenodo archive (85).

Submission history

Received: August 12, 2023
Accepted: December 18, 2023
Published online: June 24, 2024
Published in issue: July 2, 2024

Keywords

  1. protein–protein interactions
  2. protein complex structure
  3. protein language models
  4. coevolution
  5. machine learning

Acknowledgments

We thank Sergey Ovchinnikov for valuable feedback on a preliminary version of this manuscript and Alex Hawkins-Hooker for interesting conversations. This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 851173, to A.-F. B.).
Author contributions
U.L., D.S., and A.-F.B. designed research; U.L. and D.S. performed research; U.L. and D.S. contributed new reagents/analytic tools; U.L., D.S., and A.-F.B. analyzed data; and U.L., D.S., and A.-F.B. wrote the paper.
Competing interests
The authors declare no competing interest.

Notes

This article is a PNAS Direct Submission.
*
We employ no special tokens to demarcate the boundary between one sequence and its partner.
In contrast, uniformly random masking with P=0.15 was used during MSA Transformer’s pre-training (51).
Based on a two-sample T-test: we say “statistically insignificantly” when p0.95, and “statistically significantly” when p<0.05.
§
While the official AFM implementation in https://github.com/deepmind/alphafold uses UniProt “entry names” to define species, when possible we instead use NCBI TaxIDs (via UniProt mappings to NCBI tax IDs, retrieved on 17 Dec 2022, corresponding to UniProt Knowledgebase release 2022_04), which are more accurate.

Authors

Affiliations

Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland

Notes

2
To whom correspondence may be addressed. Email: [email protected] or [email protected].
1
U.L. and D.S. contributed equally to this work.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Citation statements




Altmetrics

Citations

Export the article citation data by selecting a format from the list below and clicking Export.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to access the full text.

    Single Article Purchase

    Pairing interacting protein sequences using masked language modeling
    Proceedings of the National Academy of Sciences
    • Vol. 121
    • No. 27

    Figures

    Tables

    Media

    Share

    Share

    Share article link

    Share on social media