Pairing interacting protein sequences using masked language modeling

Significance Deep learning has brought major advances to the analysis of biological sequences. Self-supervised models, based on approaches from natural language processing and trained on large ensembles of protein sequences, efficiently learn statistical dependence in this data. This includes coevolution patterns between structurally or functionally coupled amino acids, which allows them to capture structural contacts. We propose a method to pair interacting protein sequences which leverages the power of a protein language model trained on multiple sequence alignments. Our method performs well for small datasets that are challenging for existing methods. It can improve structure prediction of protein complexes by supervised methods, which remains more challenging than that of single-chain proteins.


Introduction
Interacting proteins play key roles in cells, ensuring the specificity of signaling pathways and forming multi-protein complexes that act e.g. as molecular motors or receptors.Predicting protein-protein interactions and the structure of protein complexes are important questions in computational biology and biophysics.Indeed, high-throughput experiments capable of resolving protein-protein interactions remain challenging [1], even for model organisms, and experimental determination of protein complex structure is demanding.
A major advance in protein structure prediction was achieved by AlphaFold [2] and other deep learning approaches [3][4][5].Extensions to protein complexes have been proposed [6][7][8][9], including AlphaFold-Multimer (AFM) [7], but their performance is heterogeneous and less impressive than for monomers [10].Importantly, the first step of AlphaFold is to build multiplesequence alignments (MSAs) of homologs of the query protein sequence.The results of the CASP15 structure prediction contest demonstrated that MSA quality is crucial to further improving AlphaFold performance [11,12].For protein complexes involving several different chains (heteromers), paired MSAs, whose rows include actually interacting chains, can provide coevolution signal between interacting partners that is informative about inter-chain contacts [13][14][15][16].However, constructing paired MSAs poses the challenge of properly pairing sequences.Accordingly, the quality of pairings strongly impacts the accuracy of heteromer structure prediction [9,17,18].Pairing interaction partners is difficult because many protein families contain several paralogous proteins encoded within the same genome.This problem is known as paralog matching.In prokaryotes, genomic proximity can often be used to solve it, since most interaction partners are encoded in close genomic locations [19,20].However, this is not the case in eukaryotes.Large-scale coevolution studies of protein complexes [21][22][23] and deep learning approaches [6][7][8][9]24] have paired sequences by using genomic proximity when possible [8,21,24], and/or by pairing together the closest, or equally ranked, hits to the query sequences, i.e. relying on approximate orthology [6][7][8][9][22][23][24][25].
We develop a new coevolution-based method for paralog matching which leverages recent neural protein language models taking MSAs as inputs [2,51].These models are one of the ingredients of the success of AlphaFold [2].We focus on MSA Transformer [51], a protein language model which was trained on MSAs using the masked language modeling (MLM) objective in a self-supervised way.We introduce DiffPALM, a differentiable method for predicting paralog matchings using MLM.We show that it outperforms existing coevolution methods by a large margin on difficult benchmarks of shallow MSAs extracted from ubiquitous prokaryotic protein datasets.DiffPALM performance further quickly improves when known interacting pairs are provided as examples.Next, we apply DiffPALM to the hard problem of paralog matching for eukaryotic protein complexes.Among the complexes we tested, DiffPALM substantially improves structure prediction by AFM in some cases, and does not yield any significant deterioration.It also achieves competitive performance with using orthology-based pairing.
which require considerably deeper alignments to achieve good performance [14,15,47,50].Furthermore, this regime is highly relevant for eukaryotic complexes, because their homologs have relatively small sequence diversity, as shown by the effective depth of their MSAs in Table S1.While prokaryotic proteins such as HKs and RRs feature large diversity, focusing on small datasets allows us to address the relevant regime of low diversity in this well-controlled benchmark case.We hypothesize that MSA Transformer's extensive pre-training can help to capture coevolution even in these difficult cases.To assess this, we first test two variants of our DiffPALM method (see "Methods") on 40 MSAs from the HK-RR dataset comprising about 50 HK-RR pairs each (see "Supplementary material, Datasets").We first address the de novo pairing prediction task, starting from no known HK-RR pair, and then we study the impact of starting from known pairs.Fig. 1 shows that DiffPALM performs better than the chance expectation, obtained for random within-species matching.Moreover, it outperforms other coevolution-based methods, namely DCA-IPA [14], MI-IPA [47], which rely respectively on Potts models and on mutual information, and GA-IPA [50], which combines these coevolution measures with sequence similarity, a proxy for phylogeny.Importantly, these results are obtained without giving any paired sequences as input to the algorithm.The performance of DiffPALM is particularly good for pairs with high confidence score (see "Result and confidence"), as shown by the "precision-10" curve, which focuses on top 10% predicted pairs, when ranked by predicted confidence (see Fig. 1).We also propose a method based on a protein language model trained on single sequences, ESM-2 (650M) [5], see "Pairing based on a single-sequence language model".DiffPALM also outperforms this method, even though the latter is faster (no need for backpropagation) and is formulated as a linear matching problem, which is solved exactly.This confirms that the coevolution information contained in the MSA plays a key role in the performance of DiffPALM, which is based on MSA Transformer.A key strength of MSA Transformer and thus of DiffPALM is that they leverage the power of large language models while starting from MSAs, and thus allow direct access to the coevolution signal.Fig. 1 shows that both variants of DiffPALM, namely MRA and IPA (see "Methods") outperform all baselines, and that precision of MRA increases with the number of runs used (see Table S2 for details).Note that the distribution of precision-10 scores over the different MSAs we considered is skewed, especially after many MRA runs, see figure Fig. S3.For many MSAs, almost perfect scores are reached, while performance is bad for a few others.MSAs with smaller average number of sequence per species tend to yield larger precision, as the pairing task is then easier.RR pairs.The chance expectation, and the performance of various other methods, are reported as baselines.Three existing coevolution-based methods are considered: DCA-IPA [14], MI-IPA [47], and GA-IPA [50].We also consider a pairing method based on the scores given by the ESM-2 (650M) single-sequence protein language model [5], see "Pairing based on a single-sequence language model".With all methods, a full one-to-one within-species pairing is produced, and performance is measured by precision (also called positive predictive value or PPV), namely, the fraction of correct pairs among predicted pairs.The default score is "precision-100", where this fraction is computed over all predicted pairs (100% of them).For DiffPALM-MRA, we also report "precision-10", which is calculated over the top 10% predicted pairs, when ranked by predicted confidence within each MSA (see "Methods").For DiffPALM, we plot the mean performance on all MSAs (color shading), and the standard error range (shaded region).For our ESM-2 based method, we consider 10 different values of masking probability p from 0.1 to 1.0, and we report the range of precisions obtained (gray shading).For other baselines, we report the mean performance on all MSAs.
So far, we addressed de novo pairing prediction, where no known HK-RR pair is given as input.Can DiffPALM precision increase by exploiting "positive examples" of known interacting partners?This is an important question, since experiments on model species may for instance give some positive examples (see "The paralog matching problem").To address it, we included different numbers of positive examples, by using the corresponding non-masked interacting pairs as context (see "Construction of an appropriate MLM loss").The left panel of Fig. 2 shows that the performance of DiffPALM significantly increases with the number of positive examples used, reaching almost perfect performance for the highest-confidence pairs (precision-10).
While we focused on HK-RR pairing so far, DiffPALM is a general method.To assess how it extends to other cases, we consider another pair of ubiquitous prokaryotic proteins, namely homologs of the E. coli proteins MALG-MALK, which are involved in ABC transporter complexes.These proteins form permanent complexes, while HK-RR interact transiently to transmit signal.The right panel of Fig. 2 shows results obtained on 40 MSAs comprising about 50 MALG-MALK pairs, without positive examples.We observe that DiffPALM outperforms the chance expectation by a large margin.It also significantly outperforms existing coevolution methods [14,47,50], as well as our method based on ESM-2 (650M), see Table S2.Note that all approaches yield better performance for MALG-MALK than for HK-RR, as the number of MALG-MALK pairs per species is smaller than that of HK-RR pairs.Finally, while Fig.We report the performance of DiffPALM with 5 MRA runs (measured as precision-100 and precision-10, see Fig. 1), for various numbers of positive examples, on the same HK-RR MSAs as in Fig. 1 (left panel).We also report the performance of DiffPALM for similarly-sized MALG-MALK MSAs (right panel).In both cases, we show the mean value over the 40 different MSAs with its standard error interval, and we plot the chance expectation for reference.reports the final MRA performance, Fig. S4 shows that both performance scores increase with the number of MRA runs in all cases.
Using DiffPALM for eukaryotic complex structure prediction by AFM An important and more challenging application of DiffPALM is predicting interacting partners among the paralogs of two families in eukaryotic species.Indeed, eukaryotes often have many paralogs per species [57] but eukaryotic-specific protein families generally have fewer total homologs and smaller diversity than prokaryotes.Moreover, most interacting proteins are not encoded in close proximity in eukaryotic genomes.Paired MSAs are a key ingredient of protein complex structure prediction by AFM [7,9].When presented with query sequences, the default AFM pipeline [7] retrieves homologs of each of the chains.Within each species, homologs of different chains are ranked according to Hamming distance to the corresponding query sequence.Then, equal-rank sequences are paired.Can DiffPALM improve complex structure prediction by AFM?To address this question, we consider 15 complexes, listed in Table S1, whose structures are not included in the training set of the AFM release we used (see "Supplementary material, General points on AFM"), and for which the default AFM complex prediction was reported to perform poorly [7,18] (see "Supplementary material, Eukaryotic complexes").
Fig. 3 shows that DiffPALM can improve complex structure prediction by AFM (see Fig. S5 for details).This suggests that it is able to produce better paired MSAs than those from the default AFM pipeline.In particular, substantial improvements are obtained for the complexes with PDB identifiers 6L5K and 6FYH, see Figs.S6 and S7 for structural visualizations.Among the complexes we considered, 6L5K and 6FYH have large effective (pairable) MSA depths, see Table S1.Conversely, complexes with very small raw or effective MSA depths do not significantly benefit from DiffPALM.Thus, DiffPALM is sensitive to MSA depth and diversity, albeit having less stringent such requirements than other coevolution methods.In most cases, the quality of structures predicted using DiffPALM pairing is comparable to that obtained using the pairing method adopted e.g. by ColabFold [8], where only the orthologs of the two query sequences, identified as their best hits, are paired in each species (resulting in at most one pair per species) [8,9,[22][23][24][25], see Fig. 3.Note however that, for 6PNQ, the ortholog-only pairing method is outperformed both by DiffPALM and by AFM default.Indeed, the raw and effective MSA depths are smaller for this structure than e.g. for 6L5K and 6FYH (see Table S1).Thus, further reducing diversity by restricting to paired orthologs may be negatively impacting structure prediction in this case.Given the good results obtained overall with orthology-based pairings, we tried using them as positive examples for DiffPALM.Given the very good precision obtained by DiffPALM for high-confidence HK-RR pairs (see above), we also tried restricting to high-confidence pairs.For most structures, we obtained no significant improvement over the standard DiffPALM using these variants (see Fig. S8).However, for 6WCW, we could generate several higher-quality structures, particularly when using orthologs as positive examples.
Figure 3: Performance of AFM using different pairing methods.We use AFM to predict the structure of protein complexes starting from differently paired MSAs, each of them constructed from the same initial unpaired MSAs.Three pairing methods are considered: the default one of AFM, only pairing orthologs to the two query sequences, and a single run of DiffPALM (equivalent to one MRA run).Performance is evaluated using DockQ scores (top panels), a widely used measure of quality for protein-protein docking [58], and the AFM confidence scores (bottom panels), see "Supplementary material, General points on AFM".The latter are also used as transparency levels in the top panels, where more transparent markers denote predicted structures with low AFM confidence.For each query complex, AFM is run five times.Each run yields 25 predictions which are ranked by AFM confidence score.The five top predicted structures are selected from each run, giving 25 predicted structures in total for each complex.Out of the 15 complexes listed in Table S1, we restrict to those where any two of these three pairing methods yield a significant difference (> 0.1) in average DockQ scores for at least one set of predictions coming from different runs but with the same within-run rank according to AFM confidence.Panels are ordered by increasing mean DockQ score for the AFM default method.
Although DiffPALM achieves similar performance on these structure prediction tasks as using orthology, it predicts some pairs that are quite different from orthology-based pairs.Indeed, Fig. S9 shows that the fraction of pairs identically matched by DiffPALM and by orthology is often smaller than 0.5.Fig. S10 further shows that, for the sequences that are paired differently by DiffPALM and by orthology, the Hamming distances between the two predicted partners is often above 0.5.Nevertheless, most of the pairs that are predicted both by DiffPALM and by using orthology have high DiffPALM confidence (see Fig. S9), confirming the importance of these pairs.

Discussion
We developed DiffPALM, a method for pairing interacting protein sequences that builds on MSA Transformer [51], a protein language model trained on MSAs.MSA Transformer efficiently captures coevolution between amino acids, thanks to its training to fill in masked amino acids using the surrounding MSA context [51][52][53].We showed that it also captures inter-chain coevolutionary signal, despite being trained on single-chain MSAs.We leveraged this ability in DiffPALM by using a masked language modeling loss as a coevolution score and looking for the pairing that minimizes it.We formulated the pairing problem in a differentiable way, allowing us to use gradient methods.On shallow MSAs extracted from controlled prokaryotic benchmark datasets, DiffPALM outperforms existing coevolution-based methods as well as a method based on a state-of-the-art language model trained on single sequences.Its performance quickly increases when adding examples of known interacting sequences.Paired MSAs of interacting partners are a key ingredient to complex structure prediction by AFM.We found that using DiffPALM can improve the performance of AFM, and achieves competitive performance with orthology-based pairing.
Recent work [18] also used MSA Transformer for paralog matching, in a method called ESMPair.It relies on column attention matrices and compares them across the MSAs of interacting partners.This makes it quite different from DiffPALM, which relies on coevolutionary information via the MLM loss.ESMPair may be more closely related to phylogeny-based [17] or orthology-based pairing methods, since column attention encodes phylogenetic relationships [52].13 out of the 15 eukaryotic protein complexes we considered were also studied in [18], but no substantial improvement (and often a degradation of performance) was reported for those when using ESMPair instead of the default AFM pairing, except for 7BQU.By contrast, DiffPALM yields strong improvements for 6L5K and 6FYH, and no significant performance degradation.Explicitly combining coevolution and phylogeny using MSA Transformer is a promising direction to further improve partner pairing.Indeed, such an approach has already improved traditional coevolution methods [50].Other ways of improving MSA usage by AFM have also been proposed [59] and could be combined with advances in pairing.Besides improving MSA construction [60] and the extraction of MSA information, other promising approaches include exploiting structural alignments [61], using massive sampling and dropout [62], and combining AFM with more traditional docking methods [63,64], which has allowed e.g. to improve structure prediction of 6A6I [63].
DiffPALM illustrates the power of neural protein language models trained on MSAs, and their ability to capture the rich structure of biological sequence data.The fact that these models encode inter-chain coevolution, while they are trained on single-chain data, shows their ability to generalize.We used MSA Transformer in a zero-shot setting, without fine-tuning it to the task of interaction partner prediction.Such fine-tuning could yield further performance gains [65].
The fact that DiffPALM outperforms existing coevolution methods [14,47,50] for shallow MSAs is reminiscent of the impressive performance of MSA Transformer at predicting structural contacts from shallow MSAs [51].While traditional coevolution methods either compute local coevolution scores for two columns of an MSA [47] or build a global model for an MSA [14,15], MSA Transformer was trained on large ensembles of MSAs and shares parameters across them.This presumably allows it to transfer knowledge between MSAs, and to bypass the usual needs for deep MSAs of traditional coevolution methods [14,15,46,47,50], or of MSA-specific transformer models [66].This constitutes major progress for the use of coevolution signal.
After the transformative progress brought by deep learning to protein structure prediction [2][3][4][5], predicting protein complex structure and ligand binding sites is fast advancing with AFM and related methods, but also with other deep learning models based on structural representations [67][68][69][70].Combining the latter with the power of sequence-based language models may bring even further progress.

Methods
The paralog matching problem Goal and notations.Paralog matching amounts to pairing a pair of MSAs, each one corresponding to one of the two protein families considered.We assume that interactions are one-to-one.Let M (A) and M (B) be the (single-chain) MSAs of two interacting protein families A and B, and let K denote the number of species represented in both MSAs and comprising more than one unmatched sequence in at least one MSA.Species represented in only one MSA are discarded since no within-species matching is possible for them.Species with only one unmatched sequence in each MSA are not considered further since pairing is trivial.There may also be N pos known interacting pairs: they are treated separately, as positive examples (see below).Here we focus on the unmatched sequences.For k = 1, . . ., K, let N available sequences in M (B) .The remaining sequences of the species in M (B) are left unpaired.In practice, we achieve this by augmenting the original set of species-k sequences from "padding sequences" made entirely of gap symbols.By doing so (and analogously when N k ), the thus-augmented interacting MSAs have the same number N k := max(N k ) of sequences from each species k.In practice, this method is used for the AFM complex structure prediction, while the curated benchmark prokaryotic MSAs do not have asymmetries (see "Supplementary material, Datasets").
Formalization.The paralog matching problem corresponds to finding, within each species k, a mapping that associates one sequence of M (A) to one sequence of M (B) (and reciprocally).Thus, within each species k, one-to-one matchings can be encoded as permutation matrices of size N k × N k .A brute-force search through all possible within-species one-to-one matchings would scale factorially with the size N k of each species, making it prohibitive.Note that the Iterative Pairing Algorithm (IPA) [14,47] is an approximate method to solve this problem when optimizing coevolution scores.Here, we introduce another one, which allows to leverage the power of deep learning.
Exploiting known interacting partners.Our use of a language model allows for contextual conditioning, a common technique in natural language processing.Indeed, if any correctly paired sequences are already known, they can be included as part of the joint MSA input to MSA Transformer.In this case, we exclude their pairing from the optimization process -in particular, by not masking any of their amino acids, see below.We call these known paired sequences "positive examples".In Fig. 2, we randomly sampled species and included all their pairs as positive examples, until we reached the desired depth N pos ± 10%.For eukaryotic complex structure prediction, we treated the query sequence pair as a positive example.

DiffPALM: Paralog matching based on MLM
Here, we explain our paralog matching method based on MLM, which we call DiffPALM.Background information on MSA Transformer and its MLM loss is collected in "Supplementary material, MSA Transformer and masked language modeling for MSAs".DiffPALM exploits our differentiable framework for optimizing matchings, see "Supplementary material, A differentiable formulation of paralog matching".The key steps are summarized in Fig. 4.

START
Backpropagate the loss on the parameterization matrix

MSA Transformer
Figure 4: Schematic of the DiffPALM method.First, the parameterization matrices X k are initialized, and then the following steps are repeated until the loss converges: (1) Compute the permutation matrix M (X k ) and use it to shuffle M (A) relative to M (B) .Then pair the two MSAs.
(2) Randomly mask some tokens of one of the two sides of the paired MSA and compute the MLM loss Eq.(S1).( 3) Backpropagate the loss and update the parameterization matrices X k , using the Sinkhorn operator Ŝ for the backward step instead of the matching operator M (see "Supplementary material, A differentiable formulation of paralog matching").
Construction of an appropriate MLM loss.Using the tools just described, we consider two interacting MSAs (possibly augmented with padding sequences), still denoted by M (A) and M (B) .Given species indexed by k = 1, . . ., K, we initialize a set {X k } k=1,...,K of square matrices of size N k × N k (the case K = 1 corresponds to X in "Supplementary material, A differentiable formulation of paralog matching").We call these "parameterization matrices".By applying to them the matching operator M [see Eq. (S2)], we obtain the permutation matrices {M (X k )} k=1,...,K , encoding matchings within each species in the paired MSA.Using gradient methods, we optimize the parameterization matrices so that the corresponding permutation matrices yield a paired MSA with low MLM loss.More precisely, paired MSAs are represented as concatenated MSAs with interacting partners placed on the same row,1 and our MLM loss for this optimization is computed as follows: 1. Perform a shuffle of M (A) relative to M (B) using the permutation matrix M (X k ) in each species k (plus an optional noise term, see below), to obtain a paired MSA M; 2. Generate a mask for M (excluding any positive example tokens from the masking); 3. Compute MSA Transformer's MLM loss for that mask, see Eq. (S1).Importantly, we only mask tokens from one of the two MSAs, chosen uniformly at random within that MSA with a high masking probability p = 0.7.2Our rationale for using large masking probabilities is that it forces the model to predict masked residues in one of the two MSAs by using information coming mostly from the other MSA -see Fig. S2.We stress that, if padding sequences consisting entirely of gaps are present (see above), we mask these symbols with the same probability as those coming from ordinary sequences.Of the two MSAs to pair, we mask the one with shorter length if no padding sequences exist (i.e.here for our prokaryotic benchmark datasets).Else, if lengths are comparable but one MSA contains considerably more padding sequences than the other, we preferentially mask that MSA.Otherwise, we randomly choose which of the two MSAs to mask.
We fine-tuned all the hyperparameters involved in our algorithm using two joint MSAs of depth ∼ 50, constructed by selecting random species from the HK-RR dataset (see "Supplementary material, Datasets").
Noise and regularization.Following [71], after updating (or initializing) each X k , we add to it a noise term given by a matrix of standard i.i.d.Gumbel noise multiplied by a scale factor.The addition of noise ensures that the X k do not get stuck at degenerate values for the right-hand side of Eq. (S2), and more generally may encourage the algorithm to explore larger regions in the space of permutations.As scale factor for this noise we choose 0.1 times the sample standard deviation of the current entries of X k , times a global factor tied to the optimizer scheduler (see next paragraph).Finally, since the matching operator is scale-invariant, we can regularize the matrices X k to have small Frobenius norm.We find this to be beneficial and implement it through weight decay, set to be w = 0.1.
Optimization.We backpropagate the MLM loss on the parameterization matrices X k .We use the AdaDelta optimizer [72] with an initial learning rate γ = 9 and a "reduce on loss plateau" learning rate scheduler which decreases the learning rate by a factor of 0.8 if the loss has not decreased for more than 20 gradient steps after the learning rate was last set.The learning rate scheduler also provides the global scale factor which, together with the standard deviation of the entries of X k , dynamically determines the magnitude of the Gumbel noise (see previous paragraph).
Exploring the loss landscape through multiple initializations.We observe that the initial choice of the parameterization set {X k } k=1,...,K strongly impact results.Slightly different initial conditions for X k lead to very different final permutation matrices.Furthermore, we observe fast decrease in the loss when the X k are initialized to be exactly zero (our use of Gumbel noise means that we break ties randomly when computing the permutation matrices M (X k ); if noise is not used, similar results can be achieved by initializing X k with entries very close to zero).Thus, we can cheaply probe different paths in the loss landscape by performing several short runs using zero-initialized parameterization matrices X k .In practice, we use 20 different such short runs each consisting of 20 gradient steps.Then, we average all the final parameterizations together to warm-start a longer run made up of 400 gradient steps.
Result and confidence.We observe that, even though the loss generally converges to a minimum average value during our optimization runs, there are often several distinct hard permutations associated to the smallest loss values.This may indicate a flattening of the loss landscape relative to the inherent fluctuations in the MLM loss, and/or the existence of multiple local minima.To extract a single matching per species from one of our runs (or indeed from several runs, see "Improving precision: MRA and IPA"), we average the hard permutation matrices associated to the q lowest losses, and evaluate the matching operator [Eq.(S2)] on the resulting averages.We find final precision-100 figures to be quite robust to the choice of q.On the other hand, for individual (warm-started) runs as described in "Exploring the loss landscape through multiple initializations", precision-10 benefits from setting q to its maximum possible value of 400.
Furthermore, we propose using each entry in the averaged permutation matrices as an indicator of the model's confidence in the matching of the corresponding pair of sequences.Indeed, pairs that are present in most low-loss configurations are presumably essential for the optimization process and are captured most of the times, pushing their confidence value close to 1. Conversely, non-interacting pairs are in most of the cases associated to higher losses and therefore appear sporadically, obtaining confidences close to zero.Accordingly, we refer to the averaged hard permutations used to extract a single matching per species as "confidence matrices", and to the final in-species matchings as "consensus permutations".
Improving precision: MRA and IPA.We propose two methods for improving precision further.In the first method, which we call Multi-Run Aggregation (MRA), we perform N runs independent optimization runs for each interacting MSA.Then, we collect together the hard permutations independently obtained from each run, and aggregate the q = 400 lowest-loss permutations from this larger collection to obtain more reliable confidence matrices and hard permutations.
The second method is an iterative procedure analogous to the Iterative Pairing Algorithm (IPA) of Refs.[14,47], and named after it.The idea is to gradually add pairs with high confidence as positive examples.Assuming a paired MSA containing a single species for notational simplicity, the n-th iteration (starting at n = 1) involves the following steps: 1. Perform an optimization run and extract from it a confidence matrix C (n) as described in "Result and confidence", using the currently available positive examples; 2. Compute the moving average C(n) = mean(C (n) , C(n−1) , . . ., C( 1) ) (where C(1) ≡ C (1) ); 3. Define candidate matchings via the consensus permutation M (n) = M ( C(n) ); 4. Repeat Steps 1-3 a maximum of 3 times, until the average MLM loss estimated using M (n) , and 200 random masks, is lower or statistically insignificantly higher3 than what could have been obtained using M (n−1) and the same positive examples as in Step 1; 5.If Step 4 fails, set C(n) = C(n−1) and M (n) = M ( C(n−1) ) (but removing rows and columns corresponding to the positive examples added at iteration n − 1); 6. Check that the average MLM loss estimated using M (n) and 200 random masks, but only regarding as positive examples those available at the beginning of iteration n − 1, is not statistically significantly higher 3 than the average MLM loss estimated using M (n−1) and those same positive examples; 7. If Step 6 fails, terminate the IPA.Otherwise, pick the top 5 pairs according to C(n) , promote them to positive examples in all subsequent iterations, and remove them from the portion of the paired MSA to be optimized.If several species are present, they are optimized together (see "Construction of an appropriate MLM loss") and confidence values from all species are used to select the top 5 pairs.

Pairing based on a single-sequence language model
To assess whether a single-sequence model is able to solve the paralog matching problem, we consider the 650M-parameter version of the model ESM-2 [5].We score candidate paired sequences using the MLM loss in Eq. (S1).In contrast with MSA Transformer, the input of the model is not paired MSAs but single paired sequences.Therefore, it is sufficient to individually score each possible pair within each species, without needing to consider all permutations.Denoting by N k the number of sequences from each family in species k, the number of possible pairs is N 2 k while the number of permutations is N k !.This complexity reduction allows us to evaluate the scores of all possible pairs.This removes the need of backpropagating the loss on the permutation matrix.Accordingly, this method is much faster, since we only need to use the model in evaluation mode, without gradient backpropagation.
For each candidate paired sequence, we evaluate the average of the MLM losses computed over multiple random masks (with masking probability p).Once the average MLM losses are computed for all the N 2 k pairs, we compute the optimal one-to-one matching by using standard algorithms for linear assignment problems [73] on the N k × N k matrix containing all the losses.

Assessing the impact of pairing on AFM structure prediction
Pairing methods employed in AFM and ColabFold.When presented with a set of query chains, AFM retrieves homologs of each of the chains by running JackHMMER [74] on UniProt, and further homology searches on other databases [7].UniProt hits are partitioned into species 4and ranked within each species by decreasing Hamming distance to the relevant query sequence.A paired MSA is obtained by matching equal-rank hits.Sequences left unpaired are discarded.In addition, AFM produces "block MSAs" constructed by "pairing" hits from the remaining databases with padding sequences of gaps.The input for AFM comprises the paired MSA and the block MSAs.
While sharing the same architecture and weights as AFM, ColabFold retrieves homologs using MMseqs2 [75] on ColabFoldDB [8].In each species, hits are sorted by increasing E-value, and the best hits are paired [8,9,[22][23][24][25].Thus, contrary to the default AFM pipeline, the paired MSA in ColabFold contains at most one sequence pair per species for a heterodimer.Because the databases and homology search methods used by ColabFold differ from those used by AFM, a direct comparison does not allow one to isolate the effect of their different pairing schemes.Therefore, we employed the ColabFold pairing method starting from the sequences that are paired in the default AFM pipeline.
Pairing using DiffPALM.To assess the impact of DiffPALM on complex structure prediction by AFM, we started from the sequences that are paired in the default AFM pipeline.We left out species with large unbalances between the number of sequences in the two families considered.Specifically, if the ratio of the larger to the smaller of these two numbers exceeds an ad-hoc "maximum size ratio" MSR (see Table S1), if there is only one sequence in both families, or if there are more than 50 sequences in at least one family, then we do not attempt pairing via DiffPALM, and revert to default AFM pairing.When the full MSA to be paired with DiffPALM is sufficiently deep and/or long, optimizing it as a whole is not possible due to GPU memory limitations.Instead, we partition it into several small enough sub-MSAs, which we optimize independently.We always use the query sequences as positive examples.
for a suitably large l, we can define a smooth mapping Ŝ(X) = S l (X/τ ) which sends arbitrary square matrices to "soft permutations" approximating bona fide ("hard") permutations.In practice, we use τ = 1 and l = 10.
Applying general soft permutations directly on an MSA (after one-hot encoding its residues) yields a dataset consisting of "amino acid mixtures" at each MSA position.Such datasets are out of distribution relative to MSA Transformer's pre-training since it was trained on single amino acid embeddings, not mixtures of them.Besides, we wish to optimize for an MLM loss defined on realistic MSAs.Therefore, in order to be able to backpropagate through Ŝ, while also evaluating MLM losses only on MSAs shuffled by hard permutations, we compute the full matching operator M [Eq.(S2)] in the forward pass, but propagate gradients backwards through Ŝ alone.7

S1.3 Datasets
Benchmark prokaryotic datasets.We developed and tested DiffPALM using joint MSAs extracted from two datasets.The first dataset is composed of 23,632 cognate pairs of histidine kinases (HK) and response regulators (RR) from the P2CS database [55,56], paired using genome proximity, and previously described in [14,47].The average size of the species in this dataset is 10.23 (standard deviation: 7.85).
The second dataset consists of 17,950 ABC transporter protein pairs, homologous to the Escherichia coli MALG-MALK pair of maltose and maltodextrin transporters, also paired using genome proximity [14,21].The average size of the species in this dataset is 5.68 (standard deviation: 5.60).
Out of each of these two benchmark datasets of known interacting pairs, we consider paired MSAs of depth ∼ 50, constructed by selecting random species from the full dataset.Specifically, species are added one by one until an MSA depth between 45 and 55 is reached.For such shallow MSAs, existing coevolution-based methods do not perform well.Note also that MSA Transformer's large memory footprint constrains the depth and length of input MSAs.
Eukaryotic complexes.We considered targets whose structure are not in the training set of AFM with v2 weights, and where default AFM predictions are poor.Specifically, we started from those eukaryotic targets from Table A1 of [18] and from the "Benchmark 2" dataset in [7] whose PDB structures were released after the training cutoff for the AFM v2 weights (April 30, 2018).Among those, we focused on multimers with no more than 2 different types of monomers, where both monomers come from the same species, and with paired sequences not longer than 500 amino acids, due to GPU memory constraints.Finally, we further restricted to the 15 targets with default AFM predictions yielding the poorest reported DockQ score.They are listed in Table S1.All of them are heterodimers, except 6ABO which is a heterotetramer complex made of two IFFO1 and two XRCC4 molecules.

S1.4 General points on AFM
The AFM confidence score is defined as 0.8 • iptm + 0.2 • ptm, where iptm is the predicted TM-score in the interface, and ptm the predicted TM-score of the entire complex [7].[13] of the MSA associated to chain A (resp.chain B) in the MSAs which are paired by DiffPALM.These effective depths quantify MSA diversity.Rows are ordered by increasing mean DockQ score for the default AFM pairing method (cf.Fig. S5).We observe that MSA Transformer is able to correctly predict the inter-protein contacts when given as input a paired MSA made of correctly matched sequences.Conversely, if the model is given as input a paired MSAs where rows have been shuffled before pairing, it is not able to recover the inter-protein contact map (even though it correctly recovers correctly the intraprotein contact maps).These results suggests that MSA Transformer can distinguish between interacting and non-interacting pairs of protein sequences, despite the fact that dimers or paired MSAs were not in the training set used for its MLM pre-training [51].We use an MSA of M 50 sequences and 5 different species.To estimate the expected loss accurately, we used 20 different masks at each step.L tar denotes the expected loss when all pairs are correctly matched.For visualization purposes, in every plot we rescale the loss by shifting it by L tar .We find that our MLM loss in Eq. (S1) decreases for increasing numbers of correctly matched sequences in the MSA.We see that the sweet spot of the masking probability p (i.e. the value that gives steeper and smoother loss curves) is at moderately high values (0.4 ≤ p ≤ 0.7).As explained in "Methods", high masking probabilities make it more challenging for the model to predict the masked amino acids using only information coming from the masked MSA, thus encouraging it to use, instead, information coming from the matched MSA.This motivates our choice of a masking probability of p = 0.7.    Figure S8: Performance of AFM using different variants of DiffPALM.Same as Fig. 3, but here we compare the standard DiffPALM method with two of its variants: one where we only use pairs with high predicted confidence (≥ 0.9) as input to the AFM pipeline, and one where we use orthology-based pairs (i.e.those employed in the "Only orthologs" case shown in    S1, histograms of the Hamming distance between the partner predicted by DiffPALM and the one predicted by matching orthologs to the query sequences, whenever they differ.In practice, for each sequence A in family A which is paired with a partner B using orthology, but with a different partner b using Diff-PALM, we measure the Hamming distance between B and b.A similar protocol is conducted for each sequence B in family B. These distances allow us to compare the pairs predicted by DiffPALM to the orthology-based pairs.Note that the total counts of the distributions regarding MSA A and MSA B generally differ.This happens because DiffPALM might pair orthologs to padding sequences of gaps: in this case, we do not report Hamming distances.

Figure 1 :
Figure1: Performance of DiffPALM on small HK-RR MSAs.The performance of two variants of DiffPALM (MRA and IPA, see "Improving precision: MRA and IPA") is shown versus the number of runs used for the MRA variant, for 40 MSAs comprising about 50 HK-RR pairs.The chance expectation, and the performance of various other methods, are reported as baselines.Three existing coevolution-based methods are considered: DCA-IPA[14], MI-IPA[47], and GA-IPA[50].We also consider a pairing method based on the scores given by the ESM-2 (650M) single-sequence protein language model[5], see "Pairing based on a single-sequence language model".With all methods, a full one-to-one within-species pairing is produced, and performance is measured by precision (also called positive predictive value or PPV), namely, the fraction of correct pairs among predicted pairs.The default score is "precision-100", where this fraction is computed over all predicted pairs (100% of them).For DiffPALM-MRA, we also report "precision-10", which is calculated over the top 10% predicted pairs, when ranked by predicted confidence within each MSA (see "Methods").For DiffPALM, we plot the mean performance on all MSAs (color shading), and the standard error range (shaded region).For our ESM-2 based method, we consider 10 different values of masking probability p from 0.1 to 1.0, and we report the range of precisions obtained (gray shading).For other baselines, we report the mean performance on all MSAs.

Figure 2 :
Figure 2: Impact of positive examples and extension to another pair of protein families.We report the performance of DiffPALM with 5 MRA runs (measured as precision-100 and precision-10, see Fig.1), for various numbers of positive examples, on the same HK-RR MSAs as in Fig.1(left panel).We also report the performance of DiffPALM for similarly-sized MALG-MALK MSAs (right panel).In both cases, we show the mean value over the 40 different MSAs with its standard error interval, and we plot the chance expectation for reference.
unmatched sequences belonging to species k in M (A) and M (B) (respectively).Dealing with asymmetric cases.The two protein families considered may have different numbers of paralogs within the same species.Assume, without loss of generality, that N k.To solve the matching problem with one-to-one interactions, we would like to pick, for each of the N (A) k sequences in M (A) , a single and exclusive interaction partner out of the N (B) k

Figure S1 :
Figure S1: Comparison of contact maps predicted by MSA Transformer for the correct pairing of an HK MSA and an RR MSA ("Correct pairs"), and for an incorrect pairing ("Shuffled pairs").We observe that MSA Transformer is able to correctly predict the inter-protein contacts when given as input a paired MSA made of correctly matched sequences.Conversely, if the model is given as input a paired MSAs where rows have been shuffled before pairing, it is not able to recover the inter-protein contact map (even though it correctly recovers correctly the intraprotein contact maps).These results suggests that MSA Transformer can distinguish between interacting and non-interacting pairs of protein sequences, despite the fact that dimers or paired MSAs were not in the training set used for its MLM pre-training[51].

Figure S2 :
Figure S2: MLM loss vs. number of correct pairs for different masking probabilities.We use an MSA of M 50 sequences and 5 different species.To estimate the expected loss accurately, we used 20 different masks at each step.L tar denotes the expected loss when all pairs are correctly matched.For visualization purposes, in every plot we rescale the loss by shifting it by L tar .We find that our MLM loss in Eq. (S1) decreases for increasing numbers of correctly matched sequences in the MSA.We see that the sweet spot of the masking probability p (i.e. the value that gives steeper and smoother loss curves) is at moderately high values (0.4 ≤ p ≤ 0.7).As explained in "Methods", high masking probabilities make it more challenging for the model to predict the masked amino acids using only information coming from the masked MSA, thus encouraging it to use, instead, information coming from the matched MSA.This motivates our choice of a masking probability of p = 0.7.

Figure S6 :
Figure S6: Comparing the AFM default MSA pairing strategy with DiffPALM, for structure 6L5K.In both panels, we superimpose the experimental structure of 6L5K with a structure predicted using AFM.Chains A and B of the PDB structure are colored in salmon and light blue respectively, while chains A and B of both predicted structures are colored in bright red and bright blue respectively.(a) Comparing the experimental structure with a typical high-confidence prediction generated with the default MSA pairing pipeline.(b) Comparing the experimental structure with a typical high-confidence prediction generated with our MSA pairing pipeline based on DiffPALM.

Figure S7 :
Figure S7: Comparing the AFM default MSA pairing strategy with DiffPALM, for structure 6FYH.Same as Fig. S6, but for 6FYH.

Fig. 3 )
FigureS8: Performance of AFM using different variants of DiffPALM.Same as Fig.3, but here we compare the standard DiffPALM method with two of its variants: one where we only use pairs with high predicted confidence (≥ 0.9) as input to the AFM pipeline, and one where we use orthology-based pairs (i.e.those employed in the "Only orthologs" case shown in Fig.3) as positive examples for DiffPALM, and use the pairs predicted by DiffPALM, as well as the positive examples, as input of AFM.

Figure S9 :
Figure S9: Confidence of DiffPALM predictions.We show, for the 15 complexes listed in TableS1, histograms of the DiffPALM confidence values (see "Result and confidence").We distinguish the orthology-based pairs that are recovered by DiffPALM, the otherwise paired orthologs, and all the other paired sequences.We indicate in panel titles the value of the fraction f of orthology-based pairs that are recovered by DiffPALM.

Figure S10 :
Figure S10: Hamming distance to the orthologs of all the mismatched pairs predicted by DiffPALM.We show, for the 15 complexes listed in TableS1, histograms of the Hamming distance between the partner predicted by DiffPALM and the one predicted by matching orthologs to the query sequences, whenever they differ.In practice, for each sequence A in family A which is paired with a partner B using orthology, but with a different partner b using Diff-PALM, we measure the Hamming distance between B and b.A similar protocol is conducted for each sequence B in family B. These distances allow us to compare the pairs predicted by DiffPALM to the orthology-based pairs.Note that the total counts of the distributions regarding MSA A and MSA B generally differ.This happens because DiffPALM might pair orthologs to padding sequences of gaps: in this case, we do not report Hamming distances.

Table S1 :
Dataset of eukaryotic complexes.All 15 eukaryotic protein complexes considered here are listed by their PDB ID, and various MSA properties are given.L A and L B are the lengths of the aligned amino acid sequences of the two chains A and B considered.D denotes the depth of the full MSA built by AFM, consisting of the paired MSA and the block MSAs (see "Assessing the impact of pairing on AFM structure prediction").F paired is the fraction of sequences that are paired by AFM, i.e. the depth of the paired MSA divided by D. ⟨d p ⟩ is the average depth of the MSAs of the single species in the AFM-paired MSA.Thus it is the average of the largest depth among the two chains, since the other one is completed by padding sequences of gaps.MSR stands for "maximum size ratio": if the ratio of the larger to the smaller of the depths of MSAs A and B is larger than MSR, species are not paired by DiffPALM.F same is the fraction of pairs predicted by DiffPALM that is identical to the pairs predicted by the default AFM pairing method.F pred is the ratio of the number of pairs predicted with DiffPALM to the number predicted using the default AFM pairing method.D DiffPALM = D ×F paired ×F pred denotes the depth of the MSA which is paired by DiffPALM.D A eff (resp.D B eff ) is the effective depth corrected with phylogenetic weights (with Hamming distance threshold 0.2)