# Predicting protein–ligand affinity with a random matrix framework

See allHide authors and affiliations

Edited by Michael L. Klein, Temple University, Philadelphia, PA, and approved September 29, 2016 (received for review July 9, 2016)

## Significance

Developing computational methods to screen ligands against protein targets is a major challenge for drug discovery. We present a robust mathematical framework, inspired by random matrix theory, which predicts ligand binding to a target given the known ligand set of that target. Our method considers binding prediction as a denoising problem, recognizing that only some of the chemically important features associated with each ligand contribute to binding to a particular receptor. We use correlations among chemical features in the known ligand set, combined with random matrix theory, to eliminate statistically insignificant correlations. Our method outperforms existing algorithms in the literature. We show that our algorithm has the physical interpretation of estimating the ligand–target binding energy.

## Abstract

Rapid determination of whether a candidate compound will bind to a particular target receptor remains a stumbling block in drug discovery. We use an approach inspired by random matrix theory to decompose the known ligand set of a target in terms of orthogonal “signals” of salient chemical features, and distinguish these from the much larger set of ligand chemical features that are not relevant for binding to that particular target receptor. After removing the noise caused by finite sampling, we show that the similarity of an unknown ligand to the remaining, cleaned chemical features is a robust predictor of ligand–target affinity, performing as well or better than any algorithm in the published literature. We interpret our algorithm as deriving a model for the binding energy between a target receptor and the set of known ligands, where the underlying binding energy model is related to the classic Ising model in statistical physics.

- drug discovery
- random matrix theory
- protein–ligand affinity
- computational pharmacology
- statistical physics

Finding new ligands that bind to a given target is both a crucial step and a major stumbling block in modern drug discovery. Numerous attempts have been made to develop computational algorithms to predict the binding affinity of a ligand to a given receptor, which would allow potential compounds to be screened in silico, reducing costs and saving time. In particular, in response to the wealth of experimental data that exists both within pharmaceutical companies, and also in freely accessible online databases such as ChEMBL (1), approaches that attempt to “learn” from these data are increasingly gaining attention (2).

An intuitive data-driven approach builds on the hypothesis that chemical commonalities among the known ligand set reveal salient features of the binding site. A corollary is that ligands with similar chemical functionality are expected to share similar binding affinity toward a particular receptor (3, 4). This suggests that the known ligand set of a given target can be used to learn criteria that predict whether a novel ligand will bind to the target. This ligand-based approach is a powerful paradigm that does not require structural information about the receptor, which is potentially arduous to obtain, unlike other more atomistic methods such as docking or molecular dynamics.

Any ligand-based method requires a way to quantify the chemical functionalities of a ligand, and various chemical descriptors have been proposed. Examples include a vector of measured or predicted physical properties (5⇓⇓–8), a vector enumerating the presence or absence of known functional groups on the ligand (9, 10), a vectorial representation of connectivities in the molecular graph (11, 12) (known also as molecular fingerprints), and simply the 3D shape of the ligand (13⇓⇓–16). Existing approaches then take the descriptor associated with each ligand and compare ligands with each other, for example through the Tanimoto coefficient (17, 18).

Nonetheless, regardless of how ligand chemical functionalities are quantified, without fortuitously knowing a priori which ligand features determine binding, most of the chemical features describing the ligand are likely irrelevant. Whereas some of the features in the descriptor determine binding to the receptor of interest, others do not and simply add background noise. Moreover, for any particular receptor, the known set of ligands that bind to it is often smaller, or of the same order of magnitude as the number of potentially relevant chemical features. As such, the problem of ligand-based binding prediction can be recast as a problem in signal processing––can we identify those chemical ligand features that determine binding (i.e., the “signal”) amid many irrelevant ones (the “noise”) in the regime where the amount of data is not significantly larger than the number of variables being measured?

Random matrix theory (RMT) provides a natural mathematical framework for addressing this issue. Physical applications of RMT include Wigner’s study of the spectra of heavy atoms (19). In the context of data analysis, RMT gives a null model for the similarity between samples (ligands) that can be expected by chance due to finite sampling (20). Powerful analytical tools from RMT define a precise threshold that distinguishes the similarity that can be expected by chance from that which is caused by signal. These tools enable an effective and simple denoising algorithm, which allows us to recover the statistically significant signals. This denoising algorithm has been used in different fields, ranging from finance (21⇓–23) to face recognition (24, 25).

This article contains three major results: First, we show that for a randomly chosen set of molecules, the eigenvalue distribution of the covariance matrix of chemical descriptors agrees with the canonical Marčenko–Pastur (MP) distribution (26) of RMT, expected in the absence of any significant signal. Second, if we consider descriptors of pharmacologically similar molecules, i.e., those that bind to the same protein receptor, then part of the eigenvalue spectrum agrees with the MP distribution, but crucially there are eigenvalues that deviate from it significantly. These eigenvalues, and their corresponding eigenvectors, describe the statistically significant signals. The most common substructure of these eigenvectors corresponds to pharmacophores. Using these two results, we can predict with higher accuracy than known methods when an unknown ligand will bind to a receptor, constructing a unique model for each protein receptor. Finally, we provide a physical interpretation of the success of the algorithm––namely, that it is effectively inferring a model of the ligand–protein binding energy from the covariance structure of fingerprints that bind to a target protein. The underlying mathematical model is closely related to the classic Ising model in statistical physics.

## RMT Framework

To motivate the RMT framework, we focus on a popular set of descriptors that are often used in cheminformatics. Molecular fingerprints are typically constructed by first representing a ligand as a 2D molecular graph, and then considering all possible bond paths within the molecule (11, 12). The set of bond paths that characterize each molecule is unique, so that only identical molecules share exactly the same bond paths; similar molecules share most bond paths. Because the set of all possible bond paths is vast, typically fingerprints are defined by first considering bond paths that are below some threshold length (i.e., within some radius of every atom of the structure) and then mapping these bond paths to a bit string of defined length [a molecular “fingerprint” (27)] through a hash function.

The fundamental aim is to detect similarity among a set of binary strings of the same length, *p*, where each bit represents the presence or absence of a molecular feature. There is significant noise in these bit strings, because only some of the bits are truly informative––for any particular receptor, not all bond paths are equally relevant to ligand–target binding. If the individual bits of the binary strings were chosen randomly, with no information about ligand–target binding, then RMT predicts that the eigenvalue distribution of the covariance matrix of the bit strings obeys a specific analytical function known as the MP distribution. Therefore, a highly accurate test for detecting the presence of nonrandom commonalities among a set of strings is to compare the eigenvalue spectrum of their covariance matrix to the MP distribution. Any deviation necessarily reflects the presence of a signal in the data, which in this case are sets of molecular features that characterize the chosen ligand–target interaction.

Mathematically, we represent the *k*th ligand associated with the chosen receptor as a row vector of bits *N* ligands that bind to the chosen receptor can be arranged as a data matrix *N* will vary between receptors. We then remove repeated columns of the data matrix, which correspond to redundant information, and convert the data matrix to *z* scores by subtracting the column mean and normalizing each column to have unit variance. This allows us to construct the *N* × *N* correlation matrix *C* would indicate relationships between specific molecular features, suggesting that these features do not occur independently of one another in this dataset.

A fundamental result from random matrix theory describes the eigenvalue distribution of the correlation matrix *C* analytically—under certain weak assumptions, if entries in *A* are drawn from a Gaussian distribution with zero mean and unit variance, the probability of *A* having an eigenvalue *λ* is given by the MP distribution (26)**1** is that those eigenvalues above

Fig. 1*A* shows that the eigenvalue distribution of the correlation matrix of 1,000 ligands drawn randomly from ChEMBL (1) agrees quantitatively with the MP distribution. However, if instead we choose the ligands nonrandomly, by choosing the ligand sets associated with a particular protein receptor, we find a significant number of eigenvalues above the MP threshold. As examples, Fig. 1 *B* and *C* shows the eigenvalue distribution from ligand sets from ChEMBL associated with two G protein coupled receptors, the adenosine A2a receptor (AA2AR) and the

The MP distribution thus suggests an intuitive denoising algorithm for ligands that bind to a particular receptor: only eigenvectors with eigenvalues larger than the MP upper bound correspond to statistically significant features of the receptor; the other eigenvectors simply reflect random noise caused by finite sampling. The set of statistically significant features, represented as orthonormal eigenvectors, are thus orthogonal chemical features relevant for ligand binding. In other words, if there are *m* eigenvalues greater than the MP upper bound, then the linear space spanned by the *m* associated eigenvectors,

## Classification of Unknown Ligands

Intuitively, if an unknown ligand is sufficiently similar to the set of known ligands that bind to a receptor, the unknown ligand will likely also bind to the receptor. The random matrix framework gives a precise mathematical statement for this intuition: An unknown ligand is predicted to bind to a receptor if the bit-string vector corresponding to the unknown ligand (after transformation to *z* score by subtracting the sample mean and normalizing by sample variance) lies close to the subspace

Let *z* scores corresponding to the unknown ligand. The projection of **3** has the chemical interpretation that one can be confident a ligand will bind to the receptor if it contains pharmacophores found in known ligands, and is minimally decorated with other functional groups. A pharmacophore is typically a small fragment (see Fig. 4), and the chemical properties of the resulting molecule will increasingly deviate from those of the pharmacophore as one incorporates additional functional groups. The threshold parameter ε allows the tolerance of the analysis to the presence of other functional groups to be controlled, and hence an appropriate false positive/false negative tradeoff selected; this is discussed in detail below.

To test this, we consider human G protein coupled receptors (GPCRs) reported in ChEMBL. A ligand is considered to bind to a given target if its

The receiver operating characteristic (ROC) curve plots the accuracy of identifying ligands (true positives) as a function of false-positive predictions. This characteristic is commonly used to quantify the performance of classification algorithms. In particular, the area under the ROC (the so-called AUC) is the crucial figure of merit: the closer the AUC is to 1, the better the classifier. Fig. 2*A* shows that our algorithm has a mean AUC of 0.9, surpassing methods commonly used in the literature, which have a mean AUC of

The ROC curve is plotted by varying ε, the threshold parameter in Eq. **3**. Fig. 2*B* shows the effect of varying ε, represented as the percent of the training set accounted for by each choice of ε. A stringent choice of ε corresponds to a large portion of the training set being rejected by the threshold in Eq. **3**, resulting in a low false-positive rate but a high false-negative rate. Vice versa, an ε value that accounts for a larger portion of the training set has higher false-positive rate but lower false-negative rate. In the remainder of this paper, we choose ε so that **3**. With this heuristic choice, the algorithm picks out 84% of the verification set as ligands with a 7% false-positive rate (i.e., it rejects 93% of randomly selected ligands from ChEMBL).

The random matrix distribution (Eq. **1**) is crucial to the success of our algorithm. Fig. 3 shows that including too many eigenvectors into

We also report that the statistically significant eigenvectors picked out by our algorithm represent pharmacophores. Formally, a fingerprint cannot be inverted directly to give a unique chemical structure because multiple structures can lead to the same fingerprint. Nonetheless we can infer the structural motif that an eigenvector represents by the common substructure among those ligands that lie closest to that eigenvector. Fig. 4 shows the structural motif corresponding to the top two eigenvalues of AA2AR and ADRB1. Strikingly, the first eigenvector of AA2AR is precisely the adenine motif. The second eigenvector contains a thymine motif fused to a more complex scaffold. For ADRB1, the top eigenvector is the structural motif of *β*-blockers (e.g., propranolol), a class of successful antagonists which are used, e.g., to treat hypertension.

## Physical Model

Before concluding, we address the question of why this algorithm might prove effective. What is the physics encoded in those eigenvalues larger than the MP threshold, and their associated eigenvectors?

The clearest way of determining which ligands bind to a given protein would be to accurately predict the binding energy of every possible ligand to the protein. The ligand set of the protein is then given by the set of ligands with a binding affinity greater than some threshold. Accurate determination of this binding energy is extremely computationally intensive. Nonetheless, even without a first-principles determination of the ligand binding energy, we might still hope to parameterize a model of protein–ligand binding, where the parameters are determined from the set of ligands that bind to a given protein target. If sufficiently accurate, such a model of the binding energy could potentially still give accurate predictions as to which ligands bind to a given target protein.

We now demonstrate that there is a natural class of models for ligand binding where our algorithm precisely picks out the set of strongly binding ligands. To begin, we note that because we are describing ligands through their fingerprints *E* in powers of *p* also depend on the nature of the fingerprints that we use to describe the ligands. More detailed fingerprints have a better chance of accurately modeling the binding energy between the ligand and receptor than those that do not take into account parts of the molecule that bind to the receptor. The fact that the Morgan 3 fingerprints used herein have long been shown to have predictive power for ligand–target association means that they plausibly contain sufficient information to model the binding energy. It is noteworthy that because fingerprints are binary strings of length *p*, the model in Eq. **4** is equivalent to the Ising model, well known in statistical physics.

Can we deduce *w* and *J* from the fingerprints of those ligands that bind to a protein target? Here we take as input the correlation matrix of the fingerprints that bind to each protein target in question. Indeed, determining the Ising model interaction matrix *J* from the correlation matrix is a classic problem in statistical physics and biophysics (31⇓⇓⇓–35). We now argue that our random-matrix-based procedure effectively removes noise caused by finite sampling from this problem. The essence of our algorithm is the derivation of a protein-specific binding energy model *J*.

We can directly compute the correlation matrix of the fingerprints that bind to a given protein target (characterized by *Z* is the partition function, summing **5**. The correlation matrix *T*: at high temperatures, where *J* in Eq. **4**.

Correspondingly, ligand–protein target binding only occurs over a range of temperatures, and we assume that we are in the range of temperatures where the binding is effective. Our algorithm computes the correlation matrix *n* samples, where *n* is the number of ligands that bind to the target in question. Critically, *n* is the same order of magnitude as the fingerprint length *p*, so our computed covariance matrix does not converge to the equilibrium expectation––it is corrupted by noise. Our procedure of extracting the eigenvalues above the MP threshold corresponds to estimating the binding energy from the data matrix.

To see this, Fig. 5 shows a set of simulations of the Ising model. We consider fingerprints of length **5**. We take the first-order coefficients to vanish (*z* score) and choose *A* shows the spectrum of the resulting correlation matrix, formed by considering **5** with *B* shows the corresponding spectrum of the correlation matrix when *C* shows that the eigenvector corresponding to this eigenvalue is extremely well correlated with

This correlation between the eigenvector and the coupling matrix *J* gives a physical interpretation of the projection onto the subspace of eigenvectors that escape the MP distribution in Eq. **2**: We have used the data to derive a model for the binding energy of the ligand in fingerprint “coordinates,” and to determine whether an arbitrary ligand binds to the target, we are simply evaluating this binding energy. The correlation structure is lost when we use a dataset of random ligands instead of those corresponding to a single protein receptor, because in this case there is no underlying energy model to learn. Although our simulations (Fig. 5) use a rank-1 *J* for simplicity, if *J* is of higher rank, more eigenvectors will be pushed outside the MP distribution. Indeed, ref. 36 showed that random matrix denoising is related to putting in a prior that the rank of *J* (in our case the number of independent pharmacophores) is less than the number of variables (2,048 for the Morgan 3 fingerprint). We note that the Ising energy in Eq. **4** provides another way to score ligands. However, the classification accuracy does not significantly improve if the energy is estimated using the leading-order mean-field approximation (37).

Although interpreting our algorithm in terms of a binding energy function requires experimental verification through binding energy measurements, we note that this interpretation offers several conceptual insights. First, new candidate compounds could be uncovered by exploring the potential energy landscape of Eq. **4**, and jumping between different energy minima could be related to “scaffold hopping” in drug discovery (38) as the minima would correspond to structures with pharmacophores. Investigating the topology of the energy landscape and those paths that connect distinct basins (39), as well as the statistics of energy minima, could reveal properties of the binding site. Secondly, relating our algorithm to an interaction energy provides a way to extend our method to regression problems, such as predicting solubility (40).

Third, we note that chemical fingerprints may be improved by incorporating physically relevant terms such as charge and molecular volume. This is facilitated by our approach, which accounts for additional noise introduced by increasing the number of fingerprint variables. Finally, the binding energy interpretation highlights the importance of high-quality negative data, i.e., which molecules do not bind to the desired receptor. Ref. 36 shows that including repulsive patterns could improve high-dimensional inference with inverse Ising/Hopfield models. Empirically, for our system, the repulsive patterns (small eigenvalues) inferred from the data are noisy and uninformative. This can be addressed either through identification of many more ligands that bind to each protein receptor, or, perhaps more efficiently, the incorporation of negative data into this framework.

## Conclusion

We have developed a classification algorithm that predicts whether a compound will bind to a particular receptor of interest, given the known ligand set of that receptor. Our algorithm decomposes signal from noise using a robust bound that is derived from RMT. Applying our approach to human GPCRs reported in ChEMBL successfully identifies 84% of known ligands with a 7% false-positive rate, yielding an average AUC of 0.9. The methodology developed here complements the vast literature on optimizing fingerprint design, for example through the use of high-throughput screening data (7) or through application of neural networks to molecular graphs (41). The random matrix framework described here provides a robust threshold for maximizing the information extracted from correlations between structural features, while avoiding overfitting the data. The algorithm has the natural interpretation as a data-driven model for the binding energy of the ligands to the target protein, in fingerprint coordinates. This model gives a different perspective on the validity and uses chemical fingerprints for both ligand binding predictions and other purposes such as predicting ligand solubility (40) or aggregation (42), as well as revealing insights in fingerprint design.

## Acknowledgments

This research was funded by a grant from Roche Pharmaceuticals. A.A.L. acknowledges the support of a Fulbright Fellowship. L.J.C. was supported by a Next Generation Fellowship, and a Marie Curie Career Integration Grant (Evo-Couplings, Grant 631609). M.P.B. is an investigator of the Simons Foundation, and also acknowledges support from the National Science Foundation through DMS-1411694.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: ljc37{at}cam.ac.uk or alphalee{at}g.harvard.edu.

Author contributions: A.A.L., M.P.B., and L.J.C. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

## References

- ↵.
- Bento AP, et al.

- ↵.
- Jacoby E

- ↵.
- Johnson MA,
- Maggiora GM

- ↵
- ↵
- ↵
- ↵
- ↵.
- Buonfiglio R, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Kubinyi H

- ↵.
- Shin WH,
- Zhu X,
- Bures MG,
- Kihara D

- ↵
- ↵
- ↵.
- Wigner EP

- ↵.
- Edelman A,
- Wang Y

- ↵
- ↵
- ↵.
- Akemann G,
- Baik J,
- Di Francesco P

- Bouchaud JP,
- Potters M

- ↵.
- Turk MA,
- Pentland AP

- ↵
- ↵.
- Marčenko VA,
- Pastur LA

- ↵
- ↵.
- Landrum G

- ↵.
- Lagarde N,
- Zagury JF,
- Montes M

- ↵.
- Unterthiner T, et al.

- ↵
- ↵
- ↵.
- Cocco S,
- Leibler S,
- Monasson R

- ↵.
- Weigt M,
- White RA,
- Szurmant H,
- Hoch JA,
- Hwa T

- ↵
- ↵.
- Cocco S,
- Monasson R,
- Sessak V

- ↵.
- Tanaka T

- ↵
- ↵.
- Wales D

- ↵
- ↵.
- Duvenaud DK, et al.

*Advances in Neural Information Processing Systems 28*, eds Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R. Available at https://arxiv.org/pdf/1509.09292.pdf. Accessed October 18, 2016. - ↵

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Applied Physical Sciences

- Biological Sciences
- Biophysics and Computational Biology