New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Statistical significance of protein structure prediction by threading

Communicated by Roy Gordon, Harvard University, Cambridge, MA (received for review January 24, 2000)
Abstract
In this study, we estimate the statistical significance of structure prediction by threading. We introduce a single parameter ɛ that serves as a universal measure determining the probability that the best alignment is indeed a nativelike analog. Parameter ɛ takes into account both length and composition of the query sequence and the number of decoys in threading simulation. It can be computed directly from the query sequence and potential of interactions, eliminating the need for sequence reshuffling and realignment. Although our theoretical analysis is general, here we compare its predictions with the results of gapless threading. Finally we estimate the number of decoys from which the native structure can be found by existing potentials of interactions. We discuss how this analysis can be extended to determine the optimal gap penalties for any sequencestructure alignment (threading) method, thus optimizing it to maximum possible performance.
Protein structure prediction is a complex problem that requires significant approximations and simplifications both in models involved and in search strategy. Currently a popular and reasonably successful method is threading. In threading, a new sequence is mounted on a series of known folds with the goal of finding a fold (a sequencestructure alignment) that provides the best score (lowest energy). A standard quasienergetic scoring scheme assigns energy E^{s} to an alignment s in the hope that the lowest energy alignment bears structural similarity to the native fold of the query sequence. In this regard, threading is similar in spirit to sequence alignment (1). An essential part of all sequence alignment procedures is the evaluation of the statistical significance of obtained scores (2, 3). More recently, the problem of the statistical significance of structural alignments was addressed (3).
Despite the development of numerous approaches and applications (4–10), the statistical significance of predictions from threading calculations has not been systematically analyzed. An empirical approach was proposed by Bryant and coworkers (5, 6), who compared the best threading alignment with the threading of reshuffled random sequences. Bryant and coworkers assumed that scores (energies) of realigned random sequences are normally distributed. This approach is computationally demanding, as it requires realignment of all reshuffled random sequences with all target proteins in the database. Furthermore, the assumption of Gaussian distribution of threading scores of realigned random sequences was not justified in refs. 5 and 6.
In this study, we show that, in contrast to earlier assumptions (5, 6), the probability of successful prediction in threading calculations follows an extreme value distribution (EVD) (11). Furthermore, our analysis identifies a simple parameter that provides a fast and computationally inexpensive clue as to whether the actual threading calculation resulted in a reliable prediction or in a false positive.
Theory Development
As a theoretical background, we use the random energy model (REM). The REM was originally introduced by Derrida (12) to describe a class of spinglass models. Later Bryngelson and Wolynes postulated (13) and Shakhnovich and Gutin showed [using replica meanfield theory (14)] that the REM provides an adequate description of equilibrium properties of random heteropolymers. Subsequently the REM was successfully applied to various aspects of the proteinfolding problem (15–18) as well as to analysis of the proteinstructure prediction problem (19, 20). The general applicability and limitations of the REM in describing the energy landscape of heteropolymers have been addressed in refs. 14, 21, and 22.
Here we will formulate the REM and its underlying assumptions by using the language of threading calculations.
The energy of each threading alignment is usually taken as a sum energy of all pairwise contacts: 1 where L is the length of a query sequence, s denotes alignment, and r_{i}^{s} is a coordinate of the ith group (usually the Cα or Cβ atom) in this alignment. Δ denotes the cutoff distance for contact potential that determines which groups are interacting (usually taken 7.5–9 Å between the Cα or Cβ atoms). ξ_{i} denotes the type of amino acid at position i of the query sequence. U is a 20 × 20 matrix of interaction energy parameters between all types of amino acids. Summations here and below are taken over all residues that are farther apart than two units along the sequence, i.e., j > i + 2.
The REM formulation for threading is as follows:
The set of alignments consists of the “native” alignment having energy E_{N} and a set of M decoys.
The energies of decoys take statistically independent random values.
The most common form for the probability density of energies of decoys is Gaussian: 2 E_{av} is the average energy, and Σ is the standard deviation of energies of all decoys. The assumption that energies of decoys are independent random values is validated if contacts in the alternative conformations (decoys) are distributed independently and uniformly. Physically this means that polymer connectivity that may cause correlation between contacts plays a relatively minor role, i.e., longrange (along the sequence) contacts provide dominant contribution to the energy of an alignment. This was shown to be generally true for threedimensional compact polymers (23, 24).
Assuming independence and identical distribution of contacts in the decoys, we can make a simple estimate for the average energy of alignments E_{av} and standard deviation of alignment energies Σ. These estimates were made in our earlier publication (equations 8–14 of ref. 25); here we provide the results 3 4 where C is the number of contacts in the native conformation (we assume here that all decoys have the same number of contacts) and C_{total} is the total number of possible pairwise interactions between residues (i.e., C_{total} = ∑_{i,j=1}^{L} 1≈L(L−3)/2).
The estimates (Eqs. 3 and 4) are potentially very useful because they permit evaluation of the average energy of decoys and their standard variation directly from interaction matrix U and query sequence ξ. No explicit decoys are necessary. Importantly, the Zscore that is used as a common criterion of the quality of discrimination of the native state can be estimated: 5 Because E_{av} and Σ depend only on the composition of the query sequence (and U), one can think of Z_{REM} as a “correction” for sequence composition.
However, the REM estimates for E_{av} and Σ are obtained under certain assumptions, and the validity of the Zscore must be assessed by comparison with threading calculations (see below).
The density of states (i.e., the number of decoys found in an energy range from E to E + dE) is w(E)dE = Mf(E)dE. We also define a very important threshold energy E_{c} as 6 This threshold energy E_{c} corresponds to the bottom of the continuous part of the energy spectrum when the system “runs out” of decoys. This threshold energy determines qualitatively the features of the density of states:
at E > E_{c}, the density of states w(E) is very high;
at E < E_{c}, the density of states w(E) is very low; and
the commonly accepted estimate for E_{c} is given in refs. 14 and 18, 7 In other words, according to the REM, at E > E_{c}, there are many decoys in any energy interval Σ whereas at E_{c} the system runs out of states and the spectrum becomes discrete: one can find only occasional and rare decoys with E < E_{c}.
In threading calculations, native (or nearnative) alignment is obtained with energy E_{N}. It is clear that E_{N} needs to be below E_{c} to make a successful prediction with nativestate ranking first. More specifically, we can determine within the REM approximation the probability that native alignment ranks first. To this end, we require that all M decoys have an energy higher than E_{N}. The probability of this event is 8 The analysis of Eq. 8 simplifies when the number of decoys M is large (≫1), which is always the case in threading calculations. Straightforward calculations result in 9 (see ref. 26 for more details), where 10 is the deviation of E_{N} from E_{c}, and u and α are unitless “center” and “width” of the distribution: 11 12 The parameter ɛ_{N} is also related to the predicted Zscore Z_{REM} via a simple relation: 13 The result in Eq. 9 represents an EVD (11), which is valid for a broad range of distributions f(E) that can be converted to exponential distribution by linearization of the logf(E) at large deviations from the mean (11). However, the specific expressions for parameters u and α given by Eqs. 11 and 12 are valid for Gaussian distribution f(E) (Eq. 2) and large M.
Because for any sequencestructure pair M depends only on its length, one can consider transformation from Z_{REM} to ɛ as a “correction” for the length of the template and the query sequence.
It is very instructive to examine the qualitative features of the probability distribution (Eq. 9). When the number of decoys is very large, u→1 and α→∞, which means that ɛ serves as a very good predictor of success in the threading simulations: when ɛ > 1, the native fold ranks first with a very high degree of certainty, whereas at ɛ < 1, the native fold will surely not rank first. The results become less clearcut when the number of decoys is small. It is crucial to note that one does not need to know the native fold to evaluate E_{c}. Eq. 7 suggests that the number of decoys M and the standard variance define E_{c} completely. Both M and E_{c} can either be estimated (by using Eq. 4) or obtained directly from threading calculations (see below). Importantly, the computation of E_{c} does not require costly runs of threading with “shuffled” sequences, a method widely used to estimate the statistical significance of threading (27). Given a query sequence and a potential, one can compute E_{c} and use it as a cutoff for assessing the significance of structure prediction.
Comparison with Gapless Threading
Now we compare the predictions from the REM model with the results of “gapless threading.” In gapless threading, one takes an amino acid sequence and mounts it in every possible way (without gaps) onto known protein structures of greater length. If the sequence has a length L and the structure it is mounted onto has a length L_{str}, the total number of decoys generated by gapless threading is L_{str} − L + 1. Hence a database of about 10^{3} protein structures allows generation of about 10^{5}. . . 10^{6} decoys. Decoys obtained in this way are sorted by energy. The goal is to recognize the native structure, i.e., to have it rank first in the sorted list. Importantly, gapless threading is inappropriate for real structure prediction because native structure is not present among the set of decoys, and structures similar to the native (analogues) cannot be recognized in gapless threading for the vast majority of proteins (L.A.M. and E.I.S., unpublished results). However, gapless threading is a useful tool for generating decoys and for testing the recognition abilities of the energy function.
To construct a database, we selected a representative set of nonhomologous proteins from the fssp database (28). From this set, all structures that have no coordinates for side chains and those that are longer than 500 residues were removed. The final database contained 1,011 nonhomologous proteins. We performed allagainstall recognition by gapless threading of every sequence through all proteins of greater length. Energy was computed by using an optimized potential of interactions U(σ,η) taken from ref. 25. Of 1,011 sequences, the native structure was recognized as having rank 1 for 763 (75%) proteins. The native structure had rank = 2 for 30 other proteins. For 13 sequences, a structure similar to the native (rootmeansquare deviation < 5 Å) ranked first. A small fraction of proteins in the dataset are not stable by themselves but are stabilized only in larger complexes (e.g., protein rop is tetrameric) or by metal ions, large numbers of disulfides, etc. For those proteins, gapless threading is not expected to recognize their native structure correctly based on interactions within a monomer, and it does not (see below).
First, we test the main REM assumption that energies of all decoys are normally distributed. In Fig. 1, we plotted the energy distributions of decoys for 10 proteins and fitted each of them into Gaussian distribution. Clearly, the fit is very good throughout the whole range of energies except the lowest energies, which show a deviation from Gaussian distribution in the form of a characteristic “shoulder.” Such a “shoulder” is a typical feature of the density of states of nonrandom proteinlike sequences that fold into its native conformation with low (compared with a random sequence) energy (see also figure 3 of ref. 29). The existence of this “shoulder” suggests that potentials are good enough to distinguish the native structure from the set of decoys generated in gapless threading. Furthermore, Fig. 1B confirms the condition ɛ = 1 for the boundary of the density of states for decoys that are structurally unrelated to the native state. Decoys with ɛ > 1.2 are likely to be structurally similar to the native state (see below and Fig. 4 Inset).
We then test another important REM assumption of the independence of contacts. To this end, we compare the Z_{REM} estimated in Eq. 5 based on this assumption with the Zscore obtained directly from threading: 14 where E_{av}^{gapless} and Σ^{gapless} are obtained for each protein directly from the threading calculation as average energy and its standard deviation over all decoys. Comparison of Z_{gapless} with Z_{REM} is the first test of the REM applied to fold recognition. Fig. 2 compares Z_{gapless} with Z_{REM} for every protein in the database. The correlation between the two measures is 0.95. Notice there is no bias or nonlinearity in the plot. This comparison indicates that Z_{REM} is a very good estimate of Z_{gapless}. Note that the major assumption in estimating Z_{REM} is the statistical independence of frequencies of interresidue contacts, which is fundamental for the whole REM analysis. The good correlation shown in Fig. 2 supports the applicability of the REM to threading.
The next question we address is how well the rank of the native structure (which is the measure of success in fold recognition) can be predicted by the value of ɛ [defined in Eq. 10 and computed as ɛ = Z_{gapless}/
Fig. 3 presents a number of important results. It shows that ɛ is a very good predictor of success in fold recognition, particularly:
Two distinct regions can be seen: ɛ > 1 and ɛ < 1. As expected from the REM, when ɛ < 1, the rank of the native structure is >1 in 95% of the cases, i.e., the native structure is not recognized.
When ɛ > 1, the native structure ranks first for 94% of the sequences but not for all of them. However, even when the native structure is not recognized (rank > 1) for ɛ > 1, in the vast majority of cases (41 of 43≈95%), it is located among the 10 topscoring decoys (i.e., rank ≤ 10). According to the REM, this happens when the native structure is below the bottom of the continuous spectrum (E_{N} < E_{c}) but is intermixed with rare lowenergy decoys that also have E < E_{c}. However, there are only few such decoys, and the rank of the native structure stays low.
When ɛ > 1.5, the native structure has the first rank in 99% of the cases. In 3 of 490 cases, the native structure has rank = 2, and in 1 case, it has rank = 3. Most importantly in these four cases, the topscoring decoy has a structure similar to the native one (rootmeansquare deviation < 5 Å).
The color code of squares in Fig. 3 indicates the degree of similarity between the native structure r^{N} and the structure that ranks first r^{1}. This quantity is defined as 15 the overlap between contact maps of the two structures. Clearly, when the native structure has rank = 1, then Q = 1, and the square on Fig. 3 is colored red. More important are the cases when the native structure is not recognized but a similar structure (analogue) comes with rank = 1. One would expect an analogue to have energy similar to the native one. Hence, when the native structure is intermixed with lowenergy decoys (i.e., has a low, but not the first, rank), an analogue can rank first instead. In agreement with this expectation, we observe the following:
When the native structure is not recognized but is not very high in rank (≤10) (i.e., still below the continuous part of the spectrum), then in about 10% of the cases a nativelike structure (Q ≥ 0.7) ranks first.
In the opposite case, when the rank of the native structure is high (>10) in only 1 case of 173, the decoy with the first rank has Q ≥ 0.7.
These results bring us to the conclusion that when ɛ < 1, the native structure is not recognized, and the topscoring decoy is very unlikely to have a nativelike structure.
We also notice that when ɛ > 1, the energy of the native state belongs to the discrete spectrum (see Fig. 1). However, that does not guarantee the recognition of the native structure, because an occasional lowenergy decoy belonging to the discrete spectrum can still rank first. However, the probability of finding a random decoy with energy E well below E_{c} is small (see Fig. 3 Inset).
Now we come to the central point of our study: estimating the probability of the native fold (with energy E_{N}) ranking first among its decoys.
The result is shown in Fig. 4. It can be seen that the EVD (Eq. 9) provides the perfect functional form for the observed probability P(E_{N}, rank = 1). However, there is some quantitative disagreement between the parameters of the EVD predicted by Eqs. 11 and 12 [black lines in Fig. 4 and parameters u_{fit} and α_{fit} obtained by fitting the data into the EVD (9) (green line in Fig. 4)]. Although the predicted u and fitted u_{fit} are close to each other, the discrepancy in parameter α is more substantial, close to a factor of 2. This discrepancy indicates that the actual distribution f_{decoys}(E) for the energy of the decoys deviates from the Gaussian on the tails. In fact, f_{decoys}(E) have a Gaussian form for E close to E_{av} (E − E_{av} < 3Σ) but decays exponentially at larger deviations (see Fig. 1).
However, the form of the distribution (Eq. 9) does not depend on the Gaussian form of f(E) and thus is more generally applicable. Parameter α then can be viewed as an empirical parameter or can be obtained directly from the form of the distribution of energies of decoys.
The results presented in Figs. 3 and 4 address the crucial issue of “false positives” in protein threading: a comparison of the lowestenergy predicted alignment with E_{c} makes it possible to assess with a high degree of certainty whether the threading calculation returned a nativelike structure or a “falsepositive” misfold.
It is clear from Fig. 4 that ɛ > 1.2 almost guarantees that the native structure ranks first in threading calculations. As can be seen from Fig. 4 Inset, most of the decoys at ɛ > 1.2 are native like. Parameter ɛ depends on two factors: (i) the quality of the model and the potential U (good model and precise potential provide low E_{N}); and (ii) the number of decoys M that affect ɛ via E_{c}. Having established ɛ as a powerful criterion of success in threading prediction, we can now address the next question: among how many decoys can existing models and potential functions select the native state as first ranking (30). To this end, we solve the inequality ɛ > 1.2 vis à vis the number of decoys by using the definitions of ɛ (10) and E_{c} from Eq. 7.
Fig. 5 shows the maximal number of decoys M_{max} that can be distinguished from the native state for the protein model we use. Different data points correspond to different proteins. Each protein has its own value of E_{N} and ɛ, and therefore the criterion ɛ > 1.2 determines a different limiting number of decoys for different proteins. Nevertheless, we come to an estimate that present models and potential functions can select the native state out of about 10^{12} decoys for a protein of 150… 250 amino acids.
Is this sufficient for a realistic threading calculation? Such a calculation should allow for the possibility of gaps and insertions both in the target structure and in a sequence. This is a key requirement to make threading strategy capable of recognizing analogs. The ability to recognize analogs is crucial because in real life, threading application native structure is not available, and the only hope is that there will be a structure in the database that is not identical but is similar to the native state of the query sequence.
Practical Implications for Protein Structure Prediction
Although gapless threading is used in this paper to illustrate the important points of our analysis, our results are not limited to it. Rather they may be applicable to any threading calculation for which energies of decoys can be considered as independent random values. This assumption is clearly corroborated in the present study for gapless threading (see Fig. 2). An advanced Monte Carlo threading technique that allows gaps and insertions in both sequence and target was reported recently (8). The energy landscape of decoys generated by this threading technique was analyzed with the conclusion that the REM may be adequate to describe it (8).
We showed that parameter ɛ can serve as a reliable and computationally inexpensive predictor of success in threading calculation. This parameter is related to stability gap or “T_{f}/T_{g},” which was shown in protein folding theory to be good a determinant of sequence stability and fast folding (13, 29, 31–33).
The estimate of ɛ for gapless threading is straightforward and does not require sequence reshuffling and realignment, a computationally costly procedure. The main difference in the evaluation of ɛ for realistic gapped threading comes from the fact that the length of an alignment now varies. Σ and E_{av} depend on alignment length so that the full distribution of alignment scores cannot be described by a single Gaussian. However, the analysis of gapped threading (L.A.M., W. Chen & E.I.S., unpublished work) suggests that the distribution of scores for each alignment length l can be described by its own Gaussian with alignmentlength dependent Σ(l) and E_{av}(l). Then the density of states can be generalized to gapped threading: 16 where f_{l}(E) is the Gaussian probability density corresponding to an alignment of length l with its own E_{av}(l) and Σ(l). M(l) is the number of decoys for alignments of length l. This number can be determined from combinatorics. The average energy of all alignments is determined from w(E), and E_{c} can also be found from w(E) by using Eq. 6 [where maximal Σ(l) can be used]. Then ɛ for gapped threading can be determined from Eq 10. Note that the evaluation of ɛ does not require sequence reshuffling. This provides a fast way to recognize false positives in realistic threading calculations. In cases when lowest scoring threading alignment features ɛ ≤ 1, such alignment is most likely to be a false positive that is structurally unrelated to the native state. Alignments featuring ɛ > 1.5 are most likely to be correct predictions. A more detailed quantitative estimate of the probability of correct prediction in the range 1 < ɛ < 1.5 requires evaluation of the EVD of ɛ. Parameters α and u can be determined from the fitting of distribution of ɛ for the threading of many reshuffled random sequences of various lengths into the EVD with subsequent tabulation of the results for the range of lengths. (Each reshuffled random sequence “E_{N} ” entering the definition of ɛ in Eq. 10 corresponds to the lowestscoring alignment.) These procedures will be discussed in detail in subsequent publications.
Although gaps and insertions in realistic threading calculations are crucial, their introduction comes at the cost of a serious increase in the number of decoys, to the point that even the best potentials for the present models are unable to find analogs with lowest energy E_{analog} < E_{c} (or ɛ > 1) (25). Our results suggest a constructive way to address this problem. The allowed length of an alignment and the gap penalty may be chosen in such a way that the total number of decoys M generated by threading would not exceed a “recognition threshold” M_{max} (see Fig. 5).
Generally the number of allowed gaps and insertions (and hence the number of allowed decoys) should be chosen to achieve a maximum value of ɛ. This corresponds to maximizing the probability of a correct prediction (34). Indeed, when gaps/insertions are not restricted, ɛ may be small because the number of decoys M is large. On the other hand, restrictions on gaps/insertions that are too severe may lead to elimination of nativelike conformations from the threading set of alignments leading to lower ɛ because of loss of alignments with low energy (higher apparent E_{N}). Note that achieving maximal ɛ does not reduce the number of decoys to simple minimization: their restriction carries the risk of eliminating analogs of the native structure of the threaded sequence from the ensemble. Thus the strategy of setting optimal threading simulations adjusts gap penalties/number of fragments to achieve the maximal number of decoys that still allows recognition of the native structure by existing potentials.
Acknowledgments
L.A.M. was sponsored by the Harvard Society of Fellows. A.V.F. acknowledges the support of the Howard Hughes Medical Institute International Research Scholar Award No. 75195544702 and of the Russian Basic Research Foundation Award No. 980449303.
Footnotes

↵‡ To whom reprint requests should be addressed. Email: eugene{at}belok.harvard.edu.

Article published online before print: Proc. Natl. Acad. Sci. USA, 10.1073/pnas.160271197.

Article and publication date are at www.pnas.org/cgi/doi/10.1073/pnas.160271197
Abbreviations
 EVD,
 extreme value distribution;
 REM,
 random energy model
 Received January 24, 2000.
 Accepted June 13, 2000.
 Copyright © The National Academy of Sciences
References
 ↵
 Atschul S F,
 Madden T L,
 Schaffer A A,
 Zhang J,
 Zhang Z,
 Miller W,
 Lipman D
 ↵
 ↵
 Levitt M,
 Gerstein M
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Gumbel E J
 ↵
 ↵
 Bryngelson J D,
 Wolynes P G
 ↵
 ↵

 Bryngelson J D
 ↵
 ↵
 ↵
 Finkelstein A V
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Leadbetter H,
 Lindgren M R,
 Rootzen G
 ↵
 ↵
 ↵
 Shakhnovich E I,
 Gutin A
 ↵
 ↵
 Goldstein R,
 LutheySchulten Z A,
 Wolynes P
 ↵
 ↵
Citation Manager Formats
More Articles of This Classification
Biological Sciences
Biophysics
Related Content
 No related articles found.