New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
A unified statistical framework for sequence comparison and structure comparison
Abstract
We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an allvs.all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extremevalue distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., blast and fasta validates our approach. Structure comparison scores also follow an extremevalue distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.
Comparison is a most fundamental operation in biology. Measuring the similarities between “things” enables us to group them in families, cluster them in trees, and infer common ancestors and an evolutionary progression. Biological comparisons can take place at many levels, from that of whole organisms to that of individual molecules. We are concerned here with the comparison on the latter level, specifically, with comparisons of individual protein sequences and structures. (For an example of systematic comparison applied to whole organisms, see refs. 1 and 2.)
Our overall aim is to describe these two types of comparisons in a selfconsistent, unified framework. For sequence or structure comparison, each act of comparing one “entity” to another (that is, either comparing two sequences or two structures) involves two steps. First, the two objects are aligned optimally through the introduction of gaps in such a way as to maximize their residuebyresidue similarity. This operation generates some form of total similarity score for the number of residues matched—traditionally, a percent identity for sequences or an rms for structures, although we will use other measures. Second, one has to assess the significance of this score in the context of what is known about the proteins currently in the database.
In earlier papers, Gerstein and Levitt (3, 30) extended the work of Subbiah et al. (4) and Laurents et al. (5) and described an approach for structural alignment in an analogous fashion to the traditional approach for sequence alignment (6–9). Like sequence alignment, this method involves applying dynamic programming to a matrix of similarities between individual residues to optimize their overall correspondence through the introduction of gaps.
In this paper, we tackle the second of the two steps in protein comparison: assessing significance. We developed a simple empirical approach for calculating the significance of an alignment score based on doing an allvs.all comparison of the database and then curve fitting to the distribution of scores of true negatives. This allows us to express the significance of a given alignment score in terms of a P value, which is the chance that an alignment of two randomly selected proteins would obtain this score. We applied our approach consistently to both sequences and structures. For sequences, we could compare our fitbased P values with the differently derived statistical score from commonly used programs such as blast and fasta (10–13). The agreement we found validated our approach. For structure alignment, we followed a parallel route to derive an expression for the P value of a given alignment in terms of the structural alignment score.
Our work followed on much that recently has been done assessing the significance of sequence and structure comparison. One of the major developments in the past few years has been the implementation of probabilistic scoring schemes (13–16). These give the significance of a match in terms of a P value rather than an absolute, “raw” score (such as percent identity). This places scores from very different programs in a common framework and provides an obvious way to set a significance cutoff (that is, at P = < 0.0001 or 0.01%). P values were first used in the blast family of programs, where they are derived from an analytic model for the chance of an arbitrary ungapped alignment (10, 17). P values subsequently have been implemented in other programs, such as fasta and gapped blast by using a somewhat different formalism (13, 18, 19).
There are currently many methods for structural alignment (20–31). Some of these are associated with probabilistic scoring schemes. In particular, one method (vast) computes a P value for an alignment based on measuring how many secondary structure elements are aligned as compared with the chance of aligning this many elements randomly (28). Another method (27, 32) expresses the significance of an alignment in terms of the number of standard deviations it scores above the mean alignment score in an allvs.all comparison (i.e., a Zscore).
Data Set Used for Testing.
One of the most important aspects of our analysis is that we carefully tested it against the known structural relationships. This testing allowed us to decide unambiguously whether a given comparison resulted in a true or falsepositive and to decide objectively between different statistical schemes. In particular, structures were taken from the Protein Data Bank (33–34) and definitions of domains, structural classes, and structural similarities were taken from the Structural Classification of Proteins (scop) database (version 1.32; refs. 35–37). The creators of scop have clustered the domains in the Protein Data Bank on the basis of sequence identity (38, 39). At a sequence identity level of 40%, this clustering resulted in 941 unique sequences corresponding to the known structural domains. These 941 sequences were what we used as test data for both the sequence and structure comparisons. They contained 390 different superfamilies and 281 different folds. Because they had a considerably closer and more certain relationship than fold pairs, we concentrated here on superfamily pairs. These 2,107 nontrivial, pairwise relationships between the domains formed our set of truepositives.
Sequence Comparison Statistics.
Sequence matching was done with standard approaches: In particular, we used the ssearch implementation of the Smith–Waterman algorithm (7) [from the fasta package, version 3, (12, 40); the URL is ftp://ftp.virginia.edu/pub/fasta], with a gapopening penalty of −12, a gapextension penalty of −2, and the blosum50 substitution matrix [which has a maximal match score of 13 (for C to C) and an average match score of −0.36].
A probability–density function for sequence–comparison scores.
Each pairwise sequence comparison was best quantified by three numbers, S_{seq}, n, and m, where S_{seq} is the raw sequence alignment score and n and m are the lengths of the two sequences compared. Comparing all possible pairs of sequences allowed us to calculate an observed probability density, ρ^{o}_{seq}, for the chance of finding a pair of sequences with particular values for S_{seq} and ln(nm). Fig. 1A shows the density for pairs between all sequences. This includes the scores for ≈300 sequence pairs that are related closely, which clearly show up as “spots” on the right side of the plot. These highscoring “truepositives” are removed in Fig. 1B, which shows the density for just the pairs in different structural classes (42), i.e., the pairs that definitely are unrelated. This is the density distribution that we aim to fit.
Fig. 2A shows the density distribution as a function of S_{seq} for sections at constant ln(nm). The clear linear relationship between log(ρ_{seq}^{o}) and S_{seq} at high values of S_{seq} is indicative of an extremevalue distribution The variable “Z” was defined in terms of S_{seq} and ln(nm) by using the “Zscorelike” expression Z = (S_{seq} − μ_{seq})/σ_{seq}, where μ_{seq} = a ln(nm) + b and σ_{seq}= a are the most likely sequence score and width parameter for the distribution. The two adjustable parameters a and b were obtained by fitting the calculated density ρ_{seq}^{c}(Z) to the observed density ρ_{seq}^{o}(Z) for all values of S_{seq} and ln(nm). Substituting for μ_{seq} and σ_{seq} for Z above gave Z = (S_{seq} − a ln(nm) − b)/a = S_{seq}/a − ln(nm) − b/a.
To derive specific values for the a and b parameters, we fit the above formulas to the observed density distribution obtained by comparing pairs in different scop classes, getting a = 5.84 and b = −26.3. The fit was done by leastsquares optimization by using the simplex minimizer in matlab (Math Works, Natick, MA). It has a residual of 0.084, which was calculated by using the standard relation r = Σ w_{i}(O_{i} − C_{i})^{2}/Σ w_{i}(O_{i})^{2}, where i indexes “bins” with particular S_{seq} and ln(nm) values, O_{i} = log (ρ_{seq}^{o}(Z_{i})) is the observed density in a bin, C_{i}= log (ρ_{seq}^{c}(Z_{i})) is the calculated density in a bin, w_{i} = 1/N_{i} is a weighting factor, N_{i} is the number of sequence pairs in a bin, and the summation is over all bins, I, with ln(nm) between 5.9 and 13.5.
A cumulative sequence distribution function, giving the P value.
To estimate the statistical significance of a particular comparison in terms of particular S_{seq}, n, and m values, we needed the cumulative distribution function P_{seq}(z > Z), which is defined as the probability that matching any two random sequences will give a z value greater than or equal to Z. This is just the integral of ρ_{seq}^{c}(z) = exp(−z − exp(−z)) = exp(−z) exp(−exp(−z)), from z = Z to z = ∞, so that P_{seq}(z > Z) = 1 − exp(−exp(−Z)). Writing Z in terms of S_{seq}, n, and m gives where the parameters a and b are given above.
Relation to blast P value.
For sequence comparison without gaps, Karlin and Altschul (10, 11) derived the following cumulative distribution function: P_{K&A}(s > S_{seq}) = 1 − exp(−exp(−λ(S_{seq} − ln(Kmn)/λ)))= 1 − exp(−exp(−λ(S_{seq} + ln(Kmn)/λ))), where λ and K are calculated analytically based on the sequence composition and amino acid scoring matrix. Comparison of their analytical form with our P value expression shows that λ = 1/a and K = exp(b/a). Substituting the specific values for a and b that we calculated from the fit, we found that λ = 0.171 and K = 0.011. For the particular database sequences and amino acid scoring matrix used here, the values for λ calculated by Karlin and Altschul’s formula ranged from 0.217 to 0.259, all somewhat larger than our value for λ.
Relation to fasta E value.
In the fasta sequence comparison programs (12, 13, 18), the significance of a given alignment score S_{fa} is estimated by fitting an extremevalue distribution to scores resulting from comparison of a given query sequence to each sequence in the database. The distribution is recomputed for each new query so that, unlike our approach, each query sequence is associated with a different distribution function. This type of association has the advantage of allowing for any peculiarities of the query sequence (e.g., composition bias), but it also means that one cannot estimate the significance of a single pairwise comparison of two sequences.
The value used by fasta in judging the significance of a sequence similarity is known as the expectation value or E value (here E_{fa}). The P value, defined above, gives the statistical significance of a single comparison whereas the E value is an estimate of the expected number of falsepositives (dissimilar matches with a significant score) for a search of the entire database. With N_{db} entries in the database, the E value E_{seq} is calculated from our P_{seq}(s > S_{seq}) as E_{seq} = N_{db} P_{seq}. The E values we obtained were very similar to those found by fasta over a very wide range of values (Fig. 3). When one considers that our closedform E_{seq} depends on only two parameters for all pairs whereas E_{fa} is optimized separately for each query sequence (941 × 2 = 1,882 parameters in all), this agreement is astonishing.
Measuring coverage vs. error rate to compare different formalisms for significancestatistics.
We have presented two forms of E value statistics for sequence comparison: our method, E_{seq}, which is based on fitting a twoparameter model to the observed distribution of alignment scores; and the fasta method E_{fa}, which is based on fitting different distributions for each query. Now we naturally are led to ask whether there is an objective way to decide which formalism performs the best on some representative test data.
The seminal work of Brenner et al. (39) and Brenner (43) provides a framework for such an assessment by using the known truepositives in the scop database and a coveragevs.error plot. To compare any two significancestatistics formalisms, we proceeded as follows for each:
(i) For each of the pairs in the allvs.all comparison (941 × 940 pairs), we determined an E value and noted whether the pair was a truepositive or truenegative (for truepositives, both sequences must belong to protein domains with the same fold in the scop classification). (ii) We sorted the pairs by increasing E value. (iii) We counted down the list from best to worst until the number of falsepositives was 1% of the total number of database entries (here, this was 9 falsepositives, which is ≈1% of 941). (iv) We got the threshold E value at this point, which ideally should be close to 0.01, so as to correspond to the 1% error rate per query. (5) Finally, we got the number of entries that were more significant than the threshold E value; this number defined the coverage, which should be as large as possible.
Here, we compared the coverage and error rate of our sequence score statistics with those of fasta (E_{seq} vs. E_{fa}). At the threshold E value, our sequence statistics had log E_{seq} = −1.98 and a coverage of 328, and the fasta statistics had a log E_{fa} of −1.68 and a coverage of 379. The fasta statistics had better coverage, but our statistics had an almost perfect threshold value, which should be −2 for 1% error rate.
Structure Comparison Statistics.
The procedure we used for pairwise structural alignment is described in detail in Gerstein and Levitt (3, 30) and is summarized only briefly here. Our core method was based on iterative application of dynamic programming. As such, it was a simple application of the Needleman–Wunsch sequence alignment (6). It originally was derived from the align program of Cohen (21, 31), with many subsequent refinements. One starts with two structures in an arbitrary orientation. Then one computes all pairwise distances between every atom in the first structure and every atom in the second, which results in an interprotein distance matrix in which each entry, d_{ij}, corresponds to the distance between residue i in the first structure and residue j in the second (interresidue distances usually are expressed between αcarbons). This distance matrix, d_{ij}, can be converted into a similarity matrix, S_{ij}, through the relationship S_{ij} = M/(1 + (d_{ij}/d_{o})^{2}), where M = 20 and d_{o} = 5 Å.
One applies dynamic programming to the similarity matrix to get equivalences (using a gap opening penalty of M/2 = 10 and no gap extension penalty) and uses them to leastsquares fit the first structure onto the second one (44). Then one repeats the procedure, finding all pairwise distances and doing dynamic programming to get new equivalences, until the process converges. After an alignment is determined, it can be “refined” by eliminating the worstfitting pairs of aligned residues and then refitting to get a new rms in a similar fashion to the corefinding procedure in Gerstein and Altman (45, 46). This refinement is necessary because the dynamic programming used tries to match as many residues as possible. (It is a global, as opposed to local, method.)
The structural comparison score and the rms.
At the end of the procedure, we were left with a number of scores characterizing our final alignment. The score optimized by dynamic programming was the sum of the similarity matrix scores S_{ij} minus the total penalty for opening gaps. We refer to this as “S_{str}.” To be more explicit, it was computed from the following formula: where N_{gap} is the total number of gaps (not including gaps at the end of a chain) and the summation is carried out over all pairs, ij, of equivalenced residues. The more traditional score is the rms deviation in αcarbon position after doing a leastsquares fit on the aligned atoms (the “rms”). rmsbased statistics were used in our earlier work (for example, refs. 3–5) and have been used in almost all other work in structural alignment.
A probability–density function for structural alignment scores.
To derive significancestatistics for the structural alignment score S_{str}, we proceeded exactly as we did for sequence comparison. Structural alignment of all pairs in the database gave us an observed probability distribution for comparison scores ρ_{str}^{c}, which was a function of the number of residues matched N and the comparison score S_{str} (Fig. 4A. This distribution contained the many pairs of structures that were similar, and these pairs stood out with high S_{str} scores. Fig. 4B shows data for pairs that were in different scop structural classes and, therefore, should not have had any structural similarity. Fig. 4B is much “cleaner” than Fig. 4A and shows the underlying distribution expected for the comparison of structures that are not similar.
Fig. 2B shows the density distribution as a function of S_{str} for sections at constant N. There is a close parallel between the structural alignment score S_{str} and the sequence alignment score, S_{seq}, in Fig. 2A, and both can be modeled by an extremevalue distribution. Thus, we fit the calculated structure density by ρ_{str}^{c}(Z) = exp(−Z − exp (−Z)), where the variable Z is defined in terms of S_{str} and N by using Z = (S_{str} − μ_{str})/σ_{str}. The most likely structure score μ_{str} and the width parameter σ_{str} have a more complicated dependence on sequence length N than was the case for sequences with μ_{str}(N) = c ln(N)^{2} + d ln(N) + e (if N < 120), μ_{str}(N) = a ln(N) + b (if N ≥ 120) and σ_{str}(N) = f ln(N) + g (if N < 120) and σ_{str}(N) = f ln(120) + g (if N ≥ 120).
Continuity of function values and slopes allows a and b to be written in terms of c, d, and e. To be more specific, at N = 120, a ln(N) + b = c ln(N)^{2} + d ln(N) + e and a = 2c ln(N) + d. Thus, the expressions for μ_{str}(N) and σ_{str}(N) involve five independent parameters: c, d, e, f, and g. We determined these five parameters via leastsquares optimization by using the simplex minimizer in matlab, which yielded c = 18.4, d = −4.50, e = 2.64, f = 21.4, and g = −37.5 (a = 419.3 and b = 171.8 were derived as described above). The residual was 0.288. It was given by the same formula as was used for the residual in the sequence statistics fit with O_{i} = ρ_{str}^{o}(Z_{i}), C_{i} = ρ_{str}^{c}(Z_{i}) and w_{i} = 1, and the summation was over bins with any value of S_{str} and N between 30 and 170 residues. The resulting fit of the observed and calculated distribution (Fig. 2B) was good for all values of N and S_{str}.
A cumulative structure distribution function, giving the P value.
To estimate the statistical significance of a particular structure comparison in terms of its S_{str} and N values, we proceeded as we did for sequence comparison. We integrated the score distribution to determine a cumulative distribution function P_{str}, defined as the probability that matching two random structures will give a z value greater than or equal to Z. The structure score distribution has the same extremevalue form as the sequence score distribution, so the derivation of P_{str} follows that of P_{seq}, with P_{str}(z > Z) = 1 − exp[−exp(−Z)], where Z is expressed in terms of S_{str} and N by using and the seven parameters a, b, c, d, e, f, and g are given above.
Structural comparison statistics based on rms.
The traditional characterization of a structural alignment is in terms of the number of residues matched, N, and the rms deviation from fitting these matched residues, R. It is convenient to focus on ln(R), which ensures that there is good separation of values for small R, where the significant pairs occur. We calculated a probability distribution ρ_{rms}^{o}[ln(R),N] for the observed rms values of truenegative pairs in the same fashion as we did earlier for the observed distribution of structural alignment scores ρ_{str}^{o}(S_{str},N).
The fact that log (ρ_{rms}^{o}) varies very slowly with ln(R) near the maximum (Fig. 5) led us to fit the calculated density by using ρ_{rms}^{c}(Z) = exp(−Z^{4}), where Z is defined in terms of ln(R) and N as Z = (ln(R) − μ_{rms}(N))/σ_{rms}(N), with μ_{rms}(N) = c ln(N)^{2} + d ln(N) + e (if N < 60), μ_{rms}(N) = a ln(N) + b (if N ≥ 60) and σ_{rms}(N) = f ln(N) + g (if N < 60), σ_{rms}(N) = f ln(60) + g (if N ≥ 60). The values of the five independent parameters c, d, e, f, and g were determined by leastsquares optimization by using the simplex minimizer in matlab, which yielded c = 0.155, d = −0.619, e = 1.73, f = 0.0922, and g = 0.212. (a = 0.872 and b = 0.650 were determined as before to ensure continuity.)
To estimate the statistical significance of a particular comparison in terms of its R and N values, we derived a cumulative distribution function P_{rms}(z > Z), defined as the probability that any z will be less than or equal to a given Z. This was just the integral of ρ^{c}_{rms}(z) from z = −∞ to z = Z. Because the function exp(−z^{4}) cannot be integrated analytically, we integrated it numerically for z from −5 to Z and tabulated its value for 10,000 different Z values from −5 to 5.
Comparing structure comparison statistics: Alignment score S_{str} vs.
rms. Once we had derived structure comparison statistics based on structural alignment score S_{str} and rms, we could compare them. The same coveragevs.error scheme used above to compare the two formulae for sequence alignment significance could be used again here. When assessed in terms of coverage (number of truepositives found) at a given error rate on our test data, the E value statistics based on S_{str} gave a much better performance (i.e., had a larger coverage) than those based on rms. To be more specific, we compared the two approaches (E_{str} vs. E_{rms}) in exactly the same way that we previously had compared our sequence E value to that produced by fasta (E_{seq} vs. E_{fa}). We found that, at the 1% error threshold, the rmsbased statistics have log(E_{rms}) = −32.8 and a coverage of 202 whereas the structuralalignment score statistics have log(E_{str}) = −1.58 and a coverage of 627. Clearly, the statistics based on S_{str} perform much better because the threshold is much more reliable (i.e., closer to the value of −2 for an error rate of 1%) and the truepositive coverage is >3fold higher. The difference between E_{str} and E_{rms} is striking and confirms that the structure score is much better than the rms score.
There are other reasons why the structural alignment score S_{str} is a more reliable indicator than rms: (i) S_{str} depends most strongly on the bestfitting atoms whereas rms depends most on the worstfitting atoms; (ii) S_{str} penalizes gaps, whereas rms does not; and (iii) S_{str} is formally analogous to the score one gets from a standard sequence comparison, S_{seq}, because both quantities are derived from a “dynamicprogramming” similarity matrix. As dynamic programming finds a maximum score over many possible alignments, it is reasonable that both S_{str} and S_{seq} should follow an extreme value distribution. However, this is not a trivial result, as the scores are not independent, random variables whose maximum must follow such a distribution.
Relationship Between Sequence Comparison and Structure Comparison.
Having derived sequence and structure significance scores by using allvs.all comparisons on the same database of 941 sequences and structures, we were in a position to compare directly structure and sequence significance scores. Fig. 6 shows such a comparison for the 2,107 pairs of proteins in our data set that are considered to be related evolutionarily according to scop (i.e., they are the truepositives in the same superfamily). The lines at log(E_{seq}) = −2 and at log(E_{str}) = −2 divide the 2,107 truepositive pairs among four quadrants, depending on whether their sequence or structure matches are significant, as follows:
Top right (1,204 pairs; nonsignificant sequence match, nonsignificant structure match).
Over half (1,204 of 2,107) of the pairs of domains thought to be evolutionarily related by scop fall into this category of having no significant match, indicating that the combination of manual measures used in scop is more sensitive than either automatic sequence or structure comparison.
Lower left (244 pairs; significant sequence match, significant structure match).
These pairs are evenly distributed in the lower left quadrant, indicating that the sequence and structure significance scores are on the same scale.
Lower right (576 pairs; nonsignificant sequence match, significant structure match).
There are many more pairs with good structure matches but without sequence matches than the converse (sequence match but no structure match). This fact objectively shows how structure is conserved more than sequence in evolution. These 576 pairs are very good test cases for threading algorithms that match a sequence to a structure, and we currently are testing them in this way.
Top left (83 pairs; significant sequence match, nonsignificant structure match).
Almost all of the pairs (70 of 83) in this category involve matches with a small number of residues (N < 70). For such short matches, the structures may be deformed and may not match well. There are seven labeled pairs that are exceptions because the match is extensive (N > 70), but the pairs structurally are less similar than would be expected from the strong sequence match. These seven exceptions involve 11 coordinate sets. Three of these sets were solved by xray crystallography to only medium resolution (>2.9 Å, 1mys, 1scm, and 1tlk), five were solved by NMR (1prr, 1ntr, 2pld, 2pna, and 1tnm), and three are high resolution xray structures (better than 1.7 Å for 1osa, 3chy, and 1sha). None of the seven exceptional pairs involved two high resolution structures, and it seems likely that some of the seven exceptions would have had a more significant structural match if both structures in the pair were determined to a high resolution. Furthermore, as determined from consultation of a Database of Macromolecular Movements (ref. 47; see database at http://bioinfo.mbb.yale.edu/MolMovDB), some of the seven exceptions involved proteins that had been solved in different conformational states. In particular, 1osa, 1mys, and 1scm involved proteins with the highly flexible calmodulin fold. These are clearly examples for which one would expect sequence similarity but structural differences.
DISCUSSION AND CONCLUSION
Summary.
We have presented an approach for assessing in a unified statistical framework the significance of a given comparison of proteins, whether involving sequences or structures. For either sequence or structure we fit an extremevalue distribution to the observed distribution obtained from the allvs.all comparison of the database (i.e., between pairs of scop domains in different structural classes). For sequence comparison, this extremevalue distribution is as expected: We empirically observed for gapped alignments what Karlin and Altschul (11) derived for ungapped ones. We also gave a simple formula for the E value that is likely to be useful for pairwise comparisons without involving searches of the entire database.
For structure comparison, we found that the score distribution follows an extremevalue distribution when expressed in terms of the structural alignment score S_{str}. By using this measure, expressions for statistical significance can be formulated in an almost identical way for structure as they are for sequence. It is important to realize that, although the S_{str} is produced naturally by our specific alignment method, it can be calculated from any arbitrary structural alignment. Thus, by using our formulas, a significance can be computed from the results of any structural alignment program. Using the more traditional rms deviation as a score does not lead to as reliable a measure of structural significance.
In connection with this, it is interesting that recent work (39, 43) indicates that the significance statistics based on optimized “sum” scores from dynamic programming (i.e., Smith–Waterman scores, which are essentially sums of blosum matrix values minus gap penalties) perform much better than those based on the traditional measure of sequence similarity, percentage identity, which parallels the poor performance of our structural alignment statistics based on the traditional rms. It is disconcerting that such well established and intuitive measures such as percentage identity or rms perform so much worse than the statistical measures based on the sequence or structure alignment scores.
Furthermore, it is surprising that over half of the relationships between distant homologues in scop were not statistically significant (at a rate of 1% error per query) using either pure sequence comparison or pure structure comparison. Almost all of the pairs found by sequence comparison were found by structure comparison, but there were many pairs found by structure comparison that were not found by sequence comparison. Overall, structural comparison was able to detect about twice as many of the scop distant homology superfamily pairs as sequence comparison (at the same rate of error).
Future Directions.
The approach we have used to derive statistical significance easily could be generalized to other contexts. In particular, it can be adapted to provide significance statistics for threading. We have not presented a detailed examination of the significance values for specific pairs of sequences or structures. Such an examination could prove to be a useful endeavor in the future, particularly if it focused on pairs of proteins with the same fold but insignificant E values and those with different folds but significant E values. These two classes of pairs characterize the twilight zone for structure, which has yet to be described fully.
Acknowledgments
We thank S. E. Brenner for carefully reading the manuscript and S. E. Brenner and T. Hubbard for providing the pdb40d1.32 database. M.G. acknowledges the National Science Foundation for support (Grant DBI9723182), and M.L. acknowledges the Department of Energy (Grant DEFG0395ER62135).
Footnotes

↵† To whom reprint requests should be addressed. email: michael.levitt{at}stanford.edu.

This paper was presented at the colloquium “Computational Biomolecular Science,” organized by Russell Doolittle, J. Andrew McCammon, and Peter G. Wolynes, held September 11–13, 1997, sponsored by the National Academy of Sciences at the Arnold and Mabel Beckman Center in Irvine, CA.
ABBREVIATION
 scop,
 Structural Classification of Proteins
 Copyright © 1998, The National Academy of Sciences
References
 ↵
 ↵
 Bookstein F L
 ↵
 ↵
 ↵
 ↵
 Needleman S B,
 Wunsch C D
 ↵

 Doolittle R F
 ↵
 Gribskov M,
 Devereux J
 ↵
 Karlin S,
 Altschul S F
 ↵
 Karlin S,
 Altschul S F
 ↵
 Lipman D J,
 Pearson W R
 ↵
 ↵
 ↵
 ↵
 Pearson W R
 ↵
 Altschul S F,
 Madden T L,
 Schaffer A A,
 Zhang J,
 Zhang Z,
 Miller W,
 Lipman D J
 ↵
 ↵
 Satow Y,
 Cohen G H,
 Padlan E A,
 Davies D R

 Artymiuk P J,
 Mitchell E M,
 Rice D W,
 Willett P
 ↵
 ↵
 ↵
 Gerstein M,
 Levitt M
 ↵
 Cohen G H
 ↵
 Holm L,
 Sander C
 ↵
 ↵
 ↵
 ↵
 Hubbard T J P,
 Murzin A G,
 Brenner S E,
 Chothia C
 ↵
 ↵
 Brenner S,
 Chothia C,
 Hubbard T
 ↵
 Pearson W R,
 Lipman D J

 Henikoff S,
 Henikoff J G
 ↵
 ↵
 Brenner S E
 ↵
 ↵
 Gerstein M,
 Altman R
 ↵
 ↵
Citation Manager Formats
More Articles of This Classification
Related Content
 No related articles found.
Cited by...
 Maps of protein structure space reveal a fundamental relationship between protein structure and function
 Structure of the DNASspC Complex: Implications for DNA Packaging, Protection, and Repair in Bacterial Spores
 The Crystal Structure and Reaction Mechanism of Escherichia coli 2,4DienoylCoA Reductase
 Statistical significance of protein structure prediction by threading
 Statistical significance of protein structure prediction by threading