Previous Article |
Table of Contents
| Next Article
Vol. 95, Issue 11, 6073-6078, May 26, 1998
* MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2
2QH, United Kingdom; and § Sanger Centre, Wellcome Trust Genome
Campus, Hinxton, Cambs CB10 1SA, United Kingdom
Communicated by David R. Davies, National Institute of Diabetes,
Bethesda, MD, March 16, 1998 (received for review November 12, 1997)
Pairwise sequence comparison methods have been assessed using
proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin,
A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol.
Biol. 247, 536-540]. The evaluation tested the programs
BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410],
WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA
[Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci.
USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and
their scoring schemes. The error rate of all algorithms is greatly
reduced by using statistical scores to evaluate matches rather than
percentage identity or raw scores. The E-value statistical scores of
SSEARCH and FASTA are reliable: the number of
false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and
WU-BLAST2 exaggerate significance by
orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and
they are capable of detecting almost all relationships between proteins
whose sequence identities are >30%. For more distantly related
proteins, they do much less well; only one-half of the relationships
between proteins with 20-30% identity are found. Because many
homologs have low sequence similarity, most distant relationships
cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.
Sequence database searching plays a role in virtually every branch
of molecular biology and is crucial for interpreting the sequences
issuing forth from genome projects. Given the method's central role,
it is surprising that overall and relative capabilities of different
procedures are largely unknown. It is difficult to verify algorithms on
sample data because this requires large data sets of proteins whose
evolutionary relationships are known unambiguously and independently of
the methods being evaluated. However, nearly all known homologs have
been identified by sequence analysis (the method to be tested). Also,
it is generally very difficult to know, in the absence of structural
data, whether two proteins that lack clear sequence similarity are
unrelated. This has meant that although previous evaluations have
helped improve sequence comparison, they have suffered from
insufficient, imperfectly characterized, or artificial test data.
Assessment also has been problematic because high quality database
sequence searching attempts to have both sensitivity (detection of
homologs) and specificity (rejection of unrelated proteins); however,
these complementary goals are linked such that increasing one causes
the other to be reduced.
Sequence comparison methodologies have evolved rapidly, so no
previously published tests has evaluated modern versions of programs
commonly used. For example, parameters in BLAST (1) have
changed, and WU-BLAST2 (2) The previous reports also have left gaps in our knowledge. For example,
there has been no published assessment of thresholds for scoring
schemes more sophisticated than percentage identity. Thus, the widely
discussed statistical scoring measures have never actually been
evaluated on large databases of real proteins. Moreover, the different
scoring schemes commonly in use have not been compared.
Beyond these issues, there is a more fundamental question: in an
absolute sense, how well does pairwise sequence comparison work? That
is, what fraction of homologous proteins can be detected using modern
database searching methods?
In this work, we attempt to answer these questions and to overcome both
of the fundamental difficulties that have hindered assessment of
sequence comparison methodologies. First, we use the set of distant
evolutionary relationships in the SCOP: Structural Classification of Proteins database (4), which is derived from structural and functional characteristics (5). The SCOP
database provides a uniquely reliable set of homologs, which are known independently of sequence comparison. Second, we use an assessment method that jointly measures both sensitivity and specificity. This
method allows straightforward comparison of different sequence searching procedures. Further, it can be used to aid interpretation of
real database searches and thus provide optimal and reliable results.
Previous Assessments of Sequence Comparison.
Several previous
studies have examined the relative performance of different sequence
comparison methods. The most encompassing analyses have been by Pearson
(6, 7), who compared the three most commonly used programs. Of these,
the Smith-Waterman algorithm (8) implemented in SSEARCH
(3) is the oldest and slowest but the most rigorous. Modern heuristics
have provided BLAST (1) the speed and convenience to make
it the most popular program. Intermediate between these two is
FASTA (3), which may be run in two modes offering either
greater speed (ktup = 2) or greater effectiveness (ktup = 1).
Pearson also considered different parameters for each of these
programs.
Biochemistry
Assessing sequence comparison methods with reliable structurally
identified distant evolutionary relationships
,
,
![]()
ABSTRACT
Top
Abstract
Introduction
Conclusion
References
![]()
INTRODUCTION
Top
Abstract
Introduction
Conclusion
References
which
produces gapped alignments
has become available. The latest version of
FASTA (3) previously tested was 1.6, but the current
release (version 3.0) provides fundamentally different results in the
form of statistical scoring.
A Database for Testing Homology Detection. Since the discovery that the structures of hemoglobin and myoglobin are very similar though their sequences are not (29), it has been apparent that comparing structures is a more powerful (if less convenient) way to recognize distant evolutionary relationships than comparing sequences. If two proteins show a high degree of similarity in their structural details and function, it is very probable that they have an evolutionary relationship though their sequence similarity may be low.
The recent growth of protein structure information combined with the comprehensive evolutionary classification in the SCOP database (4, 5) have allowed us to overcome previous limitations. With these data, we can evaluate the performance of sequence comparison methods on real protein sequences whose relationships are known confidently. The SCOP database uses structural information to recognize distant homologs, the large majority of which can be determined unambiguously. These superfamilies, such as the globins or the immunoglobulins, would be recognized as related by the vast majority of the biological community despite the lack of high sequence similarity. From SCOP, we extracted the sequences of domains of proteins in the Protein Data Bank (PDB) (30) and created two databases. One (PDB90D-B) has domains, which were all <90% identical to any other, whereas (PDB40D-B) had those <40% identical. The databases were created by first sorting all protein domains in SCOP by their quality and making a list. The highest quality domain was selected for inclusion in the database and removed from the list. Also removed from the list (and discarded) were all other domains above the threshold level of identity to the selected domain. This process was repeated until the list was empty. The PDB40D-B database contains 1,323 domains, which have 9,044 ordered pairs of distant relationships, or
0.5% of the total 1,749,006 ordered pairs. In
PDB90D-B, the 2,079 domains have 53,988 relationships, representing 1.2% of all pairs. Low
complexity regions of sequence can achieve spurious high scores, so
these were masked in both databases by processing with the
SEG program (27) using recommended parameters: 12 1.8 2.0. The databases used in this paper are available from
http://sss.stanford.edu/sss/, and databases derived from the
current version of SCOP may be found at
http://scop.mrc-lmb.cam.ac.uk/scop/.
Analyses from both databases were generally consistent, but
PDB40D-B focuses on
distantly related proteins and reduces the heavy overrepresentation in
the PDB of a small number of families (31, 32), whereas
PDB90D-B (with more
sequences) improves evaluations of statistics. Except where noted
otherwise, the distant homolog results here are from
PDB40D-B. Although the
precise numbers reported here are specific to the structural domain
databases used, we expect the trends to be general.
Assessment Data and Procedure. Our assessment of sequence comparison may be divided into four different major categories of tests. First, using just a single sequence comparison algorithm at a time, we evaluated the effectiveness of different scoring schemes. Second, we assessed the reliability of scoring procedures, including an evaluation of the validity of statistical scoring. Third, we compared sequence comparison algorithms (using the optimal scoring scheme) to determine their relative performance. Fourth, we examined the distribution of homologs and considered the power of pairwise sequence comparison to recognize them. All of the analyses used the databases of structurally identified homologs and a new assessment criterion.
The analyses tested BLAST (1), version 1.4.9MP, and WU-BLAST2 (2), version 2.0a13MP. Also assessed was the FASTA package, version 3.0t76 (3), which provided FASTA and the SSEARCH implementation of Smith-Waterman (8). For SSEARCH and FASTA, we used BLOSUM45 with gap penalties
12/
1 (7, 16). The default parameters and matrix
(BLOSUM62) were used for
BLAST and WU-BLAST2.
The "Coverage Vs. Error" Plot. To test a particular protocol (comprising a program and scoring scheme), each sequence from the database was used as a query to search the database. This yielded ordered pairs of query and target sequences with associated scores, which were sorted, on the basis of their scores, from best to worst. The ideal method would have perfect separation, with all of the homologs at the top of the list and unrelated proteins below. In practice, perfect separation is impossible to achieve so instead one is interested in drawing a threshold above which there are the largest number of related pairs of sequences consistent with an acceptable error rate.
Our procedure involved measuring the coverage and error for every threshold. Coverage was defined as the fraction of structurally determined homologs that have scores above the selected threshold; this reflects the sensitivity of a method. Errors per query (EPQ), an indicator of selectivity, is the number of nonhomologous pairs above the threshold divided by the number of queries. Graphs of these data, called coverage vs. error plots, were devised to understand how protocols compare at different levels of accuracy. These graphs share effectively all of the beneficial features of Reciever Operating Characteristic (ROC) plots (33, 34) but better represent the high degrees of accuracy required in sequence comparison and the huge background of nonhomologs. This assessment procedure is directly relevant to practical sequence database searching, for it provides precisely the information necessary to perform a reliable sequence database search. The EPQ measure places a premium on score consistency; that is, it requires scores to be comparable for different queries. Consistency is an aspect which has been largely ignored in previous tests but is essential for the straightforward or automatic interpretation of sequence comparison results. Further, it provides a clear indication of the confidence that should be ascribed to each match. Indeed, the EPQ measure should approximate the expectation value reported by database searching programs, if the programs' estimates are accurate.The Performance of Scoring Schemes. All of the programs tested could provide three fundamental types of scores. The first score is the percentage identity, which may be computed in several ways based on either the length of the alignment or the lengths of the sequences. The second is a "raw" or "Smith-Waterman" score, which is the measure optimized by the Smith-Waterman algorithm and is computed by summing the substitution matrix scores for each position in the alignment and subtracting gap penalties. In BLAST, a measure related to this score is scaled into bits. Third is a statistical score based on the extreme value distribution. These results are summarized in Fig. 1.
|
Sequence Identity. Though it has been long established that percentage identity is a poor measure (35), there is a common rule-of-thumb stating that 30% identity signifies homology. Moreover, publications have indicated that 25% identity can be used as a threshold (17, 36). We find that these thresholds, originally derived years ago, are not supported by present results. As databases have grown, so have the possibilities for chance alignments with high identity; thus, the reported cutoffs lead to frequent errors. Fig. 2 shows one of the many pairs of proteins with very different structures that nonetheless have high levels of identity over considerable aligned regions. Despite the high identity, the raw and the statistical scores for such incorrect matches are typically not significant. The principal reasons percentage identity does so poorly seem to be that it ignores information about gaps and about the conservative or radical nature of residue substitutions.
|
|
Raw Scores. Smith-Waterman raw scores perform better than percentage identity (Fig. 1), but ln-scaling (7) provided no notable benefit in our analysis. It is necessary to be very precise when using either raw or bit scores because a 20% change in cutoff score could yield a tenfold difference in EPQ. However, it is difficult to choose appropriate thresholds because the reliability of a bit score depends on the lengths of the proteins matched and the size of the database. Raw score thresholds also are affected by matrix and gap parameters.
Statistical Scores. Statistical scores were introduced partly to overcome the problems that arise from raw scores. This scoring scheme provides the best discrimination between homologous proteins and those which are unrelated. Most likely, its power can be attributed to its incorporation of more information than any other measure; it takes account of the full substitution and gap data (like raw scores) but also has details about the sequence lengths and composition and is scaled appropriately.
We find that statistical scores are not only powerful, but also easy to interpret. SSEARCH and FASTA show close agreement between statistical scores and actual number of errors per query (Fig. 4). The expectation value score gives a good, slightly conservative estimate of the chances of the two sequences being found at random in a given query. Thus, an E-value of 0.01 indicates that roughly one pair of nonhomologs of this similarity should be found in every 100 different queries. Neither raw scores nor percentage identity can be interpreted in this way, and these results validate the suitability of the extreme value distribution for describing the scores from a database search.
|
Overall Detection of Homologs and Comparison of Algorithms. The results in Fig. 5A and Table 1 show that pairwise sequence comparison is capable of identifying only a small fraction of the homologous pairs of sequences in PDB40D-B. Even SSEARCH with E-values, the best protocol tested, could find only 18% of all relationships at a 1% EPQ. BLAST, which identifies 15%, was the worst performer, whereas FASTA ktup = 1 is nearly as effective as SSEARCH. FASTA ktup = 2 and WU-BLAST2 are intermediate in their ability to detect homologs. Comparison of different algorithms indicates that those capable of identifying more homologs are generally slower. SSEARCH is 25 times slower than BLAST and 6.5 times slower than FASTA ktup = 1. WU-BLAST2 is slightly faster than FASTA ktup = 2, but the latter has more interpretable scores.
|
|
|
| |
CONCLUSION |
|---|
|
|
|---|
The general consensus amongst experts (see refs. 7, 24, 25, 27 and references therein) suggests that the most effective sequence searches are made by (i) using a large current database in which the protein sequences have been complexity masked and (ii) using statistical scores to interpret the results. Our experiments fully support this view.
Our results also suggest two further points. First, the E-values reported by FASTA and SSEARCH give fairly accurate estimates of the significance of each match, but the P-values provided by BLAST and WU-BLAST2 underestimate the true extent of errors. Second, SSEARCH, WU-BLAST2, and FASTA ktup = 1 perform best, though BLAST and FASTA ktup = 2 detect most of the relationships found by the best procedures and are appropriate for rapid initial searches.
The homologous proteins that are found by sequence comparison can be distinguished with high reliability from the huge number of unrelated pairs. However, even the best database searching procedures tested fail to find the large majority of distant evolutionary relationships at an acceptable error rate. Thus, if the procedures assessed here fail to find a reliable match, it does not imply that the sequence is unique; rather, it indicates that any relatives it might have are distant ones.**
| |
ACKNOWLEDGEMENTS |
|---|
The authors are grateful to Drs. A. G. Murzin, M. Levitt, S. R. Eddy, and G. Mitchison for valuable discussion. S.E.B. was principally supported by a St. John's College (Cambridge, UK) Benefactors' Scholarship and by the American Friends of Cambridge University. S.E.B. dedicates his contribution to the memory of Rabbi Albert T. and Clara S. Bilgray.
| |
FOOTNOTES |
|---|
Present address: Department of Structural Biology, Stanford
University, Fairchild Building D-109, Stanford, CA 94305-5126
To whom reprints requests should be addressed. e-mail:
brenner{at}hyper.stanford.edu.
**
Additional and updated information about this work, including
supplementary figures, may be found at
http://sss.stanford.edu/sss/.
| |
ABBREVIATION |
|---|
EPQ, errors per query.
| |
REFERENCES |
|---|
|
|
|---|
| 1. | Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990) J. Mol. Biol. 215, 403-410 [CrossRef][ISI][Medline] . |
| 2. | Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480 [ISI][Medline] . |
| 3. |
Pearson, W. R. & Lipman, D. J.
(1988)
Proc. Natl. Acad. Sci. USA
85,
2444-2448
|
| 4. | Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536-540 [CrossRef][ISI][Medline] . |
| 5. | Brenner, S. E., Chothia, C., Hubbard, T. J. P. & Murzin, A. G. (1996) Methods Enzymol. 266, 635-643 [CrossRef][ISI][Medline] . |
| 6. | Pearson, W. R. (1991) Genomics 11, 635-650 [CrossRef][ISI][Medline] . |
| 7. | Pearson, W. R. (1995) Protein Sci. 4, 1145-1160 [Abstract]. |
| 8. | Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197 [CrossRef][ISI][Medline] . |
| 9. | George, D. G., Hunt, L. T. & Barker, W. C. (1996) Methods Enzymol. 266, 41-59 [Medline] . |
| 10. | Vogt, G., Etzold, T. & Argos, P. (1995) J. Mol. Biol. 249, 816-831 [CrossRef][ISI][Medline] . |
| 11. | Henikoff, S. & Henikoff, J. G. (1993) Proteins 17, 49-61 [CrossRef][ISI][Medline] . |
| 12. |
Bairoch, A. & Apweiler, R.
(1996)
Nucleic Acids Res.
24,
21-25
|
| 13. |
Bairoch, A., Bucher, P. & Hofmann, K.
(1996)
Nucleic Acids Res.
24,
189-196
|
| 14. |
Henikoff, S. & Henikoff, J. G.
(1992)
Proc. Natl. Acad. Sci. USA
89,
10915-10919
|
| 15. | Dayhoff, M., Schwartz, R. M. & Orcutt, B. C. (1978) in Atlas of Protein Sequence and Structure, ed. Dayhoff, M. (National Biomedical Research Foundation, Silver Spring, MD), Vol. 5, Suppl. 3, pp. 345-352. |
| 16. | Brenner, S. E. (1996) Ph.D. thesis (University of Cambridge, UK). |
| 17. | Sander, C. & Schneider, R. (1991) Proteins 9, 56-68 [CrossRef][ISI][Medline] . |
| 18. | Johnson, M. S. & Overington, J. P. (1993) J. Mol. Biol. 233, 716-738 [CrossRef][ISI][Medline] . |
| 19. |
Barton, G. J. & Sternberg, M. J. E.
(1987)
Protein Eng.
1,
89-94
|
| 20. |
Lesk, A. M., Levitt, M. & Chothia, C.
(1986)
Protein Eng.
1,
77-78
|
| 21. | Arratia, R., Gordon, L. & M, W. (1986) Ann. Stat. 14, 971-993 . |
| 22. |
Karlin, S. & Altschul, S. F.
(1990)
Proc. Natl. Acad. Sci. USA
87,
2264-2268
|
| 23. |
Karlin, S. & Altschul, S. F.
(1993)
Proc. Natl. Acad. Sci. USA
90,
5873-5877
|
| 24. | Altschul, S. F., Boguski, M. S., Gish, W. & Wootton, J. C. (1994) Nat. Genet. 6, 119-129 [CrossRef][ISI][Medline] . |
| 25. | Pearson, W. R. (1996) Methods Enzymol. 266, 227-258 [ISI][Medline] . |
| 26. | Lipman, D. J., Wilbur, W. J., Smith, T. F. & Waterman, M. S. (1984) Nucleic Acids Res. 12, 215-226 . |
| 27. | Wootton, J. C. & Federhen, S. (1996) Methods Enzymol. 266, 554-571 [ISI][Medline] . |
| 28. | Waterman, M. S. & Vingron, M. (1994) Stat. Science 9, 367-381 . |
| 29. | Perutz, M. F., Kendrew, J. C. & Watson, H. C. (1965) J. Mol. Biol. 13, 669-678 [ISI]. |
| 30. | Abola, E. E., Bernstein, F. C., Bryant, S. H., Koetzle, T. F. & Weng, J. (1987) in Crystallographic Databases: Information Content, Software Systems, Scientific Applications, eds. Allen, F. H., Bergerhoff, G. & Sievers, R. (Data Comm. Intl. Union Crystallogr., Cambridge, UK), pp. 107-132. |
| 31. | Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1997) Curr. Opin. Struct. Biol. 7, 369-376 [CrossRef][ISI][Medline] . |
| 32. | Orengo, C., Michie, A., Jones, S, Jones, D. T, Swindells, M. B. & Thornton, J. (1997) Structure (London) 5, 1093-1108 [Medline] . |
| 33. |
Zweig, M. H. & Campbell, G.
(1993)
Clin. Chem.
39,
561-577
|
| 34. | Gribskov, M. & Robinson, N. L. (1996) Comput. Chem. 20, 25-33 [CrossRef][ISI][Medline] . |
| 35. | Fitch, W. M. (1966) J. Mol. Biol. 16, 9-16 [CrossRef][ISI][Medline] . |
| 36. | Chung, S. Y. & Subbiah, S. (1996) Structure (London) 4, 1123-1127 [Medline] . |
| 37. |
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J.
(1997)
Nucleic Acids Res.
25,
3389-3402
|
| 38. | Girling, R., Schmidt, W., Jr, Houston, T., Amma, E. & Huisman, T. (1979) J. Mol. Biol. 131, 417-433 [Medline] . |
| 39. | Spezio, M., Wilson, D. & Karplus, P. (1993) Biochemistry 32, 9906-9916 [CrossRef][Medline] . |
| 40. | Sayle, R. A. & Milner-White, E. J. (1995) Trends Biochem. Sci. 20, 374-376 [CrossRef][ISI][Medline] . |
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
G. Moreno-Hagelsieb and K. Latimer Choosing BLAST options for better detection of orthologs as reciprocal best hits Bioinformatics, February 1, 2008; 24(3): 319 - 324. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Reid, C. Yeats, and C. A. Orengo Methods of remote homology detection can be combined to increase coverage by 10% in the midnight zone Bioinformatics, September 15, 2007; 23(18): 2353 - 2360. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. G. Kann, S. L. Sheetlin, Y. Park, S. H. Bryant, and J. L. Spouge The identification of complete domains within protein sequences using accurate E-values for semi-global alignment Nucleic Acids Res., July 9, 2007; 35(14): 4678 - 4685. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. W. Mount Using the Basic Local Alignment Search Tool (BLAST) CSH Protocols, July 1, 2007; 2007(14): pdb.top17 - pdb.top17. [Abstract] [Full Text] |
||||
![]() |
A. Coghlan and R. Durbin Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron exon structure Bioinformatics, June 15, 2007; 23(12): 1468 - 1475. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Viksna and D. Gilbert Assessment of the probabilities for evolutionary structural changes in protein folds Bioinformatics, April 1, 2007; 23(7): 832 - 841. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Amoutzias, A. Veron, J Weiner III, M Robinson-Rechavi, E Bornberg-Bauer, S. Oliver, and D. Robertson One Billion Years of bZIP Transcription Factor Evolution: Conservation and Change in Dimerization and DNA-Binding Site Specificity Mol. Biol. Evol., March 1, 2007; 24(3): 827 - 835. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. K. Freyhult, J. P. Bollback, and P. P. Gardner Exploring genomic dark matter: A critical assessment of the performance of homology search methods on noncoding RNA Genome Res., January 1, 2007; 17(1): 117 - 125. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-K. Yu, E. M. Gertz, R. Agarwala, A. A. Schaffer, and S. F. Altschul Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches Nucleic Acids Res., November 6, 2006; 34(20): 5966 - 5973. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Gough Genomic scale sub-family assignment of protein domains Nucleic Acids Res., July 28, 2006; 34(13): 3625 - 3633. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Oberai, Y. Ihm, S. Kim, and J. U. Bowie A limited universe of membrane protein families and folds. Protein Sci., July 1, 2006; 15(7): 1723 - 1734. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Gibbs, V. V. Smeianov, J. L. Steele, P. Upcroft, and B. A. Efimov Two Families of Rep-Like Genes That Probably Originated by Interspecies Recombination Are Represented in Viral, Plasmid, Bacterial, and Parasitic Protozoan Genomes Mol. Biol. Evol., June 1, 2006; 23(6): 1097 - 1100. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Fodor and R. W. Aldrich Statistical Limits to the Identification of Ion Channel Domains by Sequence Similarity J. Gen. Physiol., May 30, 2006; 127(6): 755 - 766. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A. Price, G. E. Crooks, R. E. Green, and S. E. Brenner Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap Bioinformatics, October 15, 2005; 21(20): 3824 - 3831. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. E. Crooks, R. E. Green, and S. E. Brenner Pairwise alignment incorporating dipeptide covariation Bioinformatics, October 1, 2005; 21(19): 3704 - 3710. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Espadaler, R. Aragues, N. Eswar, M. A. Marti-Renom, E. Querol, F. X. Aviles, A. Sali, and B. Oliva Detecting remotely related proteins by their interactions and sequence similarity PNAS, May 17, 2005; 102(20): 7151 - 7156. [Abstract] [Full Text] [PDF] |
||||