Previous Article |
Table of Contents
| Next Article
EVOLUTION
Comparative homology agreement search: An effective combination of homology-search methods



, ¶, ||
International NRW Graduate School in Bioinformatics and Genome Research, Center of Biotechnology, Bielefeld University, 33615 Bielefeld, Germany;
Max Planck Institute for Mathematics in the Sciences, Inselstrasse 22-26, 04103 Leipzig, Germany;
Department of Medicine, AG Bioinformatics, Domagkstrasse 3, 48149 Muenster, Germany; and ¶Division of Bioinformatics, Department of Biology, University of Muenster, Schlossplatz 4, 48149 Muenster, Germany
Communicated by Walter M. Fitch, University of California, Irvine, CA, August 2, 2004 (received for review February 12, 2004)
| Abstract |
|---|
|
|
|---|
In a pairwise search, a query sequence is compared to any database sequence, yielding a confidence estimate that is supposed to indicate the probability of finding a comparably similar sequence (of the same size) in a database of random sequences. The comparison is done for every sequence in the database, and the sequences with highest confidence ("hits") are reported. The most popular pairwise-search tool is BLAST (1).
Simple profile searches make use of position-specific scoring statistics and are usually more sensitive than pairwise comparisons. The introduction of hidden Markov models (HMMs) seems to provide a firmer statistical basis for profile search. The majority of currently available profile tools use HMMs (for example, the HMMER package) (6).
Kinship between protein sequences can also lead to (and thus be recognized by) the occurrence of particular amino acid motifs (also known as patterns, signatures, or fingerprints) that were conserved throughout the evolution of the protein family in question and are believed to correlate with specific structural features and function. Motif analysis, therefore, can also be used for identifying new members of a protein family (710). Motifs are the backbone of homology-search methods such as PHI-BLAST (5). In contrast to profiles, motifs are usually short, they include a short stretch of very specific amino acids deemed relevant for function, and they are denoted by specific regular expressions.
In this study, we will show that the overall performance of homology searches can be improved if these methods are combined appropriately. The combination of methods is an advanced form of a metastudy. Important medical questions are typically studied more than once, and a metastudy compiles and analyzes the results of all relevant studies. INTERPRO (11) and METAFAM (12) present such compilations in protein-family research. Combining methods directly to generate a consensus result is also common practice in some areas of bioinformatics. Two algorithms that combine different methods are PCONS (13) (for fold recognition) and JPRED (14) (for secondary structure prediction). They improve the accuracy of results considerably.
Here we restricted ourselves to combining the following five homology-search methods: HMMSEARCH (6), TREESEARCH (15), position-specific iterated BLAST (PSI-BLAST) (16), PHI-BLAST (5), and motif alignment and search tool (MAST) (17). All of them can use a collection of sequences as search input and report a confidence estimate for each hit. The first two methods perform profile-based searches by transforming the sequences into an HMM, and TREESEARCH also uses a phylogenetic tree of the input sequences. Although PSI-BLAST can be used iteratively, each new run being based on the output of the previous one, we use one step of this algorithm only, using a profile read off from a multiple CLUSTALW alignment of the input sequences as input. For PHI-BLAST, we use a motif in the form of a "regular expression" (18) designed to represent a family-specific pattern. Such an expression can be derived from the input sequences by using PRATT (10). Then, following a suggestion from the BLAST software "read-me" file for improving the competitiveness of PHI-BLAST, we apply PSI-BLAST just once, using a profile derived from the PHI-BLAST result. Finally, MAST uses profiles derived from motif analysis.
We combine these five methods as follows: First, given a collection of query sequences, method-specific input queries structured according to the specific requirements of the individual search algorithms are automatically derived for each of the five component algorithms. Then, after these algorithms have been applied by using their respective input queries, we compute and report a "consensus hit list."
In addition to detailing the resulting consensus tool dubbed comparative homology agreement search (CHASE), we present a comparative evaluation of its performance. Needless to say, the evaluation is performed by testing CHASE on a database that is disjoint from the database used to calibrate this tool. A standalone version of the CHASE software is available upon request.
| Materials and Methods |
|---|
|
|
|---|
HMMSEARCH IP. We use CLUSTALW (21) to generate a multiple alignment that in turn is used by HMMBUILD, available with the HMMER package, to build an HMM. We calibrate the required HMM by using HMMCALIBRATE, also available as part of HMMER.
TREESEARCH IP. We use BUILD_COMPOUND, available with TREESEARCH, to generate, as required, a sequence alignment (using CLUSTALW), a phylogenetic tree [using FITCH (22)], and an HMM (using HMMBUILD).
PSI-BLAST IP. We use CLUSTALW to align the input sequences and format the alignment such that it can be used to "jump-start" a "single-run" PSI-BLAST search.
PHI-BLAST IP. We use PRATT to generate a PROSITE-like pattern and a CLUSTALW alignment to generate a consensus sequence by relative majority rule for starting a PHI-BLAST search, followed by a single run of PSI-BLAST.
MAST IP. We use multiple expectation maximization for motif elicitation (MEME) (23) to generate motifs and convert them into the required profiles.
Automatic Evaluation of Database-Search Methods and Calculation of Performance Weights. PHASE4 (24) is a system for the automatic evaluation of database-search methods. In PHASE4, the performance of a method is evaluated by its ability to find a test set of sequences in a target database by using a training set of sequences for "learning" (e.g., for calculating an HMM). To construct test and training sets, PHASE4 relies on target databases such as SCOP 1.53 (20) that classify proteins [in a strictly Linnean, i.e., a binary or (according to proper Linnean terminology) "binomial" hierarchical, fashion] according to membership in families (of closely related sequences) and in superfamilies (of not-so-closely related sequences). An "evaluation scenario" is provided by specifying a training and a test set in the target database. For example, the scenario "distant family one model" is used to evaluate a homology search method for its ability to report distant relationships in protein families by splitting off one family from a given superfamily to provide the test sequences and keeping the rest of the superfamily as training sequences. Such a test is executed, for each family in turn, for every superfamily (see Table 1 for commonly used scenarios and ref. 24 for more details).
|
We use the PHASE4 system first for evaluating the individual homology-search methods to be combined in CHASE to derive a "weighting scheme" for the methods that is based on their performance. Among several available scenarios offered by PHASE4 that define training and test sequences using the SCOP database as described before, we use one for detecting distant relationships, one for detecting close relationships, and one for detecting very close relationships (see Table 1 for details). An E value EC = 1,000 was set as a cutoff for all individual homology-search methods; sequences with a larger E value are not listed. For each method i, consider the average percentage Pi(k) of coverage of true positives while considering k misclassifications (false positives) acceptable. In this case, the average is taken over the coverage for all three scenarios mentioned above, and the coverage for a scenario is in turn the average taken over all tests. Using some fixed number k (in our case, we used k = 50), this gives rise to the weighting scheme W = W1,..., Wn (as listed in Table 2), where n is the number of methods, and the weight Wi of method i is set to Pi/(P1 +... + Pn), with the average coverage Pi divided by the ntotal sum of the average coverages of all n methods so that
holds.
|
|
Placing methods on a common scale. For each method i and each sequence s in the database, we report the sequence, provided its E value Ei(s) is below or equal to a cutoff value EC of 1,000. Then, one method is chosen to be used as a reference method, on the basis of which the E values of the other methods are rescaled (25). In CHASE, we use HMMSEARCH as our reference method. Next, before doing any E value manipulation, we take the logarithm to base 10 to transform the E values for all methods. This transformation is necessary, because E values may be very close to zero for good database hits, and we must avoid rounding problems. This way, we obtain for each sequence s taken into consideration and each method i = 1,..., n a number ei(s):= log10Ei(s) that we call the "e value" of the sequence (with a small e) for conciseness. Next, we use a regression procedure yielding the slopes and the intercepts for HMMSEARCH versus TREESEARCH, PSI-BLAST, PHI-BLAST, and MAST to rescale the e values. For example, ordinary leastsquares regression (26) applied to HMMSEARCH e values eHMM(s) and corresponding PSI-BLAST e values ePSI(s) provides the slope a and the intercept b for which the error term
[eHMM(s) a·ePSI(s) b]2 is minimized. Here, the sum is taken over all sequences s with both e values eHMM(s) and ePSI(s) below or equal to a certain threshold e0. This procedure is repeated each time CHASE is applied to search for a protein family. Slope and intercept depend on the specific data under consideration: there is no universal data-independent regression line for the various methods. For each sequence s, we then put
![]() | [1] |
For a small scaling threshold e0, the formula rescales small e values according to the regression line and keeps large e values as they are. Keeping large e values as they are may be useful, because they may be "downscaled" otherwise, suggesting a significance that is not there. In the rare case that rescaled e values exceed the threshold, they are set to precisely this threshold to keep the ranking as is. For larger e0, fewer e values are kept as they are. In Results and Discussion, we set e0 = log10(EC) = 3. Because no hits are considered for which the E value exceeds the E value cutoff EC = 1,000, all values are rescaled in this case. Nevertheless, results improve slightly for smaller e0, as discussed later.
The same scaling procedure is applied to the e values reported by the other three methods. For notational consistency, we set
for our reference method HMMSEARCH. Calculating the C value. Once we have the rescaled e values
for all n methods, we calculate the c value for each sequence s as the W-weighted sum:
![]() |
The final C value (on the original E value scale) is then obtained as C value(s):= 10c-value(s), which yields a consensus over individual homology-search methods. "Missing E values" arise if a homology search method finds a sequence not found by another, given the E value cutoff EC. In the C value formula, these missing E values are set to the cutoff E value EC.
Evaluation of CHASE. As noted above, our tool CHASE implements the above scheme by using five homology-search methods. Using the weights W1,..., Wn of the component search algorithms calculated once and for all, we compute the regression lines and the resulting C values of the sequences in each database search. Treating the C values as E values, we can use PHASE4 again to evaluate the performance of CHASE and compare its performance with that of the component algorithms. Clearly, the weights that are incorporated in (and thus the performance of) CHASE depend on the database that was used for determining these weights. In particular, if a component algorithm does very well on that database, it will get a high weight, implying that it will strongly influence the outcome of the consensus method, making it look good on that particular database, too.
To avoid this kind of circularity, we split the SCOP 1.53 database into two separate databases: the odd database, containing every second SCOP superfamily starting with the first one, and the even database, containing the rest. We used the odd database to compute the weights, W1,..., Wn, as listed in Table 2, and the even database to evaluate the performance of the resulting consensus method and compare this performance with that of its component algorithms, using again the three scenarios offered in PHASE4 as described in Table 1. As before, we used coverage vs. false-positive counts in PHASE4 as a performance evaluator, and sorting of CHASE hits was based on the C value. Sequences with a C value exceeding EC = 1,000 are not listed. By default, CHASE sets the E value cutoff EC to 1,000, and the e value threshold used for rescaling e0 to 3 (= log101,000) so that all values are rescaled. However, other cutoff values can be specified also.
| Results and Discussion |
|---|
|
|
|---|
|
|
The results of running CHASE for the SCOP superfamily featuring the FAD/NAD(P)-binding domain are shown in Table 3. C values along with rescaled E values from different methods are listed. The names of the sequences from the given family are shown in black (in the "description" column), and the others (the names of the false positives) are shown in red. We consider a family member to be classified correctly by method i if its rescaled E value is smaller than the rescaled E value of the first false positive. For the false positives listed by method i, the minimum rescaled E value is shown in red. Rescaled E values of family members f that would not be classified correctly using method i alone are marked in orange. They are larger than the smallest rescaled E value of the false positives for method i (printed in red) so that the false positive with the smallest rescaled E value would precede the family members f in the ranking based on method i. In the twilight zone of rows 1524, CHASE performs well, triggered by the rescaled E values marked in green, which indicate success for at least one method. Inspecting the consensus hit lists for all protein families under consideration in the "distant relationship" scenario, we noted that each method detects specific true positives that would not be detected if we had restricted ourselves to combining the other four.
The evaluation that we report is one scenario for CHASE. In another scenario, CHASE can be used to search for a maximum number of members of a protein family by providing expert rather then automatic input information to the component methods. Such information could, for example, be patterns described in the PROSITE database (18). These patterns may be found by PS_SCAN (18). We provide an advanced user interface by which one can submit protein sequence(s), a sequence alignment, a profile, or a pattern as input for the underlying search methods.
In the future, we would like to include more search methods. In some evaluation scenarios and for some data, methods based on support vector machines seem to be superior to some of the methods we combine (e.g., ref. 27), but they lack E values.
| Conclusion |
|---|
|
|
|---|
| Acknowledgements |
|---|
| Footnotes |
|---|
|| To whom correspondence should be addressed. E-mail: fuellen{at}uni-muenster.de.
© 2004 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
M. J. Cornell, I. Alam, D. M. Soanes, H. M. Wong, C. Hedeler, N. W. Paton, M. Rattray, S. J. Hubbard, N. J. Talbot, and S. G. Oliver Comparative genome analysis across a kingdom of eukaryotic organisms: Specialization and diversification in the Fungi Genome Res., December 1, 2007; 17(12): 1809 - 1822. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||