New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
Reliable prediction of transcription factor binding sites by phylogenetic verification
-
Edited by Michael S. Waterman, University of Southern California, Los Angeles, CA (received for review May 20, 2005)

Abstract
We present a statistical methodology that largely improves the accuracy in computational predictions of transcription factor (TF) binding sites in eukaryote genomes. This method models the cross-species conservation of binding sites without relying on accurate sequence alignment. It can be coupled with any motif-finding algorithm that searches for overrepresented sequence motifs in individual species and can increase the accuracy of the coupled motif-finding algorithm. Because this method is capable of accurately detecting TF binding sites, it also enhances our ability to predict the cis-regulatory modules. We applied this method on the published chromatin immunoprecipitation (ChIP)-chip data in Saccharomyces cerevisiae and found that its sensitivity and specificity are 9% and 14% higher than those of two recent methods. We also recovered almost all of the previously verified TF binding sites and made predictions on the cis-regulatory elements that govern the tight regulation of ribosomal protein genes in 13 eukaryote species (2 plants, 4 yeasts, 2 worms, 2 insects, and 3 mammals). These results give insights to the transcriptional regulation in eukaryotic organisms.
Footnotes
-
↵ b To whom correspondence should be sent at the present address: Division of Biostatistics, Department of Medicine, Indiana University, Indianapolis, IN 46202. E-mail: shawnli{at}iupui.edu.
-
Author contributions: X.L. and W.H.W. designed research; X.L. performed research, contributed new reagents/analytic tools, and analyzed data; and X.L., S.Z., and W.H.W. wrote the paper.
-
Conflict of interest statement: No conflicts declared.
-
This paper was submitted directly (Track II) to the PNAS office.
-
Abbreviations: ChIP, chromatin immunoprecipitation; CSC, cross-species conservation; MSM, marginally significant motif; TF, transcription factor; TSS, transcription start site; RPG, ribosomal protein gene; Sc, Saccharomyces cerevisiae; NLC, network-level conservation.
-
↵ d Coregulated genes are those that are regulated by the same TF or TF modules.
-
↵ e Appendix 1, which is published as supporting information on the PNAS web site, gives an example showing that motif instances may not always align correctly.
-
↵ f The anchor species is where the motif-finding problem arises; i.e., if we are interested in finding the motifs in a certain species, then this species is called the anchor species. We give this name to this species to differentiate it from all other species that are used to help finding the motifs (the genes from the anchor species are called anchor genes).
-
↵ g A grouping of MSMs is a collection of similar MSMs, where each MSM in the group belongs to a different species. See Appendix 2, which is published as supporting information on the PNAS web site, for how to obtain groupings of MSMs.
-
↵ h Although we use “upstream sequence” in describing the method, in practice, the method should be applied to any regions that may contain cis-regulatory elements.
-
↵ i For the yeast species, we downloaded alignment of upstream orthologs from ref. 25. For the two plant species, we did local alignment of orthologous upstream and used the best-aligned regions of 100-bp length for every orthologous pair. The 100-bp cutoff is arbitrarily chosen, but, to our knowledge, our method is not so sensitive to the background-substitution matrices. For the three mammalian species, two insect species, and two worm species, we download the available alignments of the RPG upstream sequences from University of California, Santa Cruz genome browser web site.
-
↵ j This cutoff is arbitrary. From our experience, this cutoff works well for all the data sets from different species we used. In the text, we have another empirical P value cutoff, 1 × 10-19, which is used to report motifs.
-
↵ k To avoid overfitting, we exclude the motif instances on the current group of orthologous genes (the group of genes to be scanned by the ancient motif) from constructing the ancient motif.
-
↵ l phylocon on average outputs 60 predictions with some redundancies. compareprospector outputs ranked ordered motifs, for which we performed the manual search within the top three motifs.
-
↵ m The criterion for matching with transfac motifs is that there should be at most one mismatch when we compare the putative motifs with the transfac ones (see the legend of Table 5 and the supplementary files of ref. 18).
-
Freely available online through the PNAS open access option.
- Copyright © 2005, The National Academy of Sciences