A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome

  1. Mark Yandell*,,,
  2. Adina M. Bailey,§,
  3. Sima Misra,§,
  4. ShengQiang Shu§,
  5. Colin Wiel§,
  6. Martha Evans-Holm§,
  7. Susan E. Celniker, and
  8. Gerald M. Rubin*,§,
  1. *Howard Hughes Medical Institute and §Department of Molecular and Cell Biology, University of California, Life Sciences Addition, Berkeley, CA 94720-3200; and Department of Genome Sciences, Lawrence Berkeley National Laboratory, One Cyclotron Road, Mailstop 64-121, Berkeley, CA 94720
  1. Contributed by Gerald M. Rubin, December 17, 2004

Abstract

Five years after the completion of the sequence of the Drosophila melanogaster genome, the number of protein-coding genes it contains remains a matter of debate; the number of computational gene predictions greatly exceeds the number of validated gene annotations. We have assembled a collection of >10,000 gene predictions that do not overlap existing gene annotations and have developed a process for their validation that allows us to efficiently prioritize and experimentally validate predictions from various sources by sequencing RT-PCR products to confirm gene structures. Our data provide experimental evidence for 122 protein-coding genes. Our analyses suggest that the entire collection of predictions contains only ≈700 additional protein-coding genes. Although we cannot rule out the discovery of genes with unusual features that make them refractory to existing methods, our results suggest that the D. melanogaster genome contains ≈14,000 protein-coding genes.

Footnotes

  • To whom correspondence should be addressed at: Department of Molecular and Cell Biology, University of California, Life Sciences Addition, Room 539, Berkeley, CA 94720-3200. E-mail: myandell{at}fruitfly.org.

  • M.Y., A.M.B., and S.M. contributed equally to this work.

  • Abbreviations: Mb, megabases; oligo, oligonucleotide; sjc, splice-junction conserved.

  • Data deposition: The sequences reported in this paper have been deposited in the GenBank database (accession nos. CX309415–CX309654).

  • Freely available online through the PNAS open access option.

« Previous | Next Article »Table of Contents