Previous Article |
Table of Contents
| Next Article
GENETICS
Identification of putative noncoding polyadenylated transcripts in Drosophila melanogaster

, ¶, ||

**


, ¶, ||
*Berkeley Drosophila Genome Project and 
Department of Genome Sciences, Lawrence Berkeley National Laboratory, One Cyclotron Road, Mailstop 64-121, Berkeley, CA 94720; and
Department of Molecular and Cell Biology and ¶Howard Hughes Medical Institute, University of California, Berkeley, CA 94720
Contributed by Gerald M. Rubin, February 22, 2005
| Abstract |
|---|
|
|
|---|
Analysis of EST and cDNA collections from a number of metazoan species has identified genes encoding long polyadenylated transcripts that do not contain ORFs of lengths typical for protein-encoding mRNAs. Noncoding functions of such polyadenylated transcripts have been elucidated in only a few examples. The corresponding genes neither contain hallmark sequence motifs nor appear to have been conserved across phyla. Thus, it is impossible to systematically identify new members of this class of gene by using sequence homology and traditional gene-finding algorithms that depend on protein-coding potential. Consequently, even their approximate number has not been established for any metazoan genome. We curated polyadenylated transcripts with limited protein-coding capacity from intergenic regions of the Drosophila melanogaster genome. We used RT-PCR assays, hybridization to RNA blots and whole-mount embryos, and computational analyses to characterize candidate transcripts. We verify the structures and expression of 17 distinct, likely non-protein-coding polyadenylated transcripts. We show that the expression of many of these transcripts is conserved in other Drosophila species, indicating that they have important biological functions.
genome | noncoding RNA | noncoding RNA | Drosophila pseudoobscura | Drosophila virilis
Although long ncRNAs in Drosophila melanogaster were first described more than two decades ago (2), their functions generally remain elusive (reviewed in ref. 3). The ncRNAs that have been best studied to date in D. melanogaster appear to function as part of RNAprotein complexes, but they do not share obvious sequence motifs or secondary structures. The RNA on the X (roX1 and roX2) transcripts function in localizing the chromatin remodeling activity of the male-specific lethal complex by binding specifically to male-specific lethal proteins (reviewed in ref. 4). Transcripts from the fly Heat-shock RNA-
(Hsr-
) locus, produced in response to cellular heat stress, have been proposed to aid in organizing heterogeneous nuclear RNA-binding proteins (5). In addition, ncRNA produced by the polar granule component (pgc) gene and found in the pole plasm of embryos (6) recently has been shown to mediate transcriptional repression, apparently as a result of transcription factor sequestration (7). However, the interaction motifs used by these ncRNAs are not recognizable in other genes and cannot yet be used to define classes of ncRNAs. Additionally, systematic identification and annotation of long noncoding transcripts has not been possible because traditional gene-finding algorithms rely on coding potential and sequence homology.
Another approach to new ncRNA gene discovery is positional curation, searching for evidence of specific transcription in segments of the genome that lack other annotations. Microarray data from several metazoan species indicate that much more of the genome is transcribed than can be accounted by annotated gene transcripts (8, 9). But how much of this transcriptional activity produces specific and discrete RNAs remains unclear. Positional curation has been used to identify novel candidate ncRNA sequences in the Arabidopsis and mouse genomes (10, 11). The reannotated fruitfly genome (12) and sequences from the Drosophila Gene Collection (13), a repository of high-quality full-insert cDNA sequences, offer a rich resource for discovery of novel transcripts with potentially noncoding functions.
Here, we present the results of a positional curation effort that identifies 17 previously undescribed, long putative ncRNA genes by employing RT-PCR assays, Northern analysis, and in situ hybridization to characterize 72 candidate sequences. Comparative genome analyses indicate that the very small ORFs contained in these transcripts are unlikely to encode proteins. In several cases, we have obtained direct experimental evidence for expression of orthologous transcripts in diverged Drosophila species, Drosophila pseudoobscura and Drosophila virilis, indicating that the expression of those genes as specific transcripts has been conserved during evolution.
| Materials and Methods |
|---|
|
|
|---|
2 test. Results with P < 0.05 were retained. QRNA (19, 20) analysis also used the D. pseudoobscura genome. To identify D. pseudoobscura and D. virilis sequences orthologous to ncRNA candidate transcripts, we concatenated syntenic and/or overlapping BLASTN highest scoring pair results to nonrepetitive sequence with expectation values <1e-05. PRIMER3 (21) was used for primer design. For descriptions of molecular biology methods, oligonucleotide sequences, and experimental observations, see Supporting Materials and Methods and Data Sets 17, which are published as supporting information on the PNAS web site.
| Results |
|---|
|
|
|---|
Because generation of cDNA clones by internal oligo(dT) priming of transcripts is a common artifact of cDNA library-construction, we disqualified 48 of 120 sequences whose 3' poly(A)-RNA tracts appeared encoded in genomic sequence, leaving a final 72 candidate cDNAs (Table 1; see also Table 3, which is published as supporting information on the PNAS web site, for additional details).
|
To facilitate evolutionary analyses, we searched for sequences that were homologous to our transcripts in two other Drosophila species by using BLASTN. Retaining for analysis the highest scoring pairs with expectation values of 1e-5 or less, we identified considerable homology between D. melanogaster candidate ncRNA transcripts and regions of the D. pseudoobscura genome, which diverged from D. melanogaster
25 million to 30 million years ago (22): 63 (86%) candidate ncRNA transcripts had conserved regions, with most transcripts showing homology along their entire lengths. Additional analysis indicated that regions reflecting noncoding transcript orthology were usually syntenic with neighboring gene annotations. We observed similarly strong conservation of nucleotide sequence when we repeated the analysis by using an assembly of the D. virilis genome for comparison. D. virilis is more distantly related to D. melanogaster than is D. pseudoobscura (22); accordingly, we found 44 transcripts (60%) that contained homologous regions. These comparative findings support the hypothesis that the candidate transcripts represent conserved genes. On average, the candidate D. melanogaster transcripts shared 60% sequence identity with D. pseudoobscura sequence and shared 61% with D. virilis, similar to the 61% identity reported in a comparison of representative D. melanogaster and D. pseudoobscura protein-coding sequences (23).
In the more distantly related honey bee, Apis mellifera, (www.hgsc.bcm.tmc.edu/projects/honeybee) and mosquito, Anopheles gambiae (24), genomes we found sequence conservation for 32 (44%) and 23 (32%) of candidate cDNAs, respectively. Even in these evolutionarily distant genomes, homologous regions encompassed most of the transcript lengths.
Likelihood of Translation. Because these 72 candidate ncRNA transcripts are structurally similar to protein-coding mRNAs, it was important to assess their potential for encoding polypeptides. To make this assessment, we first considered ORF lengths and initiating codons. Next, we measured whether the sequence of each ORF was more conserved compared with untranslated sequences from the same transcript. Finally, we used two interspecies comparative analysis methods, Ka/Ks and QRNA, to ascertain whether codon structure in each transcript was significantly conserved.
Having calculated the longest possible Met-initiated translation for each transcript (Table 3), we next randomized each transcript sequence and again calculated the longest ORF. After six randomization trials, we found that for 37% of the transcripts the average longest ORF length of randomized sequence was the same or longer than that of the native, nonrandomized sequence (Table 3). For transcripts with native translations longer than those found in the average of randomized trials, the native ORFs were longer by an average of only 23 codons. We conclude that the short ORFs that occur in our candidate ncRNAs are similar in length to ORFs that occur by chance in random sequence.
We examined each transcript for the longest non-Met-initiated ORF to address two issues: first, the rare occurrence of non-Met-initiated translation, and second, the possibility that a cDNA may not represent the true 5' end of a transcript and thus may not include a used start codon. Within our data set, the non-Met-initiated ORFs are rarely significantly longer than the 100-codon curation criterion we applied for Met-initiated reading frames (Table 3). In addition, we ultimately excluded from our analysis any candidate cDNA that, based on Northern experiments (see below), did not appear to correspond to a full-length transcript; thus, we are unlikely to have missed a significant portion of any ORF due to a truncation of the transcript.
Although the ORFs encoded by our candidate ncRNAs are indeed short and their translated sequences have no similarity to known proteins (BLASTX analysis of these sequences was conducted as part of the D. melanogaster 3.1 annotation pipeline) (12, 15), the possibility remains that some of these sequences encode novel small peptides. To assess this possibility, we asked whether sequence conservation in the D. pseudoobscura genome is greater within the ORF than in the remainder of the transcript. We compared BLASTN results for the non-ORF regions in each transcript with results for its ORF. Of the 63 transcripts conserved in D. pseudoobscura, only 14 displayed greater sequence conservation in their longest ORF than in the remainder of their sequence (Table 3).
We next examined the ORFs of our transcript data set for evolutionary conservation of codon structure. We calculated the Ka/Ks ratio for the longest Met-initiated translation for each sequence in our transcript data set, comparing nonsynonymous (Ka) to synonymous (Ks) substitutions in codon structure (25). Although it has been shown that nearly 95% of D. melanogaster protein coding exons have conserved coding information as determined by Ka/Ks (23), we were able to find only seven transcripts (10%) in our data set with statistically significant conservation of any codon structure within a transcript ORF (Table 3).
Even though Ka/Ks analysis indicated that the majority of our transcript data set does not have any conserved codon structure, we used an additional method to evaluate the coding potential of the ncRNA candidate transcripts. The QRNA algorithm relies upon stochastic pair grammars to evaluate substitutions between aligned homologous sequences (19, 20). Based on the pattern of mutations between conserved regions, the software assigns each nucleotide to one of three states: protein-coding, structural RNA, and other (potentially novel ncRNA). QRNA analysis indicated only seven candidate transcripts that contain any conserved sequence that approximates codon structure (Table 3). Three of the seven QRNA transcripts that show possible conservation of potential protein-coding sequence also have Ka/Ks results indicating conservation of codon structure (Table 3). Of these three transcripts, only one has an ORF that is more conserved than its flanking untranslated sequence.
In the absence of functional evidence, the very short ORFs found in these long transcripts, coupled with a lack of consistent support for conserved reading frames by independent methods of comparative analysis, sustain the current classification of these transcripts as likely non-protein-coding.
RT-PCR and Microarray Analyses of Candidate Noncoding Transcripts. To verify that candidate cDNAs represent expressed and processed transcripts and to validate the predicted splice junctions, we tested each of 31 putatively spliced transcripts with a RT-PCR assay, applying primers designed to amplify across predicted exon boundaries to RNA pooled from a broad range of Drosophila stages (see Table 4, which is published as supporting information on the PNAS web site). For 26 of the 31 putatively spliced transcripts, the sequences of amplification products primed by oligonucleotide pairs flanking predicted exon boundaries verify at least one splice junction (Table 4; for example, see Fig. 1 A and B); in one case a candidate transcript could be detected by RT-PCR but represented unspliced RNA. The four remaining putatively spliced candidates were not detected under our RT-PCR conditions. For 26 spliced transcripts detectable in this assay, single RT-PCRs were primed from the 5'- and 3'-most exons of those transcripts. The entire structure of the candidate cDNA was validated in this manner for 21 transcripts (Table 4). The Affymetrix Drosophila Genome 2.0 Gene Chip expression array contains all of the candidate noncoding candidate transcripts described in this study; 92% of candidate transcripts are detected in at least one RNA sample by hybridization to this array (Table 4).
|
Northern Blot Analysis of Candidate Noncoding Transcripts. Northern blot analysis was performed for each of the 72 candidates to determine which correspond to discrete RNA species and whether the cDNA clone representing each of those transcripts was full length (Table 4). Radiolabeled probes corresponding to 45 candidates detect transcripts on Northern blots of poly(A)+ RNA samples representing a broad range of developmental stages and a cultured Drosophila cell line (Fig. 1C). In 17 cases, the transcript length corresponded to that of the curated cDNA (for example, see Fig. 1C); two of these probes also detected transcripts whose molecular weights are higher (Table 4).
In the remaining 28 cases where a transcript was detected by Northern blotting, the RNA appears significantly longer than the corresponding cDNA, indicating that the predominant transcript represented by that clone is longer than the original cDNA. We designed RT-PCR experiments to amplify hypothetical RNAs that would bridge candidate ncRNA transcripts and proximal exons of protein-coding genes or nearby ESTs (for example, see Fig. 2), testing a single hypothetical gene structure for most of these 28 candidates. In five cases, we observed that the candidate transcript derived from an adjacent protein-coding annotation (data not shown), emphasizing the value of Northern analysis of transcripts first identified within EST collections.
|
|
pncr Orthologs in D. pseudoobscura and D. virilis. Of the 17 D. melanogaster pncr genes, we were able to identify orthologous sequences in 13 cases in D. pseudoobscura and 9 cases in D. virilis (Table 2; for alignment of one example, see Fig. 4, which is published as supporting information on the PNAS web site). We used syntenic highest-scoring pairs to construct models of putative orthologous transcripts for both of these species. We next investigated whether these conserved sequences were expressed in D. pseudoobscura and D. virilis as discrete transcripts. By applying primers based on gene models to RNA isolated from D. pseudoobscura and D. virilis, RT-PCR was used to generate cDNAs whose identities were verified by sequencing. We were able to detect a putative ortholog in this way for 9 of the 13 D. melanogaster pncr transcripts for which genomic sequence conservation was observed. We used these RT-PCR products as probes for Northern blots in both related Drosophila species and in each case detected an orthologous transcript expressed in at least one of these two related Drosophila species; for seven pncr genes, an orthologous transcript was detected in both related species. The D. pseudoobscura and D. virilis orthologs exhibited lengths that roughly correspond to their cognate D. melanogaster transcripts (Data Sets 4 and 5). Stage-specific expression of transcripts was often observed to mirror D. melanogaster patterns (for example, see Fig. 3; other data not shown). In situ hybridizations also were carried out in D. pseudoobscura embryos (Table 2); expression patterns of three orthologs were nearly identical to those of respective D. melanogaster transcripts (for example, see Fig. 3).
|
| Discussion |
|---|
|
|
|---|

-element, iab-4, and bft) (2, 47, 29). In this work, starting from a set of 72 computationally curated candidates, we have identified 17 additional mRNA-like ncRNAs that produce distinct transcripts, tripling the number of described mRNA-like ncRNAs in Drosophila. No systematic approach to gene-finding for long, noncoding mRNA-like RNA genes in fruitfly has been previously reported. We describe previously unrecognized candidate ncRNAs that are both spliced and unspliced and, in some cases, conserved in other Dipteran genomes. We examined the expression and structures of these genes by multiple experimental approaches and demonstrated the expression of discrete orthologous transcripts of a subset of these genes in two other Drosophila species. Further investigation of the pncr set of Drosophila genes classified by this study may reveal novel RNA-mediated functions among their transcripts. With the aim of evaluating independent curations of new ncRNAs, we applied our experimental approach to a small sampling of purported ncRNAs in mouse from the RIKEN FANTOM2 collection (http://fantom2.gsc.riken.go.jp) of recently identified cDNA transcripts (11). Information on these efforts can be found in Supporting Appendix, which is published as supporting information on the PNAS web site, and Data Sets 6 and 7.
Determination of protein-coding status is the most challenging task in noncoding gene curation. Pseudogenes, truncated clones, or errors in sequence determination could give rise to transcript sequences with reduced coding capacity. Additionally, a short ORF might still be translated to produce a small peptide. To minimize cDNA artifacts, we started with high-quality sequences and screened out truncated and reversed clones by using various computational and experimental methods. We then used homology searches to eliminate pseudogenes and extensive comparative studies to assess for conservation of protein-coding potential from the short ORFs. It is important to point out, however, that the argument that the pncr transcripts identified in this study are noncoding relies solely on the lack of positive evidence supporting the alternative hypothesis that these transcripts encode proteins. Demonstration of any RNA-mediated functions awaits further investigation.
Our efforts to curate long, mRNA-like ncRNA genes were nonsaturating because of our reliance on existing cDNA resources. EST and cDNA sequences have proven to be extremely valuable for the identification of protein-coding genes and their alternatively spliced transcripts (12, 30, 31). As demonstrated by this work and by ncRNA curation efforts in plant and mouse (10, 11), these resources are even more important for the identification and characterization of ncRNA genes. Although genomic hybridization technologies have begun to provide extensive evidence for transcription not accounted for by annotated protein-coding genes (9, 32), production of high-quality EST and cDNA sequence data, and experimental data such as Northern analyses, is essential for distinguishing which of these transcribed sequences encode discrete RNAs.
Because we have used EST and cDNA sequences as a source for candidate ncRNA curation, any estimate that we make of the total number of ncRNAs encoded in the Drosophila genome will depend on the extent of EST representation in our data set. In this regard, it is worth noting that the D. melanogaster control genes roX1, roX2, Hsr-w, and pgc (the most well characterized fruitfly mRNA-like ncRNAs) each have multiple ESTs (Table 3). However, >20% of annotated protein-coding genes, as well as most of the other described mRNA-like ncRNA genes, did not have corresponding ESTs at the time of the Release 3.1 reannotation (12).
Analysis of fruitfly 5' ESTs that are not associated with annotations reveals almost 500 distinct clusters containing one or more spliced EST reads (J. Carlson, personal communication). We expect that analyses of these sequences by using the methods we have used here to characterize our 193 initial candidates will identify additional mRNA-like ncRNAs similar in properties to the 17 pncr transcripts we describe. Taking all these factors into account, our guess is that there may be a total of 50100 long, mRNA-like ncRNA genes encoding discrete transcripts in the D. melanogaster genome. Based on the candidates we have examined in this work, many of these genes will be evolutionarily conserved, suggesting that they have important biological functions.
| Acknowledgements |
|---|
| Footnotes |
|---|
Freely available online through the PNAS open access option.
Abbreviation: ncRNA, noncoding RNA.
J.L.T. and A.M.B. contributed equally to this work. ![]()
Present address: Celera Genomics, South San Francisco, CA 94080. ![]()
** Present address: Kosan Biosciences, Hayward, CA 94545. ![]()
|| To whom correspondence may be addressed. E-mail: adina{at}fruitfly.org or rubing{at}hhmi.org.
© 2005 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
D. J. Begun, H. A. Lindfors, A. D. Kern, and C. D. Jones Evidence for de Novo Evolution of Testis-Expressed Genes in the Drosophila yakuba/Drosophila erecta Clade Genetics, June 1, 2007; 176(2): 1131 - 1137. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. Uhler, C. Hertel, and J. Q. Svejstrup A role for noncoding transcription in activation of the yeast PHO5 gene PNAS, May 8, 2007; 104(19): 8011 - 8016. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stapleton, J. W. Carlson, and S. E. Celniker RNA editing in Drosophila melanogaster: New targets and functional consequences RNA, November 1, 2006; 12(11): 1922 - 1932. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Biemar, D. A. Nix, J. Piel, B. Peterson, M. Ronshaugen, V. Sementchenko, I. Bell, J. R. Manak, and M. S. Levine Comprehensive identification of Drosophila dorsal-ventral patterning genes using a whole-genome tiling array PNAS, August 22, 2006; 103(34): 12763 - 12768. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. P. Samanta, W. Tongprasit, H. Sethi, C.-S. Chin, and V. Stolc Global identification of noncoding RNAs in Saccharomyces cerevisiae by modulating an essential RNA processing pathway. PNAS, March 14, 2006; 103(11): 4192 - 4197. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. J. Begun, H. A. Lindfors, M. E. Thompson, and A. K. Holloway Recently Evolved Genes Identified From Drosophila yakuba and D. erecta Accessory Gland Expressed Sequence Tags Genetics, March 1, 2006; 172(3): 1675 - 1681. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Ashburner and C. M. Bergman Drosophila melanogaster: A case study of a model genomic sequence and its consequences Genome Res., December 1, 2005; 15(12): 1661 - 1667. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. CHEN, D. J. LULLO, E. MA, S. E. CELNIKER, D. C. RIO, and J. A. DOUDNA Identification and analysis of U5 snRNA variants in Drosophila RNA, October 1, 2005; 11(10): 1473 - 1477. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||