Modern origin of numerous alternatively spliced human introns from tandem arrays
- *Laboratoire de Génomique Fonctionnelle de Sherbrooke,
- †Département de Microbiologie et d'Infectiologie, Faculté de Médecine et des Sciences de la Santé, Université de Sherbrooke, Sherbrooke, PQ, Canada J1H 5N4
-
Edited by Joan A. Steitz, Yale University, New Haven, CT, and approved November 17, 2006 (received for review June 7, 2006)
Abstract
Despite the widespread occurrence of spliceosomal introns in the genomes of higher eukaryotes, their origin remains controversial. One model proposes that the duplication of small genomic portions could have provided the boundaries for new introns. If this mechanism has occurred recently, the 5′ and 3′ boundaries of each resulting intron should display distinctive sequence similarity. Here, we report that the human genome contains an excess of introns with perfect matching sequences at boundaries. One-third of these introns interrupt the protein-coding sequences of known genes. Introns with the best-matching boundaries are invariably found in tandem arrays of direct repeats. Sequence analysis of the arrays indicates that many intron-breeding repeats have disseminated in several genes at different times during human evolution. A comparison with orthologous regions in mouse and chimpanzee suggests a young age for the human introns with the most-similar boundaries. Finally, we show that these human introns are alternatively spliced with exceptionally high frequency. Our study indicates that genomic duplication has been an important mode of intron gain in mammals. The alternative splicing of transcripts containing these intron-breeding repeats may provide the plasticity required for the rapid evolution of new human proteins.
The origin of spliceosomal introns and their dissemination in higher eukaryotes are controversial issues. Different mechanisms have been proposed to explain how introns have arisen at different times during eukaryotic evolution (1). Ancestral spliceosomal introns may have evolved from organellar type II introns (2). In a manner similar to type II introns, the transposition of a spliceosomal intron through its reverse splicing into an intronless gene has been proposed as an ancient dissemination mechanism (3). Many spliceosomal introns in higher eukaryotes are considered to be relatively recent (4–6), but there is very little direct evidence for the mechanisms by which they arose. Intron insertion by transposition of a p-SINE1 element has occurred in the rice catalase A gene (7). Likewise, transposon insertion supports the creation of a new intron in the Sh2 gene of maize (8). Reverse splicing and the genomic insertion of transposons have been proposed to explain recent intron gains in nematodes (5, 9). Intron transfer among paralogues is consistent with the distribution of introns in the globin genes of Chironomus (10). The tandem genomic duplication of a portion of coding sequences with an AGGT cryptic splice site was initially proposed by Rogers (11) as a mechanism to produce the boundaries of a new intron. Although this model was rejected almost immediately based on the lack of similarity at the boundaries of introns that were examined, this mechanism was revived 10 years later to explain the origin of introns in several fish genes (12).
A new intron can also arise when a known boundary is spliced to a novel junction created by point mutations, as is the case with the exonization of Alu elements (13, 14). Recently, an intron in the SETMAR gene was shown to originate from the exonization of a mariner-like Hsmar1 transposon (15). Several human LF-SINE elements overlap ultraconserved exons, underscoring the role of retroposons in the creation of new vertebrate splice junctions (16). New introns can also emerge from genomic events that produce new combinations of intron boundaries, as is the case when a complete exon is duplicated. This latter mechanism has been active before the radiation of mammalian orders (17, 18). In this respect, alternative splicing may have favored the creation of introns from already functional or newly appearing boundaries. However, whether alternative splicing has contributed to the creation of modern introns is controversial. Consistent with this view is the observation that most of the human exons that have no orthologous versions in mouse are alternatively spliced (19). On the other hand, the mechanisms leading to the emergence of species-specific exons have not been examined, and it is unclear whether they were produced by exon-gain or -loss events (18). Because no unambiguous cases of intron gain occurring after the divergence of humans and mice 70–110 million years ago have been documented, the contribution of alternative splicing to the evolution of modern introns remains uncertain (5).
To examine the recent mechanisms of intron gain that may have occurred in mammals, we departed from the conventional phylogenetic approach used to identify modern introns. Instead, we based our analysis on the premise that many of the proposed mechanisms put forward to explain intron gain involve duplicating a genomic segment that would provide new splicing boundaries. For example, duplicating a segment containing the sequence AGGT would generate the appropriate substrate for both a 5′ and a 3′ splice junction. This model predicts that an intron produced recently by a duplication event will display extensive sequence similarity at its boundaries. Consistent with this model, we report that the human genome contains a multitude of introns carrying a direct-repeat signature at boundaries. Importantly, introns with the most-similar boundaries are always part of an array of direct repeats arranged in tandem. The young age of these introns is supported by the observations that the corresponding repeats are often lacking in mouse. Moreover, many orthologous regions in the chimpanzee display differences in the organization and sequence of the repeats. Finally, we observed that these recent introns are alternatively spliced with high frequency. Our results highlight the importance of genomic duplication as a mechanism for intron gain and suggest that the alternative splicing of pre-mRNAs containing tandem repeats has contributed to protein diversification in recent human evolution.
Results and Discussion
Human Introns with Perfectly Matching Boundaries.
To determine whether the boundaries of introns originate from duplication events, we aligned the exon–intron boundary (5′ splice site) with the intron–exon boundary (3′ splice site) of the same intron. Thus, the exon portion upstream of the 5′ splice site was aligned to the intron portion upstream of the 3′ splice site, and the intron portion downstream of the 5′ splice site was aligned to the exon portion downstream of the 3′ splice site (Fig. 1 a). For each alignment, we counted the number of perfect contiguous matches covering at least one nucleotide at the splice junctions (position +1 or −1). The analysis was performed first with 183,774 human introns obtained from the National Center for Biotechnology Information (NCBI) database (build 36). For comparison, we produced control sets of 5′ and 3′ boundaries by randomly combining exon–intron with intron–exon regions (Scramble). We find that the NCBI set yields an excess of introns with boundaries displaying perfect contiguous matches of ≥10 nt (Fig. 1 b). Fifty-five introns have boundaries with matches ≥15 nt [Table 1; supporting information (SI) Table 2 and SI Fig. 4]. The largest contiguous match was 357 nt. The analysis was also conducted on 304,008 introns obtained from the AceView database (20), which contains a higher proportion of EST records (November 2004 release). In this case, 679 introns displayed boundaries with perfect matches ranging from 15 to 530 nt (SI Fig. 5 and SI Table 3). Overall, 128 of the 679 AceView introns (18.9%) belonged to genes encoding different categories of known proteins (e.g., membrane associated, zinc finger, and immunoglobulins) implicated in a variety of processes like signal transduction. Approximately two-thirds of these genes are associated with medically important human disorders, as defined by Online Mendelian Inheritance in Man (www.ncbi.nlm.nih.gov/omim).
Similarity at boundaries of human NCBI introns. (a) The similarity of sequences at intron boundaries was assessed by alignment, as indicated. Only perfect matches that included at least one position directly abutting splice junctions were compiled. (b) Distribution of introns from the NCBI data set based on sequence similarity. As a control, perfect matches were computed from sets of boundaries generated from randomly combining 5′ splice site with 3′ splice site regions (Scramble). Standard deviations for the Scramble set are too small to be visible on the graph. A log value of −1 is meant to indicate zero occurrence.
Attributes of human introns with different levels of sequence similarity at boundaries
To improve the reliability of the data set, we extracted from the AceView database a subset of 183,852 human introns that were supported by cDNAs isolated from major compilations efforts and displaying ≥99% match to genomic sequences. The analysis performed using this set (cDNA99) identified 105 introns with matching boundaries in the ≥15-nt category (Fig. 2 a and Table 1; SI Fig. 6). A χ2 analysis indicated that the 0.71% excess of introns with matching boundaries ≥4 nt was highly significant (P < 0.00001) (SI Table 4). We also used the cDNA99 set to compare the distribution of introns in protein-coding regions vs. 5′ and 3′ UTRs. Each group contained a similar number of introns in the high-similarity category (Fig. 2 b–d). Given that the total number of introns in the protein-coding group is 6 and 12 times superior to the number of introns in the 5′ and 3′ UTR groups, respectively, introns with highly similar boundaries are therefore more prevalent in the noncoding portions of genes. Their preferential retention in untranslated regions may be explained if these introns are not spliced efficiently. In this case, intron retention would be more deleterious when occurring in coding regions.
Distribution of human introns with similar boundaries based using the cDNA99 data set. The distribution was generated as in Fig. 1. The total distribution (a) was dissected into introns interrupting 5′ UTRs (b), protein-coding regions (c), and 3′ UTRs (d).
Notably, although most introns in the highest-similarity category were of the GT-AG type, GC-AG and AT-AC introns were strongly overrepresented in this category in all intron libraries (Table 1), indicating that duplicated segments containing AGGC or ACAT can also lead to intron gain. None of the AT-AC introns had a 5′ splice site sequence signature characteristic of U12-dependent introns. On average, human introns with matching boundaries in the ≥15-nt category were also smaller, suggesting that extra intronic sequences can be acquired over time. Pictograms displaying the nucleotide composition of a 30-nt region at intron borders indicate that the increased similarity at boundaries is associated with low pyrimidine content upstream of the 3′ splice site (SI Fig. 7). By contrast, a significant enrichment of purines was noted in the sequences flanking the splice sites in the ≥15-nt category, suggesting that the binding of SR proteins to purine-rich elements (21) may enforce the use of some of these splice sites.
Introns with Matching Boundaries Are Part of Tandem Repeats.
The presence of an extensive and contiguous stretch of identical nucleotides at the boundaries of an intron raises the possibility that the similarity might be more extensive. Thus, we assessed the similarity of the regions flanking a perfect match of any given length at intron boundaries. For a perfect match covering positions n to m, positions n − 1 and m + 1 are mismatches by definition. Similarity scores were computed for two 20-nt stretches covering positions n − 2 to n − 22 and positions m + 2 to m + 22. We find that perfect matches of ≥9 nt are flanked by regions that are more similar than the regions flanking shorter perfect matches (SI Fig. 8). Further analysis indicated that the average similarity ranged between 0.2 and 0.4 kb, with some regions extending for up to ≈2 kb (data not shown).
Remarkably, each one of the 55 NCBI introns with the most similar boundaries was part of a tandem repeat made up of 3–175 units (SI Fig. 9). The sequence similarity between the units of each array varied considerably among different arrays, likely reflecting the different ages of the arrays. Notably, 23 of the 55 introns could be grouped into eight families based on sequence similarity between arrays (SI Figs. 10 and 11). In two cases, one array was used to produce two different introns within the same gene (XM_929108 and XM_498627) (families 1 and 8; SI Fig. 11 a and h, respectively). Some of the arrays in families 1, 3, 4, and 7 belonged to genes that are likely paralogues, suggesting that gene duplication was responsible for disseminating the arrays in these cases. In addition, sequences similar to various portions of the Alu element in the sense or antisense orientation were identified in the arrays associated with 21 introns (SI Fig. 9). Sequences with similarity to mariner transposon elements, retrovirus-like MaLR elements, LINE family, or L1 elements, Mer2 interspersed repetitive sequence, and human satellite sequences were found in some of the arrays (SI Fig. 9).
Some of the repeat sequences associated with NCBI introns were also found in the AceView introns of the same category (SI Table 5), underscoring the success of a small number of repeats at disseminating introns. Moreover, this similarity in the sequence of the arrays extended to NCBI and AceView splicing units harboring less-similar intron boundaries (category 8–14 nt), which likely represent older introns. Thus, many precursors of modern arrays seem to have disseminated introns at earlier times during evolution. Nonetheless, 17 of the 55 NCBI introns had no relationship to introns in the 8- to 14-nt category (NCBI or AceView), suggesting that they represent the most recent introns of the set.
Orthologous Versions of Human Introns with Highly Matching Boundaries in Mouse and Chimpanzee.
The extensive stretch of perfect sequence similarity at the boundaries of human introns suggests a recent origin for these introns. We have examined this aspect by looking for equivalent regions in the mouse and chimpanzee genomes. The orthologous versions of human introns interrupting protein coding sequences in the cDNA99 set were located by using axtNet pairwise alignments of human (hg17)/mouse (mm8), and human (hg17)/chimpanzee (panTro2). The alignments were further refined by using ClustalX and visual optimization. The mouse orthologous gene for SEMG1 could not be found. Most human introns did not align well with mouse sequences (Fig. 3 a; SI Fig. 12 a–i), suggesting a primate origin for several human tandem repeats. Importantly, the breeding of introns from tandem repeats may have operated independently in various mammalian clades, because the mouse genome also contained an excess of introns with matching boundaries (SI Fig. 13). When human and mouse ≥15-nt perfect boundary repeats were blasted against one another, a value inferior to 20% was obtained (using an e value cutoff of 10−6), indicating little overlap between human and mouse repeat sequences.
Alignments between H. sapiens (hs), P. troglodytes (pt), and M. musculus (mm) orthologous regions. (a) A human/chimpanzee/mouse alignment for the CGREF1 intron region. (b–d) Three examples of human/chimpanzee alignments. PRH1 was taken from the NCBI36 set, whereas FBXO17 and COL18A1 are from the cDNA99 set. Asterisks indicate positions similar in all species aligned. Horizontal arrows indicate repeat units. Down and up arrows indicate 5′ and 3′ splice sites, respectively. Additional alignments can be found in SI Fig. 12.
We also noted differences in the human and chimpanzee alignments. Seven of 12 introns displaying identical splicing signals in the chimpanzee had a different length (Fig. 3 b and SI Fig. 12). In the majority of the cases, however, the sequence corresponding to one or both human splice sites was lacking (Fig. 3 c and d and SI Fig. 12), suggesting that these introns arose after the divergence from the chimpanzee ≈5 million years ago (22). However, given the provisional nature of the chimpanzee genome assembly, alignments performed with repetitive regions must be interpreted with caution. A definitive conclusion concerning the evolution of these introns in primates will require a confirmation of the sequence of these chimpanzee genes and the mapping of associated transcripts.
Nevertheless, our observations suggest that genomic duplication is responsible for the origin of many recent introns in mammalian genomes. Our current appreciation of the importance of this mechanism is most certainly underestimated, because we have confined our analysis to perfect contiguous matches at intron boundaries and have ignored a multitude of boundaries with extensive similarity that are interrupted by short gaps or mismatches (for a few examples, see SI Fig. 14).
Intron-Containing Repeats and Alternative Splicing.
To confirm that the NCBI introns in the ≥15-nt perfect-match category were bona fide introns, we performed RT-PCRs using total RNA from PC-3 cells. Sequence analysis of cloned amplicons from 26 of these units indicated that six units produced the expected spliced products. Remarkably, 60% of the units used novel splice site combinations (SI Table 2 and SI Fig. 15). This high value was only partly due to alternatively spliced Alu-containing introns (23), because disregarding the five Alu-containing units in the set still yielded a frequency of 50%. Using an assortment of samples, microarray analysis indicated that 74% of human multiexon genes are alternatively spliced (24). Given that human genes contain an average of six splicing units, this means that ≈10% of splicing units are alternatively spliced. This is consistent with a recent study reporting that ≈7% of the human exons conserved in mouse are alternatively spliced (25). Thus, our analysis of a single RNA sample indicates that transcripts containing repeat elements are alternatively spliced with exceptionally high frequency. The alternative splicing of transcripts carrying repeated elements has the potential to produce proteins with varying numbers of domains. This strategy may be exploited during human evolution if such proteins can provide an advantageous function.
Conclusions
Our study provides insights into the timing and mechanism of intron gain in mammals. The human and mouse genomes contain an excess of introns carrying perfectly similar boundaries often covering a stretch of ≥15 nt. That all human introns with the best-matching boundaries are embedded in tandem arrays of direct repeats indicates that genomic duplication has contributed actively to intron gain in human evolution. The similarity displayed by some repeats suggests that many intron-breeding tandem arrays have disseminated through gene duplication or by recombination into paralogous or unrelated genes. Moreover, related repeat sequences are found in seemingly older introns (i.e., introns whose perfect boundary match is less extensive), indicating that some intron-breeding repeats have disseminated on multiple occasions in human evolution. Nevertheless, the young age of the majority of these human introns is supported by the fact that they are almost always absent in mouse. Conversely, the mouse genome has evolved a set of intron-breeding repeats that have little sequence overlap with the human arrays. A provisional conclusion, based on the assumption that the chimpanzee sequences used are error-free, is that many of the introns with highly matched boundaries are human-specific. In support of the above conclusions, Zhang and Chasin (26) have recently observed that species-specific repeats are also a major route to exon gain.
Based on RT-PCR analysis, we estimate that 60% of the human introns with the most similar boundaries are alternatively spliced. Given that approximately one-third of these introns interrupt coding regions, their alternative splicing may help minimize the deleterious effect of duplications while simultaneously providing an opportunity to explore functional diversity. Globally, a tight association between clade-specific direct repeats and alternative splicing may have contributed to the evolution of different mammals, consistent with the view that alternative splicing has created hotspots for the evolution of mammalian protein sequences (27).
Last, at least 12 intron-breeding repeats identified in our study have been described as variable number of tandem repeats (VNTR) (minisatellites) and hence are highly polymorphic within the human population (28–44). Polymorphisms in the tandem-repeat sequence of the thymidylate synthase gene are used as markers for effectiveness of 5-fluorouracil in cancer patients (42–44). Moreover, polymorphisms in the VNTR of the sirtuin and CD209 genes are associated with human longevity and HIV-1 transmission, respectively (29, 37). These observations raise the intriguing possibility that individual-specific differences in such rapidly evolving and transcribed genomic regions have an impact on alternative splicing. Thus, individual-specific introns may be produced from transcribed VNTR, and this variability may contribute to the evolution of complex traits in the human population.
Methods
Bioinformatic Approaches.
The NCBI and AceView data sets were first systematically checked to minimize errors. “Fuzzy” introns, misalignment of exons/introns, and data errors (in AceView) were removed after careful visual inspection and sequence comparisons. Only GT-AG, GC-AG, and AT-AC introns were used. The final NCBI and AceView data sets contained 183,774 and 304,008 introns, respectively. To calculate the numbers of perfectly identical nucleotides between intron boundaries, we started from the nucleotides directly at the splice junctions (positions −1/+1). The number of perfect contiguous similarity was calculated in both directions. Only contiguous identities covering at least position −1 or position +1 were considered.
Data.
Annotated Homo sapiens (human) genome build 35 and AceView Gff data were downloaded from the NCBI database (www.ncbi.nlm.nih.gov and www.ncbi.nlm.nih.gov/IEB/Research/Acembly, respectively). Introns in the cDNA99 data set were derived from the AceView cDNA library. The accession origin of the cDNAs taken from the AceView library is as follows: the GenBank database (74,106), the HUGE protein database (KIAA; 2,805), the Millenium Galaxy Catalogue database (MGC; 32,859), the German Cancer Research Center database (DKFZ; AL or BX; 9,228), and the FLJ-DB database (AK; 29,711). The National Institute on Aging mouse gene index 4.0 was from the National Institute on Aging, National Institutes of Health (45), and was downloaded from http://lgsun.grc.nia.nih.gov/geneindex4/download.html. The assembled chromosomal data sets of H. sapiens (hg17 and hg18), Pan troglodytes [chimpanzee, University of California, Santa Cruz (UCSC), version panTro2), Mus musculus (mouse, UCSC mm8), and axtNet pairwise alignments of human(hg17)/chimpanzee(panTro2) and human(hg17)/mouse(mm8) were downloaded from UCSC Genome Bioinformatics (http://hgdownload.cse.ucsc.edu/downloads.html).
RT-PCR Analysis.
Oligonucleotide primers were designed to amplify a product generated by splicing of the tested intron. RT-PCRs were conducted by using DNase I-treated total RNA from PC-3 cells. RT-PCR was carried by using the Qiagen (Valencia, CA) One-Step RT-PCR kit with gene-specific reverse and forward primers. The RT-PCR products were analyzed on Caliper 90 (Caliper Life Sciences, Hopkinton, MA) to assess their size. In SI Table 2, the “RT-PCR product” column indicates our success at amplifying a product of the size indicated in the previous column identified as “RT-PCR product length.” Because we often saw other products of different sizes, we proceeded by shotgun cloning the products present in each RT-PCR after purifying the products using the QIAquick PCR purification kit (Qiagen). The RT-PCR fragments were cloned into a pBluescript vector. Plasmids were isolated, screened for inserts, and sequenced at the McGill University and Genome Quebec Innovation Centre (McGill University, Quebec, Canada). Clones were assessed for splicing events that corresponded to the expected or novel products (“splicing events expected” or “splicing events novel” columns, respectively).
Acknowledgments
We thank Valérie Watier, Ulrike Froehlich, and Geneviève Dufresne-Martin for help with RT-PCR and cloning and Jean-François Lucier and Marc Dumoulin for bioinformatics support. We thank R. F. Doolittle, L. Bonen, R. Wellinger, S. Zimmerly, D. Simon, and E. Paquet for comments on the manuscript. We thank Réseau Québecois de Calcul de Haute Performance and the Centre for Computational Science at the Université de Sherbrooke for access to the Mammouth Linux cluster. This work was supported by grants from Genome Quebec and Genome Canada. S.A.E. is a Research Scholar Senior of the Fonds de la Recherche en la Santé, Quebec. B.C. is a Canada Research Chair in Functional Genomics.
Footnotes
- ‡To whom correspondence should be addressed. E-mail: benoit.chabot{at}usherbrooke.ca
-
Author contributions: D.Z. and B.C. designed research; D.Z. and R.M. performed research; D.Z., R.M., S.A.E., and B.C. analyzed data; and D.Z. and B.C. wrote the paper.
-
The authors declare no conflict of interest.
-
This article is a PNAS direct submission.
-
This article contains supporting information online at www.pnas.org/cgi/content/full/0604777104/DC1.
- Abbreviation:
- NCBI,
- National Center for Biotechnology Information.
- © 2007 by The National Academy of Sciences of the USA


