Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / EVOLUTION
Thousands of human mobile element fragments undergo strong purifying selection near developmental genes
,

*Center for Biomolecular Science and Engineering and
Howard Hughes Medical Institute, University of California, Santa Cruz, CA 95064; and
Departments of Developmental Biology and Computer Science, Stanford University, Stanford, CA 94305
Edited by Susan R. Wessler, University of Georgia, Athens, GA, and approved March 20, 2007 (received for review December 18, 2006)
| Abstract |
|---|
|
|
|---|
exaptation | genome evolution | transposon | vertebrate cis-regulation
Complex metazoans seem to harbor significantly more conserved non-protein-coding sequence than simpler organisms (4). In vertebrates, many of these regions seem to serve as regulatory elements controlling the transcription of nearby genes (5–8). The evolution of regulatory regions is believed to be a major force behind the observed morphological diversity within the vertebrate lineage (9, 10), yet how this additional regulatory sequence was created is currently far from understood.
More than 50 years ago, when transposable elements were first discovered, B. McClintock (11) termed them "controlling elements" because of how they affect the expression of neighboring genes. Fifteen years later, Britten and Davidson (12) expanded this idea by hypothesizing that repetitive elements can act to distribute regulatory sequences throughout the genome and, in doing so, enriching, possibly even creating, whole pathways.
First glimpses of this phenomenon were explored in the pregenomic era and compiled into a hand-curated list of cases where researchers had come across individual mobile element instances that acquired a cellular role (13), a process termed "exaptation" (as opposed to adaptation) by Gould and Vrba (14). In the early genomic era, 1 Mb of the human and mouse genomes was examined for exaptation of mobile elements (15). A later analysis of 1.9 Mb of the human genome sequenced in 28 additional mammals came up with another handful of ancestral repeats evolving under strong purifying selection (16). More recent works, focusing on large families of constrained paralogous non-protein-coding sequences (17, 18), were able in two cases to explicitly implicate these families as originating from mobile elements (19, 20). Recent work has also elucidated that some mobile elements may be rich in transcription factor-binding sites (21). Combined, these observations suggest that the ideas of McClintock, Britten, and Davidson should be revisited on a genomic scale.
Here, we perform a genome-wide scan for mobile element instances exapted into putative cis-regulatory roles, by analyzing a large set of constrained nonexonic sequences with clear repetitive origins. We find this set by looking for repetitive origins in a conservative set of putative cis-regulatory regions, which covers <1.5% of the human genome and has been under strong purifying selection since the boreoeutherian ancestor (100 Mya), predating the human–dog split. We show that even by these conservative measures, thousands of constrained nonexonic elements (CNE), totaling over one million bases, including >5% of all CNEs unique to mammals, were deposited by interspersed repeats. These elements are significantly enriched near genes associated with the regulation of transcription and development. We also show that particular repeat portions are preferentially exapted into nonexonic functions and examine the reelin pathway, where all known receptor-related genes have acquired similar putative regulatory regions by conserving a repeat instance of the same type.
| Results |
|---|
|
|
|---|
We used only the highest scoring elements from each method, and augmented these elements with clear syntenic alignments between human and chicken, frog, fugu, tetraodon, or zebrafish; no neutrally evolving DNA should be alignable at these distances (23). Combined, these regions cover 3.5% of the human genome, constituting a conservative set compared with the 5% or more believed to be under purifying selection (1).
To obtain a nonexonic subset, we filtered out all regions found in any known or reliably predicted mature transcript (see Methods). Remaining regions were then required to be within syntenic alignments between human and chimp, rhesus, rat, mouse, and dog, leaving us with 1.45% of the genome as constrained boreoeutherian nonexonic elements. Each of the four conservation measures uniquely contributes >8% of this set, attesting to the value of combining rather than arbitrating between them.
In each of these six species, we then intersected this set with mobile element subfamilies annotated by RepeatMasker (24, 25). We used only mobile element subfamilies that have a presence in primates, rodents, and dog. Because these subfamilies appear across the boreoeutherian subtree, we term them "pan-boreoeutherian" [supporting information (SI) Text, section S1, and SI Fig. 5]. The intersection of our conserved nonexonic elements with the pan-boreoeutherian repeat subfamilies resulted in a set of 10,402 highly constrained nonexonic elements with clear repetitive origins. All elements are at least 50 bp long, with a maximum of 489 bp and a mean of 100 bp. The set covers just over 1 Mb (0.04%) of the human genome.
Data Set Validation. We used a second set of tools to reaffirm that these regions are indeed mobile element fragments evolving under purifying selection. First, we used Blastz (26) to realign all repeat consensus sequences to the human genome. Using sensitive thresholding, we were able to recover 98% of the constrained regions. Secondly, we validated that these regions are indeed evolving under purifying selection. The regions resisting insertions and deletions were previously shown to have a false discovery rate (FDR) of 1% (22). Using PhyloP (27) to compute the likelihood of a given multiple alignment under the species tree of neutral substitutions, all elements except 20 rejected the neutrality assumption at a FDR of 1%. The 20 exceptions all evolve less stringently within mammals, but each has a clear (>70%id) match to an orthologus region in a non-mammal and all were thus retained.
Constrained Regions Originate from All Walks of Transposon Life. Fig. 1 shows the distribution of constrained nonexonic bases with respect to the progenitor mobile element. Strikingly, despite our stringent filtering, all four characterized classes of repeats are present, with long interspersed elements (LINEs) and short interspersed elements (SINEs) contributing the bulk of the constrained nonexonic sequence.
|
Specific Parts of Mobile Elements Tend to Be Exapted. In the vast majority of instances, only a portion of the mobile element, rather than its entire length, exhibits extreme conservation. Truncation is a well known phenomenon in LINE repeats, where newly integrated copies are often truncated to varying degrees at their 5' end (30). This phenomenon is apparent in a histogram showing how many times each base in the LINE consensus appears in the human genome (Fig. 2 A and B). Yet, a similar histogram of only exapted consensus regions departs markedly from this background, peaking at very different regions for both the L2 and L3 elements. This difference is suggestive not only of exaptation per se, but of one that depends on the sequence content of the LINE elements themselves. It could be that these sections of the LINEs are functional upon insertion, or become so after a few fortuitous mutations, and are therefore more likely to be exapted [as was previously observed for exonic exaptations (19, 31)]. SI Fig. 6 A and B gives two additional examples for other classes of repeats.
|
This model is not biased by gene length because it uses only the location of the TSS. Nor is the model biased by genes residing next to gene deserts vs. those found in tight gene clusters, because larger basins of attraction result in a more probable null assignment. The overall top scoring GO term for this set is "development" (uncorrected P < 2 x 10–75). Many of its subterms are highly enriched as well, such as "system development" (4 x 10–55) and "nervous system development" (4 x 10–53). The second best scoring term is "transcription regulator activity" (2 x 10–72) along with many of its subterms. Of the more specific terms, we find particular enrichment for "cell recognition" (4 x 10–23) and related terms such as "neuron recognition," "transmembrane receptor protein tyrosine kinase signaling," "GPI anchor binding," and "cell adhesion" (3 x 10–14). SI Table 4 shows the top scoring GO terms for this test.
The exapted regions tend to cluster around individual genes, often 20–30 instances within 1 Mb. To ensure that large clusters around a handful of genes are not the sole cause for GO term enrichment, we used the same association rule to assign exaptations to genes, but instead of using a uniform null distribution over all bases in the genome, we used a hypergeometric distribution over genes, now allowing each gene to be selected only once. The GO categories of "development" and "transcription regulator activity" were again at the top of the list with somewhat diminished but still very significant P values (8 x 10–24 and 6 x 10–19, respectively). "Cell adhesion" (6 x 10–11) and related terms also featured prominently. SI Table 5 shows all of the top scoring GO terms for this second test.
One may also suspect that mobile elements in general congregate near genes enriched for the observed GO terms, perhaps because the chromatin surrounding these genes is more accessible during germline transposition. This localization would cause any random subset of mobile elements to appear highly enriched for the same GO terms. To eliminate this possibility, we conducted a third, hypergeometric test where we assigned GO terms to the set of all mobile element instances from pan-boreoeutherian subfamilies, according to the above association rule of closest TSS within 1 Mb. We then computed the likelihood that the observed GO terms for genes near exapted CNEs can be obtained by selecting a random subset of repeats. This test resulted in extreme significance for the same terms, including "transcription regulator activity," "development," and "cell adhesion" (3 x 10–64, 3 x 10–60, and 2 x 10–15, respectively). The results of this third statistical test demonstrate that the enrichment we see is not due to an insertion bias of mobile elements, because the distribution of mobile elements exapted as CNE sequence is significantly different from the distribution of all mobile elements from these same families. Landing near a gene associated with development or transcriptional regulation makes an interspersed repeat much more likely to be exapted as a CNE. SI Table 6 shows all of the top scoring GO terms for this test, and SI Text, section S2, gives formal definitions of all three tests.
In search of subsets of exaptations whose annotation would suggest specific specialization, we also investigated GO enrichments at the different taxonomic levels of interspersed repeats (classes, families, and subfamilies), as well as for smaller sets of exaptations of the same consensus portion, or of smaller subsets of the most sequence similar exapted CNEs. Overall, these smaller categories were found enriched for the same functional annotation as the all-inclusive large set (SI Text, sections S3, S7, and S8, and SI Tables referenced therein).
We compared a histogram of exapted CNEs distance from the nearest TSS with that of all bases in the human genome, as well as that of all repeat instances from pan-boreoeutherian subfamilies. As Fig. 3 shows, whereas both background distributions are similar, the exapted CNE set is clearly enriched for lying 0.1–1 Mb away from the nearest gene, at the expense of closer localizations. This enrichment suggests that these CNE are preferentially involved in distal cis-regulation.
|
|
Exaptations in the Reelin-Signaling Pathway. Spurred by Britten and Davidson's early hypothesis (12), we attempted to investigate whether exapted elements, as a whole or broken by taxonomic groups, are also enriched for in particular molecular pathways (SI Tables 17 and 18). Unfortunately, mammalian pathway annotation is currently in its infancy, with only a small fraction of pathways annotated. Nonetheless, our attention was drawn to the reelin-signaling pathway, which allows neurons to complete their migration in the developing brain. Both the L1 family of LINEs and the MIRb subfamily of SINEs have at least one exaptation near each of the four genes that are known to be involved in response to the extracellular RELN signal: VLDLR, LRP8 (ApoER2), DAB1, and FYN. VLDLR and LRP8 are transmembrane receptors that, when bound by RELN, cause the tyrosine phosphorylation of DAB1 by FYN (34). Both enrichments are equally unlikely against the background genomic distribution of L1s and MIRb (5 x 10–6 and 7 x 10–6, respectively). The pathway itself is not completely understood downstream of these four genes. Interestingly, it is thought that some of the downstream targets could be cell adhesion molecules (35, 34), matching our observation above for the enrichment of exaptation events near these genes.
The MIRb exaptations all originate from overlapping sections of the MIRb consensus. It is thus plausible that these instances add similar regulatory regions to each gene in the receptor pathway. To examine this hypothesis, we identified potential transcription factor-binding sites orthologously conserved between human, chimp, macaque, rat, mouse, and dog (see Methods). Each of the four genes has an instance near it that contains orthologously conserved sites for En-1, Oct-1, YY-1, SRY, and v-Myb. However, whereas the position of these binding sites is conserved between species for each gene separately (CNE orthologs), no single predicted binding site comes from the same bases of the progenitor MIRb for all four genes (CNE paralogs). In fact, even irrespective of known binding sites, there are almost no columns where a base is perfectly conserved across all four paralog subtrees. The MIRb copies near each gene seem to have diverged differently from the consensus. However, the progenitor consensus sequence often has multiple predicted binding sites for each of the five transcription factors. Some orthologs seem to have conserved one or more of the instances, whereas their paralogs have conserved different binding sites for the same factor, thus presumably retaining function while diverging in sequence (SI Fig. 7).
Flux of CNE Exaptation from Mobile Elements.
Our conservative set of 10,402 exapted CNEs implicates 4% of all CNEs (2.5% of CNE bases) predating the human–dog split as having clear origins in mobile element insertions. Some CNEs, however, are as old as the vertebrate lineage itself (36). Current estimates suggest that repeat families can be recognized only if they are younger than 200 million years (Myr) (37), implying that some observed CNEs may well have evolved from exaptation of repeat families that have since decayed to the point where they cannot be recognized as such. We can thus attempt to refine our estimate of CNEs from mobile element origins, by considering only the subset of CNEs born after the avian–mammal split, represented by all 180,954 CNEs not found in the chicken genome. Of the identified exapted regions, 9,903 have no clear syntenic ortholog in chicken, suggesting that at least 5.5% of all CNEs born on this branch are from mobile elements. This estimate should increase as closer outgroups to the carnivore split, such as platypus and opossum are published (see SI Text, section S9, and SI Fig. 8). Normalizing for the estimated branch length between the avian and carnivore splits, we obtain a lower bound on the rate of exaptation on this branch that is
22,000 mobile elements exapted as CNEs for every substitution per site of branch length. The number of sampled species and branch lengths on the primate tree is currently insufficient to identify elements that have come under purifying selection during this time frame. However, well established examples make it clear that the mechanism of interspersed repeat exaptation into gene regulatory roles persists in the primate lineage (38, 39). A hypothetical extrapolation from the above estimate using the branch length from the extant human genome back to the speciation of Galago, one of the most distantly related primates, suggests that a substantial set of 2,650 CNE elements under strong purifying selection have been exapted from mobile elements in humans since this early primate ancestor.
| Discussion |
|---|
|
|
|---|
Britten and Davidson (12) hypothesized that the dispersion of repetitive sequences with strong exaptation potential throughout the genome could allow for a whole "battery" of genes to suddenly become coregulated, augmenting an existing pathway, or even creating one from scratch, especially in the context of development. Remarkably, our much more recent appreciation for the complexity and modularity of vertebrate gene regulation serves only to strengthen this early insight. Exapted elements are indeed extremely enriched for clustering near developmental genes, even when considering the background distribution of transposon insertions. In fact, transposons seem to be biased against inserting and remaining near genes involved in developmental regulation (40). Thus, our enrichment is not due to an insertional bias of transposons, but rather a bias in retention, suggesting that they may carry something that may affect the regulation of these genes either beneficially or detrimentally. Developmental functions also dominate the list of genes flanking the largest spatial clusters of exapted elements in the genome. By transposing into the chromatin-accessible region surrounding an active transcription start site, mobile elements may seed novel transcription factor-binding sites and, through both functional and nonfunctional insertions, repeatedly drive older cis-regulatory elements further away from their target genes. Thus, whereas Britten and Davidson (12) did not foresee distal cis-regulation at distances of a megabase 36 years ago, their theory of transposon-mediated regulatory network evolution indirectly predicts it, and our observations provide circumstantial support for this theory. However, to fully verify their hypothesis, we must understand how these exapted CNEs affect developmental gene expression in the context of their regulatory networks (41).
Reelin signaling is believed to result in the activation of genes involved in cell adhesion, a theme that has been seen often in recent papers exploring the evolution of regulatory elements. Exapted instances of the previously discovered LF-SINE, which is thought to have been most active at the base of the tetrapods, are enriched near genes involved in cell adhesion (19). The elements we explore here originated mostly along the mammalian branch and also show significant enrichment for being near cell adhesion genes. A recent work looking at rapidly evolving regions in the human lineage also reports a strong enrichment near genes involved in cell adhesion (7). These results suggest that cell adhesion genes (perhaps mostly those involved in brain wiring) have been constantly refining their expression patterns throughout the last 300 Myr and into the present day.
The majority of current whole-genome experimental and computational approaches to gene regulation, such as tiling arrays used in ChIP-chip experiments (42), and transcription factor-binding site prediction (43), choose to ignore repetitive regions, for pragmatic reasons, assuming that most if not all are inert. Our analysis, however, suggests that, whereas the fraction of repeat copies that have come under strong purifying selection is indeed small, as a fraction of all putative regulatory elements under the same selective pressures, they constitute a pronounced minority. Indeed, as our appreciation for the contributions of repeats to different aspects of genome evolution continues to grow (44), it now seems that these unwanted, and often ignored, children of the genome played multiple crucial roles during the evolution of the human lineage.
| Methods |
|---|
|
|
|---|
Generation of Constrained Nonexonic Elements. Three mammalian sources of conserved elements were used: top-scoring elements resisting insertion and deletions from ref. 22 covering 2% of the human genome; same-sized set of elements resistant mostly to substitutions, generated by running phastCons (4) on a syntenic multiple alignment of human, chimp, macaque, mouse, rat, and dog; and a same-sized set comprising all sliding windows along a syntenic alignment of human, mouse, and dog, where 42 of the 50 bases (84%) of alignment columns were identical, resisting both substitution and in-dels. All regions of the human genome that syntenically aligned to chicken, frog, tetraodon, zebrafish, or fugu at 70% identity or better over at least 50 bases were also added to our final set. We filtered the resulting set of constrained elements by removing bases overlapping any known or reliably predicted mature transcript: refSeq and UCSC known genes, any GenBank cDNA/mRNA reliably alignable to human from a variety of species, human spliced ESTs, known and predicted pseudo genes, RNA genes, micro RNAs, and all Ensembl and Exoniphy gene predictions (45). After filtering, our constrained nonexonic set totaled 1.45% of the human genome.
Comparison with Neutral Rate. We used a model of neutral evolution computed by PhyloP (27) from 4-fold degenerate sites in the ENCODE regions (46).
Calculating GO Enrichment. All UCSC hg18 Known Genes (45) splice variants were combined into human gene loci. Each locus was assigned a representative TSS and the union of all Gene Ontology (GO) annotations (33) assigned to its variants. All loci lacking meaningful GO annotation were removed, leaving a set of 14,277 annotated loci.
Identification of Potential Binding Sites. We used the Transfac free matrices (version 6.0) and search tools to identify potential transcription factor-binding sites (47). All sites were found by the P-Match search tool while minimizing the sum of false-positive and false-negative hits.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: CNE, constrained nonexonic element; GO, Gene Ontology; TSS, transcriptional start site; LINE, long interspersed element; SINE, short interspersed element.
To whom correspondence should be addressed. E-mail: bejerano{at}stanford.edu
Author contributions: C.B.L., G.B., and D.H. designed research; C.B.L. and G.B. performed research; C.B.L. and G.B. analyzed data; and C.B.L., G.B., and D.H. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/cgi/content/full/0611223104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
A. Tsirigos and I. Rigoutsos Human and mouse introns are linked to the same processes and functions through each genome's most frequent non-conserved motifs Nucleic Acids Res., May 1, 2008; (2008) gkn155v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Sasaki, H. Nishihara, M. Hirakawa, K. Fujimura, M. Tanaka, N. Kokubo, C. Kimura-Yoshida, I. Matsuo, K. Sumiyama, N. Saitou, et al. Possible involvement of SINEs in mammalian-specific brain formation PNAS, March 18, 2008; 105(11): 4220 - 4225. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Stephen, M. Pheasant, I. V. Makunin, and J. S. Mattick Large-Scale Appearance of Ultraconserved Elements in Tetrapod Genomes and Slowdown of the Molecular Clock Mol. Biol. Evol., February 1, 2008; 25(2): 402 - 408. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Karolchik, R. M. Kuhn, R. Baertsch, G. P. Barber, H. Clawson, M. Diekhans, B. Giardine, R. A. Harte, A. S. Hinrichs, F. Hsu, et al. The UCSC Genome Browser Database: 2008 update Nucleic Acids Res., January 11, 2008; 36(suppl_1): D773 - D779. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Green 2x genomes Does depth matter? Genome Res., November 1, 2007; 17(11): 1547 - 1549. [Full Text] [PDF] |
||||
![]() |
M. Pheasant and J. S. Mattick Raising the estimate of functional human sequences Genome Res., September 1, 2007; 17(9): 1245 - 1253. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||