Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / GENETICS
Identifying pattern-defined regulatory islands in mammalian genomes


Department of Chemistry and Biochemistry, University of Colorado, Boulder, CO 80309
Communicated by Marvin H. Caruthers, University of Colorado, Boulder, CO, May 1, 2007 (received for review March 15, 2007)
| Abstract |
|---|
|
|
|---|
1.1% of the half-billion base pairs covered in the search and are located mainly in noncoding regions of the genome. We show that the premise of PRIs can be used to identify previously known and novel cis-regulatory regions controlling genes regulated by myogenic differentiation. Thus, PRIs may represent a fundamental property of the architecture of cis-regulatory elements in mammalian genomes, and this feature can be exploited to pinpoint critical transcriptional regulatory elements governing cell type-specific gene expression.
cis-regulatory elements | combinatorial regulation | computational prediction | myogenesis | transcription
The sequencing of many genomes has enabled comparative genomics approaches to annotate functions of genomic sequences. Indeed, exploiting phylogenetic conservation has proved to be a productive approach for finding cis-regulatory elements in less organismically complex species where the regulatory regions tend to be short and compact (3–10). Unfortunately, identifying cis-regulatory elements in vertebrate genomes is difficult because of the increased size and complexity of their genomes. For example, functional cis-regulatory elements typically consist of short stretches of DNA sequences (<500 bases) and are often scattered in the noncoding regions hundreds or thousands of base pairs away from the transcriptional start site (11). In addition, the increased size of the genome implies that an increased number of known binding-site sequence motifs will occur by random chance, even though only a fraction of them are likely to figure prominently in transcription regulation. Finally, although genomewide location analysis using ChIP-on-chip may ultimately elucidate many transcription factor binding sites in mammalian genomes, identification of cis-elements by this approach is impeded by the daunting size of the genomes and the one-by-one nature of probing transcription factor binding. Therefore, unraveling functional cis-regulatory elements buried in the genome remains a challenge because of the insufficient knowledge of the organization and architecture of regulatory DNA.
A defining feature for mammalian transcriptional regulation is combinatorial control, in which multiple transcription factors are required simultaneously to regulate gene expression (2, 12, 13). However, it is unclear what characteristics of functional cis-regulatory regions will facilitate their identification in the context of combinatorial control. A number of studies suggest that combinatorial regulation often relies on direct interactions between participating transcription factors and that these interactions depend on proper spatial orientation of transcription factors with respect to each other (2, 12, 13). In addition, there is significant conservation at the level of transcriptional regulation among human, mouse, and rat species and functional cis-regulatory elements frequently cluster in small regions of the genome (2, 12, 13). An offshoot of this observation is that the architecture of cis-regulatory sequence motifs will likewise be conserved. These fundamental features prompted us to develop a motif-centric computational algorithm to search for regions in mammalian genomes that harbor clusters of multiple transcription factor binding sites that are absolutely conserved with respect to order and spacing across three mammalian genomes (human, mouse, and rat).
| Results |
|---|
|
|
|---|
The algorithm for finding PRIs includes the following steps [supporting information (SI) Fig. 5]. First, we developed a custom curated version of the transcription factor database (TFD) (14, 15). The TFD contains experimentally characterized transcription factor binding sites. Our custom curation minimized redundancy and eliminated many of the poorly defined binding sites that could potentially affect the robustness of the algorithm. Second, we located all putative transcription factor binding sites from the TFD in the same orientation as the gene feature within 20 kb on either side of the start codon of all annotated gene orthologs in all three species [determined by the participation of a given gene in a homologene group as defined by the National Center for Biotechnology Information (16)]. This amounted to 13,520 sets of sequences. Third, we identified PRIs for each gene with a pattern-matching approach that locates clusters of binding sites whose order and spacing are conserved across the three genomes. Spacing conservation was determined by calculating the difference in genomic position of a given binding site in the human genome with respect to its position in the mouse and rat genomes. We called this set of numbers the genomic position offset (GPO). Binding sites with the same GPO in one 40-kb region were grouped into a PRI.
To define a minimum number of binding sites within a PRI in an effort to ensure that the PRIs found are statistically significant, we evaluated the likelihood of different numbers of binding sites occurring in concert in random sequences. Specifically, we selected the 1,000 sequence sets in our database with PRIs containing the greatest number of binding sites and extracted the 40-kb region centered about the start codon of each of these genes in each genome. Each of these sequences was then randomized while maintaining GC distribution and subjected to the PRI search algorithm. This procedure was repeated 100 times per gene. In total, we obtained 100,000 sets of randomized sequences that were searched for PRIs (see SI Text for further details).
A histogram depicting the distributions of PRIs in the scrambled and natural sequences is shown in Fig. 1. We found that PRIs from the natural sequences generally had more binding sites than those from the scrambled sequences. Only 1% of PRIs from scrambled sequences contained seven or more binding sites, implying that we could be 99% confident that PRIs with seven or more binding sites from natural sequences are unlikely to occur by chance and likely represent a cis-regulatory region. Although it is quite likely that functional PRIs exist with fewer binding sites, we wanted to set a conservative threshold to account for any residual redundancy in the TFD. Therefore, the number of binding sites should be thought of as a computational index and one that could change as the customized TFD expands. With the current database, we computationally define a PRI as a region in the genome in which at least seven distinct binding sites are conserved in order and spacing across the three mammalian genomes.
|
540,800,000 bases, or about one-sixth, of the human genome. Of this, we estimate that
6,154,852 bases, equal to 1.1% of the searched regions, are covered by PRIs. The mean PRI size is 335 bp, even though we placed no constraints on PRI size but did constrain number of binding sites. It is important to note that any apparent clustering of binding sites is a natural feature of the genomic organization and is not imposed artificially. To ease exploration of the PRI database, we created a web-based interface (http://barcode.colorado.edu/pri/). (For more information relating to the web site, please see SI Text, and see SI Table 1 for a catalog of all binding sites conserved within PRIs to be examined in this study). Using the interface, we recovered many cis-regulatory regions in our database that have been reported in the literature, including well characterized regions from IL-2, myogenin, and CDC2.
To make a broader assessment of the biological relevance of PRIs, we surveyed 100 regulatory regions reported in the literature to ask how frequently PRIs in our database overlap with regions defined by functional studies. As shown in SI Table 2, documented functional transcription regulatory elements can be matched with the PRI database by using the default search on the web site in 54 of the 100 cases examined. There are several reasons that could explain why the PRI algorithm did not find all regions. First, nine of the genes controlled by these regions are not found in a National Center for Biotechnology Information Homologene group. In addition, it is probable that not all genes are subject to combinatorial regulation. It is also important to note that any very distal elements would not have been covered by the search. Finally, binding sites that have yet to be discovered or that do not conform to a canonical consensus sequence would not be in the TFD. However, new binding sites can be added to the TFD as they are discovered. Overall, our analysis suggests that the PRI database is highly enriched with experimentally defined functional regulatory regions.
In an effort to assess the uniqueness of the PRI approach, we evaluated the performance of other programs dedicated to genomewide identification of cis-regulatory regions across species, PReMod (17) and EEL (18). We used the default search parameters for each database. Interestingly, EEL did not predict any of the same 100 literature-defined regions tested with PRI. Conversely, PReMod finds 63 of 100; 43 of these overlap with those found by PRI (SI Table 2). Curiously, of 12 regions heretofore unexplored with respect to transcriptional regulation but predicted by PRI, PReMod predicts only 4, yielding overall totals of 66 of 112 regions for PRI and 67 of 112 for PReMod. As detailed below, we tested these regions for their potential as genomic sites of transcriptional regulation. (For a more detailed comparison of these two predictive programs, please see SI Text.)
Identifying Skeletal Muscle-Specific PRIs. A corollary to the above survey is that the PRI approach can be used to discover novel, biologically relevant cis-regulatory regions on a genomewide scale. To test this proposition, we experimentally investigated as yet uncharacterized regulatory regions of genes that may be involved in myogenic differentiation. Myogenesis is principally orchestrated by a series of transcriptionally controlled events governed by myogenic regulatory factors (MRFs) (19, 20). A key MRF is MYOD, which binds to a short, degenerate consensus binding site (CANNTG) called an E-box (20). The probability of finding an E-box in the genome is 1/256, only a few of which are likely to be functional. However, MYOD often acts in concert with other transcription factors (20); therefore, we expect that functional E-boxes occur within PRIs. Examples of transcription factors known to cooperate with MYOD are members of the MEF2 family (19, 20) and the SRF family (21). Thus, as a result of the characteristics of the E-box and the combinatorial nature of MYOD-mediated transcriptional regulation, finding functional PRIs in myogenesis poses an excellent challenge to the PRI hypothesis.
To test this hypothesis, we developed a systematic approach with several criteria to sift through the PRI database. We first extracted all PRIs with E-boxes and/or MEF2 binding sites. Next, we crossed this list with a list of genes up-regulated during myogenesis as revealed by several microarray studies (22–24). We deemed this step necessary because many other transcription factors in addition to MYOD are known to bind E-boxes and MEF2 family members can regulate other processes, such as neurogenesis (25). We chose 12 genes, of which the regulatory elements that mediate responses to myogenic signals in mice are unidentified (TNNI2, AQP1, EDA2R, PGCP, PKIA, CPA1, FHL3, MAX, RDH5, SOX6, TPM2, and USP2), for further study. Schematic representations of the organization of associated PRIs are displayed in SI Fig. 6. Some of the PRIs display unexpected features. For example, the PRI for FHL3 is found in the intron of an upstream gene (UTP11L) in the opposite orientation, and the PRI for EDA2R encompasses an exon. Although it is generally believed that transcriptional regulatory regions do not reside within exons, we still were interested in pursuing this region.
We then proceeded to verify that these genes are up-regulated during myogenesis with real-time PCR. We chose to study these candidate PRIs in the C2C12 mouse myoblast cell line as it expresses MYOD and can recapitulate skeletal myotube formation. Fig. 2A tabulates the fold change in transcript level after either 1 or 4 days of serum deprivation. We included the genes SERPINE1 and CUEDC2 as negative controls as the expression of SERPINE1 is known to be regulated in response to TGF-
through an E-box that does not lie within a PRI (26) and CUEDC2 harbors a MEF2 binding site in its upstream region that also is not covered by a PRI. MYOG serves as a positive control for this study as it is a well studied muscle-specific gene whose documented regulatory region (27, 28) overlaps with a PRI. Of the 12 candidate genes, 10 were up-regulated in at least one time point tested. The remaining two, AQP1 and MAX, were not amenable to real-time PCR analysis. However, AQP1 is known to be strongly expressed in adult cardiac and skeletal muscle (29), and this is corroborated by the expression pattern of AQP1 on a microarray following muscle regeneration after injury (23). In addition, it can be demonstrated by microarray (22) that transcript levels of MAX are induced as an early myogenic event.
|
-responsive region (3APP) of SERPINE1 is not bound by MYOD at the E-box known to be critical for TGF-
signaling. Additionally, the positive control region from MYOG is bound by both MYOD and MEF2 as expected.
To test the capacity of candidate PRIs to drive transcription during myogenic differentiation, six of these PRIs were cloned into a luciferase reporter gene vector upstream of a minimal TATA box-containing promoter and subsequently transfected into C2C12 cells. We induced cell differentiation and measured luciferase activity in GM and differentiated myotubes 1 day after induction. As shown in Fig. 3A, minTATA, which is the parental construct for all of our cloned PRIs, has minimal basal activity and just 1.8-fold induction. In contrast, mMYOG –184
+11, a construct that harbors the same region probed in the ChIP experiments, showed robust transcriptional activity and induction upon myogenic differentiation. The 3APP region of SERPINE1 displays minimal activity during myogenesis and is induced only 1.3-fold, suggesting that an E-box alone without the proper combination of other transcription factor binding sites is insufficient for mediating gene activation during myogenic differentiation. The six PRIs showed robust basal transcriptional activity as well as induction greater than the minTATA construct upon myogenic differentiation.
|
We were therefore able to successfully mine our PRI database by asking candidate PRIs to fulfill the following criteria: (i) the PRI contains binding sites for transcription factors known to regulate myogenesis; (ii) the associated gene is differentially expressed during myogenesis as demonstrated by microarray and real-time PCR; (iii) the transcription factors for which there are conserved binding sites in the PRI bind this region as assessed by ChIP; and (iv) the PRIs can drive transcription during myogenesis and depend on the integrity of conserved binding sites as determined by reporter gene assays.
Assessing the Importance of Spacing Between Conserved Binding Sites in PRIs. We next explored the idea that the identity of the intervening bases between binding sites is not critical. An objective variation in spacer base identity can be found in the form of orthologous human sequences. As an illustration of this idea, SI Fig. 7 depicts the conservation of binding sites and the differences between spacing base pairs of human and mouse versions of PRI 8 of AQP1. While there is 100% conservation at the level of the conserved binding sites, only 61% of the intervening sequence is conserved, which is satisfying because it is a good illustration of what we expect to find given the PRI hypothesis is true. Fig. 4A demonstrates that the human ortholog of mouse AQP1 PRI 8 is quite active in our experimental C2C12 system, supporting the notion that the identity of spacer bases between conserved binding sites in PRIs is not crucial.
|
| Discussion |
|---|
|
|
|---|
The experiments presented here have verified the idea that binding-site order and spacing are defining features of regulatory regions, particularly those that serve as sites of combinatorial control. What we do not know is how stringent these criteria should be. As enhancer properties are not well enough defined to be certain how relaxed the criteria can be, we allowed no tolerance for any deviations in binding-site order and spacing. Programs such as PReMod are more flexible from this point of view. Although PReMod also performs well, it is possible that this relaxation with respect to conservation could result in a higher false positive rate. At the same time, there is no existing program that can guarantee identification of all regulatory regions, as clearly evidenced by the fact that PRI and PReMod each find literature-defined regulatory regions that the other does not. Thus, it is distinctly possible that there are key features of regulatory regions that are not yet known but that could improve the algorithm and allow it to be more inclusive.
An interesting observation is that there are frequently multiple PRIs associated with a given gene. It is possible that this is correlated with the complex expression patterns observed for many of the genes of higher eukaryotes. For example, different PRIs could be used in different scenarios. This also establishes the potential for long-range interactions between regulatory regions, which could allow fine-tuning of the expression level of a given gene in different biological settings. As a result, we anticipate that there could be cases where a PRI cloned out of its genomic context and thereby isolated from any potential interactions could not recapitulate the expression pattern of the associated gene. Conversely, we encountered a few cases of PRIs from genes not normally differentially expressed in muscle that could unexpectedly drive reporter gene expression in a C2C12 differentiation time course. This aberrant behavior could be a good example of chromatin structure playing an important role in determining DNA accessibility to transcription factors as the plasmid-borne PRI would escape much of the influence of chromatin modifications normally present. In the future, perhaps overlaying our database with databases that contain information about chromatin structure, such as DNA methylation or histone posttranslational modifications, could allow even more confident forecasting of which PRIs are directing transcription in different scenarios.
There are many potential applications of the PRI algorithm to answering diverse questions. For instance, by examining a list of genes containing PRI-associated binding sites for a given transcription factor, it could be possible to predict the biological process in which this transcription factor is involved. This, in effect, would constitute a functional annotation of cis-regulatory elements. We performed such an analysis on all of the genes in the PRI database with conserved MEF2 binding sites. We compared these genes with all genes in the genome and looked for enrichment of specific functional categories by using Gene Ontology (33). As a control, we also performed the same analysis on genes that contain MEF2 binding sites in their upstream regions but that are not conserved in PRIs. The analysis done on genes with MEF2 binding sites in PRIs returned significantly low (<0.05) binomial P values for functional categories compatible with myogenesis and neurogenesis (data not shown). No such enrichment was observed with genes whose MEF2 consensus binding sites were not within PRIs. We anticipate that one could apply this approach to uncharacterized binding sites to predict what system might use them.
In addition to discovering new cis-regulatory regions, the computational framework developed to support our analysis could be exploited to perform global analyses of transcriptional networks. One obvious question one can ask is what is the network of genes controlled by two or more transcription factors suspected of working together to achieve combinatorial transcriptional regulation? The first search tool on the PRI web site can serve as a launch pad for such an analysis. Further analysis could be done to functionally categorize the PRIs or determine whether key combinations of transcription factors might cooperate to regulate gene expression under different conditions, for example, in response to cytokines, in governing tissue-specific expression patterns, or in coordinating developmental timing events.
Finally, PRIs can be viewed from an evolutionary perspective. It has already been proposed that the complexity of gene regulatory mechanisms is proportional to the complexity of the organism (2), more so than gene number or gene conservation. Could it be that natural mutations within the PRI region that change spacing, binding-site sequence, or orientation alter the transcriptional output and drive the phenotypic variations that distinguish species?
In summary, identification of PRIs coupled with the experimental validation presented here has revealed heretofore unexplored characteristics of the architecture of cis-regulatory regions. This analysis underscores the importance of the order and spacing of transcription factor binding sites that serve as the platform for transcriptional regulation of gene expression in mammalian genomes.
| Methods |
|---|
|
|
|---|
-responsive elements cloned previously between the two restriction sites. Point mutations and deletions were generated with the QuikChange II kit (Stratagene, La Jolla, CA) per the manufacturer's instructions. Cell Lines and Luciferase Assay. C2C12 mouse myoblasts were a gift from Leslie Leinwand (University of Colorado) and maintained in DMEM supplemented with L-glutamine, penicillin, streptomycin, and 20% FBS. Myogenic differentiation was induced by serum deprivation. Briefly, cells were washed with PBS, and the culture media were replaced with DMEM supplemented with 5% horse serum. For luciferase assays, C2C12 cells were plated at 5,000 cells per cm2 in triplicate and 24 h after were transfected with FuGENE6 transfection reagent (Roche, Indianapolis, IN) with 1 µg of the indicated plasmids and 10 ng of pRL-TK for normalization according to the manufacturer's conditions. Twenty-four hours after transfection, GM were harvested and differentiation was induced to yield DM. DM cells were harvested at the time points indicated. Reporter gene assays were performed by using the Dual Luciferase kit (Promega, Madison, WI) according to the manufacturer's instructions and read with a Dynex luminometer.
Real-Time PCR. Total RNA was isolated from C2C12 cells with an RNeasy kit (Qiagen, Valencia, CA) following the manufacturer's instructions. The total RNA was then reverse-transcribed with a Transcriptor First Strand cDNA synthesis kit using Anchored oligo-(dT)18 (Roche). The real-time PCR experiment was performed by using LightCycler FastStart DNA Master(PLUS) SYBR Green I following the manufacturer's instructions on a Light Cycler 2.0 (Roche). Expressions were normalized to the control gene EPB7.2, which was selected from a microarray for its high expression level and minimal fold change during myogenic differentiation. For oligo sequences used in this analysis, please refer to SI Text.
ChIP. ChIP assays were performed essentially according to Lambert and Nordeen (34) except that antibodies against MYOD (sc-760; Santa Cruz Biotechnology, Santa Cruz, CA) and MEF2 (sc-313; Santa Cruz Biotechnology) were used. The results were analyzed by PCR on a Mastercycler Gradient (Eppendorf, Hamburg, Germany). For more information, please refer to SI Text.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: PRI, pattern-defined regulatory island; TFD, transcription factor database; GM, growing myoblasts; DM, differentiating myotubes.
To whom correspondence should be addressed. E-mail: xuedong.liu{at}colorado.edu
Author contributions: T.H.C. and K.K.B.B. contributed equally to this work; T.H.C., K.K.B.B., and X.L. designed research; K.K.B.B. performed research; T.H.C. and Y.L.K. contributed new reagents/analytic tools; T.H.C., K.K.B.B., and X.L. analyzed data; and K.K.B.B. and X.L. wrote the paper.
*Present address: Department of Neurology and Neurological Sciences, Stanford University School of Medicine, Stanford, CA 94305. ![]()
Present address: Dharmacon Research, Lafayette, CO 80026. ![]()
The authors declare no conflict of interest.
This article contains supporting information online at www.pnas.org/cgi/content/full/0704028104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||