| ||||||||||||||||||||||||||||
|
Xie et al. 10.1073/pnas.0701811104. |
Fig. 4. Distribution of distance between CNEs and their nearest gene starts.
Fig. 5. Steps for motif discovery.
Fig. 6. Expected number of random matches for k-mers of different size in a 100-Mb sequence.
Fig. 7. Comparison between LM9 and previously known NRSE motif.
Fig. 8. Three CTCF motifs forming LM2*.
Fig. 9. Distribution of predicted CTCF sites in different chromosomes.
Fig. 10. Genes separated by predicted CTCF sites are less correlated in expression. Correlation coefficient between neighboring gene pairs is shown in terms of probability density (a) and cumulative distribution (b). Red, correlation between genes separated by a CTCF site; gray, correlation between randomly chosen gene pairs; blue, correlation between genes not separated by CTCF sites, sampled such that their inter-gene distance distribution is the same as the intergene distance distribution of those separated by CTCF sites.
Fig. 11. LM2* distribution in vertebrate and invertebrate genomes. (a) Number of LM2* matching sites in 11 species. Matching sites were predicted using a stringent threshold, but requiring no across-species conservation. (b) Density of matching sites. (c) Size of tested genomes (only non-repeat-masked regions).
SI Text
Neighboring genes separated by intergenic CTCF sites tend to have uncorrelated gene expression patterns, providing functional evidence that CTCF sites indeed serve as insulators. The list of ≈15,000 CTCF sites thus provides a genome-wide map of potential insulator sites in human. We observe a few instances of local clustering of CTCF sites, including six regions containing the protocadherin a/g clusters, the T cell receptor a and b loci, and the Ig heavy and l loci (SI Table 4). It is possible that CTCF sites may play special roles in demarcating regions within these gene families.
Constructing the CNE Data Set. To compile the CNE data set, we started from a list of 2.2 million conserved elements curated previously by Siepel et al. [Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. (2005) Genome Res 15:1034-1050; the University of California, Santa Cruz, Genome Browser (http://genome.ucsc.edu)], based on the whole genome alignment of 17 vertebrates including 12 mammals. We then removed all elements that overlapped protein-coding regions or that did not have corresponding sequence lying in the syntenic regions of the mouse and dog genomes. This resulted in 829,730 elements with a total length of 62 Mb. The annotation of protein-coding genes was based on the known genes deposited in the University of California, Santa Cruz, Genome Browser.
Constructing PMW and Searching Motif Instances. To construct a PWM for each motif we aligned all CNE sequences matching any of the k-mers belonging to the motif cluster. We then computed the frequency of different nucleotides at each aligned position, with pij representing the frequency of nucleotide j at position i. The information content of the motif is defined as Ii = 2 + åj pijlog2(pij) at position i. To identify matching instances, we calculated a log-odds score to evaluate how well a sequence matched the PWM. The log-odds score is defined as LO = åi log2(pi, j(i)/bj(i)), where j(i) is the nucleotide at position i of the sequence and bj is the background frequency of the nucleotide j. The log-odds score was then normalized to obtain a final score between 0 and 1: S = (LO - LOmin/(LOmax - LOmin), where LOmax and LOmin are the maximum and minimum score, respectively, that the matrix can possibly achieve.
Conservation of Matching Motif Instances. Upon identifying a matching motif instance in humans we determined whether the instance is conserved in orthologous regions of other mammals. We proceeded by first extracting aligned sequences in the whole-genome alignment of 12 mammals (from the University of California, Santa Cruz, Genome Browser: http://genome.ucsc.edu). We then determined those species in which the corresponding aligned sequence also contains a matching instance. We defined an instance as conserved if the evolutionary tree connecting all species with a matching instance has a total branch length (measured in rate of mutations per nucleotide) >0.85. (For reference, the total branch length connecting human, mouse, rat, and dog is 0.76.)
Affinity-Capture Assay. To study binding to a motif we incubated 1 pmol of 5′-biotinylated oligos containing four consecutive instances of the motif (Operon, Huntsville, AL) with 70 ml of HeLa nuclear extract (Promega, Madison, WI) in a stringent binding buffer [50 mM Tris·HCl, pH 7.5/150 mM NaCl/0.25 mM EDTA/0.5 mM DTT/0.1% Tween 20/350 g of BSA/70 g of poly(dI/dC)] at room temperature for 1 h. The oligos were captured by using streptavidin-coated magnetic beads (Dynal, Carlsbad, CA) and washed three times in binding buffer with BSA, and three times without BSA. Remaining proteins were eluted by boiling in loading buffer, separated by SDS/PAGE, and assayed by Western blot using polyclonal antibodies for RFXI (sc-10652; Santa Cruz Biotechnology, Santa Cruz, CA) or CTCF (sc-5916; Santa Cruz Biotechnology). Probe sequences are as follows: LM1a, GCTGTTGCCATGGAAACCAG; LM1b, TGTTGCTTAGCAACA; LM2a, CCACCAGGTGGCAGCAGA; LM2b, CCACTAGATGGCAGTGTT. For testing the binding of CTCF to LM2, LM7, and LM23 and their mutants we used the following probes: LM2, TCTCCACCAGATGGCAGCA; LM2 mutant, TCTCTACCACATTGCAGCA; LM7, TCCAGCAGGTGGCGCTGTC; LM7 mutant, TCGAGCATGTAGCGCTGTC; LM23, CTGACCACCAGGTGGTGCTGTT; LM23 mutant, CTGACTACCAAGTCGTGCTGTT.
| ||||||||||||||||||||||||||||