CTCF mediates chromatin looping via N-terminal domain-dependent cohesin retention

Significance The DNA-binding protein CCCTC-binding factor (CTCF) and the cohesin complex function together to establish chromatin loops and regulate gene expression in mammalian cells. It has been proposed that the cohesin complex moving bidirectionally along DNA extrudes the chromatin fiber and generates chromatin loops when it pauses at CTCF binding sites. To date, the mechanisms by which cohesin localizes at CTCF binding sites remain unclear. In the present study we define two short segments within the CTCF protein that are essential for localization of cohesin complexes at CTCF binding sites. Based on our data, we propose that the N-terminus of CTCF and 3D geometry of the CTCF–DNA complex act as a roadblock constraining cohesin movement and establishing long-range chromatin loops.


Fig. S1. Overlap of CTCF and cohesin occupancy in multiple cell lines. (A)
Venn diagram representing overlap of CTCF and RAD21 ChIP-seq binding regions mapped in MCF7 cells. (B) Heatmaps of CTCF (red), RAD21 (pink) and IgG (black) occupancy at genomic regions bound either by CTCF or RAD21 or both in MCF7 cells demonstrate that both CTCF and RAD21 peaks not overlapping with each other show some enrichment of RAD21 and CTCF occupancy, respectively. The heatmaps correspond to the overlapping CTCF and RAD21 binding sites in panel (A), with the connection between two panels shown by black arrows. (C) Average profiles of CTCF (blue), RAD21 (pink), and IgG (green) occupancy at the binding sites determined in panel A confirm the enrichment of RAD21 and CTCF occupancy at CTCF and RAD21 peaks not overlapping with each other, respectively. The connection between two panels (B and C) is shown by black brackets. (D) Based on the enrichment of CTCF and RAD21 occupancies mapped in human (MCF7 and HEPG2) and mouse (mES and CH12) cells, the three classes of CTCF and RAD21 sites were identified (labelled on the left). Venn diagrams illustrate the overlap of these three classes between the two human and mouse cell types. The percentages show how many of the sites from the cell type with a smaller number of sites overlap with the cell type with the larger number of sites. CTCF and RAD21 sites were the most reproducible, followed by CTCF sites depleted of RAD21 enrichment, while CNC sites were more cell typespecific.

Fig. S2. The cohesin loading factor NIPBL is sufficient to explain CTCF-independent cohesin occupancy in different mouse and human cell lines, while the tissue-specific transcription factors ESR1 (MCF7), CEBPA (HepG2), OCT4 (mESC) do not generally overlap with cohesin (RAD21). (A)
Heatmaps of ESR1 (purple), CTCF (red), RAD21 (pink) and NIPBL (green) occupancy at 51,395 ESR1 binding sites mapped in MCF7 cells. (B) Genome browser view of CTCF, RAD21 and ESR1 ChIP-seq data in MCF7 cells confirms that ESR1 binding sites generally do not coincide with cohesin occupancy. (C) Heatmaps of CEPBA (black), CTCF (red), RAD21 (pink), and NIPBL (green) occupancy at 89,721 CEBPA binding sites mapped in HepG2 cells demonstrate that CEPBA binding sites are generally do not coincide with CTCF-depleted cohesin binding sites. (D) Genome browser view of CTCF, RAD21, NIPBL, and CEBPA occupancy in HepG2 cells shows that RAD21 sites depleted of CTCF correspond better to NIPBL binding sites than CEPBA sites, highlighted by red arrows. (E) Heatmaps of OCT4 (blue), CTCF (red), RAD21 (pink) and NIPBL (green) occupancy at 32,338 OCT4 binding sites mapped in mESCs.    S5. Comparison of the 5K lost CTCF sites with all CTCF sites mapped in CH12 cells with respect to their genomic distribution and to their association with epigenetic marks and transcription factors (A) Genomic distribution of the remaining (56K) and the lost (5K) CTCF ChIP-seq peaks in mut CH12 cells in comparison with the genomic distribution of all CTCF peaks (61K) mapped in wt CH12 cells (left). The lost 5K sites and the remaining 56K CTCF sites showed a similar distribution with respect to genomic context. (B) A heatmap showing row z-scores of overlapping ChIP-seq data for multiple transcription factors and histone modifications (labeled at the right of the heatmap) with the 5K lost CTCF sites (Lost 5K), with the 5K sites randomly selected from a total number of 61K CTCF sites mapped in wt CH12 cells (Random 5K), and with the total number of all 61K CTCF sites mapped in wt CH12 cells (Total 61K). Venn diagrams representing overlap of CTCF ChIP-seq data with RNA-seq data in wt and mut CH12 cells. The genes that significantly changed expression upon deletion of ZFs 9-11 in CTCF (A) overlapped with CTCF peaks lost in mut CH12 cells. Gene coordinates were extended 100 kb up-and downstream of their transcription start and end sites. The majority (60%) of deregulated genes had a lost CTCF peak within 100 kb, suggesting that they might be direct targets of CTCF. (C) The major pathways that are significantly deregulated in mut CH12 cells compared to wt CH12 cells.    S8. Ectopically expressed CTCF constructs restore CTCF occupancy at the majority of CTCF sites lost in mut CH12 cells. (A) Heatmaps demonstrating that V5-tagged full-length (FL) CTCF restores CTCF occupancy at the 5K lost sites in mut CH12 cells. FL-CTCF was mapped by ChIP-seq with both CTCF and V5-tag Abs, shown at the top of the heatmap in comparison with CTCF occupancy in both wt and mut CH12 cells. (B) Heatmaps showing 188 lost CTCF sites that do not restore occupancy upon ectopic expression of CTCF in mut CH12 cells. (C) Heatmaps demonstrating that the binding pattern of FL-CTCF and truncated mutants in mut CH12 cells generally reproduced that of full-length CTCF, including the occupancy at the 5K lost CTCF sites. (D) Genome browser view of CTCF, V5, and RAD21 ChIP-seq data mapped in wt and mut CH12 cells. The App promoter, residing in a CpG island (green track), contains one of the 188 "permanently" lost CTCF sites (B) (shown by red arrows).  Fig.4. First, we selected 2529 CTCF anchored chromatin loops by overlapping CTCF ChIPseq data with a deeply sequenced Hi-C dataset where 3331 chromatin loops were identified in wt CH12 cells (PMID: 25497547). Second, we selected 344 loops that overlapped with the 5K lost CTCF sites at one or both anchors. Third, we sorted out 70 loops for Hi-C analysis by removing the short-range loops (those that span less than 300kb).      (pink), and Chimera2 (purple) occupancy at the 5K lost CTCF sites demonstrates an overall gain of cohesin occupancy following the gain of Chimera2 occupancy, albeit to a lower extent than with FL-CTCF stably expressed in mut CH12 cells. K-means ranked clustering of ChIP-seq data along the 5K lost CTCF sites shows that only some of them were enriched with cohesin, reflecting Chimera2 occupancy, while the majority of the lost CTCF sites were occupied by cohesin following FL-CTCF occupancy. Clusters shown on the right side of heatmap explain the observed patterns.        No coimmunoprecipitation of CTCF with any cohesin subunits was detected when DNA-assisted protein interactions were inhibited by ethidium bromide. V5-tag and cMYC Abs were used as a negative control. YY1 and PARP1 Abs were used as a positive control for CTCF. All four cohesin subunits (RAD21, SMC1, SMC3, and SA2) are co-immunoprecipitated together. The asterisk shows a nonspecific band, not corresponding to the molecular weight of SMC3 protein. ) to see if the proteins form a stable complex that can be supershifted with both CTCF and RAD21 Abs. EMSA with CTCF-cohesin overlapping nuclear extract fractions demonstrated that the labelled DNA-protein complexes could be supershifted with antibodies against CTCF and BORIS, a known interacting partner of CTCF, but not with antibodies against cohesin subunit RAD21, thus confirming the absence of CTCF-cohesin complexes in the nuclear extracts, consistent with our co-IP results (Fig. S23). The P 32 -labelled p53 promoter probe, described in (PMID: 26268681), was used in EMSA assay. The black arrows show the supershift with both CTCF and BORIS Abs, but not with RAD21 Abs (red arrows).  Aa sequences highlighted in red, blue, green, black and purple belong to CTCF, BORIS, AZF, flexible linker and V5-tag peptides, respectively. CTCF and BORIS ZFs are underlined, the N-terminus of both proteins is in bold. In the sequence #16, the amino acids that shown to be poly(ADP)ribosylated are replaced by alanine (A, highlighted by black color and underlined).

Bioinformatic analysis of ChIP-seq data
Single-end sequences were generated by the Illumina genome analyzer (36-60 bp reads) were aligned against either the human (build hg19) or mouse (build mm9) genome using the Bowtie program with the default parameters (8), except the sequence tags that mapped to more than one location in the genome were excluded from the analysis using the -m1 option. Peaks were called using Model-based Analysis for ChIP-seq (MACS2) using default parameters (https://github.com/taoliu/MACS). The ChIP-seq data were visualized using the Integrative Genomics Viewer (IGV) (9). The peak overlaps between ChIP-seq data sets were determined with the BedTools Suite (10). We defined peaks as overlapping if at least 1 bp of of each peak overlapped. The normalized tag density profiles were generated using the BedTools coverage option from the BedTools Suite (10), normalized to the number of mapped reads, and plotted in Microsoft Excel. The heatmaps and the average profiles of ChIP-Seq tag densities for different clusters were generated using the seqMINER 1.3.3 platform (11). We used k-means ranked method for clustering normalization. Position weight matrices were calculated using Multiple EM for Motif Elicitation (MEME) software (12). The sequences under the summit of ChIP-seq peaks were extended 100 bp upstream and downstream for motif discovery. We ran MEME with parameters (−mod oops -revcomp -w 40 or -w 20) to identify the long and short CTCF motifs considering both DNA strands. Genomic distribution of CTCF ChIP-seq peaks relative to reference genes was performed using the Cis-regulatory Element Annotation System (CEAS) (13). To call the genomic regions bound either by CTCF or RAD21 or by both proteins in the four cell lines (Fig.1C-D, F), we calculated CTCF and RAD21 ChIPseq tag densities at each binding region. For this we combined CTCF and RAD21 binding sites into a composite set, extended the summit of peaks to 300 bp, and calculated either CTCF or RAD21 normalized ChIP-seq tag density at each binding region using BedTools Coverage option. We classified the sites as "Cohesin-Non-CTCF" or "CTCF depleted of RAD21" if a difference in the tag density between the two factors was more than 3-fold at the binding region. To calculate the percent of cohesin (RAD21) occupancy at the lost CTCF sites in Fig. 7, we calculated RAD21 ChIP-seq tag densities (normalized to the number of mapped reads) mediated by the ectopic expression of either empty vector, FL-CTCF, chimeric or mutant constructs in mut CH12 cells. The RAD21 ChIP-seq tag density at the lost CTCF sites either followed by the ectopic expression of empty vector was taken as 0% or followed by the expression of FL-CTCF was taken as 100% cohesin occupancy. The percent of cohesin occupancy at the lost CTCF sites by chimeric and mutant proteins was calculated on the scale between 0% and 100%. In the case of chimeric proteins (Fig.  7B), we calculated RAD21 ChIP-seq tag density only at the lost CTCF sites that have a similar occupancy (ChIP-seq tag density) for FL-CTCF and the corresponding chimeric protein at these sites. All ChIP-seq data have been deposited in the Gene Expression Omnibus (GEO) repository with the following GEO accession number: GSE137216.

RNA-seq
The RNA sequencing library preparation and sequencing procedures were carried out according to Illumina protocols. FASTQ files were mapped to the UCSC Mouse reference (mm9) using TopHat2 (14) with the default parameter setting of 20 alignments per read and up to two mismatches per alignment. The aligned reads (BAM files) were analyzed with Cufflinks 2.0 to estimate transcript relative abundance using the UCSC reference annotated transcripts (mm9). The expression of each transcript was quantified as the number of reads mapping to a transcript divided by the transcript length in kilobases and the total number of mapped reads in millions (FPKM). Cuffdiff was applied to obtain the list of deregulated genes. Transcripts having more than 2-fold changes in their expression and p-value less than 0.005 were used for further analysis. RNA-seq data have been deposited in the GEO repository with the following accession number: GSE137216.

Hi-C
In situ Hi-C experiments were performed as previously described using the MboI restriction enzyme (15). The crosslinked pellets (1.5 million cells) were incubated and washed with 200 μL of lysis buffer (10 mM Tris-HCl pH 8.0, 10 mM NaCl, 0.2% Igepal CA630, 33 μL Protease Inhibitor (Sigma, P8340)) on ice, and then incubated in 50 μL of 0.5% SDS for 10 min at 62°C. After heating, 170 μL of 1.47% Triton X-100 was added and incubated for 15 min at 37˚C. To digest chromatin, 100 U MboI and 25 μL of 10X NEBuffer2 were added followed by overnight incubation at 37˚C. The digested ends were filled and labeled with biotin by adding 37.5 μL of 0.4 mM biotin-14-dATP (Life Tech), 1.5 μL of 10 mM dCTP, 10 mM dTTP, 10 mM dGTP, and 8 μL of 5 U/μL Klenow (New England Biolabs) and incubating at 23°C for 60 minutes with shaking at 500 rpm on a thermomixer. Then the samples were mixed with 1x T4 DNA ligase buffer (New England Biolabs), 0.83% Triton X-100, 0.1 mg/mL BSA, 2000 U T4 DNA Ligase (New England Biolabs, M0202), and incubated for at 23°C for 4 hours to ligate the ends. After the ligation reaction, samples were resuspended in 550 μL 10 mM Tris-HCl, pH 8.0. To reverse the crosslinks, 50 μL of 20 mg/mL Proteinase K (New England Biolabs) and 57 μL of 10% SDS were mixed with the samples, and incubated at 55°C for 30 minutes, and then 67 μL of 5 M NaCl were added followed by overnight incubation at 68°C. After cooling at room temperature, 0.8X Ampure (Beckman-Coulter) purification was performed, and the samples were sonicated to a mean fragment length of 400 bp using Covaris M220. Two rounds of Ampure (Beckman-Coulter) beads purification was performed for size selection. Biotin-labeled DNA was purified using Dynabeads MyOne T1 Streptavidin beads (Invitrogen). The beads were washed with 400 μL of 1x Tween Wash Buffer (5 mM Tris-HCl pH 7.5, 0.5 mM EDTA, 1 M NaCl, 0.05% Tween- 20), and resuspended in 300 μL of 2x Binding Buffer (10 mM Tris-HCl pH 7.5, 1 mM EDTA, 2 M NaCl). The beads were added to samples and incubated for 15 minutes at room temperature. Then the beads were washed twice by adding 600 μL of 1x Tween Wash Buffer. Then the beads were equilibrated once in 100 μL 1x NEB T4 DNA ligase buffer (New England Biolabs) followed by removal of the supernatant using a magnetic rack. To repair the fragmented ends, the beads were resuspended in 100 μL of the following: 88 μL 1X NEB T4 DNA ligase buffer (New England Biolabs, B0202), 2 μL of 25 mM dNTP mix, 5 μL of 10 U/μL T4 PNK (New England Biolabs), 4 μL of 3 U/μL NEB T4 DNA Polymerase (New England Biolabs), 1 μL of 5 U/μL Klenow (New England Biolabs). The beads were incubated for 30 minutes at room temperature. The beads were washed twice by adding 600 μL of 1x Tween Wash Buffer. To add a dA-tail, the beads were resuspended in 100 μL of the following: 90 μL of 1X NEBuffer2, 5 μL of 10 mM dATP, and 5 μL of 5 U/μL Klenow (exo-) (New England Biolabs). The beads were incubated for 30 minutes at 37˚C. The beads were washed twice by adding 600 μL of 1x Tween Wash Buffer. Following the washes, the beads were equilibrated once in 100 μL 1x NEB Quick Ligation Reaction Buffer (New England Biolabs) and the supernatants were removed using a magnetic rack. The beads were then resuspended in 50 μL 1x NEB Quick Ligation Reaction Buffer. To ligate adapters, 2 μL of NEB DNA Quick Ligase (New England Biolabs) and 3 μL of Illumina Indexed adapter were added to the beads and incubated for 15 minutes at room temperature. The supernatant was removed and the beads were washed twice by adding 600 μL of 1x Tween Wash Buffer. Then the beads were resuspended once in 100 μL 10 mM Tris-HCl, pH 8.0, followed by removal of the supernatant and resuspension again in 50 μL 10 mM Tris-HCl, pH 8.0. After deciding an optimal PCR cycle number using KAPA DNA Quantification kit (Kapa Biosystems), 6 cycles of PCR amplification were performed with the following reaction mixture: 10 μL Phusion HF Buffer (New England Biolabs), 3.125 μL 10 μM TruSeq Primer 1, 3.125 μL 10 μM TruSeq Primer 2, 1 μL 10 mM dNTPs, 0.5 μL Fusion HotStartII, 20.75 μL ddH20, 11.5 μL Bead-bound Hi-C library. PCR products were subjected to a final purification using AMPure beads (Beckman-Coulter) and were eluted in 30 μL 10 mM Tris-HCl, pH 8.0. Libraries were sequenced on the Illumina HiSeq 4000 platform. Hi-C data have been deposited in the GEO repository with the following accession number: GSE136122.

Hi-C data analysis
Hi-C reads (paired end, 50 bases) were aligned against the mm9 genome using BWA-mem (16). PCR duplicate reads were removed using Picard MarkDuplicates. We used juicebox (17) to create hic file with -q 30 -f options and to visualize Hi-C data. The aggregate analysis of chromatin loops was performed using APA (17) with default parameters and 10 kb resolution. The list of chromatin loops identified in wild type CH12 were downloaded from (15).

Published next-generation experiments used in this study
ChIP-seq data for CTCF and RAD21 in K562 and HEPG2 cell lines used in the study: GSE32465 (18), GSE38163 (19), GSE30263 (20), GSE25021 (21), GSE36030 (19), supplied with 0.5% of bovine serum albumin (BSA) for 2 h at room temperature under a constant rotation. After 2 h of incubation, the beads were washed three times with PBS + 0.5% of BSA. Protein extracts of wt CH12 cells were prepared with RIPA Lysis buffer (Millipore) containing 50 mM Tris-HCl, pH 7.4, 1 % Nonidet P-40, 0.25 % sodium deoxycholate, 500 mM NaCl, 1 mM EDTA, 1× protease inhibitor cocktail (Roche Applied Science). Next, the antibody-bound beads were incubated with 1.5 mg of protein extracts in the presence of ethidium bromide (100 μg/μL) overnight at 4°C with constant rotation. Of note, the protein extracts were pre-cleared with 30 µL of DiaMag protein G-coated magnetic beads for 2 h under a constant rotation at 4°C. The immunoprecipitates were collected using a magnetic rack, washed five times with PBS+0.5 % BSA, dissolved in 1X LDS Sample buffer (Invitrogen) supplemented with DTT (50 mM final concentration), and boiled for 5 min at 90°C. Immunoprecipitated samples were resolved by SDS-PAGE, transferred to a PVDF membrane, and incubated with the indicated antibodies. Detections were performed using ECL reagents.