Genome-wide colocalization of RNA–DNA interactions and fusion RNA pairs

Significance It remains formidable to predict what unreported RNA pairs can form new fusion transcripts. By systematic mapping of chromatin-associated RNAs and their respective genomic interaction loci, we obtained genome-wide RNA–DNA interaction maps from two noncancerous cell types. The gene pairs involved in RNA–DNA interactions in these normal cells exhibited strong overlap with those with cancer-derived fusion transcripts. These data suggest an RNA-poise model, where the spatial proximity of one gene’s transcripts and the other gene’s genomic sequence poises for the creation of fusion transcripts. We validated this model with 96 additional lung cancer samples. One of these additional samples exhibited fusion transcripts without a corresponding fusion gene, suggesting that genome recombination is not a required step of the RNA-poise model.

. Companion tests and targeted therapies have been developed to identify and treat fusion-gene defined cancer subtypes (2,4). Efforts of detection of fusion transcripts have primarily relied on analyses of RNA sequencing (RNA-seq) data (1,3,(5)(6)(7)(8). A recent study analyzed 9,966 RNA-seq datasets across 33 cancer types from The Cancer Genome Atlas (TCGA) and identified more than 15,000 fusion transcripts (4).
Despite the large number of gene pairs in identified fusion transcripts, it remains formidable to predict what unreported pair of genes may form a new fusion transcript. Recent analyses could not identify any distinct feature of fusion RNA forming gene pairs (9). Here, we report a characteristic pattern of the 2D distribution of the genomic locations of the gene pairs involving RNA-DNA interactions that provides insights into understanding the creation of fusion transcripts.
Chromatin-associated RNAs (caRNAs) provide an additional layer of epigenomic information in parallel to DNA and histone modifications (10). The recently developed mapping of RNAgenome interactions (MARGI) technology enabled identification of diverse caRNAs and the respective genomic interacting locations of each caRNA (6). In this work, we developed an improved MARGI experimental pipeline called iMARGI. Compared with MARGI, iMARGI reduced the required amount of input cells by 100-fold to ∼5 million cells. We used iMARGI to map RNA-DNA interactions in human embryonic kidney (HEK) and human foreskin fibroblast (HFF) cells. The detected RNA-DNA interactions often appeared on the gene pairs involved in cancer-derived fusion transcripts. The widespread RNA-DNA interactions on the gene pairs involved in fusion transcripts suggest a model wherein the RNA of gene 1 by interacting with the genomic sequence of gene 2 is poised for being spliced into gene 2's nascent transcript and thus creating a fusion transcript. Consistent with this model, we identified an RNA fusion in a new cancer sample that does not involve the creation of a fusion gene.

Results
Characteristics of Genome-Wide RNA-DNA Interaction Maps. To systematically characterize caRNAs and their genomic interaction locations, we developed iMARGI, an enhanced version of the Significance It remains formidable to predict what unreported RNA pairs can form new fusion transcripts. By systematic mapping of chromatin-associated RNAs and their respective genomic interaction loci, we obtained genome-wide RNA-DNA interaction maps from two noncancerous cell types. The gene pairs involved in RNA-DNA interactions in these normal cells exhibited strong overlap with those with cancer-derived fusion transcripts. These data suggest an RNA-poise model, where the spatial proximity of one gene's transcripts and the other gene's genomic sequence poises for the creation of fusion transcripts. We validated this model with 96 additional lung cancer samples. One of these additional samples exhibited fusion transcripts without a corresponding fusion gene, suggesting that genome recombination is not a required step of the RNA-poise model.  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10  chr11  chr12  chr13  chr14  chr15  chr16  chr17  chr18  chr19  chr20  chr21  chr22  chrX  chrY   chr1   chr2   chr3   chr4   chr5   chr6   chr7   chr8   chr9   chr10   chr11   chr12   chr13   chr14  chr15  chr16  chr17  chr18  chr19  chr20  chr21  chr22  MARGI assay (10). The main difference between iMARGI and MARGI is that iMARGI carries out the ligation steps in situ (Fig. 1A), whereas MARGI performs these ligation steps on streptavidin beads. We applied iMARGI to HEK and HFF cells to yield 361.2 million and 355.2 million 2 × 100-bp paired-end sequencing read pairs, respectively. These resulted in 36.3 million (HEK) and 17.8 million (HFF) valid RNA-DNA interaction read pairs, in which ∼35%, 10%, and 55% were proximal, distal, and interchromosomal interactions, respectively ( Fig. 1C and SI Appendix, Fig. S1A). The proximal and distal interactions were defined as the intrachromosomal interactions where the RNA end and DNA end were mapped to within and beyond 200 kb, respectively. Following Sridhar et al. (10), we removed proximal read pairs from further analysis because proximal interactions likely represent interactions between nascent transcripts and their neighboring genomic sequences. Hereafter, we refer to the union of distal and interchromosomal interactions as remote interactions. The rest of this paper deals only with remote interactions. Among the remote RNA-DNA interactions, both cell types exhibited an ∼1:5 ratio of intra-and interchromosomal interactions (Fig. 1D). The 2D map of RNA-DNA interactions exhibited more interactions near the diagonal line ( Fig. 1E and SI Appendix, Fig. S1D). Within each chromosome, the number of iMARGI read pairs negatively correlated with their genomic distances ( Fig. 2A, red and blue circles).

52%
Comparison of iMARGI with MARGI. We compared iMARGI with the MARGI technology previously described (10). iMARGI requires only ∼5 million cells to start the experiment, whereas MARGI requires ∼500 million cells. MARGI has two technical variations called pxMARGI and diMARGI, which differ by the degree of chromatin fragmentation (10). We compared the iMARGI, pxMARGI, and diMARGI datasets generated from HEK293T cells. These datasets had roughly comparable amounts of raw read pairs (SI Appendix, Table S1).
First, we compared the distribution of the read pairs. iMARGI and pxMARGI produced similar amounts of valid interchromosomal (∼19 million) and distal (1-4 million) read pairs (SI Appendix, Table S1). diMARGI generated many fewer valid interchromosomal (∼0.5 million) and distal (∼26,000) read pairs. Second, we compared by the numbers of discovered caR-NAs. Under the smallest possible threshold (1 valid read pair), iMARGI revealed a similar amount of caRNAs to that of pxMARGI, with mRNA, lincRNA, pseudogene RNA, and antisense RNA as the most abundant types of caRNAs (SI Appendix, Fig. S2). diMARGI revealed severalfold fewer caRNAs in every RNA type (SI Appendix, Fig. S2), consistent with its many fewer usable read pairs. Third, we compared by the "RNA attachment level" (10) on every genomic segment. We segmented the genome into 100-kb bins and counted the number of read pairs with the DNA end mapped to each bin. iMARGI and pxMARGI exhibited a strong correlation (Pearson correlation = 0.88, P value <2.2×10 −16 ; SI Appendix, Fig. S3A). The correlation increased with the bin size (SI Appendix, Fig. S3 B and C). diMARGI data exhibited much weaker correlation to iMARGI data (SI Appendix, Fig. S3 D-F), likely also due to its very small amount of usable read pairs.
The Most Significant RNA-DNA Interactions Colocalized with the Gene Pairs Forming Fusion Transcripts in Cancer. We set off to identify the most significant distal RNA-DNA interactions from the iMARGI data. Excluding extremely abundant noncoding RNAs, such as XIST, the top gene pair with the largest amount of interchromosomal and distal iMARGI read pairs in HEK cells was FHIT-PTPRG ( Fig. 2B and SI Appendix, Fig. S4A). Investigating this gene pair, we found the reporting of FHIT-PTPRG fusion transcripts from kidney, liver, head and neck, lung, and prostate cancers (7). The second largest amount of interchromosomal and distal iMARGI read pairs was from GPC5-GPC6 ( Fig. 2B and SI Appendix, Fig. S4B). Fusion transcripts from this second-ranked gene pair were reported from liver and prostate cancers (7). Notably, 5 of the top 10 gene pairs were reported as fusion transcripts in cancers (1,7). These findings led us to systematically analyze the relationship between RNA-DNA interactions and fusion transcripts.

Nonuniform Distribution of the RNA Pairs Contributing to Fusion
Transcripts. We asked whether there is any global characteristic of genome-wide distribution of the RNA pairs that contribute to fusion transcripts. To this end, we subjected the previously reported 16,410 fusion transcripts that were derived from 9,966 A B samples across 33 cancer types from TCGA project to further analysis (4). On average, there were two fusion transcripts per sample (Fig. 3A). The 16,410 fusion transcripts corresponded to 15,144 unique RNA pairs. Hereafter we refer to these gene pairs as fusion transcript-contributing RNA pairs ("Futra pairs"). More than 95% (14,482 of 15,144) of Futra pairs occurred only in 1 sample of the 9,966 samples analyzed (Fig. 3B). These data confirmed the scarcity of recurrent gene pairs in fusion transcripts (11,12). We visualized the frequency of Futra pairs in a 2D heatmap, which we call a "fusion map" (Fig. 3C and SI Appendix, Fig. S5A). The 2D distribution was not uniform, with more intrachromosomal than interchromosomal gene pairs (odds ratio = 27.91, P value <2.2 × 10 −16 , χ 2 test). A total of 8,891 and 6,253 Futra pairs were intra-and interchromosomal, respectively. Chromosomes 1, 12, and 17 harbored the largest amounts of intrachromosomal gene pairs (SI Appendix, Fig. S5B). Chromosomes 1, 11, 12, 17, and 19 contribute to the largest amounts of interchromosomal gene pairs (SI Appendix, Fig. S5C). Higher density of gene pairs appeared on the diagonal line of the fusion map, suggesting enrichment of gene pairs within chromosomes or large chromosomal domains (Figs. 3C and 4A). We quantified the relative distances between the Futra pairs. The number of intrachromosomal Futra pairs negatively correlated with their chromosomal distance ( Fig. 2A, purple circles). Taken together, Futra pairs exhibited nonuniform distribution in the genome, characterized by enrichment of intrachromosomal pairs and preference to smaller genomic distances.

Differences Between the Genomic Locations of Futra Pairs and
Genome Interactions. We asked to what extent Futra pairs may correlate with genome interactions. Forty-one percent (6,253 of 15,144) of Futra pairs were interchromosomal, whereas less than 15% of chromosomal conformation capture-derived read pairs were interchromosomal (13,14). The intrachromosomal Futra pairs exhibited enrichment in large chromosomal domains (Figs. 3C and 4A and SI Appendix, Fig. S6A). These enriched chromosomal domains ranged from approximately 1/10th to 1/3rd of a chromosome in lengths, which are ∼10-20 times the typical sizes of topologically associated domains (TADs) (15). Taken together, Futra pairs exhibited different global distribution characteristics from those of genome interactions.  A model that may explain the colocalization of RNA-DNA interactions and Futra pairs is that RNA-DNA interactions in the normal cells poise for creation of fusion transcripts. Recognizing that this model cannot be tested by perturbation due to the very small likelihood for a fusion transcript to occur in a cancer sample, we carried out two other tests. First, we tested whether the cancer-derived Futra pairs were detectable in normal cells. We reanalyzed the merged RNA-seq datasets of more than 75 million 2 × 100-bp paired-end read pairs from HEK293T cells (16) and ran STAR-Fusion (17) on these datasets, which reported a total of 8 Futra pairs. None of the previously derived 15,144 Futra pairs from TCGA RNA-seq data were detected in HEK293T cells. In addition, we specifically tested for EML4-ALK fusion transcripts, which were reported in nonsmall cell lung carcinoma (NSCLC) (18), and there were RNA-DNA interactions between EML4 RNA and the ALK genomic locus in HEK and HFF cells (see Fig. 6A). Neither PCR nor quantitative PCR analysis detected EML4-ALK fusion transcripts in HEK293T cells (SI Appendix, Fig. S8), whereas both assays detected fusion transcripts in a NSCLC cell line (H2228) (SI Appendix, Fig. S8). Taken together, these data suggest that, although cancer-derived Futra pairs colocalized with RNA-DNA interactions in normal cells, the fusion transcripts found in cancer are not present in the normal cells.

RNA-DNA Interactions in Normal Cells Are Predictive of Fusion
Transcripts in New Cancer Samples. Next, we tested whether the RNA-DNA interactions in normal cells are predictive of fusion transcript formation in cancer. To this end, we analyzed a validation cohort comprising 96 new lung cancer samples from patients who were not part of the TCGA cohorts. We also analyzed a NSCLC cell line (H2228). RNA was extracted and targeted RNA-seq was carried out with Illumina's TruSight RNA Pan-Cancer Panel. Of these 96 samples, 27 did not yield sufficient RNA for sequencing, whereas the other 69 samples produced a sequencing library and yielded on average 3.9 million uniquely aligned read pairs per sample (Fig. 5A). STAR-Fusion (17) was applied to this dataset and it reported a total of 42 fusion transcripts from these 69 samples (Fig. 5 B and C). These 42 fusion transcripts included EML4-ALK and FRS2-NUP107 fusions, which were also reported from the 9,966 TCGA cancer samples, as well as 40 new fusion transcripts that were not previously documented. The small amount of recurring Futra pairs between these additional cancer samples and TCGA samples is expected from the small fraction of recurring Futra pairs across the TCGA samples (Fig. 3B).
Among these 42 Futra pairs detected from the validation cohort, 37 (88.1%) colocalized with RNA-DNA interactions in the assayed normal cells, supporting the idea that RNA-DNA interactions in the already assayed normal cells are predictive of Futra pairs in cancer (odds ratio = 106.51, P value <2.2 × 10 −16 , χ 2 test). We asked whether only intrachromosomal Futra pairs colocalized with RNA-DNA interactions. Nineteen of the  H2228  1  3  10  11  13  14  15  17  18  19  21  22  23  24  25  26  27  28  30  31  32  33  35  36  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  55  56  57  58  59  60  61  62  64  73  74  75  76  77  78  79  80  81  82  83  84  87  88  89  90  91  92  95  We tested whether genome rearrangement is a prerequisite step for the creation of fusion transcripts from fusion-susceptible pairs by choosing EML4-ALK fusion transcripts for this test because EML4-ALK is a fusion-susceptible pair (Fig. 6A), EML4-ALK fusion transcripts are detected in one of our new tumor samples (sample no. 44) (Fig. 6B), and there is an FDA-approved diagnosis kit (Vysis ALK Break Apart FISH) based on DNA FISH detection of the EML4-ALK fusion gene. Break Apart assays were performed by Knight Diagnostic Laboratories at the Oregon Health & Science University according to standardized protocols. We subjected the remaining tissue from sample no. 44 to DNA FISH analysis. None of our eight attempts yielded any DNA FISH signal in the remaining tissue from either control or ALK probes. We therefore could not ascertain whether there was genome rearrangement in the only sample with detectable EML4-ALK fusion transcripts.

Discussion
Abundance of Genome Rearrangement-Independent Fusion Transcripts. Our RNA FISH and DNA FISH analyses revealed a cancer sample that contained a fusion transcript without the corresponding fusion gene. Such an example, although not often seen in the literature, may not be a rare case (11). The lack of reports is likely attributable to the research attention paid to the other side of the coin, i.e., the fusion transcripts created by fusion genes (2). Indeed, ∼36-65% of fusion transcripts derived from cancer RNA-seq data were attributable to genome rearrangement (Low Pass bars, figure S1A of ref. 1). However, this is likely an overestimate because when low-quality whole-genome sequencing (WGS) data were removed, only ∼30-45% of fusion transcripts had corresponding WGS reads (High Pass bars, figure S1A of ref. 1). These published results are consistent with the notion that fusion genes do not account for all observed fusion transcripts and suggest the occurrence of fusion transcripts independent of genome rearrangement. However, we recognize that to date, the sheer amount of validated fusion RNAs independent of genome rearrangement remains limited, which warrants future investigation.
The RNA-Poise Model Allows for Splicing Errors. Fusion transcripts can be created by two processes. The better-recognized process is through transcription of a fusion gene that was created by genome rearrangement. The less-recognized process is by RNA splicing errors, where two separate transcripts were spliced together (transsplicing) (24). Transsplicing does not involve genome rearrangement. A theoretical gap in the splicing error model is that transsplicing can happen only to two RNA molecules that are close to each other in 3D space; however, except for neighboring genes (11), the chances for two RNA molecules transcribed from distant chromosomal locations to meet in space are small. Therefore, it remains difficult to perceive a biophysical process in which fusion transcripts are created by splicing errors.
The RNA-poise model fills this theoretical gap. The preinstallation of gene 1's transcripts on gene 2's genomic sequence positions gene 2's nascent transcripts spatially close to gene 1's transcripts, allowing for the possibility of transsplicing. Furthermore, the majority of splicing events are cotranscriptional. The availability of transcripts of gene 1 during gene 2's transcription allows for the opportunity to perform cotranscriptional transsplicing.
Breaking Down the RNA-Poise Model by RNA-DNA Interactions.
Remote RNA-DNA interactions could be created by at least two means. First, the caRNA can target specific genomic sequences, which could be mediated by tethering molecules (RNA targeting, Fig. 7). Second, the spatial proximity of the genomic sequences in 3D space could bring the nascent transcripts of one gene to the genomic sequence of another gene (RNA confinement, Fig. 7). Both means of RNA-DNA interactions provide spatial proximity between two RNA molecules and thus allow for splicing errors. In addition, the spatial proximity of two genes in the RNA confinement model could enhance the chances of genome rearrangement of the spatially close genomic sequences and thus create fusion genes (25). Thus, the RNA-poise model can be regarded as a union of two submodels, depending on the process of RNA-DNA interaction. One submodel (targeting-poise model, Fig. 7) could create fusion transcripts only by transsplicing. The other submodel (confinement-poise model, Fig. 7) could create fusion transcripts by either transsplicing or creation of fusion genes.

Materials and Methods
Reference Genome and Gene Annotations. Human genome assembly hg38/GRCh38 and Ensembl gene annotation release 84 (GRCh38.84) were used throughout all data analyses.

RNA-poise model
Targeting-poise Confinement-poise Fig. 7. RNA-poise model. In this model, the transcripts of one gene (RNA 1, red bar) can exhibit spatial proximity to another gene (RNA 2, purple bar) due to tethering (RNA targeting) or spatial proximity of the two genes (RNA confinement). Both cases could enhance splicing errors (gray arrows), whereas the proximity of genomic sequences may also facilitate gene fusion (gray arrow on the right), which subsequently produces fusion RNA.
TCGA Derived Fusion Transcripts. The TCGA RNA-seq-derived fusion transcripts were downloaded from the Tumor Fusion Gene Data Portal (www.tumorfusions.org) (26). Tier 1 and tier 2 fusion transcripts were used in our analyses. Genomic coordinates were converted to hg38 by liftOver. Following Davidson et al. (8), Futra pairs within 200 kb on hg38 were removed. The data of Futra pairs were mainly processed using R (27) with Bioconductor packages GenomicRanges (28) and InteractionSet (29).
Visualization of Futra Pairs. Heatmaps of the count matrix were plotted using Bioconductor package ComplexHeatmap (30). Genomic plots of Futra pairs were created with GIVE (31).
Constructing iMARGI Sequencing Libraries. Nuclei preparation and chromatin digestion. Approximately 5 × 10 6 cells were used for the construction of an iMARGI sequencing library. Cells were cross-linked in 1% formaldehyde at room temperature (RT) for 10 min with rotation. The cross-linking reaction was quenched with glycine at 0.2 M concentration and incubated at RT for 10 min. Cells were pelleted, washed using 1× PBS, and aliquoted into ∼5 × 10 6 in each tube. To prepare nuclei, cross-linked cells were incubated in 1 mL of cell lysis buffer (10 mM Tris·HCl, pH 7.5, 10 mM NaCl, 0.2% Nonidet P-40, 1× protease inhibitor) on ice for 15 min and homogenized with dounce homogenizer pestle A for 20 strokes on ice. Nuclei were pelleted and weighed to estimate the pellet volume (10 mg of nuclei pellet was estimated to be about 10 µL  Fig. 1A, which is the same as with MARGI library configuration (10). Since AluI restriction enzyme recognizes "AGCT" sequence and leaves "CT" at the 5 end of the cut, we expect the first two bases of DNA end (Read 2) to be enriched with CT.

Analysis of iMARGI Sequencing Data.
Mapping iMARGI read pairs. The detailed iMARGI data-processing methods can be found in our GitHub repository (https://github.com/Zhong-Lab-UCSD/iMARGI methods). Briefly, they include three main steps. First, the read pairs were cleaned by in-house scripts. According to the library con-struction design, read pairs were filtered out if the 5 -most two bases of their DNA end (Read 2) were not CT. In addition, the first two bases of the RNA end (Read 1) were removed as they are random nucleotides. Then, the cleaned read pairs were mapped to the human genome (hg38), using bwa mem (version 0.7.17) with parameters "-SP5M" (32). Finally, pairtools (version v0.2.0, https://github.com/mirnylab/pairtools) and in-house scripts were used to parse, deduplicate, and filter the mapped read pairs. The valid read pairs that were mapped to genomic locations within 200 kb of each other were defined as proximal interactions, which were excluded from our analysis. GenomicRanges (28) and InteractionSet (29) were used for further analysis of iMARGI data. Visualization of iMARGI read pairs. Heatmaps of the count matrix were plotted using Bioconductor package ComplexHeatmap (30). Genomic plots of iMARGI read pairs were created with Bioconductor package Gviz (33) and GIVE (31). Intersection of iMARGI read pairs and Futra pairs. An iMARGI read pair was regarded as overlapping with a Futra pair when the RNA end was strandspecifically mapped to the gene body of one gene in the Futra pair and the DNA end was mapped to the gene body ±100 kb flanking regions of the other gene in the Futra pair.
FuseFISH Analysis. Probe design. Oligonucleotide probes were designed to hybridize to exons 2-6 of EML4 RNA and exons 20-23 of ALK RNA. These exons were chosen because they were present in all of the observed variations of EML4-ALK fusion genes. These probes were 35-40 nt in length, with similar GC contents and melting temperatures.

Conjugation of quantum dots to oligonucleotide probes.
Oligos were modified on the 5 end with a primary amino group and a spacer of 30 carbons to minimize steric hindrance of probe-RNA hybridization. These probes were conjugated with quantum dots through the amino group using EDC reaction (34). The probes were subsequently purified with 0.2 µm membrane filtration and 100,000 molecular weight cutoff (MWCO). The retentate of the 100,000 MWCO was subjected to dynabeads MyOne SILANE purification to remove any remaining unconjugated probes. A subsequent 0.2-µm membrane filtration was used to remove any final aggregates.

Hybridization of adherent cell lines.
Probe hybridization in H2228 cells was carried out as previously described (19,21). Briefly, probes were added to the hybridization solution and incubated with the cells at 37 • C overnight. Cells were washed and resuspended in 1× PBS for imaging. Hybridization of tissue samples. Probe hybridization in tissues was carried out as previously described (35). Briefly, a hybridization solution with probes was added to the surface of the parafilm to form droplets. A tissue slice (5-10 µm in thickness) fixed on a glass coverslip was gently placed over the hybridization solution. The mixture was incubated at 37 • C overnight. The tissue was subsequently washed with wash buffer and resuspended in 1× PBS for imaging. Imaging and analysis. Cells or tissues were imaged in 1× PBS through widefield fluorescence imaging using an Olympus IX83 inverted microscope at 60× oil immersion objective (N.A. = 1.4). Image processing was carried out as previously described (36). Briefly, single transcripts were detected using an automated thresholding algorithm that searches for robust thresholds, where counts do not change within a range. Fusion transcripts were determined by searching for colocalization of detection transcripts by overlap between predicted centers within a radius.
RNA Sequencing and Analysis. RNA was extracted with Trizol from an H2228 cell line and lung cancer tissue samples of the approximate size 3 mm×3 mm× 30 µm per sample. RNA sequencing was carried out using the TruSight RNA Pan-Cancer Panel (Illumina) following the manufacterer's protocol. All of the RNA-seq data, including HEK293T public data, the H2228 cell line, and lung cancer sample sequencing data, were mapped to the human genome (hg38) using STAR (v2.5.4b) with default parameters (37). Fusion transcripts were called using STAR-Fusion (v0.8.0) (17), requiring both numbers of supporting discordant read pair and junction-spanning read larger than zero and the sum of them larger than 2.