New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
Global analysis of somatic structural genomic alterations and their impact on gene expression in diverse human cancers
Edited by Mary-Claire King, University of Washington, Seattle, WA, and approved October 21, 2016 (received for review April 19, 2016)

Significance
Structural changes in chromosomes can alter the expression and function of genes in tumors, an important driving mechanism in some tumors. Whole-genome sequencing makes it possible to detect such events on a genome-wide scale, but comprehensive investigations are still missing. Here, enabled by a massive amount of whole-genome sequencing data generated by The Cancer Genome Atlas consortium, we map somatic structural changes in 600 tumors of diverse origins. At a global level, we find that such events often contribute to altered gene expression in human cancer, and also highlight specific events that may have functional roles during tumor development.
Abstract
Tumor genomes are mosaics of somatic structural variants (SVs) that may contribute to the activation of oncogenes or inactivation of tumor suppressors, for example, by altering gene copy number amplitude. However, there are multiple other ways in which SVs can modulate transcription, but the general impact of such events on tumor transcriptional output has not been systematically determined. Here we use whole-genome sequencing data to map SVs across 600 tumors and 18 cancers, and investigate the relationship between SVs, copy number alterations (CNAs), and mRNA expression. We find that 34% of CNA breakpoints can be clarified structurally and that most amplifications are due to tandem duplications. We observe frequent swapping of strong and weak promoters in the context of gene fusions, and find that this has a measurable global impact on mRNA levels. Interestingly, several long noncoding RNAs were strongly activated by this mechanism. Additionally, SVs were confirmed in telomere reverse transcriptase (TERT) upstream regions in several cancers, associated with elevated TERT mRNA levels. We also highlight high-confidence gene fusions supported by both genomic and transcriptomic evidence, including a previously undescribed paired box 8 (PAX8)–nuclear factor, erythroid 2 like 2 (NFE2L2) fusion in thyroid carcinoma. In summary, we combine SV, CNA, and expression data to provide insights into the structural basis of CNAs as well as the impact of SVs on gene expression in tumors.
Somatic structural variants (SVs) in genomic DNA can lead to changes in gene copy number, which may contribute to tumor development by altering gene expression levels (1). Several studies have tried to systematically investigate the effect of copy number alterations (CNAs) on tumor transcriptomes (2, 3), but this can only explain part of the expression changes observed in tumors. Notably, there are several other mechanisms by which SVs can alter tumor transcriptional output, including creation of fusion genes. These may produce chimeric mRNAs encoding novel oncogenic proteins (4, 5) or, alternatively, fusion mRNAs that maintain their protein-coding sequence while being transcriptionally induced or repressed due to swapping of 5′ ends, including the promoter (6). Additionally, SVs can affect mRNA levels by bringing distant regulatory elements into the proximity of transcription start sites (TSSs) without structurally altering the mRNA (7). More systematic investigations of the impact of SVs on gene expression levels in tumors are, however, lacking, in part due to the fact that mapping of SVs is considerably more challenging compared with determining CNA profiles. Additionally, although CNAs arise as a result of SVs, the two have generally been studied in isolation and the general structural basis of somatic CNAs in tumors is poorly investigated.
Recent studies have started to use whole-genome sequencing (WGS) to identify SVs in tumors, and multiple tools have been published (8⇓⇓⇓–12). These are based on the idea that breakpoints should reveal themselves on the basis of discordant read pairs (13) or split reads mapping partially to both ends of fused chromosomal segments (14). Several factors complicate the analysis, in particular mappability issues due to repetitive sequence regions (15). Indeed, it has become clear that the results produced by different methods are not consistent, and some studies have intersected multiple approaches to provide a presumed high-confidence set of predictions (16, 17). Adding to the challenges is the difficulty of assessing performance: True positive sets have thus far been obtained through simulated genomic sequences (18), but this will not reflect the true complexity of cancer genomes. Improved benchmarking is thus desirable before SVs can be comprehensively studied with WGS in human cancers.
Here, enabled by the availability of a large cancer cohort subjected to WGS by The Cancer Genome Atlas (TCGA) consortium, we map somatic SVs in 600 tumors across 18 different cancer types and determine relationships between SVs, CNAs, and mRNA expression. We compare SV breakpoints with those from microarray-based CNA data to evaluate and optimize tools and parameters for SV detection, which additionally gives insights into the structural basis of CNAs. We next investigate relationships between SVs and changes in gene expression levels, both in the context of gene fusions as well as SVs affecting putative regulatory regions near TSSs. Finally, we combine breakpoints determined from WGS and RNA-sequencing (RNA-seq) data to detect high-confidence fusion genes across this large cohort.
Results
CNA-Based Evaluation of SV Detection Methodology.
We first noted large discrepancies, for example in terms of SV types, when applying several available tools for SV detection to WGS data from TCGA tumor/normal pairs (Fig. 1A), and therefore sought to carefully evaluate multiple options. A general problem in assessing SV results is the lack of true positive data for comparison (19). To partially overcome this problem, we made use of orthogonal results in the form of CNA profiles from the Affymetrix SNP6 microarray platform (SI Methods). CNAs arise as a consequence of somatic SVs in the genome, and segmented CNA data therefore provide a true positive subset. Although this subset is incomplete, an ideal experimental and computational pipeline should essentially be able to detect all CNA breakpoints.
Evaluating SV detection tools. (A) Four SV detection tools were applied to 11 WGS tumor/normal pairs. The distribution of SV classes is shown for each tool. SVDetect identifies two classes of SVs: inter- and intrachromosomal translocations. The “other” class for Meerkat represents tandem duplications. (B) The fraction of CNA breakpoints overlapping with SV breakpoints was used as a proxy to determine the sensitivity of each tool (red bars). Randomized CNA breakpoints were used to assess the specificity of each tool (blue bars).
We applied four available tools, BreakDancer (20), SVDetect (21), DELLY (22), and Meerkat (11), to high-coverage WGS data from 200 TCGA tumors of diverse origins, and quantified overlaps between SV and CNA breakpoints (<15-kb tolerance). To reduce the computational complexity, only a single chromosome (chr) was considered in this analysis (chr8). To also address the specificity of the methods, we investigated overlaps with a set of randomized CNA breakpoints that maintained a similar positional distribution across the genome (SI Methods). SVDetect had the highest sensitivity score (35%), followed by Meerkat (29%) (Fig. 1B). However, Meerkat performed considerably better than SVDetect in terms of specificity (Fig. 1B). Additionally, Meerkat performed favorably compared with the other methods when applied to 11 complete tumor genomes of diverse origins (Fig. S1), and therefore we chose Meerkat for SV detection.
Evaluating SV detection tools. SV detection tools were evaluated, shown on the x axis. The fraction of CNA breakpoints overlapping with SV breakpoints was used as a proxy to determine the sensitivity of each tool (red bars). Randomized CNA breakpoints were used to assess the specificity of each tool (blue bars). Meerkat had the highest sensitivity score (38%), followed by SVDetect (28%), DELLY (27%), and BreakDancer (26%).
A Map of Somatic SVs Across 600 Tumor Genomes.
We next used high-coverage WGS data from TCGA to map somatic SVs in 600 tumors across 18 cancer types (Dataset S1), and combined these results with somatic CNA (Affymetrix SNP6 microarrays) and expression (RNA-seq) data from the same samples. In total, 51,446 somatic SVs were identified (on average 87 events per sample). Of these, 36,382 were intrachromosomal (13,745 deletions, 13,335 tandem duplications, and 9,302 inversions), whereas 14,953 were interchromosomal translocations. The number of SVs and the relative distribution between SV classes varied considerably within and across cancer types (Fig. 2A).
Overview of detected somatic SVs and overlaps with CNAs across 18 cancer types. (A) Samples are grouped by cancer type and ordered by the number of detected SVs. (B) The fraction of CNA breakpoints (Affymetrix SNP6 data) detected by the SV detection pipeline is shown for each tumor. Red lines indicate the median in each cancer type. Cancer types are ordered by the degree of copy number instability (average CNA segments). Segments involving breakpoints in centromeres and telomeres (within 2 Mb) were filtered out for this analysis, as these may not be assessable by SV callers. (C) Scatter plot showing correlation between the number of CNA and SV breakpoints in each sample. “Mostly arm level” indicates that more than half of all chromosome arms have arm-level CNAs. “Chromothripsis effected” indicates at least one chromosome arm with more than 30 copy number segments. (D and E) Two examples of complex relationships between CNAs and SVs. Blue represents deletions, and red represents amplifications. The linear copy number amplitude is indicated for the CNA data. (F) Structural basis of copy number amplifications and deletions. Only CNA segments where both breakpoints could be matched to those of a single SV event were considered (15-kb tolerance). Pie chart slices are color coded similar to Fig. 2A.
Confirming previous data (10, 23⇓–25), breast (BRCA), ovarian (OV), and lung squamous (LUSC) tumors had a large number of SVs (>100 SVs per sample on average). Likewise, this number was high in uterine (UCEC) and stomach (STAD) tumors (107 and 125 per sample, respectively). Few SVs were found in copy number stable cancers such as thyroid carcinoma (THCA) (26) (∼8 per sample on average). Similarly, clear cell and chromophobe kidney carcinomas (KIRC and KICH) and low-grade glioma (LGG) tumors exhibited a small number of SVs (Fig. 2A).
We next investigated the relationship between somatic CNA and SV breakpoints. On average, 34% of CNA breakpoints (segments with absolute log2 amplitude ≥0.3, SNP6 platform) detected across all 600 tumors were confirmed in the SV data. Overlaps were reduced in CNA stable tumors such as KICH, which are dominated by arm-level events and therefore may lack breakpoints detectable by WGS (Fig. 2B). Absolute numbers of CNA and SV breakpoints were strongly correlated across tumors (Pearson’s r = 0.81; Fig. 2C). However, a small number of samples had elevated CNAs not reflected in the SV data (Fig. 2C), again explainable by the predominance of arm-level changes in these tumors (Fig. S2). These results further support the validity of the SV data and show that both approaches are effective for quantifying overall chromosomal instability.
Copy number profile of arm-level dominant tumors. Screenshot from Integrative Genomics Viewer (IGV) showing the copy number changes in cases with mostly arm-level events. The number of CNV breakpoints in these cases is much higher than SV breakpoints detected using WGS. Chromosome-level or arm-level CN change is hard to detect using WGS. GBM, glioblastoma.
The overlap between CNA and SV data at the level of complete events (for example, both breakpoints for an amplification segment) was considerably lower (∼10%) compared with breakpoint-level overlaps. In many cases, this was explained in the SV data by a more complex structural basis than immediately evident from CNA profiles. Examples include apparent closely spaced large copy number deletion segments, revealed by the SV data to be due to a large (for example, whole-chromosome) deletion event superimposed on several small-amplitude–neutralizing tandem duplications (Fig. 2D). Similarly, apparent clustered copy number amplification segments could sometimes be explained by a larger duplication in combination with several small deletions (Fig. 2E). These results emphasize that care needs to be taken when drawing conclusions about structural chromosomal changes based on CNA data alone.
Of all CNA deletion segments (log2 amplitude ≤−0.3) with a clear correspondence in the SV data, we found that 97% were categorized as deletions in the SV analysis (Fig. 2F). Likewise, 90% of amplified regions were due to tandem duplication, whereas only 5% each overlapped with deletions or inversions.
Global Impact of Somatic SVs on Tumor mRNA Levels.
For somatic SVs to be functional in cancer, they need to have an influence on transcription in some way, by altering mRNA structures or levels. We found that a large fraction (56%) of SV breakpoints were in genic regions, and that on average 112 genes per sample overlapped with at least one SV breakpoint, showing that SVs have considerable potential to alter tumor transcriptional output. A known mechanism for gene activation by SVs is the swapping of strong and weak promoters in the context of gene fusions. This is described for many individual genes, both with and without alteration of the protein-coding region (6, 27, 28), but it remains unclear to what extent such events have a measurable global influence on tumor expression profiles.
Based on the SV data, we identified local rearrangements predicted to result in a fusion gene involving a strong promoter juxtaposed to a normally weakly expressed gene. We thus considered only gene pairs where the 5′ partner was expressed at a considerably higher average level within a given cancer type (greater than fourfold difference; SI Methods). We found that these criteria were often predictive of strong induction of the 3′ fusion partner: In 62/313 cases (20%), we observed induction of the 3′ partner above threefold in the structurally affected tumor compared with other tumors of the same cancer type (Fig. 3A). This should be compared with only 8% and 3% for fusions involving incorrect gene orientation or a weakly expressed promoter, respectively (Fig. 3A; P = 1.1e-5 compared with incorrect orientation, P = 2.7e-7 compared with weak promoters, and P = 1.3e-6 compared with randomly picked cases; two-sided Kolmogorov–Smirnov test).
Impact of SVs on gene expression levels in tumors. (A) Gene fusion events leading to swapping of strong and weak promoters have a global influence on mRNA expression levels in tumors. Cumulative distribution function (CDF) plot showing log2-transformed expression differences for genes predicted in the SV data to be activated due to fusion with a stronger promoter, comparing the SV-affected sample with the mean of other samples in the same cancer type. Control analyses involving fusions not predicted to cause activation of the 3′ partner (incorrect strand orientation or weakly expressed fusion partner) are shown by dashed lines, in addition to a randomized set (randomly picked sample within the relevant cancer type, median of 100 randomizations). Inversions and deletions were used for this analysis. (B) Global influence of promoter-proximal SVs on mRNA levels. CDF plot showing expression differences for genes harboring an upstream SV event (−10 to 0 kb), comparing the affected sample with the mean of other samples in the same cancer type. Genes with SVs in a region farther upstream (−110 to 100 kb) and randomized controls are shown. (C) Expression vs. copy number change for TERT in KICH. Samples with promoter-proximal SV breakpoints are marked in orange. P values were calculated using the Wilcoxon rank-sum test comparing the expression of the altered tumors with other samples. (D–F) Same as C for KIRC, READ, and SKCM, respectively. Samples with TERT promoter mutations are indicated in purple.
Among the 62 cases showing induced expression were several confirmatory observations (Fig. S3), including prominent activation of ERG in prostate cancer (29) as well as RET (30) and ALK (31) in THCA, by fusion with strong promoters from TMPRSS2, CCDC6, and STRN, respectively. Notable was strong (8.1-fold relative mean) mRNA induction of protein kinase C beta (PRKCB) through fusion with USP7, a highly expressed deubiquitinase, in colorectal carcinoma (COAD). PRKCB is a protein kinase linked to tumor growth and survival (32) that is druggable by the small-molecule inhibitor enzastaurin (33). Although fusions involving PRKCB were recently detected in fibrous histiocytoma (34), they have not been described in COAD. Additionally, among the induced transcripts were several long noncoding RNAs (lncRNAs), including the intergenic RP11-521D12.1, strongly (39.9-fold) induced through fusion with ASAP2 in BRCA. Although the functional relevance remains to be established, this establishes promoter swapping as a mechanism for activation of lncRNAs in cancer.
Promoter hijacking due to gene fusions resulting in mRNA induction. (A) Cases caused by deletions. The box plots show mRNA levels of fusion partners in relevant cancers. Red dots represent the affected samples. 3′ partners of fusions are shown in green. Only cases where the 3′ partner in the affected sample was expressed above threefold compared with other samples of the same cancer type are shown here. (B) Same as A but for inversions. Here, weakly expressed partners of the fusions are shown in green. On each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers extend to the most extreme data points not considered outliers.
A few recent studies show that structural rearrangements can contribute to transcriptional activation through shuffling of cis-regulatory elements near the TSS of genes, without altering mRNA structure (35, 36). This has been described in individual cases, but the impact on a more global scale remains unknown. To investigate this, we identified cases of genes being affected by upstream SVs (−10 to 0 kb from the TSS). We next considered the relative expression levels for these genes in affected samples compared with unaffected samples of the same cancer type while avoiding confounding effects from copy number changes and chromothripsis (SI Methods and Fig. S4). Cases with SVs in a region −110 to 100 kb upstream, or randomly picked cases, were considered as controls. We found that SV-affected genes/cases showed a weak trend toward overall increased mRNA levels (Fig. 3B; P = 0.069 and P = 0.0099, respectively, compared with the control sets). Although it remains to be determined to what extent the observed events are under positive selection, the results suggest upstream SVs may often contribute to transcriptional activation of genes in human tumors.
Relationship between chromotripsis and gene expression levels. The cumulative distribution function plot shows relative expression levels for genes/samples affected by chromthripsis. For a given gene, these relative levels were determined by comparing the expression level in the affected sample to the mean level of all samples with the same cancer type. Similarly, relative levels for genes/cases affected by overlapping simple SVs are shown. Randomly picked samples where it was affected by neither chromothripsis nor simple SVs were considered as controls. One hundred randomizations were performed, and the plot shows the mean. The plots show log2-transformed relative expression levels.
Notably, among top cases showing increased expression in relation to upstream SVs (40 cases above fourfold) was telomere reverse transcriptase (TERT) in two different cancers, KICH and KIRC (Fig. S5). The data from KICH, where TERT upstream SVs were found in two samples with high TERT mRNA levels, confirm recently published results (7) (Fig. 3C). In KIRC, the sample with the highest TERT expression also harbored an upstream SV, whereas an additional affected sample lacked mRNA elevation in this cancer (Fig. 3D). Although not considered in the global screen due to simultaneous copy number gain, TERT upstream SVs were found also in one rectal adenocarcinoma (READ) and one melanoma (SKCM) sample (Fig. 3 E and F), the latter notably lacking TERT promoter mutations. These results further support a role for upstream SVs in activation of TERT in cancer.
Genes having an SV breakpoint within 10 kb upstream of the TSS and showing mRNA induction in the affected sample. mRNA level and copy number amplitude are shown for each sample in a given cancer type, with SV-affected samples indicated by orange dots. Only genes with at least a fourfold increase in the SV-affected sample are shown here.
Improved Gene Fusion Detection by Combined DNA/RNA Analysis.
Recent studies have relied on RNA-seq data to detect fusion genes in comprehensive tumor materials (37, 38). However, similar to SV analysis, RNA-seq–based fusion detection is prone to both false positives and negatives (39), for example due to read-mapping issues or misinterpreted germ-line events (40). Therefore, we explored whether fusion gene detection can be improved by combining RNA and DNA data.
We applied FusionCatcher (41) to identify fusion transcripts in 431/600 samples with available paired-end RNA-seq data. In total, 9,657 fusions (on average 22.4 per sample) were reported, of which 1,467 were in-frame (Dataset S2). In parallel, we annotated all genomic SV breakpoints and identified 9,854 predicted gene fusions, of which 740 were inversions, 1,240 were deletions, 2,511 were duplications, and 5,363 were interchromosomal translocations (Fig. 4A). Of these, 5,209 (52%) involved genes fused in the correct orientation, thus having the potential to produce a fusion transcript. We found that only 299 unique pairs of genes were detected in both DNA and RNA data, of which 97 were in-frame according to FusionCatcher (Fig. S6). In this high-confidence set with dual DNA/RNA support, there were 78 genes recurrently involved in fusions in more than one sample (Table S1).
Combined DNA/RNA approach improves detection of gene fusions. (A) Venn diagram (not proportional) showing the number of fusions detected by analysis of RNA-seq data and in the genomic SV data, as well as overlaps between the two. (B) Enrichment of known cancer genes (from the Cancer Gene Census) among the top recurrently fused genes identified by analysis of RNA-seq data, genomic SVs, or a combined DNA/RNA approach. The graphs show the cumulative fraction of known cancer genes as a function of gene rank. Rank was determined for individual genes based on the frequency of recurrence, taking into account all fusions in which a given gene was predicted to participate. (C) A fusion between EIF2AK4 and BUB1B in one PRAD tumor, due to a deletion in chromosome 15. The deleted region is shown in blue, whereas nonaltered regions are shown in gray. mRNA levels [represented as reads per kilobase of transcript per million mapped reads (RPKM)] for both fusion partners across all PRADs are shown, with the affected sample indicated in red.
List of fusions detected by both SV-based and RNA-seq–based analyses. For each fusion, the underlying structural basis is shown. Bars on the right show the number of recurrences across all tumors, color-coded by cancer type similar to Fig. 2A. Genes listed in cancer gene census are shown in bold.
Recurrent fused genes detected by the DNA/RNA approach: List of unique genes fused with multiple genes in different cancer types
Next, we evaluated the two approaches, either alone or in combination, by investigating the enrichment of known cancer drivers in the Cancer Gene Census (CGC) database among the top recurrently fused genes (42). We found that neither method alone showed notable CGC enrichment, although the SV-based analysis performed slightly better (Fig. 4B). However, combining both approaches resulted in a notable CGC enrichment among top-scoring genes, with 14 of the 78 recurrently fused genes (18%) being listed in the CGC (random expectation ∼2.5%). This suggests that DNA and RNA approaches to fusion gene detection complement each other, such that a combination of both provides more relevant results.
In the set of high-confidence detections with dual DNA/RNA support (Fig. S6) were several known functional fusions, including TMPRSS2–ERG in 6/20 prostate adenocarcinoma tumors (PRAD) and RET–CCDC6 in THCA (2/34 tumors). Results also included EIF2AK4 fused to BUB1B, a mitotic checkpoint kinase, by a deletion in one PRAD tumor, with strong transcriptional induction of BUB1B in this sample (Fig. 4C).
Novel Mechanism for Nuclear Factor, Erythroid 2 Like 2 Activation in Cancer.
Notable among the high-confidence fusions was paired box 8 (PAX8) fusing with nuclear factor, erythroid 2 like 2 (NFE2L2) in one THCA tumor, revealed by the SV data to be due to tandem duplication on chromosome 2 (Fig. 5A). This event was associated with potent NFE2L2 transcriptional activation (Fig. 5B) as well as loss of the KEAP1 interaction domain in NFE2L2 (Fig. 5A). NFE2L2 (NRF2) is a transcription factor that coordinates the cellular response to oxidative stress, which is activated in some tumors by mutations that disrupt interaction with its inhibitor KEAP1 (43). However, NFE2L2 fusions involving loss of the KEAP1 interaction domain have not been described, and the result is thus suggestive of a novel mechanism for NFE2L2 activation in cancer.
Activation of the NFE2L2 pathway by a PAX8–NFE2L2 fusion gene in thyroid carcinoma. (A) PAX8 fused to the transcription factor NFE2L2 as a result of tandem duplication on chromosome 2. Nonaltered regions are shown in gray, and the tandem duplication is indicated in red. Major domains are indicated. The 5′ end of PAX8 replaces the inactivating KEAP1 interaction domain in NFE2L2. (B) Fusion with PAX8 is associated with strong transcriptional induction. The plot shows mRNA levels for both fusion partners in all thyroid carcinoma samples, with the affected tumor indicated in red. (C) Strong activation of NFE2L2 targets in the tumor expressing the PAX8–NFE2L2 fusion transcript. mRNA levels of NFE2L2 and four canonical target genes (GCLC, GCLM, TXNRD1, and NQO1) are shown across 572 samples with available exome data from TCGA. The PAX8–NFE2L2–expressing tumor, as well as one additional tumor identified as having a predicted NFE2L2-activating mutation in KEAP1, is indicated in green. P values were calculated using Student’s t test, comparing the expression of the two altered tumors with the remaining samples. (D) Transcriptome profiling of TRAMP-C1 murine prostate carcinoma cells transduced with a lentivirus expressing the PAX8–NFE2L2 fusion transcript or a control lentivirus. Volcano plot showing 534 genes significantly induced and 215 genes repressed at q < 0.01. (E) Enriched gene sets (Pnominal < 0.01) calculated by GSEA. The familywise error rate (FWER) calculated by GSEA is shown on the x axis, and the dashed line represents PFWER = 0.05. WikiPathway (57) gene sets were used for this analysis. (F) Enrichment plot for the transcriptional activation by NRF2 gene set. (G) Altered expression of NFE2L2 target genes Gclc and Gclm was confirmed using RT-qPCR. Expression levels are shown relative to ubiquitin (mUb). P values were determined using Student’s t test. Bars indicate the mean, and error bars indicate SD.
To further evaluate the PAX8–NFE2L2 fusion, we investigated the expression of known target genes downstream of NFE2L2. We found that known NFE2L2 targets were strongly induced in the tumor that carried the PAX8–NFE2L2 fusion, supporting its function as a potent activator of the NFE2L2 transcriptional program (Fig. 5C). By considering an expanded set of THCA tumors with available exome sequencing data, we identified one tumor with a KEAP1 mutation, likewise associated with strong induction of NFE2L2 target mRNAs (Fig. 5C). This further supports that NFE2L2 activation may contribute to tumorigenesis in some thyroid carcinomas.
We next synthesized the PAX8–NFE2L2 fusion transcript and cloned it into a lentiviral vector (Fig. S7), and used this to overexpress it in the murine prostate adenocarcinoma cell line TRAMP-C1 (SI Methods). We used high-coverage RNA sequencing (>30 million reads per sample) to compare the expression profiles of cells expressing the PAX8–NFE2L2 fusion transcript with cells transduced with an empty vector. We observed a prominent transcriptional response dominated by inductions (534 induced and 215 repressed genes at q < 0.01; Fig. 5D).
PAX8–NRF2 fusion gene constructs.
To identify affected pathways, we next performed gene set enrichment analysis using GSEA (44). Whereas no gene sets were significantly down-regulated, four sets were enriched among induced genes (adjusted P < 0.05; Fig. 5E). Notably, these included experimentally determined targets of NFE2L2 (“transcriptional activation by NRF2”; Fig. 5F), as well as “glutathione metabolism,” “oxidative stress,” and “arylhydrocarbon receptor” signaling, all of which are known to be regulated by NFE2L2 (43, 45). Induction of the canonical NFE2L2 targets Gclc and Gclm was confirmed by quantitative RT-PCR (2.4- and 3.1-fold, respectively; P < 0.022), further showing that the PAX8–NFE2L2 fusion protein is a capable activator of NFE2L2 target mRNAs (Fig. 5G).
SI Methods
Data Overview.
Sequencing data from 600 tumors in 18 nonembargoed cancer types were obtained from the Cancer Genomics Hub (cgHub) repository (Dataset S1). Aligned high-coverage WGS tumor/normal pairs (bam file >75 Gb) together with matched RNA-seq data for 431/600 samples were used in this study. In addition, Affymetrix SNP6 copy number data for the matched tumors (seg files) were downloaded.
Benchmarking SV Detection Tools.
To compare the performance of WGS-based SV detection tools in terms of both sensitivity and specificity, we implemented a benchmarking method based on a comparison with CNA breakpoints determined from Affymetrix SNP6 array data. CNA segments with absolute amplitude less than 0.3 and segment size less than 10 kb were removed. Similarly, intrachromosomal SVs smaller than 10 kb were filtered out. Sensitivity scores were measured as the fraction of CNA breakpoints that had an SV breakpoint within 15 kb. To assess the specificity of the results, we randomized the positions of the CNA breakpoints by adding a normally distributed random noise with an SD of 1 Mb to each breakpoint, followed by assessment of overlaps with SV breakpoints as described above.
Benchmarking was performed on 11 samples selected to have a high number of CNA breakpoints. Our benchmarking method was used to assess SV breakpoints derived from DELLY, SVDetect, BreakDancer, and Meerkat, all applied to BWA bam files available from TCGA. For SVDetect, these parameters were specified: window_size 10000, step length 3000, nb_pairs_threshold 3, and strand filtering 0. Meerkat was run using parameters suggested for >20× coverage genomes. DELLY (version 0.7.2) and BreakDancer (version 1.1.2) were used with default parameters.
Mapping Structural Variations Using WGS.
Somatic SVs for 600 tumor/normal pairs were mapped using Meerkat (11). A variant was considered only if it had at least three discordant read pairs and one read spanning the breakpoint junction. The “somatic call” function was used to identify somatic events. Next, simple SVs with two breakpoints spanning the same repeat family in the RepeatMasker track taken from the University of California Santa Cruz Table Browser view (58) were removed. Finally, SV breakpoints were annotated against the GENCODE (version 17) gene annotation (59). In cases where breakpoints overlapped with both a coding and a noncoding gene, the coding gene was used for annotation.
Integrative Analysis of SV and Expression Data.
Gene expression levels were quantified using RNA-seq bam files as described previously (47). To find cases of promoter swapping due to gene fusion events, we identified SVs predicted to result in a gene fusion. First, we considered SVs where both breakpoints were within genes expressed in at least one sample in the relevant cancer type (RPKM >0) and where the two genes differed at least fourfold in regard to their median expression level in the same cancer type. The relative expressions (fold-change values) were quantified using the mean. We found that regions with a high density of SV events (200-kb genomic tiles having more than 10 breakpoints), likely due to chromothripsis, showed a high degree of transcriptional change (Fig. S4), and cases where the relevant gene was affected by chromothripsis were therefore excluded. We determined whether the two genes were correctly oriented or not by taking into account their annotated directions of transcription and the type of SV event.
To detect promoter-proximal cases, we considered SVs that had one breakpoint within 10 kb upstream of the TSS of a coding gene and where the other breakpoint was in an intergenic region. Specifically for deletions, the other breakpoint had to be farther upstream. Only samples that were copy number-neutral (absolute copy number amplitude <0.3) and not affected by chromothripsis with respect to the affected gene were considered. Genes that were expressed in at least one sample in the relevant cancer type (RPKM >0) were considered for further analysis. The relative expressions (fold-change values) were quantified using the mean. For both analyses, a pseudovalue of 0.01 was used to avoid division by zero when quantifying log2 ratios. As a control, randomly chosen samples within the relevant cancer type were used and expression fold change was calculated the same way as described above.
Fusions transcripts were identified using FusionCatcher (41) with default parameters. In the combined DNA/RNA fusion analysis, we included events that were supported by both data types in at least one sample and would in these cases also consider additional tumors where the event was supported by only one data type.
Plasmid Construction and Isolation.
PAX8–NFE2L2 fusion gene constructs (Fig. S7) were synthesized and cloned into the Lv229 lentiviral plasmid vector (GeneCopoeia). The plasmid along with a matching empty vector control was transformed into laboratory-propagated Stbl3 cells (Thermo Fisher Scientific) made competent using Mix & Go Reagent (Zymo Research). Plasmids were isolated using the Qiagen Plasmid Plus Midi Kit.
Cell Culture.
TRAMP-C1 cells and HEK-293T cells were purchased from ATCC and cultured using DMEM media from Gibco (Thermo Fisher Scientific) supplemented with 10% (vol/vol) FBS from HyClone (GE Healthcare Life Sciences) and 50 μg/ml gentamicin and 1 mM sodium pyruvate (only for HEK-293T) (Thermo Fisher Scientific).
Virus Production and Transduction.
HEK-293T cells were seeded on 10-cm plates (Sarstedt) 1 d before transfection. Cells were transfected with the aforementioned plasmids using the Lenti-Pac HIV Expression Packaging Kit (GeneCopoeia), and supernatant was collected at 42, 46, 50, and 66 h post transfection. The virus-containing supernatant was passed through 0.45-µm low–protein-binding filters (Sarstedt).
For each transduction, 4 mL filtered supernatant was concentrated by ultracentrifugation at 50,000 × g for 90 min at 4 °C. Supernatant was carefully discarded and the pellet was dissolved in 500 µL media. TRAMP-C1 cells seeded on 12-well plates (Sarstedt) were transduced with a 1:1 mixture of media and concentrated virus. Twelve hours post transduction, fresh media were added and cells were allowed to grow for another 24 h.
RNA Extraction and Quantitative RT-PCR Analysis.
Quick-RNA MiniPrep (Zymo Research) was used per the manufacturer’s recommendation to isolate RNA. cDNA was prepared and quantitative RT-PCR was performed as previously described (60) using the primers below (F, forward; R, reverse):
RNA Sequencing.
RNA library preparation and sequencing were performed on an Illumina HiSeq 2500 (GATC Biotech) using RNA from one wild-type TRAMP-C1 cell line, three cultures transduced with empty vector, and three cultures transduced with viral vector containing the PAX8–NFE2L2 fusion; 50-bp single-end reads were obtained. The data were deposited in the Gene Expression Omnibus under accession number GSE79756.
Differential Expression Analysis.
RNA-seq reads were aligned to the GRCm38 reference genome using STAR (61) with default parameters. Gene read counts were quantified using HTSeq (62) against the GENCODE 17 reference annotation (using the options “-s no” and “-m intersection-strict” for htseq-count). Differential expression testing was performed using DESeq2 (63). Gene set enrichment analysis was performed by the GSEA Preranked (44) function using genes ranked by log2 fold change. Gene sets in WikiPathways (57) were used for the enrichment analysis.
Discussion
In the present study, we used an integrative workflow to analyze structural rearrangements in 600 tumors from 18 different cancer types. We rely on coanalysis of somatic CNA and SV data to optimize tools and parameters for SV detection, and to provide insights into the general structural basis of CNAs in tumors. Further coanalysis with RNA-seq data provided insights into the overall influence of SVs on gene expression in tumors.
Surprisingly, although there was a clear correlation between the number of CNA breakpoints and SV breakpoints seen in each tumor, only 34% of copy number breakpoints had correspondence in the SVs detected by WGS. It is likely that this partly can be explained by limited sensitivity in detecting SV breakpoints, for example due to insufficient read coverage in some regions. However, another contributing factor could be false positives or positional uncertainties arising from the circular binary segmentation algorithm used for detecting CNA breakpoints in array-based copy number data.
TERT is a subunit of telomerase essential for telomere maintenance and long-term proliferation potential of cancer cells. Both somatic promoter mutations (46⇓–48) and copy number gains are known to contribute to TERT activation in tumors. Recent data from KICH suggest that also SVs in the promoter of TERT could be important (7). Our results extend this to additional cancers, thus more firmly establishing regulatory SVs as a mechanism for TERT mRNA induction.
Earlier studies have emphasized large discrepancies between different callers for detecting fusions in RNA-seq data and that false positive rates are high (39). This should contribute to the limited overlap observed here between DNA- and RNA-based fusion predictions. Additionally, read-through transcription events may produce fusion transcripts involving neighboring genes, which have no correspondence in the SV data, and some genomic fusion may not be expressed, further reducing overlaps. Although both DNA and RNA data have limitations when it comes to detecting fusions, we found that a combination of the two data types reduced the noise inherent to each approach. This is in agreement with recent studies that combine DNA and RNA evidence to improve detection of gene fusions (49, 50).
BUB1B encodes a mitotic checkpoint kinase also known as BUBR1 that ensures proper chromosome segregation by slowing down mitosis (51). Hemizygous loss in mice results in increased susceptibility to carcinogen-induced colon cancer development (52) and accelerated aging (53). On the other hand, elevated expression of BUB1B has also been associated with cancer progression (54), and overexpression of Bub1b in mice results in tumorigenesis (55). Fusions involving this gene have not been previously described, and the functional relevance of the EIF2AK4–BUB1B fusion found here deserves further investigation.
NFE2L2 or KEAP1 mutations, indicative of activation of the NFE2L2/NRF2 pathway that coordinates the cellular response to oxidative stress, are common in some cancers, including lung adenocarcinoma, but are uncommon in THCA. However, a recent study established that the NFE2L2 pathway is frequently altered in THCA through other mechanisms, including KEAP1 promoter hypermethylation (56). Our results in THCA suggest an additional way of activating NFE2L2, whereby a fusion gene is formed that lacks the inhibitory KEAP1 interaction domain. This further establishes that NFE2L2 activation is important in THCA, and suggests a novel mechanism for activation of NFE2L2 in cancer.
In summary, by integrative analysis of genomic and transcriptomic data from a large tumor cohort, we provide multiple insights into structural rearrangements and their effects on gene expression in human cancer.
Methods
WGS data from 600 TCGA tumors were analyzed for somatic SVs using Meerkat (11). Gene expression levels were quantified as described previously (47). Fusions transcripts were identified using FusionCatcher (41) with default parameters. A detailed description of the methods used in the study is provided in SI Methods.
Acknowledgments
The results published here are in whole or in part based upon data generated by The Cancer Genome Atlas pilot project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions that constitute TCGA Research Network can be found at https://cancergenome.nih.gov. This work was supported by grants from the Knut and Alice Wallenberg Foundation (to E.L. and J.A.N.), Swedish Foundation for Strategic Research (to E.L.), Swedish Medical Research Council (to E.L. and J.A.N.), Swedish Cancer Society (to E.L. and J.A.N.), Åke Wiberg Foundation (to E.L.), Lars Erik Lundberg Foundation for Research and Education (to E.L.), Region Västra Götaland (ALF Grant to J.A.N.), Adlerbertska Forskningsstiftelsen (to J.B.), Assar Gabrielsson Foundation (to J.B.), W&M Lundgren Foundation (to J.B.), and Sahlgrenska Universitetssjukhusets Stiftelser (to J.B.). The computations were in part performed on computing resources provided by the SNIC through the Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) under Project b2012108.
Footnotes
- ↵1To whom correspondence should be addressed. Email: erik.larsson{at}gu.se.
Author contributions: B.A.-M., J.A.N., and E.L. designed research; B.A.-M., J.B., and J.W.K. analyzed data; and B.A.-M. and E.L. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Data deposition: The sequence reported in this paper has been deposited in the Gene Expression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession no. GSE79756).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1606220113/-/DCSupplemental.
Freely available online through the PNAS open access option.
References
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Hillmer AM, et al.
- ↵
- ↵
- ↵
- ↵.
- Korbel JO, et al.
- ↵.
- Suzuki S,
- Yasuda T,
- Shiraishi Y,
- Miyano S,
- Nagasaki M
- ↵
- ↵.
- Mohiyuddin M, et al.
- ↵
- ↵
- ↵.
- Liu B, et al.
- ↵
- ↵.
- Zeitouni B, et al.
- ↵.
- Rausch T, et al.
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Tomlins SA, et al.
- ↵
- ↵
- ↵
- ↵
- ↵.
- Graff JR, et al.
- ↵.
- Płaszczyca A, et al.
- ↵
- ↵
- ↵
- ↵
- ↵.
- Carrara M, et al.
- ↵.
- Liu S, et al.
- ↵.
- Nicorici D, et al.
- ↵
- ↵.
- Jaramillo MC,
- Zhang DD
- ↵.
- Subramanian A, et al.
- ↵.
- Shin S, et al.
- ↵.
- Horn S, et al.
- ↵
- ↵.
- Huang FW, et al.
- ↵
- ↵.
- Zhang J, et al.
- ↵
- ↵.
- Dai W, et al.
- ↵
- ↵.
- Ding Y, et al.
- ↵.
- Ricke RM,
- Jeganathan KB,
- van Deursen JM
- ↵.
- Martinez VD, et al.
- ↵.
- Kutmon M, et al.
- ↵.
- Karolchik D, et al.
- ↵.
- Harrow J, et al.
- ↵.
- Bhadury J, et al.
- ↵.
- Dobin A, et al.
- ↵.
- Anders S,
- Pyl PT,
- Huber W
- ↵
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
- Biological Sciences
- Biophysics and Computational Biology