The genome landscape of ERα- and ERβ-binding DNA regions

Liu et al. 10.1073/pnas.0712085105.

Supporting Information

Files in this Data Supplement:

SI Dataset 1
SI Dataset 2
SI Dataset 3
SI Table 2
SI Figure 8
SI Figure 9
SI Text
SI Table 3
SI Figure 10
SI Dataset 4




SI Dataset 1



SI Dataset 2



SI Dataset 3



Table 2. Differences in distance to closest TSS between partitions

Partition

ERa+ERb

ERa+ERb ERa-ERb

ERa+ERb ERb+ERb

ERa-ERb

ERb+ERb

ERa+ERb ERa-ERb ERb+ERb

ERa+ERb

-

0.3011

0.0002419

0.7411

0.0002668

0.03028

ERa+ERb ERa-ERb

 

-

2.744e-08

0.04972

5.091e-09

6.096e-06

ERa+ERb ERb+ERb

 

 

-

5.35e-05

0.9131

0.01222

ERa-ERb

 

 

 

-

2.398e-05

0.01664

ERb+ERb

 

 

 

 

-

0.007791

ERa+ERb ERa-ERb ERb+ERb

 

 

 

 

 

-

Table cells show P values from double-sided Wilcoxon tests.





SI Figure 8

Fig. 8. Heatmap representation of global differences in predicted TFBS densities between the partitions. This is a full version of Fig. 6, with all TF models included (Fig. 6 reducted the image by removing those TFs with no preference to any set, and removing the dendrograms). Columns represent the different partitions in the Venn diagram (Fig. 3). Rows are densities of predicted TFBSs. The gene name for the predicted TFBS is indicated to the left. TFBSs are indicated as Z scores, ranging from -15 (strong under-representation indicated by red color) to +15 (strong over-representation indicated by blue color). Columns and rows are clustered by similarity.





SI Figure 9

Fig. 9. Distance bias of ERA and ERB binding regions. Labels of partitions and associated colors are as in the Venn diagram in Fig. 3. This is a full-scale version of Fig. 5. (A) Boxplot showing the distributions of distances to the closest TSS for regions within the different partitions (B) Histogram of distances to the closest TSS. Negative distances indicate regions upstream of TSSs, positive distances indicate regions downstream of TSSs. Note that only the region around the TSS is shown.





SI Text

Distances to TSSs. In the main text, we use boxplots to visualize the difference in binding region-TSS distances. These distributions can also be visualized as cumulative accumulation of binding events (SI Fig. 10).

Clustering Binding Regions. For each region in A, we identify the region in another set B that has the highest overlap (expressed as the fraction of nucleotides of the region in A). We annotate the region in A with this value, and continue through the list of regions in set A. This will then form a distribution, where 0 corresponds to no overlap and 1 corresponds to total overlap. We repeat this analysis starting from set B, thereby correcting for differences in sizes of regions .

This analysis was performed for all sets versus each other giving six pair-wise comparisons.

As overlap between regions in sets, if existing, is almost total, we decided to cluster all regions that overlap by at least 50% (expressed as the % of nucleotides of the smallest region) (Fig. 3).

Global Transcription Factor Binding Site Overrepresentation. If a transcription factor has no preference for one sequence set (partition) over another one, the number of sites of that specific transcription factor should only be influenced by the total number of nucleotides within each sequence set. In other words, the transcription factor should be evenly distributed among the sequence sets based on the size of each sequence set. Assuming independence between sequence sets (rows) and transcription factors (columns) we sampled 1,000 new columns for each transcription factor based on the total site number for that transcription factor and the size of each sequence set. Based on these samples we estimate the expected mean and standard deviation under the independence assumption and calculate a Z score for each transcription factor in each sequence set. The Z score is calculated as:

where Obs is the observed number of a transcription factor within a given sequence set, Exp the expected number within the set based on the independence assumption and Sd the standard deviation of this expectation. A negative Z score will indicate a relative decrease of these transcription factor binding sites compared to the rest of the groups while a positive score will indicate a relative increase. As such these Z scores can be viewed conceptually as an "expression"-value for the sequence set, a value that represents how important the model is in defining the sequence set relative to the rest of the sequence sets. Correspondingly, we can use a heatmap representation to cluster Venn diagram partitions and transcription factors, similarly to genes and time course measurements. For heatmap analysis, we use the heatmap.2 function in the gplots package in R, with standard settings. In the heatmap, each Venn diagram partition is a row, and each used TF model (and its prediction Z scores) is a column. Z scores are colored from red (under-representation) to blue (over-representation). The full figure resulting from this analysis is shown in SI Fig. 8. However, in this figure there are a larger number of uninformative transcription factor models that show no particular preference for the different sequence sets. Removing all columns with a maximum absolute score below 4 gives a reduced representation of the difference between the sequence sets (as shown in Fig. 5).

Pairwise Overrepresentation. Using the same settings in the ASAP tool as above we calculated over-representation of all transcription factors between the different partitions in an all versus all fashion. We kept one partition as our positive set and sequentially used all other partitions as the background set, thus creating a total of 42 test environments. Over-representation was calculated using Fishers exact test on the number of TFBSs per nucleotide between the different partitions. The results, applying a P value below or equal to 0.01, are summarized in SI Dataset 4.

TFBS Analysis Using an Ab Initio Approach. To make sure that the patterns observed when using the matrix model from JASPAR are the strongest patterns in the dataset (there might be patterns that are strong but are not described within JASPAR), we used an ab initio method (MoAn, E.V., O. Winther, and A. Krogh, unpublished data), that finds patterns that discriminate between two groups of sequences. In general, the motifs found are consistent with the JASPAR-based analysis, which indicates that the GC and TA enrichment in ERB and ERA sets are not an effect of the models within JASPAR, and are the strongest signals in these sets.

Evolutionary Conservation. For all nucleotides in each region we retrieved PhastCons scores from 28 vertebrate alignments from the UCSC browser database. Briefly, each such nucleotide will get a PhastCons score from 0 to 1, which corresponds to the likelihood that the position is under selective constraint across the species. So, each region will get a corresponding vector of scores.

To be able to see the general conservation properties of each partition, we align each such vector belonging to one partition by their midpoint (we are not aligning them by their sequence). In such an alignment, we plot the mean PhastCons score in each column (Fig. 7).

In general, regions binding ERa are more conserved, and are more defined in terms of a higher conservation toward the center than regions binding ERb. The ERa+ERb ERa-ERb ERb+ERb partition is in between these two types of conservation profiles. In the TFBS analysis above, we observed that the number of potential ER sites within regions is much higher in the ERb bound regions. This might explain the more diffuse conservation profile for these regions. If they contain many active sites which are picked up by the chip experiment, centering the alignment at the midpoint of the region will create diffuse conservation scores as the active site might not be at the midpoint. If this is true, individual sites for ER binding within the partitions should not be significantly different in terms of conservation scores. To test this hypothesis, we used the predicted ER binding sites in the TFBS analysis and extracted the mean PhastCons scores for the nucleotides for each such predicted site in each region. The distribution of mean scores are not significantly different between the partitions (P = 0.14, Kruskal-Wallis test), indicating that individual sites in the partitions on average have similar evolutionary constraints, and the reason for the more shallow conservation profiles in the ERb binding sets is related to the higher number of ER binding sites within the these regions.

Gene Ontology Analysis. We first aimed at assigning target genes. This is challenging as most regions are far away from the closest TSS. We picked all regions in a partition that has a TSS-region distance of 10kb or less. This gave a set of target gene symbols for each partition, using refseq annotation. As some regions within the same set might get linked to the same gene, we removed redundancies. We analyzed pairs of gene lists using the GoStat tool (http://gostat.wehi.edu.au/) with standard settings, focusing on the sets bound preferentially by ERb versus ERa. We found no significant difference in terms of gene function in any of the pairs tested, suggesting that the types of genes regulated by the two factors are generally not different.





Table 3. Primers used for ChIP followed by real-time PCR (Fig. 4): The primer pairs used are as follows:

Regions name

Primers

Chromosome

chromStart (hg18)

chromEnd (hg18)

α+βα-ββ+β1

5'-GGGATTTCCAGGGCCAAT-3'

chr1

19795362

19796205

5'-GCCGTGACCAGGCCTTT-3'

α+βα-ββ+β2

5'-GATGGATGGGAACACATTGGT-3'

chr1

201324679

201325622

5'-TGGTGGCGGAGCACAAA-3'

α+βα-ββ+β3

5'-CGTACAACCGGAGGGACAGA-3'

chr6

122972517

122972958

5'-TTTCAATTCCCTTCCTGCTTTC-3'

α+βα-ββ+β4

5'-TCAGATGCCCCCTGTCAGTT-3'

chr3

162613088

162613421

5'-CAGCCAGCCACAGACAGCTA-3'

α+βα-β1

5'-TCTAACAACATGAAGGGAAAAAACAA-3'

chr6

11155031

11155281

5'-CAACAGCCGCAGGGTTCT-3'

α+βα-β2

5'-TGGAGCGCAGGCTGTGA-3'

chr22

45517437

45517931

5'-TATGGCACTCCTGAGCACTCA-3'

α+βα-β3

5'-TTGCAGGGATCAGCTCATGTT-3'

chrX

40322015

40322445

5'-GTCCAGCAGTGAGTTCTGAGTGA-3'

α+ββ+β1

5'-GCCCGAGAGGCATTTGTATTT-3'

chr22

19601001

19601221

5'-AAGTGGGTAACCTGGCTATCATG-3'

α+ββ+β2

5'-AGGCCCCCGGGATGA-3',

chr3

197498994

197499290

5'-TGACCCTGGGCCATTCC-3'

α+ββ+β3

5'-GAGAGCAAAAAGCCAAGGTTACA-3'

chr1

149246684

149247587

5'-CTACAGCCTCGGCAAATATCTTC-3'





SI Figure 10

Fig. 10. Fraction of binding regions within a partition that are within a given distance from a Refseq TSS. Note that the distribution is cumulative. Line colors for partition of regions are as in Fig. 3.





SI Dataset 4

This Article

  1. PNAS February 19, 2008 vol. 105 no. 7 2604-2609
  1. AbstractFree
  2. Figures Only
  3. Full Text
  4. Full Text (PDF)
  5. » Supporting Information