Hyperconserved CpG domains underlie Polycomb-binding sites

Tanay et al. 10.1073/pnas.0609746104.

Supporting Information

Files in this Data Supplement:

SI Figure 5
SI Figure 6
SI Figure 7
SI Figure 8
SI Figure 9
SI Figure 10
SI Figure 11
SI Table 1
SI Figure 12
SI Figure 13
SI Figure 14
SI Text




SI Figure 5

Fig. 5. Dinucleotide distribution around CpGs in and GpCs. Shown are the dinucleotide distributions around 8 million intergenic non-CpG island nonrepetitive human CpG dinucleotides (Upper) computed by centering the sequences on the CpGs and counting dinucleotide multiplicities in each relative position. The same analysis applied to GpC dinucleotides is shown in Lower, reflecting low information content around GpC as opposed to the highly informative CpG sequence context.





SI Figure 6

Fig. 6. Dinucleotide distributions around conserved and diverged CpGs. Shown are dinucleotide distributions around human CpGs that are conserved in chimp (Center) compared to dinucleotide distributions that were diverged in chimp to TpG (plus strand deamination, Left), or CpA (minus strand deamination, Right).





SI Figure 7

Fig. 7. Architecture of the M-score model. Shown are the dinucleotide log odds for plus strand deamination [log(p+(s[i+j]s[i+j + 1], j)/p(s[i+j]s[i+j + 1], j)); see Methods], dissected into three main components to illustrate the architecture of the M-score model. (A) The near-context effect, including the effects of the five nucleotides flanking the CpG. As shown, there are marked differences between dinucleotide frequencies of conserved and deaminated CpGs at these offsets. (B) The GC content effect. This profile was computed by smoothing the log-odds profile using averaging over one period (10 nucleotides). The trend reflects the long range effect of GC content around CpGs, showing that as GC content increases, the rate of deamination decreases. (C) The nucleosome effect. This profile was generated by subtracting the period average (B) from the log-odds profile. The remaining pattern is highly similar to the previously characterized nucleosome pattern (Fig. 1), showing that CpGs that are located specifically in phase with the AA/TT period of a typical nucleosome positioning sequence are less likely to be deaminated.





SI Figure 8

Fig. 8. (A) Distribution of M-scores. Shown are the distributions of M-scores for CpGs in (left) and out (right) of CpG islands. The M-score forms a continuum that generalizes the current distinction between CpG islands and non CpG islands and provides finer resolution for correct modeling of CpGs' divergence rates.





SI Figure 9

Fig. 9. (A) Correlation of M-scores and HEP methylation data. Shown are spearman correlations between the m-score of each CpG and its methylation level in the HEP data set. Correlation was computed separately for each of the tissues used in the HEP study. (B and C) M-score is outperforming G+C content in predicting methylation levels. Shown is the average HEP methylation for CpGs with specific M-score values (x axis), when restricting the analysis to include only CpG in regions with G+C content between 55% and 60%. As shown in C, G+C content in this range is a very poor predictor to methylation levels.





SI Figure 10

Fig. 10. Lower average heterozygosity for chimp-conserved CpG SNPs. Shown are the cumulative distributions of average heterozygosity for 89,522 human CpGs SNPs that were conserved in yeast and 55,098 CpG SNPs that were diverged in chimp. Conserved CpGs appear at much lower average heterozygosity (P < 10-300, Kolmogorov-Smirnov analysis), suggesting that a significant portion of the human CpGs are evolving under some selective pressure.





SI Figure 11

Fig. 11. Global distribution of COCAD scores. Shown is the distribution of COCAD scores for intervals centered on all intergenic or intronic CpGs. The distribution shows that CpG divergence in 95% of the genome is predicted by our context model, and direct further analysis toward hyper-conserved (and to a lesser extent hyper-diverged) regions.





SI Figure 12

Fig. 12. Non-CpG divergence in and around HCGDs-Suz1 domains. For each PRC2 domain overlapping a HCGD, the divergence rate in non-CpG, noncoding loci was computed. This divergence rate was compare to that of the regions immediately flanking the domain. Shown is the distribution of divergence rate differences, indicating that apart from CpGs, HCGDs-PRC2 domains are not more conserved than their surroundings.





SI Figure 13

Fig. 13. Suz12 binding for CpG islands inside and outside of conserved CpG domains. Shown are distributions of average Suz12-binding ratios for CpG islands that are part of a conserved CpG domain (blue) or away from any conserved CpG domain (gray). The marked difference (KS = 0.59, P < 10-81) indicates that while many CpG islands are part of HCGDs and are bound by PRC2, many others seem unrelated to these mechanisms.





SI Figure 14

Fig. 14. Relative decrease in CpG C to T mutations at HCGDs is correlated to the CpG's M-score. Shown are data for CpGs in HCGDs. CpGs were grouped using bins of M-scores and regional mutation rates. For each bin, the graph depicts the M-score (x axis) and the ratio of observed and predicted fraction of chimp TpG/CpA aligned to the human CpG (y axis). For CpGs with low M-scores, the observed-expected ratio is close to 1, while for higher M-score, the ratio is decreasing to around 0.35. One possible explanation to the data are that slow divergence of CpGs at HCGDs is driven by low levels of germ-line methylation. At low M-scores methylation is low a priori and the contribution of m5C deamination to the predicted mutation rate is marginal; therefore, the decrease of methylation density in HCGDs has a small effect on overall CpG divergence. For high M-scores, m5C deamination dominates all other mutations, and the effect of lower germ-line methylation at HCGDs loci is significant.

This Article

  1. PNAS March 27, 2007 vol. 104 no. 13 5521-5526
  1. AbstractFree
  2. Figures Only
  3. Full Text
  4. Full Text (PDF)
  5. » Supporting Information