Bayesian inference of chromatin structure ensembles from population-averaged contact data
- aStatistical Inverse Problems in Biophysics, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany;
- bDepartment of NMR-based Structural Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany;
- cUnité de Bioinformatique Structurale, Institut Pasteur, 75015 Paris, France;
- dMicroscopic Image Analysis Group, Jena University Hospital, 07743 Jena, Germany
See allHide authors and affiliations
Edited by Ken A. Dill, Stony Brook University, Stony Brook, NY, and approved February 24, 2020 (received for review June 18, 2019)

Significance
High-throughput chromosome conformation capture (5C, Hi-C) experiments deliver fascinating insights into 3D genome organization. They are typically performed on a large population of cells and therefore yield only average numbers of genomic contacts. Nevertheless, population Hi-C data are often interpreted in terms of a single genomic structure, which ignores all cell-to-cell variability. We propose a probabilistic, statistically rigorous method to infer chromatin structure ensembles from population Hi-C and similar data that takes the ensemble nature of the data explicitly into account and allows us to infer the number of structures required to explain the data.
Abstract
Mounting experimental evidence suggests a role for the spatial organization of chromatin in crucial processes of the cell nucleus such as transcription regulation. Chromosome conformation capture techniques allow us to characterize chromatin structure by mapping contacts between chromosomal loci on a genome-wide scale. The most widespread modality is to measure contact frequencies averaged over a population of cells. Single-cell variants exist, but suffer from low contact numbers and have not yet gained the same resolution as population methods. While intriguing biological insights have already been garnered from ensemble-averaged data, information about three-dimensional (3D) genome organization in the underlying individual cells remains largely obscured because the contact maps show only an average over a huge population of cells. Moreover, computational methods for structure modeling of chromatin have mostly focused on fitting a single consensus structure, thereby ignoring any cell-to-cell variability in the model itself. Here, we propose a fully Bayesian method to infer ensembles of chromatin structures and to determine the optimal number of states in a principled, objective way. We illustrate our approach on simulated data and compute multistate models of chromatin from chromosome conformation capture carbon copy (5C) data. Comparison with independent data suggests that the inferred ensembles represent the underlying sample population faithfully. Harnessing the rich information contained in multistate models, we investigate cell-to-cell variability of chromatin organization into topologically associating domains, thus highlighting the ability of our approach to deliver insights into chromatin organization of great biological relevance.
In recent years, methods based on chromosome conformation capture (3C) (1) have emerged as a powerful tool to investigate the three-dimensional (3D) organization of genomes on previously inaccessible length scales. Due to experimental advances and decreasing sequencing costs, genome-wide 3C methods such as high-throughput chromosome conformation capture (Hi-C) and Hi-C variants (3, 4) have yielded contact maps with resolutions of up to 1 kb. High-resolution contact maps have provided fascinating insights into the organization of chromatin domains and compartments of various sizes and their role in gene regulation (2, 4⇓⇓–7).
Although Hi-C can be performed in single cells (8⇓⇓⇓–12), much richer data are usually obtained by analyzing populations of cells at the price of a more difficult interpretation of the data: Contact maps obtained in population Hi-C experiments show an average over millions of cells. Given that single-cell Hi-C experiments revealed significant structural variability among cell nuclei (8), it remains difficult to assess the information content and degree of structural heterogeneity reflected by population Hi-C data. A prime example highlighting this problem is topologically associating domains (TADs). TADs (see ref. 13 for a recent review) are linear segments of chromatin that span between 0.1 and 1 Mb. TADs have first been identified in Hi-C (5, 7) and chromosome conformation capture carbon copy (5C) (6) matrices where they manifest themselves as square-shaped blocks along the diagonal that show an elevated number of internal contacts. While TADs have emerged as fundamental units of gene regulation in many species (13), their boundary regions are often not clearly defined and it is uncertain to which extent TADs exist in individual cells rather than only as an averaging effect in cell populations.
Much effort has gone into the development of chromatin structure determination approaches, in which a single consensus structure is calculated (14). In light of recent results on cell-to-cell variability and considering experimental limitations and biases, consensus structure approaches are fundamentally limited, if not flawed (15). Several approaches taking into account the heterogeneity of cell populations have been explored. Matrix deconvolution has been used to unmix contact frequency matrices without explicitly modeling 3D structures (16, 17). While some structural information is already apparent in contact matrices, it is useful to find an approximate 3D representation giving rise to the data. For example, a 3D chromosome model allows the detection of three-way interactions not accessible in 3C data (18). Furthermore, data from other sources such as imaging can be included in the modeling process. A second approach is thus to calculate an ensemble of structures that reflect the structural heterogeneity of chromosomes more realistically than consensus structure approaches and reproduce the contact map only on average. This can be achieved in two different ways: First, one can optimize a set of structures generated from a polymer model in the presence of restraints enforcing that they reproduce the experimental data on average (3, 19⇓–21). A second possibility is to adjust the parameters of a polymer model such that by simulating the polymer model an ensemble of structures is obtained that reproduces the experimental data on average (22⇓–24).
In this article, we extend the scope of our previous work on Bayesian chromosome structure inference from single-cell Hi-C data (25) to population-averaged contact data, combining the explicit modeling of chromatin structure populations with a rigorous probabilistic interpretation of the data. We leverage the full power of Bayesian inference to compute multistate models of chromosome structures, which together reproduce the population contact data.
Results
Bayesian Inference of Chromatin Structure Ensembles.
Our approach is based on the inferential structure determination (ISD) (ref. 26 and SI Appendix) framework. ISD infers a structural model x along with unknown modeling parameters α from experimental data D and background information I. The inference problem is solved by sampling from the joint posterior distribution for x and α, conditioned on D and I, which can be expressed using Bayes’ theorem as
To accommodate the fact that most 3C experiments are performed on populations of molecules with possibly very different conformations due to cell-to-cell heterogeneity and dynamics, we extend the scope of the ISD approach to the inference of multistate models
At the center of a Bayesian approach is a generative model for the data
Main components of the proposed chromatin structure inference approach. (A) Forward model to calculate idealized contact frequency matrices from structures that are part of a multistate model. The contact matrices arising from each structure are summed together and multiplied with a scaling factor to yield the back-calculated contact frequency matrix on the right. (B) Bayesian model comparison illustrated for a polynomial curve-fitting example. Shown are the evidence (blue) and the residual mismatch between model and data (red) for polynomials of increasing degree. Data points and best-fitting models are shown below the x axis.
Proper modeling of prior information about model parameters is an essential ingredient of any Bayesian inference approach. We model the physical interactions between beads within a single structure
Bayesian Choice of the Size of the Structure Ensemble.
The number of states n is a crucial parameter in all modeling approaches that compute ensembles of chromatin structures (3, 19, 20). Existing multistate modeling approaches choose n in an ad hoc manner. For example, in the Genomic Organization Reconstructor Based on Conformational Energy and Manifold Learning (GEM) software (20) n defaults to 4 and differs largely from
From a Bayesian perspective, the number of states n is a model parameter that should be inferred from the data. Depending on the true heterogeneity of the chromosome structures, n will adopt smaller or larger values. If n is too small, we underfit the contact data, resulting in highly strained structure models. If n is too large, the structural states are only loosely defined and potentially overfit the contact data. Therefore, a simple goodness-of-fit criterion is not sufficient to select n. We have to balance goodness of fit against model complexity. Fig. 1B illustrates this point for polynomial curve fitting. The degree of the polynomial determines the flexibility of the fitting function and therefore corresponds to the number of states n. For data generated from a parabola, the goodness of fit reaches an optimum if we use polynomials of degree two and higher, eventually resulting in overfitting: The fitted curve closely follows each data point.
A unique feature of Bayesian data analysis is that it offers an objective, albeit often computationally challenging framework for model comparison (27). Bayesian model comparison ranks alternative n-state models according to their model evidences
Proof of Concept: Reconstructing Coarse-Grained Protein Structures from Simulated Average Contacts.
As a proof of concept, we first apply our inference approach to contact data simulated from two protein structures. We consider beads-on-a-string models of two protein domains, the B1 immunoglobin-binding domain of streptococcal protein G (Protein Data Bank [PDB] identifier 1PGA) and the SH3 domain of human Fyn (PDB identifier 1SHF). Only positions of
Reconstruction of protein structures from simulated data. (A and B) Reference structures of the SH3 domain (A) and the B1 domain of protein G (B). (C and D) “Sausage representation” of reconstructed folds (
Fig. 2 D and E shows the two reconstructed states from the
Inferring Chromatin Structure Populations from 5C Data.
We now apply our extended ISD approach to a 5C (28) dataset covering the X-inactivation center in mouse embryonic stem cells (mESCs) (6). The available contact counts are already corrected for common biases and we use them as is. We restrict our analysis to contacts between primer sites within the two consecutive TADs harboring the Tsix and Xist promoters, respectively, and their boundary region. Similar to previous work on the same data (22), we model the corresponding
Determination of the optimal number of states by Bayesian model comparison. (A) Model evidences and average likelihoods for various numbers of states at 15-kb resolution. Inset shows a close-up of the log model evidence for models using more than 10 states. (B) Scatter plot showing the correlation between back-calculated and experimental data at 3-kb resolution. (C) Representative sample from an
In the following, we analyze high-resolution chromatin structure ensembles obtained from a posterior simulation for
Visual inspection of sample populations (Fig. 3C) reveals a large structural variability which cannot be captured by only a few, let alone a single structure. This heterogeneity is already apparent in the low-resolution models (SI Appendix, Fig. S5).
Bayesian Multistate Models Agree with Independent Experimental Data.
Before further analyzing our models, we validate the inferred multistate chromatin models with previously measured fluorescence in situ hybridization (FISH) (29, 30) data on loci in the Tsix TAD (22). In FISH experiments, fluorescent probes are designed that bind selectively to parts of a nucleic acid with high sequence complementarity. Imaging these probes thus allows the localization of a single locus within a cell. Fig. 4A shows that experimental distances are in excellent agreement with distances obtained from the simulated structures, confirming that our approach yields structures matching data from experimental cell populations not only on average, but also correctly reproducing the spread of the population. Further validation of our modeling approach comes from considering data obtained from female mESCs before and 48 h after onset of differentiation (6). In this cell line, transcription of the Xist TAD is up-regulated while transcription levels of the Tsix TAD remain the same (6). After again choosing
(A–C) Validation of multistate chromatin models inferred from 5C data on male (A) and female (B and C) mESCs. (A) Pairwise distances between several loci in the Tsix TAD as measured by DNA FISH and in inferred multistate models. Names of genes overlapping the probes are shown. (B and C) Distribution of the radius of gyration for data obtained before onset of differentiation (B) and 48 h later (C). Dashed lines indicate sample means.
Chromatin Structure Ensembles Reveal Cell-to-Cell Heterogeneity and TAD Intermingling.
We now consider the heterogeneity within explicit structure populations. Fig. 5A shows that the radius of gyration varies within a certain range within a structure population, but population-wide averages are comparable across posterior samples. This indicates that our MCMC algorithm converged and draws structure populations with similar characteristics.
Heterogeneity of chromatin structure ensembles. (A) Radius of gyration across representative structure populations (gray circles) and averaged over structure populations (black line) as a function of simulation progress. (B) Distribution of TAD boundaries detected from contact matrices obtained from single-structure population members (dark gray) and back-calculated from full populations (light gray). CTCF binding sites are shown as vertical black lines. Bin populations are on a logarithmic scale. (C) Overlap between densities fitted to both TADs as measured by the correlation coefficient between the mass densities obtained by applying a Gaussian kernel with a bandwidth of one bead diameter. The green reference distribution (“no data”) was obtained by generating models from the prior with an additional radius of gyration bias that ensures that the level of compaction of the prior ensemble matches the degree of compaction observed in the posterior ensemble (“full data”). The orange histogram (“no inter contacts”) was obtained with data lacking inter-TAD contacts.
A unique advantage of analyzing the 5C data with a multistate model is that it allows us to gain insight into the question of to which extent TAD formation is a feature of single structures or whether it is an averaging phenomenon. To study this question, we use a simple heuristic to determine the position of the TAD boundary from a structure by finding the best split of the model contact matrix such that the contact density within both parts achieves the highest value. Fig. 5B suggests that, assuming the formation of two TADs with unknown and possibly variable position, the TAD boundary position determined from single-structure contact matrices can vary significantly within a structure population. If the same TAD boundary detection heuristic is applied to contact frequency matrices obtained by summing single population members’ contact matrices, however, the distribution of TAD boundary positions is sharply peaked at a value of 97. Visual inspection of the 5C contact matrix suggests that a TAD boundary at bead 97 is more appropriate than the value of 109 reported in earlier work using the same data binning (22) (SI Appendix, Fig. S6). We also compare the inferred TAD boundary positions with the location of CTCF binding sites (31). CTCF is a key player in TAD organization and often demarcates TAD boundaries (32). Several CTCF binding sites are located within the range of inferred TAD boundary positions as measured from back-calculated population matrices. We also observe that several CTCF binding sites coincide with distinct peaks in the TAD boundary position distribution derived from single-contact matrices. These results suggest that TAD boundary positions derived both from structure populations and from single structures are indeed biologically relevant results.
To investigate possible TAD intermingling, we convert the structure of each TAD into a mass density by using a Gaussian blur kernel and measure the correlation coefficient between the two densities for each structure (see SI Appendix, Fig. S7 for examples). Fig. 5C shows that, compared to structures calculated from data lacking inter-TAD contacts and structures calculated without any data, considerable TAD intermingling occurs, indicated by a higher overlap between the TAD densities.
Comparison with Other Multistate Modeling Approaches.
We compare the structures estimated by our extended ISD approach to the results of two previous approaches, which also model 3C data by multistate models. We modified the publicly available code for the Population-Based Genome Structure Modeling Package (PGS) (33), the software used by Alber and coworkers (19), to work with the genomic region considered in this work. Calculations for three different numbers of states show good agreement of the resulting structure populations with the aforementioned FISH data (22) for some loci pairs, while for others the distance distributions differ significantly (SI Appendix, Fig. S8). On average, the structure populations calculated with PGS show similar radii of gyration to those of our structures, but the radius of gyration distributions of the PGS structure populations are much broader. It is noteworthy that for structure populations calculated using PGS, neither FISH loci distance nor radius of gyration histograms vary significantly with the number of states, which span two orders of magnitude. Distances between most FISH loci are significantly different in both ISD and PGS structure populations compared to a simulation of the null model without any data given by the polymer model used in our ISD approach (SI Appendix, Fig. S8). We also calculated structure populations using GEM (20), but found that, irrespective of the number of states, which we varied between 4 and 100, there is virtually no heterogeneity in the populations (SI Appendix, Figs. S9–S11).
Discussion
The calculation of chromatin structure ensembles is reminiscent of the determination of protein structure ensembles. The most common approaches for Bayesian modeling of protein ensembles are reweighting and replica averaging (34). Reweighting methods assign weights to members of a structure library (35⇓⇓–38). Replica-averaging techniques (39⇓–41) infer a finite number of conformers simultaneously from the experimental data. Our approach can be seen as a replica-averaging strategy. However, whereas in replica averaging the number of states is held fixed, our goal is to determine the number of structures directly from the data. Therefore, our method can be viewed as a rigorous minimal-ensemble method based on the model evidence.
Using both simulated and 5C data, we showed that the proposed Bayesian approach to chromatin structure inference not only is able to construct multistate models from contact data that agree well with independent biological observations, but also allows us to address crucial questions about the cell-to-cell variability of chromatin architecture. We demonstrated this by using multistate models to uncover significant heterogeneity in the organization of two consecutive TADs, with respect to the location of the boundary region separating the TADs as well as the overlap between them. This structural heterogeneity of TAD organization agrees with previous results on the same dataset obtained by simulating from a polymer model, whose parameters were optimized such that the input data could be reproduced (22). We note that, in comparison, our structures are somewhat different: While in our models, structures are rather compact, most structures shown in ref. 22 seem very extended with an elongated region. Furthermore, in the models inferred with ISD, both TADs tend to have larger radii of gyration (SI Appendix, Fig. S12). Different results on structural diversity of intra-TAD organization have been obtained in an earlier modeling-based attempt based on a coarse-mixture model (42), where it was found that the 3D structure of
Our method comes with an objective measure for the number of states required to accurately model population-based data, a feature previous approaches (3, 19, 20) are missing. The number of states used in previous work varies from very low numbers of states (
Our work opens up several avenues of further research. A more efficient implementation using molecular modeling packages such as OpenMM (47), possibly leveraging the power of graphics processing units (GPUs), would increase the feasible range of both resolutions and system size. For example, we estimate that modeling the mouse X chromosome at a resolution of 40 kb (4,250 beads) involves a 100- to 200-fold increase in computation time. In the current implementation, the time required to evaluate the likelihood scales quadratically with the number of beads. With an implementation improving this bottleneck, computation of full chromosome structures should be feasible. More detailed structural prior information encoding, e.g., a priori contact probabilities between loci based on epigenetic marks (48, 49) and more detailed forward and error models for the data would likely result in more accurate 3D models. For example, a fully probabilistic treatment of biases in Hi-C data could be included in the generative model so that raw, unnormalized Hi-C data can be used as an input. Eventually, with these improvements, the model comparison approach implemented in this work could rise from merely helping to set an important modeling parameter to a measure of cell-to-cell variability with the potential to answer a plethora of important biological questions. Finally, we note that our approach is not limited to 3C data, but could also be useful for other kinds of contact data in which structural heterogeneity can be suspected, for example cross-linking studies of protein complexes.
Materials and Methods
Statistical Model for 5C Data.
In our generative model, we compute a contact matrix with entries
Bayesian Model Comparison.
The posterior probability of a particular choice of n is
Code Availability.
We implemented our method in a freely available Python package (http://bitbucket.org/simeon_carstens/ensemble_hic).
Acknowledgments
We thank Luca Giorgetti for sharing DNA-FISH data. S.C. and M.N. acknowledge funding from the Agence National de la Recherche (ANR-11-MONU-0020). S.C. thanks Christian Griesinger for funding and support. M.H. acknowledges funding by the German Research Foundation under Grant SFB 860, TP B09 as well as funding by the Carl Zeiss Foundation.
Footnotes
- ↵1To whom correspondence may be addressed. Email: michael.habeck{at}uni-jena.de.
Author contributions: S.C., M.N., and M.H. designed research; S.C. and M.H. performed research; S.C. contributed new reagents/analytic tools; S.C. and M.H. analyzed data; and S.C., M.N., and M.H. wrote the paper.
The authors declare no competing interest.
This article is a PNAS Direct Submission.
Data deposition: Code for our method has been deposited in a freely available Python package (http://bitbucket.org/simeon_carstens/ensemble_hic).
This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1910364117/-/DCSupplemental.
Published under the PNAS license.
References
- ↵
- J. Dekker,
- K. Rippe,
- M. Dekker,
- N. Kleckner
- ↵
- E. Lieberman-Aiden et al.
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- Q. Szabo,
- F. Bantignies,
- G. Cavalli
- ↵
- ↵
- ↵
- ↵
- E. Sefer,
- G. Duggal,
- C. Kingsford
- ↵
- ↵
- H. Tjong et al.
- ↵
- ↵
- S. Wang,
- J. Xu,
- J. Zeng
- ↵
- ↵
- B. Zhang,
- P. G. Wolynes
- ↵
- M. Di Pierro,
- B. Zhang,
- E. L. Aiden,
- P. G. Wolynes,
- J. N. Onuchic
- ↵
- ↵
- W. Rieping,
- M. Habeck,
- M. Nilges
- ↵
- ↵
- J. Dostie et al.
- ↵
- P. R. Langer-Safer,
- M. Levine,
- D. C. Ward
- ↵
- C. Cui,
- W. Shu,
- P. Li
- ↵
- ↵
- R. Ghirlando,
- G. Felsenfeld
- ↵
- N. Hua et al.
- ↵
- ↵
- ↵
- ↵
- ↵
- W. Potrzebowski,
- J. Trewhella,
- I. Andre
- ↵
- ↵
- ↵
- M. Bonomi,
- C. Camilloni,
- A. Cavalli,
- M. Vendruscolo
- ↵
- ↵
- B. Bintu et al.
- ↵
- ↵
- ↵
- ↵
- ↵
- M. Barbieri et al.
- ↵
- M. Di Pierro,
- R. R. Cheng,
- E. Lieberman Aiden,
- P. G. Wolynes,
- J. N. Onuchic
- ↵
- ↵
- N. D. Lawrence,
- M. Girolami
- M. Habeck
Citation Manager Formats
Article Classifications
- Biological Sciences
- Biophysics and Computational Biology