Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Research Article

Bayesian inference of chromatin structure ensembles from population-averaged contact data

View ORCID ProfileSimeon Carstens, Michael Nilges, and View ORCID ProfileMichael Habeck
  1. aStatistical Inverse Problems in Biophysics, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany;
  2. bDepartment of NMR-based Structural Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany;
  3. cUnité de Bioinformatique Structurale, Institut Pasteur, 75015 Paris, France;
  4. dMicroscopic Image Analysis Group, Jena University Hospital, 07743 Jena, Germany

See allHide authors and affiliations

PNAS April 7, 2020 117 (14) 7824-7830; first published March 19, 2020; https://doi.org/10.1073/pnas.1910364117
Simeon Carstens
aStatistical Inverse Problems in Biophysics, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany;
bDepartment of NMR-based Structural Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany;
cUnité de Bioinformatique Structurale, Institut Pasteur, 75015 Paris, France;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Simeon Carstens
Michael Nilges
cUnité de Bioinformatique Structurale, Institut Pasteur, 75015 Paris, France;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Michael Habeck
aStatistical Inverse Problems in Biophysics, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany;
dMicroscopic Image Analysis Group, Jena University Hospital, 07743 Jena, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Michael Habeck
  • For correspondence: michael.habeck@uni-jena.de
  1. Edited by Ken A. Dill, Stony Brook University, Stony Brook, NY, and approved February 24, 2020 (received for review June 18, 2019)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Significance

High-throughput chromosome conformation capture (5C, Hi-C) experiments deliver fascinating insights into 3D genome organization. They are typically performed on a large population of cells and therefore yield only average numbers of genomic contacts. Nevertheless, population Hi-C data are often interpreted in terms of a single genomic structure, which ignores all cell-to-cell variability. We propose a probabilistic, statistically rigorous method to infer chromatin structure ensembles from population Hi-C and similar data that takes the ensemble nature of the data explicitly into account and allows us to infer the number of structures required to explain the data.

Abstract

Mounting experimental evidence suggests a role for the spatial organization of chromatin in crucial processes of the cell nucleus such as transcription regulation. Chromosome conformation capture techniques allow us to characterize chromatin structure by mapping contacts between chromosomal loci on a genome-wide scale. The most widespread modality is to measure contact frequencies averaged over a population of cells. Single-cell variants exist, but suffer from low contact numbers and have not yet gained the same resolution as population methods. While intriguing biological insights have already been garnered from ensemble-averaged data, information about three-dimensional (3D) genome organization in the underlying individual cells remains largely obscured because the contact maps show only an average over a huge population of cells. Moreover, computational methods for structure modeling of chromatin have mostly focused on fitting a single consensus structure, thereby ignoring any cell-to-cell variability in the model itself. Here, we propose a fully Bayesian method to infer ensembles of chromatin structures and to determine the optimal number of states in a principled, objective way. We illustrate our approach on simulated data and compute multistate models of chromatin from chromosome conformation capture carbon copy (5C) data. Comparison with independent data suggests that the inferred ensembles represent the underlying sample population faithfully. Harnessing the rich information contained in multistate models, we investigate cell-to-cell variability of chromatin organization into topologically associating domains, thus highlighting the ability of our approach to deliver insights into chromatin organization of great biological relevance.

  • chromosome conformation capture
  • genome structure modeling
  • Bayesian inference
  • model comparison

In recent years, methods based on chromosome conformation capture (3C) (1) have emerged as a powerful tool to investigate the three-dimensional (3D) organization of genomes on previously inaccessible length scales. Due to experimental advances and decreasing sequencing costs, genome-wide 3C methods such as high-throughput chromosome conformation capture (Hi-C) and Hi-C variants (3, 4) have yielded contact maps with resolutions of up to 1 kb. High-resolution contact maps have provided fascinating insights into the organization of chromatin domains and compartments of various sizes and their role in gene regulation (2, 4⇓⇓–7).

Although Hi-C can be performed in single cells (8⇓⇓⇓–12), much richer data are usually obtained by analyzing populations of cells at the price of a more difficult interpretation of the data: Contact maps obtained in population Hi-C experiments show an average over millions of cells. Given that single-cell Hi-C experiments revealed significant structural variability among cell nuclei (8), it remains difficult to assess the information content and degree of structural heterogeneity reflected by population Hi-C data. A prime example highlighting this problem is topologically associating domains (TADs). TADs (see ref. 13 for a recent review) are linear segments of chromatin that span between 0.1 and 1 Mb. TADs have first been identified in Hi-C (5, 7) and chromosome conformation capture carbon copy (5C) (6) matrices where they manifest themselves as square-shaped blocks along the diagonal that show an elevated number of internal contacts. While TADs have emerged as fundamental units of gene regulation in many species (13), their boundary regions are often not clearly defined and it is uncertain to which extent TADs exist in individual cells rather than only as an averaging effect in cell populations.

Much effort has gone into the development of chromatin structure determination approaches, in which a single consensus structure is calculated (14). In light of recent results on cell-to-cell variability and considering experimental limitations and biases, consensus structure approaches are fundamentally limited, if not flawed (15). Several approaches taking into account the heterogeneity of cell populations have been explored. Matrix deconvolution has been used to unmix contact frequency matrices without explicitly modeling 3D structures (16, 17). While some structural information is already apparent in contact matrices, it is useful to find an approximate 3D representation giving rise to the data. For example, a 3D chromosome model allows the detection of three-way interactions not accessible in 3C data (18). Furthermore, data from other sources such as imaging can be included in the modeling process. A second approach is thus to calculate an ensemble of structures that reflect the structural heterogeneity of chromosomes more realistically than consensus structure approaches and reproduce the contact map only on average. This can be achieved in two different ways: First, one can optimize a set of structures generated from a polymer model in the presence of restraints enforcing that they reproduce the experimental data on average (3, 19⇓–21). A second possibility is to adjust the parameters of a polymer model such that by simulating the polymer model an ensemble of structures is obtained that reproduces the experimental data on average (22⇓–24).

In this article, we extend the scope of our previous work on Bayesian chromosome structure inference from single-cell Hi-C data (25) to population-averaged contact data, combining the explicit modeling of chromatin structure populations with a rigorous probabilistic interpretation of the data. We leverage the full power of Bayesian inference to compute multistate models of chromosome structures, which together reproduce the population contact data.

Results

Bayesian Inference of Chromatin Structure Ensembles.

Our approach is based on the inferential structure determination (ISD) (ref. 26 and SI Appendix) framework. ISD infers a structural model x along with unknown modeling parameters α from experimental data D and background information I. The inference problem is solved by sampling from the joint posterior distribution for x and α, conditioned on D and I, which can be expressed using Bayes’ theorem asPr(x,α|D,I)=Pr(D|x,α,I)Pr(x,α|I)Pr(D|I).[1]Here it is assumed that a single structure x can explain the data D, as is the case for, e.g., proteins with a stable and unique fold. Sampling from Pr(x,α|D,I) results in a statistically well-defined ensemble of structures with associated estimates of nuisance parameters α, each being assigned a probability given by the posterior distribution.

To accommodate the fact that most 3C experiments are performed on populations of molecules with possibly very different conformations due to cell-to-cell heterogeneity and dynamics, we extend the scope of the ISD approach to the inference of multistate models X=(x1,…,xn) consisting of n structures xk, each represented by a polymer model. The central quantity in this extended ISD framework is the posterior distributionPr(X,α|n,D,I)=Pr(D|X,α,n,I)Pr(X,α|n,I)Pr(D|n,I),[2]sampling from which results in a hyperensemble, that is, a representative set of multistate models Xi with associated nuisance parameters αi, in which each pair (Xi,αi) has a probability given by Eq. 2. In the case of a single state (n=1), we recover the original ISD formulation and infer a consensus structure.

At the center of a Bayesian approach is a generative model for the data Pr(D|X,α,n,I) given a set X of n structures. To reflect the data density appropriately and allow for sufficiently fast computation, we represent a genomic region by a beads-on-a-string model and bin the input contact matrix such that each bin corresponds to one bead. The (coarse-grained) data D={fij} then comprise, for a given pair of beads (i,j), all experimentally accessible contact frequencies between loci mapping to the bins represented by beads i and j. Our forward model computes a contact matrix for each of the n states in X and averages these to obtain an idealized contact frequency matrix f^ij (Fig. 1A). We model deviations from idealized contact frequencies due to experimental noise, forward model inaccuracy, and, in the case of Hi-C data, data normalization with Poisson distributions whose rates are given by the idealized count data.

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

Main components of the proposed chromatin structure inference approach. (A) Forward model to calculate idealized contact frequency matrices from structures that are part of a multistate model. The contact matrices arising from each structure are summed together and multiplied with a scaling factor to yield the back-calculated contact frequency matrix on the right. (B) Bayesian model comparison illustrated for a polynomial curve-fitting example. Shown are the evidence (blue) and the residual mismatch between model and data (red) for polynomials of increasing degree. Data points and best-fitting models are shown below the x axis.

Proper modeling of prior information about model parameters is an essential ingredient of any Bayesian inference approach. We model the physical interactions between beads within a single structure xk similar to our model for single-cell data (25). Further details on our choice of prior distributions are explained in SI Appendix.

Bayesian Choice of the Size of the Structure Ensemble.

The number of states n is a crucial parameter in all modeling approaches that compute ensembles of chromatin structures (3, 19, 20). Existing multistate modeling approaches choose n in an ad hoc manner. For example, in the Genomic Organization Reconstructor Based on Conformational Energy and Manifold Learning (GEM) software (20) n defaults to 4 and differs largely from n=10,000 used by Alber and coworkers (19).

From a Bayesian perspective, the number of states n is a model parameter that should be inferred from the data. Depending on the true heterogeneity of the chromosome structures, n will adopt smaller or larger values. If n is too small, we underfit the contact data, resulting in highly strained structure models. If n is too large, the structural states are only loosely defined and potentially overfit the contact data. Therefore, a simple goodness-of-fit criterion is not sufficient to select n. We have to balance goodness of fit against model complexity. Fig. 1B illustrates this point for polynomial curve fitting. The degree of the polynomial determines the flexibility of the fitting function and therefore corresponds to the number of states n. For data generated from a parabola, the goodness of fit reaches an optimum if we use polynomials of degree two and higher, eventually resulting in overfitting: The fitted curve closely follows each data point.

A unique feature of Bayesian data analysis is that it offers an objective, albeit often computationally challenging framework for model comparison (27). Bayesian model comparison ranks alternative n-state models according to their model evidences Pr(D|n,I); the ratio Pr(D|n1,I)/Pr(D|n2,I) is the Bayes factor for preferring n1 over n2 states. The model evidence is the goodness of fit with the data obtained when averaging over all structures of the n-state model weighted by the prior. The evidence has a built-in penalty for model complexity. In the curve-fitting example (Fig. 1B), Pr(D|n,I) selects the correct degree of the polynomial (n=2), although polynomials of degree >2 achieve a better fit. This is because with increasing degree the number of curves that do not fit the data also increases, which is reflected by a decrease in the model evidence.

Proof of Concept: Reconstructing Coarse-Grained Protein Structures from Simulated Average Contacts.

As a proof of concept, we first apply our inference approach to contact data simulated from two protein structures. We consider beads-on-a-string models of two protein domains, the B1 immunoglobin-binding domain of streptococcal protein G (Protein Data Bank [PDB] identifier 1PGA) and the SH3 domain of human Fyn (PDB identifier 1SHF). Only positions of Cα atoms were taken into account, and the SH3 domain was truncated by three C-terminal residues to match the length of 1PGA. This results in two very different conformations of a chain of 56 beads (Fig. 2 A and B). We generated simulated contact frequencies and used ISD to infer representative multistate models. We simulated the posterior distributions for n=1,2,3,4,5,10 states. Model comparison shows that the model with n=2 states is strongly preferred, as indicated by Pr(D|n=2,I)/Pr(D|n=1,I)>10307 and Pr(D|n=2,I)/Pr(D|n=3,I)>10160 (Fig. 2C). It is reassuring that while the evidence decreases for n>2, models with an even number of states are preferred over models with a similar, but uneven number of states. The reason for this is that for even n, half of the states adopt the conformation of the B1 domain and the other half adopt the conformation of the SH3 domain, thus having no superfluous parameters (SI Appendix, Fig. S2). For uneven n, on the other hand, there is at least one superfluous state (SI Appendix, Figs. S3 and S4), contributing uninformative parameters which are penalized in the evidence. We find that, for n>2, the average scale factor α decreases monotonically with an increasing number of states (Fig. 2F), in agreement with our intuition that α compensates for the smaller number of simulated states compared to typical numbers of counts in the data.

Fig. 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 2.

Reconstruction of protein structures from simulated data. (A and B) Reference structures of the SH3 domain (A) and the B1 domain of protein G (B). (C and D) “Sausage representation” of reconstructed folds (n=2 states). Tube thickness indicates the local precision of bead positions. (E) Model evidences for different numbers of states. (F) Average values of sampled scale factor.

Fig. 2 D and E shows the two reconstructed states from the n=2 simulation which deviate from the ground truth by a root-mean-square deviation (rmsd) of bead positions of (1.46±0.11) Å for the B1 domain and (1.59±0.21) Å for the SH3 domain. ISD thus recovers both conformations with high accuracy.

Inferring Chromatin Structure Populations from 5C Data.

We now apply our extended ISD approach to a 5C (28) dataset covering the X-inactivation center in mouse embryonic stem cells (mESCs) (6). The available contact counts are already corrected for common biases and we use them as is. We restrict our analysis to contacts between primer sites within the two consecutive TADs harboring the Tsix and Xist promoters, respectively, and their boundary region. Similar to previous work on the same data (22), we model the corresponding ∼920-kb region by M beads, each representing a genomic length of ∼(920/M) kb. To keep the computational efforts for sufficient Markov chain Monte Carlo (MCMC) sampling reasonable, we chose a rather coarse resolution with one bead representing 15 kb, resulting in a total of 62 beads. Simulating the ISD posterior distribution for n=1,5,10,15,20,25,30,35,40, and 100 states and calculating model evidences, we find that the multistate model with n=30 states is preferred by the data (Fig. 3A). This result establishes that Bayesian model comparison is indeed able to determine a unique, optimal number of states from population-averaged contact frequencies. Low-resolution models provide only limited biological insight; therefore we increased the number of beads to M=308 and computed multistate models at 3-kb resolution. Our model for the chromatin polymer then coincides with the model chosen in previous work (22) on the same dataset.

Fig. 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 3.

Determination of the optimal number of states by Bayesian model comparison. (A) Model evidences and average likelihoods for various numbers of states at 15-kb resolution. Inset shows a close-up of the log model evidence for models using more than 10 states. (B) Scatter plot showing the correlation between back-calculated and experimental data at 3-kb resolution. (C) Representative sample from an n=30 simulation at 3-kb resolution. Tsix TAD is colored red and the Xist TAD is in blue.

In the following, we analyze high-resolution chromatin structure ensembles obtained from a posterior simulation for n=30. Of three simulations with identical parameters (but different, random initial state), we chose the one with the highest mean posterior probability. While the optimal number of states for low- and high-resolution models does not necessarily have to match, for n=30, the back-calculated data reproduce the experimental data very well, with most mismatches stemming from short-range contacts contributing little structural information (Fig. 3B). Determining the optimal number of states at the 3-kb resolution is currently not feasible, because accurate and exhaustive sampling of very high-dimensional posterior distributions is required for calculating good estimates of the evidence. This is outside the scope of our current implementation and this contribution.

Visual inspection of sample populations (Fig. 3C) reveals a large structural variability which cannot be captured by only a few, let alone a single structure. This heterogeneity is already apparent in the low-resolution models (SI Appendix, Fig. S5).

Bayesian Multistate Models Agree with Independent Experimental Data.

Before further analyzing our models, we validate the inferred multistate chromatin models with previously measured fluorescence in situ hybridization (FISH) (29, 30) data on loci in the Tsix TAD (22). In FISH experiments, fluorescent probes are designed that bind selectively to parts of a nucleic acid with high sequence complementarity. Imaging these probes thus allows the localization of a single locus within a cell. Fig. 4A shows that experimental distances are in excellent agreement with distances obtained from the simulated structures, confirming that our approach yields structures matching data from experimental cell populations not only on average, but also correctly reproducing the spread of the population. Further validation of our modeling approach comes from considering data obtained from female mESCs before and 48 h after onset of differentiation (6). In this cell line, transcription of the Xist TAD is up-regulated while transcription levels of the Tsix TAD remain the same (6). After again choosing n=30, we simulated the posterior densities with contact data before and after the onset of differentiation. To monitor structural differences between both populations, we use the radius of gyration which measures the spatial extension of a structure. The distributions of the radii of gyration of both TADs reveal that before differentiation (Fig. 4B), the Xist TAD assumes more densely packed conformations as opposed to after differentiation (Fig. 4C), while the packing of the Tsix TAD stays, on average, the same. This agrees with the notion that to be highly transcribed, chromatin has to be loosely packed to provide easy access for the transcription machinery. We furthermore compare the structures calculated at 15-kb resolution with structures calculated from recent single-cell Hi-C data (10) at the same resolution. While structures calculated with the approach presented in this work are more extended, other general structural features agree reasonably well (SI Appendix).

Fig. 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 4.

(A–C) Validation of multistate chromatin models inferred from 5C data on male (A) and female (B and C) mESCs. (A) Pairwise distances between several loci in the Tsix TAD as measured by DNA FISH and in inferred multistate models. Names of genes overlapping the probes are shown. (B and C) Distribution of the radius of gyration for data obtained before onset of differentiation (B) and 48 h later (C). Dashed lines indicate sample means.

Chromatin Structure Ensembles Reveal Cell-to-Cell Heterogeneity and TAD Intermingling.

We now consider the heterogeneity within explicit structure populations. Fig. 5A shows that the radius of gyration varies within a certain range within a structure population, but population-wide averages are comparable across posterior samples. This indicates that our MCMC algorithm converged and draws structure populations with similar characteristics.

Fig. 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 5.

Heterogeneity of chromatin structure ensembles. (A) Radius of gyration across representative structure populations (gray circles) and averaged over structure populations (black line) as a function of simulation progress. (B) Distribution of TAD boundaries detected from contact matrices obtained from single-structure population members (dark gray) and back-calculated from full populations (light gray). CTCF binding sites are shown as vertical black lines. Bin populations are on a logarithmic scale. (C) Overlap between densities fitted to both TADs as measured by the correlation coefficient between the mass densities obtained by applying a Gaussian kernel with a bandwidth of one bead diameter. The green reference distribution (“no data”) was obtained by generating models from the prior with an additional radius of gyration bias that ensures that the level of compaction of the prior ensemble matches the degree of compaction observed in the posterior ensemble (“full data”). The orange histogram (“no inter contacts”) was obtained with data lacking inter-TAD contacts.

A unique advantage of analyzing the 5C data with a multistate model is that it allows us to gain insight into the question of to which extent TAD formation is a feature of single structures or whether it is an averaging phenomenon. To study this question, we use a simple heuristic to determine the position of the TAD boundary from a structure by finding the best split of the model contact matrix such that the contact density within both parts achieves the highest value. Fig. 5B suggests that, assuming the formation of two TADs with unknown and possibly variable position, the TAD boundary position determined from single-structure contact matrices can vary significantly within a structure population. If the same TAD boundary detection heuristic is applied to contact frequency matrices obtained by summing single population members’ contact matrices, however, the distribution of TAD boundary positions is sharply peaked at a value of 97. Visual inspection of the 5C contact matrix suggests that a TAD boundary at bead 97 is more appropriate than the value of 109 reported in earlier work using the same data binning (22) (SI Appendix, Fig. S6). We also compare the inferred TAD boundary positions with the location of CTCF binding sites (31). CTCF is a key player in TAD organization and often demarcates TAD boundaries (32). Several CTCF binding sites are located within the range of inferred TAD boundary positions as measured from back-calculated population matrices. We also observe that several CTCF binding sites coincide with distinct peaks in the TAD boundary position distribution derived from single-contact matrices. These results suggest that TAD boundary positions derived both from structure populations and from single structures are indeed biologically relevant results.

To investigate possible TAD intermingling, we convert the structure of each TAD into a mass density by using a Gaussian blur kernel and measure the correlation coefficient between the two densities for each structure (see SI Appendix, Fig. S7 for examples). Fig. 5C shows that, compared to structures calculated from data lacking inter-TAD contacts and structures calculated without any data, considerable TAD intermingling occurs, indicated by a higher overlap between the TAD densities.

Comparison with Other Multistate Modeling Approaches.

We compare the structures estimated by our extended ISD approach to the results of two previous approaches, which also model 3C data by multistate models. We modified the publicly available code for the Population-Based Genome Structure Modeling Package (PGS) (33), the software used by Alber and coworkers (19), to work with the genomic region considered in this work. Calculations for three different numbers of states show good agreement of the resulting structure populations with the aforementioned FISH data (22) for some loci pairs, while for others the distance distributions differ significantly (SI Appendix, Fig. S8). On average, the structure populations calculated with PGS show similar radii of gyration to those of our structures, but the radius of gyration distributions of the PGS structure populations are much broader. It is noteworthy that for structure populations calculated using PGS, neither FISH loci distance nor radius of gyration histograms vary significantly with the number of states, which span two orders of magnitude. Distances between most FISH loci are significantly different in both ISD and PGS structure populations compared to a simulation of the null model without any data given by the polymer model used in our ISD approach (SI Appendix, Fig. S8). We also calculated structure populations using GEM (20), but found that, irrespective of the number of states, which we varied between 4 and 100, there is virtually no heterogeneity in the populations (SI Appendix, Figs. S9–S11).

Discussion

The calculation of chromatin structure ensembles is reminiscent of the determination of protein structure ensembles. The most common approaches for Bayesian modeling of protein ensembles are reweighting and replica averaging (34). Reweighting methods assign weights to members of a structure library (35⇓⇓–38). Replica-averaging techniques (39⇓–41) infer a finite number of conformers simultaneously from the experimental data. Our approach can be seen as a replica-averaging strategy. However, whereas in replica averaging the number of states is held fixed, our goal is to determine the number of structures directly from the data. Therefore, our method can be viewed as a rigorous minimal-ensemble method based on the model evidence.

Using both simulated and 5C data, we showed that the proposed Bayesian approach to chromatin structure inference not only is able to construct multistate models from contact data that agree well with independent biological observations, but also allows us to address crucial questions about the cell-to-cell variability of chromatin architecture. We demonstrated this by using multistate models to uncover significant heterogeneity in the organization of two consecutive TADs, with respect to the location of the boundary region separating the TADs as well as the overlap between them. This structural heterogeneity of TAD organization agrees with previous results on the same dataset obtained by simulating from a polymer model, whose parameters were optimized such that the input data could be reproduced (22). We note that, in comparison, our structures are somewhat different: While in our models, structures are rather compact, most structures shown in ref. 22 seem very extended with an elongated region. Furthermore, in the models inferred with ISD, both TADs tend to have larger radii of gyration (SI Appendix, Fig. S12). Different results on structural diversity of intra-TAD organization have been obtained in an earlier modeling-based attempt based on a coarse-mixture model (42), where it was found that the 3D structure of 71% and 82% of reconstructed TADs (depending on the dataset) corresponded to the consensus structure, which is in contrast to the highly heterogeneous ensembles we obtained. Cell-to-cell variability of TADs has also been addressed on the basis of single-cell Hi-C contact matrices with at best unclear, if not contradicting, results (8, 11). In structures calculated from single-cell Hi-C data (9), TADs have been shown to vary widely in terms of compaction between individual cells. Independent from Hi-C, imaging methods have recently been used to investigate intercell variability of chromatin structure: Multiplexed superresolution FISH imaging (43) showed that TAD-like domain structures with strong physical segregation exist in single cells, but the location of domain boundaries is highly variable. The latter observation agrees with our results as shown in Fig. 5B, while in our structures, physical segregation between the Tsix and Xist TADs seems less pronounced (Fig. 5C). This is evident by comparing the overlap of both TADs observed in the posterior ensemble and a reference ensemble (prior with additional compaction). Although the overlap tends to be higher in the reference ensemble (0.23±0.07), there is still substantial overlap of both TADs in the posterior (0.19±0.07), indicating that both TADs are not fully segregated (as is the case in the posterior based on data missing the inter-TAD contacts: 0.03±0.02). Another recent study based on high-throughput FISH (44) found that TAD structure is highly variable between single cells and that TADs frequently overlap in space.

Our method comes with an objective measure for the number of states required to accurately model population-based data, a feature previous approaches (3, 19, 20) are missing. The number of states used in previous work varies from very low numbers of states (n=4) (20) to very large populations (n=104) (3, 19). Bayesian model comparison can, provided sufficiently exhaustive MCMC sampling, uniquely and objectively determine this number. While our method includes consensus structure calculation as a special case, it clearly demonstrates that this kind of data is much more appropriately modeled using a larger number of structures. At the same time, the Bayesian model comparison approach employed in this work helps to avoid overfitting the data. It is also important to note that our multistate model-based approach is not limited to 5C/Hi-C data. Instead, any kind of pairwise contact frequency data could be used. This allows chromatin structure modeling from data obtained from recent ligation-free alternatives to 3C based methods (45, 46).

Our work opens up several avenues of further research. A more efficient implementation using molecular modeling packages such as OpenMM (47), possibly leveraging the power of graphics processing units (GPUs), would increase the feasible range of both resolutions and system size. For example, we estimate that modeling the mouse X chromosome at a resolution of 40 kb (4,250 beads) involves a 100- to 200-fold increase in computation time. In the current implementation, the time required to evaluate the likelihood scales quadratically with the number of beads. With an implementation improving this bottleneck, computation of full chromosome structures should be feasible. More detailed structural prior information encoding, e.g., a priori contact probabilities between loci based on epigenetic marks (48, 49) and more detailed forward and error models for the data would likely result in more accurate 3D models. For example, a fully probabilistic treatment of biases in Hi-C data could be included in the generative model so that raw, unnormalized Hi-C data can be used as an input. Eventually, with these improvements, the model comparison approach implemented in this work could rise from merely helping to set an important modeling parameter to a measure of cell-to-cell variability with the potential to answer a plethora of important biological questions. Finally, we note that our approach is not limited to 3C data, but could also be useful for other kinds of contact data in which structural heterogeneity can be suspected, for example cross-linking studies of protein complexes.

Materials and Methods

Statistical Model for 5C Data.

In our generative model, we compute a contact matrix with entries f^ijk for each state xk such that f^ij=∑k=1nf^ijk. Ideally, two beads i and j are in contact if their distance dijk is smaller than some cutoff dc resulting in binary-valued f^ijk. Because we use a gradient-based MCMC sampling algorithm, we require the posterior probability to be differentiable and thus approximate a hard contact with a flipped sigmoidal function s(d)=(γd(1+γ2d2)−12+1)/2, where γ controls the steepness of the smoothed contact such that for large γ the hard contact criterion is recovered: limγ→∞s(d)=1−θ(d), where θ is the step function. The total number of contacts between two beads ∑kf^ijk is not directly comparable to the experimental contact matrix, because the number of states n does not match the (unknown) number of molecules analyzed in the experiment. We therefore need to scale the calculated contacts with a positive factor α, which we treat as a nuisance parameter:f^ij=α∑k=1nf^ijk=α∑k=1ns(dc−dijk).We model the discrepancy between experimental and back-calculated data using a Poisson distribution with rate f^ij. For a single contact, the likelihood is then given byPr(fij|X,α)∝[f^ij(X,α)]fije−f^ij(X,α).Assuming that the single measurements are independent, the likelihood of the complete dataset is the product over all contacts:Pr(D|X,α,I)∝exp−∑j>iMf^ij(X,α)∏j>iM[f^ij(X,α)]fij.It is important to note that we could have chosen a different error model. In principle, Bayesian model comparison allows us to determine the error model that is most compatible with the data. However, we restrict this study to a Poisson error model, because the Poisson distribution is discrete and has been used previously to model Hi-C data.

Bayesian Model Comparison.

The posterior probability of a particular choice of n isPr(n|D,I)=Pr(D|n,I) Pr(n|I)Pr(D|I).Assuming that a priori all n are equally likely, the ratio of two choices for the number of states n and n′ isOnn′=Pr(n|D,I)Pr(n′|D,I)=∫dXdα Pr(D|X,α,n,I) Pr(X,α|n,I)∫dXdα Pr(D|X,α,n′,I) Pr(X,α|n′,I).[3]The odds ratio Onn′ is called the Bayes factor and tells us which model to prefer: For Onn′>1, an n-state model describes the data better; while Onn′<1 indicates a preference for a model based on n′ states. The evaluation of the model evidence involves calculating two possibly intractable and high-dimensional integrals, which in all but the most trivial cases have to be approximated numerically. As several of the occurring distributions are of a nonstandard form, we use histogram reweighting (50, 51) to approximate the evidences.

Code Availability.

We implemented our method in a freely available Python package (http://bitbucket.org/simeon_carstens/ensemble_hic).

Acknowledgments

We thank Luca Giorgetti for sharing DNA-FISH data. S.C. and M.N. acknowledge funding from the Agence National de la Recherche (ANR-11-MONU-0020). S.C. thanks Christian Griesinger for funding and support. M.H. acknowledges funding by the German Research Foundation under Grant SFB 860, TP B09 as well as funding by the Carl Zeiss Foundation.

Footnotes

  • ↵1To whom correspondence may be addressed. Email: michael.habeck{at}uni-jena.de.
  • Author contributions: S.C., M.N., and M.H. designed research; S.C. and M.H. performed research; S.C. contributed new reagents/analytic tools; S.C. and M.H. analyzed data; and S.C., M.N., and M.H. wrote the paper.

  • The authors declare no competing interest.

  • This article is a PNAS Direct Submission.

  • Data deposition: Code for our method has been deposited in a freely available Python package (http://bitbucket.org/simeon_carstens/ensemble_hic).

  • This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1910364117/-/DCSupplemental.

Published under the PNAS license.

References

  1. ↵
    1. J. Dekker,
    2. K. Rippe,
    3. M. Dekker,
    4. N. Kleckner
    , Capturing chromosome conformation. Science 295, 1306–1311 (2002).
    OpenUrlAbstract/FREE Full Text
  2. ↵
    1. E. Lieberman-Aiden et al.
    , Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. R. Kalhor,
    2. H. Tjong,
    3. N. Jayathilaka,
    4. F. Alber,
    5. L. Chen
    , Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nat. Biotechnol. 30, 90–98 (2012).
    OpenUrlCrossRefPubMed
  4. ↵
    1. S. S. P. Rao et al.
    , A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
    OpenUrlCrossRefPubMed
  5. ↵
    1. J. R. Dixon et al.
    , Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
    OpenUrlCrossRefPubMed
  6. ↵
    1. E. P. Nora et al.
    , Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381–385 (2012).
    OpenUrlCrossRefPubMed
  7. ↵
    1. T. Sexton et al.
    , Three-dimensional folding and functional organization principles of the Drosophila genome. Cell 148, 458–472 (2012).
    OpenUrlCrossRefPubMed
  8. ↵
    1. T. Nagano et al.
    , Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 502, 59–64 (2013).
    OpenUrlCrossRefPubMed
  9. ↵
    1. T. J. Stevens et al.
    , 3D structure of individual mammalian genomes studied by single cell Hi-C. Nature 544, 59–64 (2017).
    OpenUrlCrossRefPubMed
  10. ↵
    1. T. Nagano et al.
    , Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature 547, 61–67 (2017).
    OpenUrlCrossRefPubMed
  11. ↵
    1. I. M. Flyamer et al.
    , Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition. Nature 544, 110–114 (2017).
    OpenUrlCrossRefPubMed
  12. ↵
    1. V. Ramani et al.
    , Massively multiplex single-cell Hi-C. Nat. Methods 14, 263–266 (2017).
    OpenUrlCrossRefPubMed
  13. ↵
    1. Q. Szabo,
    2. F. Bantignies,
    3. G. Cavalli
    , Principles of genome folding into topologically associating domains. Sci. Adv. 5, eaaw1668 (2019).
    OpenUrlFREE Full Text
  14. ↵
    1. M. V. Imakaev,
    2. G. Fudenberg,
    3. L. A. Mirny
    , Modeling chromosomes: Beyond pretty pictures. FEBS Lett. 589, 3031–3036 (2015).
    OpenUrlCrossRefPubMed
  15. ↵
    1. J. M. O’Sullivan,
    2. M. D. Hendy,
    3. T. Pichugina,
    4. G. C. Wake,
    5. J. Langowski
    , The statistical-mechanics of chromosome conformation capture. Nucleus 4, 390–398 (2013).
    OpenUrlCrossRefPubMed
  16. ↵
    1. I. Junier,
    2. Y. G. Spill,
    3. M. A. Marti-Renom,
    4. M. Beato,
    5. F. le Dily
    , On the demultiplexing of chromosome capture conformation data. FEBS Lett. 589, 3005–3013 (2015).
    OpenUrlCrossRefPubMed
  17. ↵
    1. E. Sefer,
    2. G. Duggal,
    3. C. Kingsford
    , Deconvolution of ensemble chromatin interaction data reveals the latent mixing structures in cell subpopulations. J. Comput. Biol. 23, 425–438 (2016).
    OpenUrl
  18. ↵
    1. G. Gürsoy,
    2. Y. Xu,
    3. A. L. Kenter,
    4. J. Liang
    , Computational construction of 3D chromatin ensembles and prediction of functional interactions of alpha-globin locus from 5C data. Nucleic Acids Res. 45, 11547–11558 (2017).
    OpenUrlCrossRef
  19. ↵
    1. H. Tjong et al.
    , Population-based 3D genome structure analysis reveals driving forces in spatial genome organization. Proc. Natl. Acad. Sci. U.S.A. 113, E1663–E1672 (2016).
    OpenUrlAbstract/FREE Full Text
  20. ↵
    1. G. Zhu et al.
    , Reconstructing spatial organizations of chromosomes through manifold learning. Nucleic Acids Res. 46, e50 (2018).
    OpenUrlCrossRef
  21. ↵
    1. S. Wang,
    2. J. Xu,
    3. J. Zeng
    . Inferential modeling of 3D chromatin structure. Nucleic Acids Res. 43, e54 (2015).
  22. ↵
    1. L. Giorgetti et al.
    , Predictive polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. Cell 157, 950–963 (2014).
    OpenUrlCrossRefPubMed
  23. ↵
    1. B. Zhang,
    2. P. G. Wolynes
    , Topology, structures, and energy landscapes of human chromosomes. Proc. Natl. Acad. Sci. U.S.A. 112, 6062–6067 (2015).
    OpenUrlAbstract/FREE Full Text
  24. ↵
    1. M. Di Pierro,
    2. B. Zhang,
    3. E. L. Aiden,
    4. P. G. Wolynes,
    5. J. N. Onuchic
    , Transferable model for chromosome architecture. Proc. Natl. Acad. Sci. U.S.A. 113, 12168–12173 (2016).
    OpenUrlAbstract/FREE Full Text
  25. ↵
    1. S. Carstens,
    2. M. Nilges,
    3. M. Habeck
    , Inferential structure determination of chromosomes from single-cell Hi-C data. PLoS Comput. Biol. 12, 1–33 (2016).
    OpenUrlCrossRef
  26. ↵
    1. W. Rieping,
    2. M. Habeck,
    3. M. Nilges
    , Inferential structure determination. Science 309, 303–306 (2005).
    OpenUrlAbstract/FREE Full Text
  27. ↵
    1. L. Wasserman
    , Bayesian model selection and model averaging. J. Math. Psychol. 44, 92–107 (2000).
    OpenUrlCrossRefPubMed
  28. ↵
    1. J. Dostie et al.
    , Chromosome conformation capture carbon copy (5C): A massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 1299–1309 (2006).
    OpenUrlAbstract/FREE Full Text
  29. ↵
    1. P. R. Langer-Safer,
    2. M. Levine,
    3. D. C. Ward
    , Immunological method for mapping genes on Drosophila polytene chromosomes. Proc. Natl. Acad. Sci. U.S.A. 79, 4381–4385 (1982).
    OpenUrlAbstract/FREE Full Text
  30. ↵
    1. C. Cui,
    2. W. Shu,
    3. P. Li
    , Fluorescence in situ hybridization: Cell-based genetic diagnostic and research applications. Front. Cell Dev. Biol. 4, 89 (2016).
    OpenUrl
  31. ↵
    1. F. Yue et al.
    , A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014).
    OpenUrlCrossRefPubMed
  32. ↵
    1. R. Ghirlando,
    2. G. Felsenfeld
    , CTCF: Making the right connections. Genes Dev. 30, 881–891 (2016).
    OpenUrlAbstract/FREE Full Text
  33. ↵
    1. N. Hua et al.
    , Producing genome structure populations with the dynamic and automated PGS software. Nat. Protoc. 13, 915–926 (2018).
    OpenUrl
  34. ↵
    1. M. Bonomi,
    2. G. T. Heller,
    3. C. Camilloni,
    4. M. Vendruscolo
    , Principles of protein structural ensemble determination. Curr. Opin. Struct. Biol. 42, 106–116 (2017).
    OpenUrlCrossRef
  35. ↵
    1. C. K. Fisher,
    2. A. Huang,
    3. C. M. Stultz
    , Modeling intrinsically disordered proteins with Bayesian statistics. J. Am. Chem. Soc. 132, 14919–14927 (2010).
    OpenUrlCrossRefPubMed
  36. ↵
    1. B. Rozycki,
    2. Y. C. Kim,
    3. G. Hummer
    , SAXS ensemble refinement of ESCRT-III CHMP3 conformational transitions. Structure 19, 109–116 (2011).
    OpenUrlCrossRefPubMed
  37. ↵
    1. G. Hummer,
    2. J. Kofinger
    , Bayesian ensemble refinement by replica simulations and reweighting. J. Chem. Phys. 143, 243150 (2015).
    OpenUrlCrossRefPubMed
  38. ↵
    1. W. Potrzebowski,
    2. J. Trewhella,
    3. I. Andre
    , Bayesian inference of protein conformational ensembles from limited structural data. PLoS Comput. Biol. 14, e1006641 (2018).
    OpenUrl
  39. ↵
    1. B. Roux,
    2. J. Weare
    , On the statistical equivalence of restrained-ensemble simulations with the maximum entropy method. J. Chem. Phys. 138, 084107 (2013).
    OpenUrlCrossRefPubMed
  40. ↵
    1. A. Cavalli,
    2. C. Camilloni,
    3. M. Vendruscolo
    , Molecular dynamics simulations with replica-averaged structural restraints generate structural ensembles according to the maximum entropy principle. J. Chem. Phys. 138, 094112 (2013).
    OpenUrlCrossRefPubMed
  41. ↵
    1. M. Bonomi,
    2. C. Camilloni,
    3. A. Cavalli,
    4. M. Vendruscolo
    . Metainference: A Bayesian inference method for heterogeneous systems. Sci. Adv. 2, e1501177 (2016).
    OpenUrlFREE Full Text
  42. ↵
    1. M. Hu et al.
    , Bayesian inference of spatial organizations of chromosomes. PLoS Comput. Biol. 9, e1002893 (2013).
    OpenUrlCrossRefPubMed
  43. ↵
    1. B. Bintu et al.
    , Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells. Science 362, eaau1783 (2018).
    OpenUrlAbstract/FREE Full Text
  44. ↵
    1. E. H. Finn et al.
    , Extensive heterogeneity and intrinsic variation in spatial genome organization. Cell 176, 1502–1515.e10 (2019).
    OpenUrlCrossRef
  45. ↵
    1. R. A. Beagrie et al.
    , Complex multi-enhancer contacts captured by genome architecture mapping. Nature 543, 519–524 (2017).
    OpenUrlCrossRefPubMed
  46. ↵
    1. S. A. Quinodoz et al.
    , Higher-order inter-chromosomal hubs shape 3d genome organization in the nucleus. Cell 174, 744–757.e24 (2018).
    OpenUrlCrossRefPubMed
  47. ↵
    1. P. Eastman et al.
    , OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comput. Biol. 13, 1–17 (2017).
    OpenUrlCrossRef
  48. ↵
    1. M. Barbieri et al.
    , Complexity of chromatin folding is captured by the strings and binders switch model. Proc. Natl. Acad. Sci. U.S.A. 109, 16173–16178 (2012).
    OpenUrlAbstract/FREE Full Text
  49. ↵
    1. M. Di Pierro,
    2. R. R. Cheng,
    3. E. Lieberman Aiden,
    4. P. G. Wolynes,
    5. J. N. Onuchic
    . De novo prediction of human chromosome structures: Epigenetic marking patterns encode genome architecture. Proc. Natl. Acad. Sci. U.S.A. 114, 12126–12131 (2017).
    OpenUrlAbstract/FREE Full Text
  50. ↵
    1. A. M. Ferrenberg,
    2. R. H. Swendsen
    , New Monte Carlo technique for studying phase transitions. Phys. Rev. Lett. 61, 2635–2638 (1988).
    OpenUrlCrossRefPubMed
  51. ↵
    1. N. D. Lawrence,
    2. M. Girolami
    1. M. Habeck
    , “Evaluation of marginal likelihoods via the density of states” in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, N. D. Lawrence, M. Girolami, Eds. (Proceedings of Machine Learning Research, 2012), pp. 486–494.
PreviousNext
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Bayesian inference of chromatin structure ensembles from population-averaged contact data
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Bayesian inference of chromatin structure ensembles from population-averaged contact data
Simeon Carstens, Michael Nilges, Michael Habeck
Proceedings of the National Academy of Sciences Apr 2020, 117 (14) 7824-7830; DOI: 10.1073/pnas.1910364117

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Bayesian inference of chromatin structure ensembles from population-averaged contact data
Simeon Carstens, Michael Nilges, Michael Habeck
Proceedings of the National Academy of Sciences Apr 2020, 117 (14) 7824-7830; DOI: 10.1073/pnas.1910364117
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley

Article Classifications

  • Biological Sciences
  • Biophysics and Computational Biology
Proceedings of the National Academy of Sciences: 117 (14)
Table of Contents

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Abstract
    • Results
    • Discussion
    • Materials and Methods
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Water from a faucet fills a glass.
News Feature: How “forever chemicals” might impair the immune system
Researchers are exploring whether these ubiquitous fluorinated molecules might worsen infections or hamper vaccine effectiveness.
Image credit: Shutterstock/Dmitry Naumov.
Reflection of clouds in the still waters of Mono Lake in California.
Inner Workings: Making headway with the mysteries of life’s origins
Recent experiments and simulations are starting to answer some fundamental questions about how life came to be.
Image credit: Shutterstock/Radoslaw Lecyk.
Cave in coastal Kenya with tree growing in the middle.
Journal Club: Small, sharp blades mark shift from Middle to Later Stone Age in coastal Kenya
Archaeologists have long tried to define the transition between the two time periods.
Image credit: Ceri Shipton.
Mouse fibroblast cells. Electron bifurcation reactions keep mammalian cells alive.
Exploring electron bifurcation
Jonathon Yuly, David Beratan, and Peng Zhang investigate how electron bifurcation reactions work.
Listen
Past PodcastsSubscribe
Panda bear hanging in a tree
How horse manure helps giant pandas tolerate cold
A study finds that giant pandas roll in horse manure to increase their cold tolerance.
Image credit: Fuwen Wei.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Cozzarelli Prize
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490