Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / GENETICS
Chromosomal periodicity of evolutionarily conserved gene pairs


,¶
*Department of Genetics,
HarvardMassachusetts Institute of Technology Division of Health Sciences and Technology, and
HarvardPartners Center for Genetics and Genomics, Harvard Medical School, Boston, MA 02115; and
Departments of Biology and Biomedical Engineering, and Bioinformatics Program, Boston University, Boston, MA 02215
Edited by John R. Roth, University of California, Davis, CA, and approved May 11, 2007 (received for review December 6, 2006)
| Abstract |
|---|
|
|
|---|
chromosome structure | computational genomics | nucleoid | spatial organization
Bacterial chromosomes must be compacted 1,000-fold to fit within the cell. The resulting structures could therefore be highly disordered; for example, 10 kb of uncompacted DNA (1/400th of the genome) could span the entire cell. However, in vivo the chromosome exhibits a high degree of order. At a local level, it is wound into
10-kb supercoiled domains that topologically isolate different regions of the genome from each other (7, 8). At larger scales, certain regions of the genome are physically inaccessible to each other, suggesting that loci undergo limited diffusion (9). More recently, fluorescence microscopy has shown that loci are not randomly positioned in the cell but occupy reproducible 3D positions that undergo specific cell-cycle movements (1014). In Escherichia coli and Caulobacter crescentus, evidence suggests furthermore that the positions of loci in the cell are linearly correlated with their coordinate along the genome, with the origin and terminus at opposite cell poles (11, 12). In E. coli, recent data confirm this linear correlation but suggest the origin is located at midcell, with the two arcs of the chromosome in two different (longitudinal) halves of the cell (13, 14). At a finer scale, below the resolution of current confocal microscopy, positional correlations and periodicities in sequence (15), expression levels (1619), and transcription factor-binding sites (17) suggest a functionally important, possibly regular chromosome conformation. However, beyond coarse high-level outlines, the structure has been largely inaccessible to experiment and remains largely unknown.
Here, we approach the problem of chromosome structure from an evolutionary perspective. Our method, based on comparative genomics, is similar to statistical coupling analysis in proteins (20). In proteins, the 3D arrangement of specific residues is critical for function; for example, the WW protein domain contains a small 3D network of residues that is crucial for both folding and function (20). Maintaining this arrangement constrains the identities of the amino acids at the involved residues and causes them to coevolve, generating statistical correlations in a multiple sequence alignment (20).
In the chromosome, we reasoned analogously that if a particular 3D arrangement of genes is critical for function, such genes would tend to occupy genomic locations where they can achieve this arrangement in the folded chromosome (Fig. 1); for example, a regulatory region may tend to occupy positions where it can be folded close to the gene it regulates, as in the
-globin locus control region (21). We reasoned that the coevolution of genes to locations compatible with such 3D arrangements would create statistical correlations in gene locations, analogous to the observed correlations in protein residues. Given the dynamical and fluid nature of chromosome organization, we expected such constraints to be less rigid than those found in protein residues, yet potentially significant enough to create correlations detectable in a multiple genome comparison.
|
| Results |
|---|
|
|
|---|
Based on these criteria, we searched across 10 million gene pairs in >100 genomes and selected 22,500 strongly SC pairs [supporting information (SI) Methods and SI Table 1; Fig. 1b]. In any given genome, genes belonging to these SC pairs will occupy a specific set of genomic locations: for many SC pairs, the two genes will be located close together along the genome. However, in this same genome, the genes in other SC pairs may be far apart (Fig. 1a). In addition, many of the genes may be concentrated in particular regions of the genome. We reasoned that if the correlated pairs are constrained by chromosome structure, their distributions along a given genome (Fig. 1c) might reveal structural features of the chromosome fold, for example, regions that are folded into spatial proximity (Fig. 1d) or are constrained to particular subspaces of the nucleoid or cell.
We first investigated properties of the SC pairs across all organisms and found that the genes in a pair exhibit a strong preference for positions that are symmetric about the origin of replication (SI Fig. 6). This symmetry is consistent with fluorescence microscopy in C. crescentus (1012) and with observations of symmetry in genome alignments and gene order (2426).
We next examined the detailed genomic organization of the SC pairs in a single organism, E. coli. First, we analyzed the distribution of distances, i.e., the number of times a particular distance separates genes in an SC pair along each chromosome arc (defined by the origin and terminus of replication; see SI Fig. 6a). Because the SC genes were chosen based on their tendency for closeness in bacterial genomes, the expectation (under a null hypothesis of otherwise randomly positioned genes) is that most distances will be close to zero, and that distances larger than zero will taper smoothly to zero (see SI Fig. 7a). In E. coli, however, we observe a markedly different pattern (Fig. 2). Distances near zero are indeed overrepresented, but there is also a series of significant peaks at 117, 234, 351 kb, and n x 117 kb (n integer), out to 1.4 Mb. Using a Fourier transform, we confirmed a strong and highly significant periodicity (P < 0.002; see Fig. 2 Inset and Methods). The SC pairs are therefore not randomly spaced along the genome but prefer specific genomic intervals of n x 117 kb. A similar periodicity is observed in the distances between the SC genes located on different arcs of the chromosome (SI Fig. 7b).
|
|
We next sought to understand this link in more detail, in particular the relationship between the periodicities in the SC pairs and the positioning of highly transcribed genes in E. coli; in addition, because the SC pairs were chosen by using orthologs conserved over many genomes, we simultaneously examined the connection between SC pairs and the positioning of highly conserved genes. We therefore constructed two new pair sets in which the gene pairs were selected randomly from E. coli by using probabilities proportional to their level of transcription (transcription pairs) or conservation (conservation pairs) (see Methods). The distance distributions of these new pair sets are therefore enriched in distances that separate highly transcribed or conserved genes along the chromosome, allowing us to examine preferences in the chromosomal spacing of these genes in a manner similar to the SC pairs. We first compared the distance distributions of these two new pair sets with the SC pairs. In contrast to the SC pairs, we found no periodicity in conservation pairs (Fig. 4 a and d) and a weak 117-kb periodicity in transcription pairs [Fig. 4 a and d, consistent with previous observations of 115 kb in transcription (18, 19)].
|
| Discussion |
|---|
|
|
|---|
117 kb could generate several gene clusters spaced at 117 kb by splitting a single initial gene cluster by 117-kb recombination events. However, the clusters generated from splitting two different initial clusters would not naturally be in phase with each other. In general, such local constraints cannot easily explain a global periodicity of positions that extends in almost perfect phase along each half of the genome. Even an extreme case of recombination hotspots spaced at n x 117 kb along the chromosome could maintain the 117-kb periodicity only if the SC paired genes were constrained to the very center or edges of each 117-kb stretch. Otherwise, a single inversion would destroy the periodicity. In addition, any horizontal gene transfer would disrupt the periodicity unless the fragment were small (<<117 kb) or
117 kb long with SC genes at the center or edges. We cannot rule out the possibility that such rearrangement processes contribute to the observed patterns, e.g., symmetric inversions about the origin of replication (2426) could explain the observed symmetry of SC paired genes. However, the localization of SC genes in an in-phase set of periodically spaced islands suggests that some selective pressure beyond these processes maintains these genes at these specific locations. Structural constraints due to the spatial organization of the chromosome offer a simple explanation. In-phase positional periodicities in amino acid sequences are a canonical structural motif seen in proteins, where they indicate the presence of a specific face on a periodic structure (27), for example, the face of hydrophobic residues in contact with the membrane in a transmembrane domain (27). Because of the regular period, these spatially contiguous structural faces are composed of residues that are separated by periodic intervals along the sequence. In the E. coli chromosome, the periodic distributions suggest an analogous structural organization, a regular 117-kb looping, and a single structural face of each chromosome half, along which SC pairs are predominantly localized.
In Fig. 5, we depict three conformations consistent with these constraints. Note that these conformations are simplified backbones rather than exact structures; for example, the coiled backbone in each panel represents 21-fold compacted DNA, which may consist of more complicated substructure (SI Fig. 12b), e.g., of 1012 (possibly irregular and stochastic) topological domains (7, 8). The basic feature of each configuration is a regular 117 kb coiling along the backbone of the circular chromosome, which creates regions of high pair density along a face or faces of the structure. This 117- kb looped circular chromosome could be arranged in multiple ways within the cell. It could be flattened with the origin and terminus positioned at opposite cell poles (Fig. 5 a and b), compatible with previous observations in C. crescentus (12) and E. coli (11). Alternatively, as suggested by recent experimental data in E. coli (13, 14), the origin could be positioned at midcell, placing the right and left arcs in different cell halves (Fig. 5c). Other arrangements are also possible, including longer-range periodicities (SI Figs. 12a and 13). In addition, the structure could undergo dynamical transitions (Fig. 5d), e.g., between a longitudinally symmetric configuration (Fig. 5 a or b) and a transverse symmetric one (Fig. 5c), while constantly maintaining the structural features suggested by our analysis. In fact, the origin has been observed to move from cell pole to midcell before replication (11, 14). Additional experimental data, however, will be required to discriminate between these different possibilities.
|
If the distributions of the SC pairs are the product of a structural periodicity and the localization of SC pairs along specific structural faces of the E. coli chromosome, what could be responsible for these features? Given the correlation between SC pairs and transcription, transcription is an attractive possibility for causing localization: the localization of certain highly transcribed genes along the structural faces (see helical moments in Fig. 5) would have the advantage of creating spatial subregions in which highly transcribed genes could be accessed by limited diffusion of RNA polymerase or RNA polymerase fixed in factories. Such subregions are consistent with experimental observations of foci of RNA polymerase in E. coli (28, 29) [which, like the transcription correlation, we observe are specific to log-phase growth (28)] and with transcription factories in the eukaryotic nucleus (3). If the pair dense faces are oriented inward (Fig. 5a), the density of transcription could help mechanically hold the chromosome arms together (explaining the symmetry in SI Figs. 6 and 7). Alternatively, if oriented on the surface (Fig. 5b), they could allow nascent transcripts to be accessed by ribosomes and the membrane (30).
Spatial localization of highly transcribed genes alone, however, would not generate periodicity. Rather, periodicity requires a second constraint, a regular loop size analogous to the 3.5-residue turn of an
-helix. This suggests an intrinsic property of the chromosome or of its binding proteins (e.g., H-NS MukBEF) (6); for example, the association of chromosomal DNA with proteins that induce a regular curvature would create periodic loops. Helices are also known to be the energetically optimal way of confining a string to certain geometrical spaces (31); thus, a 117-kb looping may be the spontaneous outcome of physicochemical properties including macromolecular crowding, supercoiling, DNA persistence length, and cell dimension.
The bacterial chromosome, however, is known to be a highly dynamic structure (2, 3, 12, 13) and the proposed models (Fig. 5) are based on patterns in gene positioning that are inherently static. Two points should be emphasized. First, the proposed features may relate to a specific portion of the cell cycle, perhaps log-phase growth, as indicated in SI Fig. 9. However, such features could be maintained despite dynamical changes (e.g., a reorientation from Fig. 5 a to c). In addition, it is likely that replication plays a dominant role in determining the structure of the chromosome (13, 14). In particular, the proposed looping would be advantageous for replication, allowing the chromosome to rapidly unwind and rewind with minimal local entanglement, like two coaxial [Fig. 5c (13)] or parallel [Fig. 5 a and b (12)] stacks of ropes.
The relationship of the proposed features to existing experimental data also bears discussion. These features are consistent with confocal microscopy (10, 12, 13), recombination (32), transposon insertion (33), and atomic force microscopy (34). In addition, our findings may yield insight into previously observed periodicities of 96 kb in transcription factor-binding sites (17), 90120 kb in wavelet analysis (15, 16), and 115 kb in transcription levels (18, 19) in E. coli. In particular, our analysis suggests that these periodicities reflect an in-phase 117-kb grid, occupied by top-transcribed, top-conserved, and SC gene pairs. The significance of the pattern in the SC pairs may be due to the special vantage point of comparative genomics, where loci are identified based on the combined results of evolution acting on multiple genomes.
Independent of structure, our approach reveals significant gene organization on the chromosome. More general comparative analyses of how genes, gene pairs, or higher multiplets of genes are positioned in the genome should yield further insight into chromosome architecture. Similar methods should also be applicable to eukaryotes. The particular structural faces we propose and the chromosome-wide structural periodicity make specific predictions, which must be tested experimentally. To this effect, we are examining other bacteria for similar patterns (see SI Fig. 15 for C. crescentus, which displays a similar strong periodicity at 113 kb) and have developed a multiplex method of chromosome conformation capture (3C) (35) to measure the distances between thousands of chromosomal loci simultaneously at the resolution of our model. Ultimately, genome sequences and their structures may be highly interdependent aspects of a single finely tuned system. Evolutionary conservation should provide a powerful means of unraveling this interdependence.
| Methods |
|---|
|
|
|---|
Genomic Data. The genomes were obtained from GenBank and consisted of 105 bacterial and three eukaryotic genomes (Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Caenorhabditis elegans, which were included to represent particularly distant species).
Chromosomal Proximity.
For a pair of genes x and y, we calculated the tendency toward chromosomal proximity by using the difference in the order in which genes appear along the chromosome (gene-order difference) (23). We evaluated the probability
|
|
that the pair and its orthologs have gene order difference D, less than or equal to the gene-order differences, dg(x, y), observed across a set G of genomes g. Pg was calculated numerically under the null hypothesis that orthologous genes are randomly ordered on the chromosome. In genomes with multiple chromosomes, the gene-order distance between genes on different chromosomes was assigned to be greater than the maximum number of genes on any chromosome.
To correct for the variable phylogenetic divergence of query genomes, we constructed a UPGMA (36) phylogenetic tree based on a phylogenetic distance
(g1, g2) between genomes g1 and g2. Note that the use of an alternative phylogenetic reconstruction method (neighbor-joining) does not affect our conclusions. We used a phylogenetic distance based on gene content (37), specifically, the mutual information between E. coli ortholog occurrence vectors in two genomes. The probabilities Pg from each genome were weighted based on the phylogenetic tree, by using an approach similar to the method of phylogenetic contrasts (36) (see SI Methods for details). The orthology mapping was established by using best bidirectional orthologs from Kyoto Encyclopedic of Genes and Genomes (KEGG) Sequence Similarity Database (www.genome.ad.jp/kegg/ssdb).
Phylogenetic Co-Occurrence. Phylogenetic profile cooccurrence probability was calculated by using the extended hypergeometric distribution method described in Kharchenko et al. (38), which also includes a correction for the phylogenetic divergence. The orthologs were determined by using best bidirectional BLASTP hits against National Center for Biotechnology Information NR protein data set. Organisms containing orthologs for <1% of E. coli genes were excluded from calculations.
Distributions of Distances and Positions and Fourier Transform.
We constructed a histogram of the distances between genes for all SC pairs in E. coli. The histogram was transformed into a continuous probability density by using a Gaussian smoothing window (
= 4 kb) and normalizing the total density over the entire genome to 1. A discrete Fourier transform of the data were computed from 0 to 1,000 kb by using a Tukey window to taper the ends (ratio of 0.5 for tapered to untapered length). The periodicity is independent of the maximum distance value. We calculated the statistical significance by repeating the smoothing and Fourier analysis on 10,000 randomizations in which the positions of the operons involving SC paired genes [determined from Price et al. (39)] were randomized within their chromosomal arc. The P value was determined by counting the number of randomizations with a Fourier peak as strong as or stronger than the 117-kb SC pair peak.
The density of SC pairs was computed by counting the number of SC pairs involving genes at each position along the chromosome, smoothing with the Gaussian window (
= 8 kb), and normalizing by the overall gene density. The 1D grid is defined as a set of positions n
+ p along the chromosome, where
is the spacing between grid points (the period), p is the offset (or phase) (set separately for each arc), and n is an integer. We evaluate the fit of the distributions to the grid using the sum of the distances of each peak to the nearest grid point (over all choices of p for each
) as the error measure (see SI Fig. 8).
Expression Correlation.
We calculated an average of the absolute transcript level for wild-type standard growth conditions (4-morpholinepropanesulfonic acid minimal glucose, doubling time 28 h) using 5 Affymetrix (Santa Clara, CA) microarrays data sets extracted from the ASAP database [www.genome.wisc.edu/tools/asap.htm, Allen et al. (16)]. These data were smoothed by using a Gaussian window
= 6 kb and normalized by the overall gene density as above. We calculated the Pearson correlation coefficient of the smoothed data with the pair position density, sampling once every 12 kb to avoid smoothing artifacts (and averaging over all choices of the sampling phase). P was computed by using Student's t test with n2 degrees of freedom (where n is the number of data points).
Transcription and Conservation Pair Sets.
We constructed pair sets based on the levels of transcription (Ti) and conservation (Ci) of genes in E. coli (GE.coli), with i
GE.coli by using log-phase transcript level from Allen et al. (16) for transcription and the number of orthologs of a gene (using best bidirectional orthologs from KEGG Sequence Similarity Database) for conservation. Each pair in the transcription pair set was chosen by randomly selecting two genes from GE.coli, where the probability of selecting gene i is pi = Ti/Ttot, with Ttot =
Ti. Similarly, for selecting pairs in the conservation pair set we used probabilities pi = Ci/Ctot. Distance distributions and Fourier spectra were calculated as for the SC pairs.
Pair sets limited to the top k transcribed genes were created by choosing i
GE.coli(k, T), where GE.coli(k, T) is the set of top k transcribed genes. Similarly, we defined pairs for the top k conserved genes by sampling from a subset GE.coli(k, C) and for the top k SC genes by taking the subset of the initial SC pairs in which both genes are elements of GE.coli(k, SC), the set of k genes most represented in the initial SC pair set.
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: SC, statistically correlated.
¶To whom correspondence should be addressed. E-mail: dsegre{at}bu.edu
Freely available online through the PNAS open access option.
Author contributions: M.A.W., G.M.C., and D.S. designed research; M.A.W., P.K., and D.S. performed research; P.K. contributed new reagents/analytic tools; M.A.W., P.K., G.M.C., and D.S. analyzed data; and M.A.W. and D.S. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/cgi/content/full/0610776104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
G. Kolesov, Z. Wunderlich, O. N. Laikova, M. S. Gelfand, and L. A. Mirny How gene order is influenced by the biophysics of transcription regulation PNAS, August 28, 2007; 104(35): 13948 - 13953. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||