Previous Article |
Table of Contents
| Next Article
EVOLUTION
Quantifying the mechanisms for segmental duplications in mammalian genomes by statistical analysis and modeling

,
, 
*Department of Biology,
Courant Institute of Mathematical Sciences, and
School of Medicine, New York University, New York, NY 10003
Edited by Charles R. Cantor, Sequenom, Inc., San Diego, CA, and approved January 19, 2005 (received for review October 26, 2004)
| Abstract |
|---|
|
|
|---|
30% of the recent human segmental duplications were caused by a recombination-like mechanism, among which 12% were mediated by the most recently active repeat, Alu. But a significant proportion of the duplications are caused by some mechanism independent of the repeat distribution. A less sure but similar picture is found in the rodent genomes. A further analysis on the physical features of the flanking sequences suggests that one of the uncharacterized duplication mechanisms shared by the mammalian genomes is surprisingly well correlated with the physical instability in the DNA sequences.
segmental duplication | genomic instability | interspersed transposable elements | Markov models | copy number fluctuation
3.55% of the human genome (1, 2),
1.22% of the mouse genome (3, 4), and 3% of the rat genome (5) contain recent segmental duplications (genomic sequence blocks whose identity level is >90% and length is >1 kb). Nonetheless, a clear delineation of mechanisms responsible for those recent duplications in the mammalian genomes remains elusive: Unequal crossovers usually cause tandem duplications; long interspersed transposable element 1 (L1) retrotransposon machinery can only cause interspersed duplications of <1 kb (6). Recently, a detailed analysis on the duplication breakpoints in a specific genomic region showed that some segmental duplications may have been caused by Alu-mediated recombination events (7). Later, Bailey et al. (8) reported that a significant portion of the interspersed segmental duplications terminated within an Alu repeat. These results led to the suggestion that the primate-specific burst of Alu retrotransposition activity is the primary cause of the recent boom of segmental duplications in the human genome (8). However, given the highly dynamic nature of the Alu repeats in the recent past (9), estimation of its contribution to the segmental duplication process could be biased if its evolutionary dynamics are not taken into consideration.
To quantitatively assess the relative contribution of Alu recombination mechanism to the process of segmental duplication without bias, we developed a dynamic mathematical model that formulates the evolution of the repeat distribution in the duplication flanking regions (see Fig. 1 for the definition of flanking regions) as a Markov process with the time measured by the divergence level in the duplicated sequences since duplication. The results from the model suggest that, although the duplication flanking regions may have been involved in Alu recombination significantly more often than pairs of randomly selected genomic regions, Alu recombination contributes to only
1012% of the segmental duplications in the human genome.
|
| Methods |
|---|
|
|
|---|
Repeat Analysis. The repeats were identified according to the genome annotation database (http://genome.ucsc.edu). In this study, we considered a repeat as present in that flanking sequence if it was longer than a 100-bp threshold. For a pair of flanking regions to be identified as having a common repeat in a specific region (labeled as +/+), the repeat sequences had to be on the same side of the duplicated segments, face the same direction, and share at least 100 bp of high homology. For the Alu family, sequences from any subfamilies shared high homology (12, 13). For the L1 family, however, only sequences from the same subfamily were treated as highly homologous (14). In our model, the frequency of +/+ flanking region pairs in each age group was further normalized by subtracting the average frequency of repeats inside the duplicated segments, assuming that the repeats inside the duplicated region resulted from some repeat-independent mechanism and were uniformly distributed.
Model Parameters. All but two of the model parameters could be derived from the existing literature. They are enumerated in Table 4 in Supporting Appendix, which is published as supporting information on the PNAS web site. We chose a flanking region size that is large enough to minimize the effect of mapping and annotation errors (by allowing some gaps and shifts; see Figs. 57, which are published as supporting information on the PNAS web site) and yet sufficiently restrictive to distinguish the signals from the genomic background noise. To establish the most appropriate size of the flanking regions to be used in the study, we applied the model to the data sets generated from several different flanking region lengths (200, 500, 1,000, and 2,000 bp). The estimation of repeat recombination, measured by the h1 value (see Model for definition), reaches its highest in the 500- and 1,000-bp data sets, thereby suggesting these two sizes to be optimal choices. The data presented in this report used a flanking region length of 500 bp.
Model Evaluation. We used a cross-validation method to test the performance of the model and the confidence intervals of the estimated parameters. The complete data set was randomly partitioned into two equally sized groups: an in-sample set to estimate the parameters and an out-of-sample set to cross-validate and measure the significance of the estimated parameters. The goodness of fit was tested in the out-of-sample data by using the parameters estimated from the in-sample data (for details, see Supporting Appendix). In Results, we report the mean values and standard deviations of the parameters estimated in 50 independent trials.
Stability and Flexibility Computation. The helix stability of the DNA duplex was estimated by the average strand dissociation Gibbs free energy (
G) in overlapping 50-bp windows, computed by the nearest neighbor model experimentally verified by Breslauer et al. (15). The DNA flexibility was estimated by the average twist angle in overlapping 50-bp windows computed by the method in ref. 16.
| Results |
|---|
|
|
|---|
However, to test the above hypothesis, one needs to consider the highly active history of the overrepresented repeats in the duplication flanking regions and the reliability of the genome assembly and duplication mapping data. Therefore, we conducted a detailed analysis on the hypothesis through a mathematical model that incorporates the evolutionary dynamic of the active repeats and minimizes the effect of assembly or mapping errors.
Model. The repeats that caused duplications by recombination should reside on the same side of the duplicated segment, face the same direction, and share enough homologous sequences. Therefore, intuitively, we could directly estimate the contribution of repeat recombination to duplication by measuring the excessive level of such repeat configurations in the flanking regions of the newly duplicated segments before any erosion on the sequence occurs through mutation events. However, the newly duplicated segments are almost identical and are, therefore, most prone to genome assembly errors, making the estimations unreliable. In contrast, if we used the "older" duplications, which are less prone to assembly errors; we could potentially overestimate or underestimate the contribution of the repeats. For instance, the actively amplifying transposable repeats can be inserted into the flanking regions after duplication and can form a configuration that falsely suggests a recombination event, resulting in overestimation of the hypothesis. Conversely, the repeats in the flanking regions can also lose their initial configuration after the recombination incident because of point mutations and deletions after duplication, consequently leading to underestimation of the hypothesis.
To resolve the above dilemma, we incorporated the evolutionary dynamics of the repeats and the duplicated segments in our model. Over time, all of the repeats in the flanking regions, regardless of whether they have caused the duplication by recombination, are subject to changes in their configurations. Assuming that the mechanisms of segmental duplication and their relative contribution have been well conserved over time, the current repeat configuration in the flanking regions of duplications of different ages may be viewed as sampled from its stationary distribution. If the evolutionary rates of the repeats and the duplicated segments are known, the relative contribution of repeat recombination to segmental duplications can be estimated from the stationary distribution.
To explain the model, we begin by introducing some notations. In our model, each pair of the duplication flanking regions is assigned to a state specified by the configuration of the interspersed repeats in the flanking regions and the age of the duplication event. There are three possible repeat configurations in a pair of flanking regions (defined in Fig. 1): The flanking regions share a common repeat when they contain a repeat from the same family in the same direction and with sufficient length of homology (+/+) (see Methods); or, one of them has a repeat and the other has no repeat or a repeat of different direction (+/); or, neither of them contains repeats (/). The ages of the duplication events are estimated by the sequence divergence level between the duplicated segments and are grouped into bins with divergence interval
. A flanking region pair is assigned to the age group k if the corresponding duplicated segments have a divergence level of d, where
. The divergence interval is chosen to be
= 1% based on the sample size needed in each age group to draw statistical conclusions without being too affected by corrupting noise (see Table 5 in Supporting Appendix for details). This partition results in eight age groups after the duplications with extremely low divergence levels (d < 0.5%) are omitted because of their proneness to assembly errors. In the following text, we use the vector
to represent the frequencies of flanking region pairs in the kth age group with different configurations of the repeats from family at evolution time
represents the configurations of repeat X in the flanking regions of the new duplications at evolution time t. Let h1 = 1 h0 represent the fraction of the duplications caused by the repeat recombination mechanism, and, among those, let
represent the fraction mediated by repeat family X. (The product
represents the relative contribution of the repeat family X to the duplications through the recombination-like mechanism.)
can be expressed by using h1,
, and X repeat distribution in randomly paired sequences from the genome (RX) (for details, see Supporting Appendix). Our model tests the following hypotheses: null hypothesis, recombination between repeats from family X does not contribute to segmental duplications, i.e.,
; alternative hypothesis, recombination between repeats does contribute to segmental duplications, i.e.,
.
The model describes the dynamically changing state distribution of the flanking regions as a Markov process over evolutionary time under the effect of accumulating mutations and repeat amplifications. Table 1 lists in details all of the possible transitions between states in a small time interval (
t) and the corresponding transition probabilities expressed in the evolutionary rates of the repeats and duplicated segments. A schematic representation of the model integrating the details in a small example is displayed in Fig. 2.
|
|
, where k
0. For a detailed example of stationary states, see Fig. 2. Under those assumptions, we can evaluate the two free parameters of the model (h1 and
) based on the observed data if the evolutionary rates are known (see Supporting Appendix for details).
We applied the model to the duplication flanking regions in the human genome on the distribution of their states specified by repeats from the Alu (X = Alu) and L1 (X = L1) families, respectively, whose evolutionary rates have been well characterized (see Table 4) (9). Two different data sets (hg15 and hg16) (1, 2) were used. The free parameters in the model and their corresponding standard deviations were determined by cross-validation (see Methods). For both data sets (Fig. 3 and Fig. 9, which is published as supporting information on the PNAS web site), the model with the estimated parameters fit exceedingly well with the state distribution of the flanking regions specified by Alu repeats (P > 1104 in the goodness-of-fit test; see Supporting Appendix), whereas the null model (with
; see Supporting Appendix) could not explain the observed Alu distribution adequately (P = 0.04). As expected, the null model explained the L1 distribution in the flanking regions quite well (P = 0.86), although the model with the estimated parameters did slightly better (P > 1104). See Table 2 for a list of the relative contributions of Alu and L1 by recombination to the recent segmental duplications in the human genome as estimated by the model.
|
|
) from the original data set to three control data sets: The permuted data set is created by randomly switching the partners in the flanking region pairs while preserving the total repeat frequencies. The outside and inside data sets are obtained from positions farther outside or inside the duplicated regions, respectively. The results are listed in Table 2. As anticipated by the model, the estimated contributions in the permuted and outside data, where random distribution is expected, are very close to 0; whereas in the inside data, where no random distribution is expected, the estimations are very close to 1 (Table 2). The contribution of Alu recombination to the duplication (
) estimated from flanking data is
12%, which is significantly higher than the estimation from the permute and outside data sets. However, the contribution of L1 recombination estimated from the flanking data set is much lower and does not differ significantly from either the permute or outside data set.
The hg15 and hg16 data sets were independently mapped by different research groups using different strategies (1, 2), and it has been shown that the earlier map (hg15) contains more artifacts caused by assembly errors than the later one (2). Despite such differences, the model still gives consistent results between the two assemblies. It is also reassuring to find that, for both repeat families, the model estimated that the fraction of the duplications caused by the recombination-like mechanism (h1) is
30% (for details, see Supporting Appendix), although their contributions to the duplication mechanisms are quite different. The consistency in the parameter values suggests the robustness of our model against errors in assembly, mapping, and annotation. This robustness is mostly due to the parsimony of the model and the way in which the model accounts for a reasonable amount of errors and efficiently removes the corrupting noise.
For the mouse and rat genomes, a good estimation of the evolutionary dynamic parameters of the interspersed repeats is still lacking. Furthermore, the available duplication mappings in the rodent genomes are likely to be less accurate because of the unfinished status of the genome assemblies (3, 5). Those factors prevented us from applying the model accurately to the rodent data sets as we did for the Alu and L1 repeats in the human genome. However, if one approximates the mutation rates in the rodent genomes by doubling the corresponding rates in the human genome and the rodent L1 insertion rate by tripling the human L1 insertion rate, then it is possible to reach a fairly good fitting for the L1 distribution in the mouse and rat data sets (Fig. 9). The contribution of the L1 repeats to the recent segmental duplications through the recombination-like mechanism is then estimated at
10% in the rodent genomes.
In conclusion, in all of the mammalian genomes examined, our model estimates that
1012% of the recent segmental duplications were caused by the recombination between the most active interspersed repeat elements in the genome (Alu in human and L1 in rodents). The results from the model further suggest that the segmental duplications are likely to be caused by multiple mechanisms, and a large fraction (
70%) of the duplications are caused by some unknown mechanism independent of the interspersed repeat distributions, which is consistent with the conclusions of ref. 20.
Further Sequence Analysis: Physical Properties. In the process of searching for repeat-independent mechanisms, we discovered an enrichment of DNA sequences that are physically unstable around the duplication boundaries. The physical properties of the DNA duplex plays an important role as the initial step in many molecular processes, as shown in transcription (21), replication (22), and the large genome rearrangement events that originated from the chromosomal fragile sites (10, 11). Therefore, it is possible that similar properties can initiate or facilitate the segmental duplication process in the mammalian genomes.
To explore possible repeat-independent explanations and to avoid the bias introduced by the AT-rich regions in Alu and L1 repeats, we analyzed the flanking sequences that do not contain any repeats for their helix stability (15) and DNA flexibility (16) (see Methods for details). These two features are suggested to be the specific characteristics of the fragile sites in the genome where genetic rearrangements frequently occur (10, 11). In the mouse and human data sets, there is a slight decrease of the average helix stability and an increase of the average DNA flexibility at the duplication junction compared with the other regions either inside or outside the duplicated segments (Fig. 4). To test the significance of these observations, we counted the number of duplication junctions (250 to +250 bp flanking the boundary) that contain sequence sites with both exceptionally low helix stability and exceptionally high flexibility. The criteria for recognizing such a site is that the average
G in its centering 50-bp window is <1.3 kcal·mol1·bp1 (the bottom 0.5% of random genomic sequences) and that the average twist angle is >14° per bp (the top 0.5% of random genomic sequences) (10, 11). We found enrichment of such potential fragile sites in the repeatless duplication flanking regions from all of the three mammalian genomes compared with the randomly selected genomic regions (Table 3). The enrichment of these characteristic sites is statistically significant in all of the data sets, except in the rat genome, where it is just on the verge of being significant. Interestingly, the significance level increases with the degree of finishing of the genome assemblies, suggesting that the lack of significance in the rat genome could be explained by the incompleteness of the current assembly.
|
|
| Discussion |
|---|
|
|
|---|
Interspersed segmental duplications are significantly more abundant in the human genome than in the rodent genomes (35). It was suggested that the difference is due to the recent burst of primate Alu retrotransposition activity (8). However, the rough estimations from our model suggest that the relative contribution from the most active repeats through the recombination-like mechanism remains more or less constant in the human and rodent genomes. Therefore, the answer to why the genomes have different amounts of segmental duplications is to be sought elsewhere [for example, the difference in the tolerance for large duplications, the difference in effective population sizes, or the finishing stage of the genome assembly (23)].
Segmental duplications have been shown to be associated with the genome rearrangement events during species evolution (24, 25) and the copy number fluctuations (2629) and other rearrangements (30) in genomic sequences during cancer development. Therefore, some of the mechanisms used by segmental duplications, such as recombination mediated by interspersed repeats (31, 32), may be shared by other genomic rearrangement events. Suggested by the fragile sites we found in the duplication flanking sequences and their association with the breakpoints of the syntenic blocks (24, 25), perhaps another common mechanism could be correlated to the specific physical properties in the DNA sequences. In fact, it has been suggested that segmental duplications in yeast are caused by breakage-induced-replications induced by replication fork stalling at the AT-rich replication termination sites (33). These topics of future research may rely on mathematical models akin to the ones proposed here.
| Acknowledgements |
|---|
| Footnotes |
|---|
This paper was submitted directly (Track II) to the PNAS office.
Abbreviation: L1, long interspersed transposable element 1.
To whom correspondence should be addressed at: NYU Bioinformatics Group, New York University, 715 Broadway, Room 1002, New York, NY 10003. E-mail: mishra{at}nyu.edu.
© 2005 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
P. M. Kim, J. O. Korbel, and M. B. Gerstein Positive selection at the protein network periphery: Evaluation in terms of structural constraints and cellular context PNAS, December 18, 2007; 104(51): 20274 - 20279. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. J. Emrich, L. Li, T.-J. Wen, M. D. Yandeau-Nelson, Y. Fu, L. Guo, H.-H. Chou, S. Aluru, D. A. Ashlock, and P. S. Schnable Nearly Identical Paralogs: Implications for Maize (Zea mays L.) Genome Evolution Genetics, January 1, 2007; 175(1): 429 - 439. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||