Quantifying the mechanisms for segmental duplications in mammalian genomes by statistical analysis and modeling

Zhou and Mishra. 10.1073/pnas.0407957102.

Supporting Information

Files in this Data Supplement:

Supporting Data Set 1
Supporting Data Set 2
Supporting Data Set 3
Supporting Data Set 4
Supporting Appendix
Supporting Figure 5
Supporting Figure 6
Supporting Figure 7
Supporting Figure 8
Supporting Figure 9




Fig. 5. The figure represents schematically the definition of gap, shift, and internal indel. A gap is the distance from the matching homologous repeats in the flanking regions to the duplication boundary. Gaps are expected when the duplication boundaries or repeat boundaries are not annotated precisely or when there is a concentrated accumulation of mutations in a small fragment within the duplicated region. Shift represents the difference in the positions of the matching homologous repeats in the flanking regions. Some of the large shift sizes may be caused by the random pairing of Alu repeats that are not related to duplications in the flanking regions. Other shifts could be due to the insertions and deletions accumulated after the initial duplication event. Internal indels are the gaps in the alignment of the duplicated segments, which were caused by insertions and deletions after the duplication event.





Fig. 6. Analyses of the gaps between the matching homologous repeats in the flanking regions and the mapped duplication boundaries. We have measured the gap sizes in all of the duplication pairs with matching Alu repeats (+/+) in their flanking regions. (A and A') The distribution of the gap sizes is shown as a histogram. The majority of the gaps have very small sizes (≈50% are smaller than 10 bp). To characterize the larger gaps (>10 bp), the gap sequences are aligned by using dynamic programming (match, 1; mismatch, –0.5; gap, –0.5). (B and B') The homology levels (proportion of matched positions in the alignment) of the gap sequences estimated from the alignment results (red bars), in comparison with the estimated homology levels of random genomic sequences of similar sizes (blue bars). The homology levels of the gap sequences form a bimodal distribution (Mode1 and Mode 2). Mode 1 is similar to the random sequence results, and Mode 2 is significantly larger (close to 1). The presence of Mode 2 could be explained by the imprecision in the mapped duplication boundary positions. Mode 1 may have been caused by the random pairing of Alu repeats that are not related to duplications in the flanking regions, or by erosions in the duplicated sequences due to mutation accumulation after the initial duplication events. (C and C') The size distribution of the gaps that have relatively low homology levels (those from Mode 1). There are very few long gaps that have low homology level, suggesting that most of the long gaps are due to inaccuracies in the boundary mapping. On the contrary, the proportion of the shorter gaps from Mode 1 is very large. Because it is more likely to get a smaller fragment with low homology level under a fixed mutation rate by chance, the above observation is consistent with the presence of mismapped small fragments that are slightly more mutated but should be part of the duplicated regions. It is interesting to note that, in the hg16 data set the gap size distribution is more skewed toward zero, compared with the hg15 data set; the bimodality in the distribution of the homology levels from larger gaps are less pronounced, and the proportion of larger gaps with low homology levels is smaller. These observations may suggest the improvement of duplication mapping in the later assembly version (hg16).





Fig. 7. Analyses on the shifts between the positions of the matching homologous repeats in the duplication flanking regions. (A and A') The distribution of the shift sizes between the matching Alu repeats in the duplication flanking regions is displayed. In most cases, the shift sizes are small (≈70% is smaller than 40 bp in hg16). The shifts in the positions of the matching repeats could have been caused by insertions and deletions after the initial duplication events or the random pairing of Alu repeats that are not related to the duplication process in the flanking regions. (B and B') To examine the expected shift sizes caused by insertion and deletion events after duplication, we analyzed the internal indel sizes in the duplication regions aligned by LAGAN (1). (C and C') The shift sizes expected from the random pairing of Alus are computed from randomly paired genomic sequences of the same size (500 bp). The distribution of shift size (A and A') has an intermediate shape between internal indel distribution (B and B') and random pairing distribution (C and C'): The shift size distribution and the internal indel distribution are skewed toward zero. However, the distribution of shift sizes is flatter, possibly because of the presence of randomly paired Alu repeats in the flanking regions (C and C'). Because we have included the random pairing case in our model (see Appendix), we expect that such cases will be removed as genomic background and will not enter into our estimation of its contribution to the duplication process. To test our assumption that a considerable amount of the shifts are caused by insertions and deletions after duplication, we examined correlation between duplication age and shift size. (E and E') As expected, shown in the age distribution of the duplications containing large internal indels (>median), older duplications (with lower duplication identity levels) are more likely to get larger insertion and deletions inside their duplicated regions. (D and D') Similarly, older duplications also tend to have larger shift sizes (>median) in their flanking regions than younger duplications. These data are more noisy because of a much smaller sample size than the internal indel data set. Therefore, the evolution of the shift in the flanking region is consistent with the evolution of insertion and deletion events inside the duplicated regions. Similar to our observation in Fig. 6, in the hg16 data set, the shift size distribution is more similar to the internal indel distribution, and the correlation between duplication age and shift size is stronger than in the hg15 data set, suggesting a better duplication mapping because of the improvement on the assembly accuracy.

  1. Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F., Davydov, E., Green, E. D., Sidow, A., Batzoglou, S. (2003). Genome Res.13, 721–731.




Fig. 8. The appearance frequencies of various subfamilies of repeats detected in the duplication flanking regions in the human (hg16 and hg15), mouse (mm3), and rat (rn3) genomes. The fractions of the flanking sequences containing different subfamily repeats are compared with the two control sets, sequences randomly selected from the whole genome and sequences randomly selected from inside the duplication regions. The names of the different subfamilies of long interspersed transposable element 1 (L1), Alu in the human genome, and short interspersed transposable elements (SINEs) in the rodent genomes are listed on the x axis, roughly ordered according to their age (from younger to older). Two sample t tests are used to test the statistical significance of the repeat overrepresentation in the flanking regions compared with the two controls respectively. **, The frequency in the flanking regions is significantly higher than that for both of the controls, with P < 0.05. The statistics are based on the following sample sizes. hg16: random regions, 20,918; inside the duplication regions, 13,321; flanking sequences, 9,788. hg15: random regions, 18,864; inside the duplication regions, 9,562; flanking sequences, 7,652. mm3: random regions, 15,824; inside the duplication regions, 6,766; flanking sequences, 3,288. rn3: random regions, 6,274; inside the duplication regions, 3,631; flanking sequences, 1,652.





Fig. 9. The fitting of the model to the distribution of Alu and long interspersed transposable element 1 (L1) repeats in the duplication flanking regions in the human genome hg15 assembly, and the distribution of L1 repeats in mouse (mm3) and rat (rn3) genomes. The fractions of flanking region pairs with different repeat distribution patterns are computed in each group of different sequence divergence levels. We estimated the parameters and fitted our model to the distribution of Alu and L1 in the flanking sequences of the duplication pairs, respectively. The known evolution parameters for Alu and L1 in the human genome are listed in Table 4. The evolution parameters for L1 in the rodent genomes are approximated by doubling the mutation rate and tripling the L1 insertion rate in the human genome. The various symbols represent the real data, and the smooth lines are the theoretical trajectories of the model for the optimal choices of the parameters h1 and f1. The total numbers of flanking region pairs are as follows: hg15, 3,836; mm3,1,644; rn3, 826.

This Article

  1. PNAS March 15, 2005 vol. 102 no. 11 4051-4056
  1. AbstractFree
  2. Figures Only
  3. Full Text
  4. Full Text (PDF)
  5. » Supporting Information