## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Inevitability and containment of replication errors for eukaryotic genome lengths spanning megabase to gigabase

Edited by James E. Cleaver, University of California, San Francisco, CA, and approved July 26, 2016 (received for review March 1, 2016)

## Significance

Errors in DNA replication can never be completely avoided. By combining a minimal model that takes into account the positions of replication origins (the regions on the DNA where replication initiates) with experimental evidence, we show that genome size strongly influences the frequency of replicative errors. Our work reveals that (*i*) simple eukaryotes are able to achieve a very low probability of replicative errors by having a moderate number of origins placed at regular intervals; (*ii*) this strategy is ineffective in eukaryotes with larger genomes, such as human, for which replicative errors are inevitable; and (*iii*) in these organisms, even moderate numbers of origins can provide containment of replication errors to very low levels, which can be repaired subsequently.

## Abstract

The replication of DNA is initiated at particular sites on the genome called replication origins (ROs). Understanding the constraints that regulate the distribution of ROs across different organisms is fundamental for quantifying the degree of replication errors and their downstream consequences. Using a simple probabilistic model, we generate a set of predictions on the extreme sensitivity of error rates to the distribution of ROs, and how this distribution must therefore be tuned for genomes of vastly different sizes. As genome size changes from megabases to gigabases, we predict that regularity of RO spacing is lost, that large gaps between ROs dominate error rates but are heavily constrained by the mean stalling distance of replication forks, and that, for genomes spanning ∼100 megabases to ∼10 gigabases, errors become increasingly inevitable but their number remains very small (three or less). Our theory predicts that the number of errors becomes significantly higher for genome sizes greater than ∼10 gigabases. We test these predictions against datasets in yeast, *Arabidopsis*, *Drosophila*, and human, and also through direct experimentation on two different human cell lines. Agreement of theoretical predictions with experiment and datasets is found in all cases, resulting in a picture of great simplicity, whereby the density and positioning of ROs explain the replication error rates for the entire range of eukaryotes for which data are available. The theory highlights three domains of error rates: negligible (yeast), tolerable (metazoan), and high (some plants), with the human genome at the extreme end of the middle domain.

The proper maintenance of genetic information is of fundamental importance to the survival of all organisms, and many molecular mechanisms exist to ensure that the genetic sequence encoded by DNA is maintained unaltered generation after generation (1⇓–3). To preserve the integrity of genetic information and to avoid aberrant ploidy, it is crucial that the entire DNA is copied exactly once; replicating only part of the DNA results in potential corruption of genes, and replicating certain parts of the DNA more than once would perturb chromosome structure and strongly affect gene dosage (4⇓–6). Not surprisingly, regions of underreplicated and overreplicated DNA are common in cancer (7, 8).

DNA replication is a particularly complex process in eukaryotic organisms with large genomes distributed across multiple chromosomes. Multiple checkpoints exist to ensure that, once replication starts, the whole DNA is faithfully replicated before the chromosomes are segregated. Underreplication and overreplication of DNA are prevented by using predefined points of replication initiation called replication origins (ROs) (3, 9).

During late mitosis and the G1 phase of the cell division cycle, each potential RO is “licensed” for a single initiation event by being loaded with minichromosome maintenance proteins 2–7 (MCM2-7) double hexamers. To prevent rereplication of DNA segments, the ability to license new origins ceases before cells enter S phase. During this phase, hundreds to thousands of licensed ROs are activated throughout the genome (10). Bidirectional replication forks (RFs) are established at active ROs, each driven by a single MCM2-7 hexamer, allowing DNA polymerases to copy the DNA (Fig. 1*A*). Despite being highly reliable molecular machines, RFs can, on rare occasions, irreversibly stall (11). The activation of additional ROs can overcome the problem of irreversibly stalled RFs, as a new fork will eventually meet the stalled one, hence replicating all of the intervening DNA. However, if adjacent right-moving and left-moving forks stall and no additional ROs are available between them, the DNA in between the two forks will remain unreplicated (Fig. 1*B*). This phenomenon constitutes a major replication error for the cell, which is commonly called a double-fork stall (DFS) (Fig. 1). The occurrence of DFSs is therefore a key obstacle for cells to either avoid or overcome to maintain replication fidelity. The molecular processes underlying the management of DFSs are an active field of study, and insults to these processes have been associated with different pathologies (11⇓–13).

In our previous work, we introduced a simple probabilistic theory to determine the probability of replication failure arising from DFSs for a given set of ROs in a genome (14). The theory depends on two key assumptions, i.e., that the cell has no time constraint in completing the process (i.e., that all licensed ROs are allowed to be activated as necessary) and that there is a constant small probability per nucleotide for each individual replication fork to irreversibly stall. Mathematical analysis of the theory showed that, in organisms with a genome length comparable to yeasts (

In this article, we extend our theory to study much larger genomes (100 Mbp to 10 Gbp), which are typically found in metazoa and plants. Our theory requires as input the positions of ROs along the genome and yields a number of clear predictions concerning the rates of DFSs, using both mathematical and computational approaches. These predictions were tested on available datasets describing RO distribution in one plant (*Arabidopsis*) (15), one invertebrate (*Drosophila*) (16), and two independent human datasets (reporting different human cell lines) (17, 18) (Table S1). Note that the two human datasets have been derived using different approaches to RO detection, and hence the number and positions of ROs vary between them. The two datasets are largely compatible with reported 70% overlap in genomic sites containing ROs in both datasets (18) (see also Fig. S1), and therefore can be used to test the robustness of our theory to experimental and biological variation.

Our theoretical and computational analysis leads to a series of direct predictions, which are all found to be consistent with all datasets analyzed, revealing a picture of great simplicity. The robustness of DNA replication in eukaryotes can be maintained so long as the largest replicon (inter-RO distance) is well below the median stall distance

## Results

### The “Central Equation” for Determining Replication Errors.

In our previous work (14), we derived a mathematical equation for the genome-wide probability of DFSs, based on the distribution of ROs and *Materials and Methods*, we use the same theoretical framework to derive more general equations that are applicable to genomes containing arbitrarily large replicons. To use our theoretical results, we require detailed information on the location of ROs. A number of datasets have been published that provide the locations of ROs in eukaryotes, along with the total genome length (denoted by *Saccharomyces cerevisiae* (20), *Schizosaccharomyces pombe* (21), *Arabidopsis thaliana* (15), *Drosophila melanogaster* (16), and five human tissue culture cell lines (IMR-90, HeLa, hESC, iPSC, and K562) from Besnard et al. (17) (denoted by “B” in the following) and Picard et al. (18) (denoted by “P”). Because the work of Picard et al. used more modern techniques (particularly in peak identification), it might be considered a more reliable dataset; comparison with the Besnard et al. dataset is useful in assessing the experimental uncertainties in some of the data.

Because of the very low probability of a DFS in any given replicon, we can show that the statistics of DFSs are Poisson to a very high level of accuracy (*SI Text*), and that the probability of no DFSs genome-wide has the form *λ*. We remind the reader that, for a Poisson distribution, *λ* also describes both the mean and the variance of the distribution. For a given genome with *K* ROs, we denote the replicons by the *K* − 1 values *i* = 1, …, *K* − 1). These data can then be used in the “central equation” arising from our theory (Eq. **1**),*λ* contains a single unknown parameter *N** _{s}* is inversely proportional to the very small probability of stalling per nucleotide (14).

On the right-hand side of Eq. **1**, we can identify the two distinct contributions of the genome length (first term) and of the RO distribution (second term). Genome length determines a baseline probability of DFSs that can be lowered by increasing the number of ROs and/or changing their distribution along the genome; indeed, as we have shown previously (14), for a given number of ROs, equally distributing them across the genome is the optimal arrangement to minimize the probability of DFSs. Therefore, the different terms on the right-hand side of Eq. **1** establish a hierarchy of contributions to the probability of DFSs, with genome length being the most important factor, followed by RO number and then RO distribution (Fig. 2).

In organisms with relatively small genomes, such as yeasts (∼10 Mbp), an average density of 1 RO per ∼20 Kbp allows the maintenance of very small probabilities of genome-wide DFSs. Application of Eq. **12** to the yeast datasets gives values around 10^{−3} for the probability of one or more DFSs, consistent with our previous analysis. With the increase in genome size from around 10 Mbp (in yeasts) to around 10 Gbp (in human), Eq. **12** shows that the probability of DFSs increases by approximately two orders of magnitude, to more than 0.5 for human genomes (Fig. 3*A*). This huge increase in error rate occurs despite essentially no shift in the mean replicon size (Fig. 3*B*). Therefore, it is absolutely necessary for these organisms to have molecular machinery able to repair DFSs.

### The Bias Toward Uniformly Spaced Replication Origins Is Progressively Lost in Larger Genomes.

The regularity of the RO distribution can be assessed by computing the coefficient of variation of the replicon lengths, denoted by *R*, defined as the ratio of their SD to their mean. For a perfectly uniform distribution of equally spaced ROs, *R* is equal to 0. On the other hand, computational analysis indicates that, when ROs are randomly distributed on the genome, the value of *R* is very close to 1 (14).

In the yeast genomes (diploid genome sizes ∼20 Mbp), we previously showed that their RO distributions were strongly biased toward uniform spacing with values of *R* ranging from 0.72 to 0.77 (Fig. 3*C*). The probability of DFSs is very small in yeasts due to the small genome size, and optimization of the RO positions by lowering *R* reduces this even further. However, as discussed above, organisms with larger genomes have a significantly higher probability of DFS events, which results in the need for additional molecular mechanisms to cope with the consequences (19), and the presence of such mechanisms means there is little to be gained in uniformly ordering ROs on the genomes. Thus, our expectation is that *R* should be significantly larger in organisms with larger genomes compared with the values found in yeast. Statistical analysis of the available data confirms this expectation (Fig. 3*C*). *Arabidopsis* and *Drosophila* (diploid genome sizes ∼250 Mbp) have values of *R* around unity (i.e., approximating a random distribution). Particularly striking is the fact that, in human genomes (∼6,000 Mbp), the values of *R* are significantly larger than unity, indicating that ROs are not spaced purely randomly and that both the number and size of large replicons are significantly greater than expected by chance. This unexpected distribution has important consequences that will be discussed in *Large Replicons in Human Genomes*.

The probability of a DFS in a given replicon increases with the replicon length according to Eq. **6** (*Materials and Methods*) and is plotted in Fig. 3*D*. The probability has a strongly nonlinear form: increasing as the square of the replicon length for lengths much less than the stalling distance, and saturating to unity for lengths significantly greater than the stalling distance. Fig. 3*E* provides a graphical representation that highlights the dramatic shift in variation of replicon lengths, or, equivalently, the per replicon rate of DFS, by plotting the predicted probability of DFSs across the largest chromosome of different organisms. It is apparent that the variation in probability of error increases by approximately one order of magnitude from yeast to *Drosophila*, and then again by approximately one order of magnitude from *Drosophila* to human.

### Large Replicons in Human Genomes Cause the Most Errors but Are Bounded by the Stalling Distance.

Consistent with our analysis of the values of *R*, we would expect the largest replicons in the genome to be very significantly different in diploid genomes of size ∼20 Mbp, ∼250 Mbp, and ∼6 Gbp (represented by yeasts, *Drosophila*/*Arabidopsis*, and human, respectively), with significantly larger replicons appearing in those genomes with *R* larger than unity. As seen in Fig. 4*A*, this is exactly what is observed, with the largest replicons being ∼60 Kbp in yeasts (∼120 Kbp expected for a random distribution), 151 Kbp in *Drosophila* (207 Kbp expected if random), 773 Kbp in *Arabidopsis* (663 Kbp expected if random), and ∼5 Mbp in human (∼300 Kbp expected if random). The tendency for larger and larger replicons can also be seen by the significant increase in outliers in the box plots of replicon lengths for the different organisms considered (Fig. 4*B*). As is clear from Fig. 3*D*, the probability of a DFS in a given replicon increases dramatically as the length of the replicon approaches

In the human genome, given that errors are very likely, we can determine the range of replicon lengths that are the main contributors to the DFS. We grouped the replicons into five cohorts: very small (XS; <1 Kbp), small (S; 1 to 10 Kbp), medium (M; 10 to 100 Kbp), large (L; 100 Kbp to 1 Mbp), and very large (XL; >1 Mbp). The frequency of replicons in these five cohorts is shown in Fig. 5 *A* and *B* for IMR90 from the B and P studies. The most common range of replicons is S, followed by M, the shift from S to M being due to the coalescence of small replicons in the Picard et al. study. L and XL replicons appear only at low frequency. Despite this, Fig. 5 *C* and *D* shows that the cohort of L replicons dominates as the source of error, which is due to the fact that the DFS probability increases nonlinearly with the replicon length (Fig. 3*D*). The error rate due to the small number of XL replicons is significantly smaller compared with the L replicons. An important consequence of this finding is that there will be a very limited impact on genome-wide error rates from false negatives, which primarily affect the distribution of XL replicons.

Interestingly, in both datasets, for all cell lines, a closer examination of the error rates in the vicinity of the L cohort shows a surprisingly statistically uniform distribution of error rate, which is suggestive of ROs being placed so as to “spread the risk” of error across size scales. In Fig. 5 *E* and *F*, the probability of DFS in each 10-kbp interval in the range 10 to 300 kbp is shown for the Besnard et al. (Fig. 5*E*) and Picard et al. (Fig. 5*F*) datasets for primary IMR90 cells. These replicons are the ones that contribute the most to the DFS probability. The maxima are relatively broad, particularly for the B dataset, for which the probability of DFS in each 10 kbp is approximately constant at 0.030 to 0.035 across replicons spanning from 40 kbp to 200 kbp. For replicons significantly smaller than the stalling distance, one can infer, from the theory, that ROs are placed in such a way to give a power law, with a frequency of DFSs that decreases as the inverse square of the replicon length thereby spreading the probability of a DFS equally among all size classes (described by Eq. **S6**). Fig. 5 *G* and *H* shows that there is a remarkable concordance between the theoretical frequency distribution (in blue) and the frequency distribution in the data for IMR90 cell line in both datasets (in red). There is also excellent agreement with the theoretical distribution in all of the other cell lines in both datasets (Figs. S2−S4). These results can be interpreted in terms of “spreading the damage” as widely as possible in the replicon size region of maximal DFS errors, as a power law is the most effective way to delocalize errors from any single cohort of replicon lengths.

### Replication Errors Are Common but Low in Number for Higher Eukaryotes.

As discussed in *The “Central Equation,”* our theory predicts that the distribution of the number of DFSs in a given genome is Poisson-distributed to a very high degree of accuracy. We have applied our theory to the human cell lines datasets to test this prediction. As shown in Fig. 6, for all cell lines, from both laboratories, the distribution of DFSs is indeed Poisson-distributed, regardless of being primary or tumoral cell lines. Statistical analysis confirms that the computationally derived probability distribution of DFSs is statistically indistinguishable from the fitted Poisson distribution. Interestingly, we find a very low probability (<10%) of encountering more than three DFSs in the replication of the entire diploid human DNA per cell cycle. Therefore, despite the high probability of the presence of DFSs (∼80%), in ∼90% of cells undergoing DNA replication, the expected number of DFSs is predicted to be three or less, with one or two errors being the most likely occurrences. Indeed, we find that the parameter

Given that DFSs in human cell lines are almost inevitable, it is somewhat surprising to find that their number is quite sharply constrained to be essentially one, two, or three. This might indicate that the mechanism that deals with such errors has a very low capacity. If, as suggested in Moreno et al. (19), the defects induced by DFSs can be resolved in the following cell cycle by segregating unreplicated DNA to daughter cells, DNA strand breaks could be generated at each DFS. Because the number of illegitimate ways that double-strand breaks could be correctly rejoined increases as the factorial of the number of breaks, this might constrain the number of tolerated DFSs to about three or less. We provide a rationale for putative biological mechanisms in *Discussion*, and our arguments lead us to consider two different biomarkers for double-strand breaks that would arise from DFS errors; these are the presence of 53BP1 nuclear bodies in the G1 phase of the subsequent cell cycle and the presence of ultrafine anaphase bridges (UFBs) during mitosis. Our theory suggests that the number of both 53BP1 nuclear bodies and UFBs are distributed as a Poisson with a value of

We performed an experimental analysis of 53BP1 in IMR90 cells and both 53BP1 and UFBs in U2-OS cells, and we measured the frequency of their occurrence during the cell cycle at a single-cell level (19). In agreement with our predictions, the experimental distributions of both 53BP1 nuclear bodies and UFBs fit to a Poisson distribution (Fig. 7 *A*−*C*). Statistical analyses indicate that both a naïve fitting using the mean of the data and a more advanced approach that accounts for potential errors introduced by the experimental procedure of the immunofluorescence experiments (Fig. 7 *A*−*C*) produce distributions that are not statistically different from Poisson distributions for both 53BP1 nuclear bodies (*P* values between 0.61 and 1 for both IMR90 and U2-OS cells) and UFBs (*P* values between 0.53 and 1 for U2-OS cells). Additionally, the fitted

As a more quantitative analysis, we compared the *D*). Additional comparisons with the *D*). Interestingly, the range of variation observed in the experimental value of

In both IMR90 and HeLa cells, the experimentally derived *ca.* 8 h) (22). If the true RO density is twice that measured, one can show that the largest gap would be halved, giving a value of 2 Mbp, which is in line with the estimate of 2 Mbp for the longest stretch of DNA that could be replicated in the duration of S phase (assuming a fork speed of ∼2 Kbp per minute (23), and remembering that a large replicon will be replicated almost symmetrically by forks traveling from either end).

### Effect of Variation of the Stalling Distance.

In applying our theory to the RO position data for various human cell lines, we can vary the numerical value of

First, we analyzed the overall probability of DFSs occurring as *A*). Therefore, DFSs are inevitable for smaller values of

Our analysis stresses the inevitability of DFS errors during replication of the human genome and calls for a shift in our approach with respect to how the problem has been viewed in the past. On varying the median stalling distance in human cells, the probability of exactly one DFS genome-wide reaches a maximum between 10 and 15 Mbp, depending on the particular cell line and dataset used (Fig. 8 *B* and *C*). Furthermore, on varying the stalling distance, we find that the probabilities of exactly two or exactly three DFSs occurring also have peaks in the range 6 to 10 Mbp, again depending on the cell line and the dataset used (Fig. 8 *B* and *C*). To probe the likelihood of a small number of errors occurring, we plotted the probability of observing one, two, or three DFSs as stalling distance was varied (Fig. 8 *D* and *E*). These results show a very pronounced maximum for

Finally, we can measure the average number of DFSs when *F* and *G*). As explained in *Replication Errors Are Common*, fitting the Poisson distribution to 53BP1 and UFB experimental data gives values of *F* and *G* as black, blue, and red lines). The intersection of the decaying curve with these two lines provides another independent estimate of the stalling distance, which we find to be between 8 and 16 Mbp, depending on the cell line and dataset used. Our analysis of the statistics of DFSs in human cell data on varying the stalling distance therefore provides very strong evidence for the robustness of this parameter, with a value in the range 8 to 15 Mbp, consistent with previous estimates from our analysis of yeast RO distributions, and with direct experimental estimates (14, 24).

### Effect of Varying the Number of Licensed ROs.

Interestingly, among the cell types we analyzed, there was no major difference in the mean replicon length (Fig. 3*B*). Fig. 9 shows how decreasing mean replicon length would reduce the probability of DFSs in a generic organism. The black, light blue, and blue lines illustrate the mean replicon length to achieve a fixed probability of DFSs under the optimal situation of equally spaced ROs. All of the datasets analyzed in the article have a mean replicon length ranging between 10 and 100 Kbp (shaded pink in Fig. 9). Because of the relatively small genome sizes of yeasts, so long as ROs are evenly spaced, this mean replicon length can achieve a tolerable DFS probability of ∼0.1%, similar to the chromosome missegregation rate (14). To maintain a low probability of DFSs as in yeasts, longer genomes would require a much lower mean replicon length or, in other words, much higher density of ROs on the genome. Because the MCM2-7 double hexamer that licenses an RO has a footprint of ∼60 bp (25, 26), this provides an absolute limit to the possible replicon length (dashed line in Fig. 9). It is just about possible for organisms with ∼6,000-Mbp genomes to achieve yeast-like DFS probabilities, but the genome would have to be almost completely packed with MCM2-7, which might leave the genome unable to perform its major function of providing the template for transcription. Because this saturation is implausible for normal cells, additional postreplicative mechanisms must be in place to deal with the inevitable DFSs. For this reason, regularity in RO distribution is not an effective safeguard against DFSs in organisms with larger genomes.

## Discussion

Faithful DNA replication is fundamental to preserve the genetic content of cells and to avoid the severe pathologies that arise when DNA is improperly replicated. The appropriate location and activation of ROs is fundamental to ensuring that replicative errors are minimized. Here we show that understanding the principles that govern distribution of ROs provides quantitative insights into the way that different organisms maintain genetic integrity. By using a probability theory approach, based on a one-parameter model with simple yet plausible assumptions, we have developed a set of measures and predictions that further this understanding. The excellent agreement of our theoretical predictions with experimental data strongly supports the validity of our model assumptions. Moreover, it allows us to explore the rich system-level diversity of features and constraints associated with DNA replication.

### Replicative Errors Are Inevitable in Larger Genomes.

Increased phenotypic complexity of organisms is generally associated with larger genome length, and metazoans have much larger genomes compared with yeast: The diploid human genome is ∼600 times larger than the haploid yeast genome. Despite this large difference in genome size, the replication machinery is essentially conserved (4). Over the past few decades, much effort has been devoted to understanding the molecular mechanisms involved in eukaryotic DNA replication and the associated damage-repair mechanisms. However, less is known about the system-level structures and processes that allow replication fidelity across the different scales of eukaryotic complexity, mirrored by genome lengths spanning over three orders of magnitude across yeast to human. We have used a theoretical approach, previously validated in yeasts (14), to predict the probability of DFSs for different organisms with widely different genome lengths, and for which detailed RO distribution data are available.

Our central equation shows that there is a hierarchy of contributions to the probability of DFS, with genome length being the most important factor, followed by RO number and then RO distribution. This effectively creates different classes of probabilities of DFS errors (∼10^{−3}, ∼10^{−2}, and ∼1) for the respective classes of organisms according to their genome lengths (∼20 Mbp, ∼250 Mbp, and ∼6 Gbp). Interestingly, among the cell types we analyzed, there was no major difference in the density of ROs, i.e., mean replicon length. One possible explanation for this is that to make a significant effect on reducing DFSs, the RO density in organisms with genomes of 250 Mbp or more would lead to excessive clashes with the transcriptional machinery. The third component of our equation—the uniformity of replicon length, i.e., *R*—also reflects these classes (with values of <1, ∼1, and >1, respectively), indicating that, as the probability of DFSs approaches 1 in larger genomes, the pressure toward a regular RO distribution is lifted.

### Inevitability Is Mitigated by Containment in Longer Genomes and Beyond.

DFSs are the primary cause of DNA double-strand breaks during replication (27⇓–29), and are likely to be major contributors to the development of cancer and other pathologies, such as ones associated with aging (30, 31). The inevitability of DFSs in longer genomes requires the presence of cellular mechanisms, which are able to deal with such errors in an efficient manner. In related experimental work, we provide experimental evidence for one such postreplicative mechanism, involving the segregation of unreplicated DNA via UFBs and its protection by 53BP1 before being resolved in the next S phase (19). We have demonstrated very good agreement in the numbers and statistical distribution of experimental measurements of both 53BP1 and UFBs with the predictions of Poisson statistics from our theory, supporting the validity of our conclusions, and indicating that DFSs in the experimental systems are well approximated as independent events.

Analysis of the data available for human cell lines within our theoretical framework shows that RO density and distribution constrain the number of DFSs per cell cycle to three or less for nearly all cells. This limit on the number of DFSs may partially be explained by the difficulty in properly recombining two strands of DNA when end-joining is used. For example, if four DFSs occur and need to be fixed, eight strands will be generated, and only one of the 24 theoretically possible combinations is correct. From our experimental observations, cells with large numbers of 53BP1 nuclear bodies and UFBs showed increased blebbing and apoptosis. This suggests that large numbers of DFSs could compromise the working of the cell and the efficiency of the repair mechanism. Thus, our theory, in light of the experimental data, shows a contingent trade-off between the inevitability of DFS occurrence and the difficulty of its resolution (i.e., apparently requiring sophisticated molecular machinery for detection and repair). It is worth stressing that our central equation for

Another important requirement for the containment of replicative errors in larger genomes is an upper limit in the length of large replicons. Longer replicons correspond to a higher probability of DFSs (Fig. 3*D*). Our theory indicates that the largest tolerable replicons in human cell lines are bounded by ∼0.5

As a final note, it is worth stressing that some organisms, particularly plants, have very large genomes, with

## Materials and Methods

### Experimental Setup.

For the 53BP1 and UFBs experiments, U2OS and IMR-90 cell lines from the American Type Culture Collection were maintained in Dulbeccos’s Modified Eagle’s medium (Invitrogen), supplemented with 10% (vol/vol) FBS (Invitrogen) and penicillin and streptomycin at 37 °C in 5% CO_{2}. Standard immunofluorescence protocols were used for the 53BP1 and UFBs staining. Briefly, cells were fixed with 4% formaldehyde, permeabilized with 0.1% Triton in PBS, and blocked in 0.5% fish gelatin (G-7765; Sigma). Samples were incubated overnight with primary antibodies. To specify G1-phase cells, they were incubated with 40 µM EdU (Invitrogen) for 30 min before fixation, and then incubated with Cyclin A (1:300, ab16726; Abcam). For the detection of 53BP1, cells were also stained with GFP (1:2,000, ab13970; Abcam). To stain incorporated nucleotides, the Click-iT-EdU kit was used as instructed by the manufacturers (C10337; Invitrogen). For staining UFBs, cells were incubated with BLM (1:200, sc-7790; Santa Cruz). Alexa secondary antibodies (Invitrogen) were used for 1 h. Microscopy images were acquired using an Olympus IX70 deltavision deconvolution microscope and a CCD camera. Data from microscopy experiment were analyzed using Volocity 3D analysis software (Perkin-Elmer).

### Datasets Used and Statistical Analysis.

Limited direct experimental evidence exists on ROs in plants and metazoan, and most data focus on the genomic density, rather than localization, of ROs (33, 34). Therefore, the main results of our article are framed in the context of available datasets describing genome-wide RO positions. Less high-quality datasets have been considered, where appropriate, to provide additional challenge to the theoretical predictions and their interpretation. *Saccharomyces cerevisiae* ROs were obtained from the highly curated DNA Replication Origin Database (20) with selection criteria discussed in ref. 14. To provide additional validation, we considered another yeast species in this article: *Schizosaccharomyces pombe* (21). RO distribution data were also obtained for the following multicellular organisms: *Arabidopsis thaliana* (15), *Drosophila melanogaster* (16), and human. Human data for the four cell lines IMR90, HeLa, hESC, and iPSC were derived as discussed in ref. 17, and different datasets for IMR90, HeLa, and K562 cell lines were obtained from ref. 18. The summary of the datasets is presented in Table S1.

When RO positions were defined by genomic ranges, the middle point of the range was used as the genomic location of the RO. Moreover, to limit the problems associated with technological limitations in sequencing the centromeric regions of chromosomes, the largest replicon of each chromosome (corresponding to the centromeric region) was excluded from the analysis in all of the organisms considered.

Probabilities of DFSs were obtained from RO position data using the formulas detailed in the following mathematical derivations. To allow standardized comparisons in computing the probability of DFS, all of the organisms were considered as diploid. Poisson fits of the computationally derived distribution of DFSs were computed using the probability of no DFSs. Poisson fits of the experimental data were computed using the mean (naïve) or by minimizing the difference from the frequencies of DFS strictly larger than zero (filtered). Differences between distributions were computed using Chi-Squared tests.

### Model Derivation and Mathematical Details.

The baseline assumptions that have been used to construct the mathematical model have been described elsewhere (14) and will not be discussed here. In yeast, the size of the largest replicon, i.e., inter-RO distance, is significantly smaller than

Let *D* be the distance between two adjacent ROs located respectively at *n* = 0 and *n = N*, where *N* − 1 is the number of nucleotides within *D*. As shown in ref. 14, the probability of a double stall in *D* (DSD) is given by the following expression:

Therefore,

Evaluating the sums using the formula for a geometric series, we have

Thus,

Expressing the product as the exponential of the sum of the logarithms gives

Because *q* is an extremely small number,

Combining Eq. **3** with Eq. **5**, we obtain

Let us define the distance between the adjacent (*k* + 1)th and *k*th ROs as *N*_{k}. The probability of double stall between this pair of ROs will be denoted as *P*_{k}. Thus,

The genome-wide probability of no double stall, which will be denoted as Prob(NDS), is given by the product of probability of no double stalls in each replicon, i.e.,

Combining Eq. **7** and Eq. **8**, we have

Let *N*_{g} be the genome length; then

Thus,

Similarly,

Therefore, combining Eqs. **9**, **10**, and **11**, we have

or

We have shown before (14) that

As given by Eq. **1**, where the negative of the quantity in parentheses is denoted by *λ*. Further derivations and mathematical details are provided in *SI Text* and Table S2.

### Software Used.

Data analysis was performed using R version 2.15 and RStudio version 0.98.978 (https://www.rstudio.com).

## SI Text

### Mathematical Derivations.

#### Probability of a specific number of DFSs.

The previous approach can be extended to calculate the probability of an arbitrary number of double stalls. The probability of exactly one DFS, which will be called Prob(1DS), can be calculated directly as**13** with Eq. **8**, we obtain*m*DS) be the probability of *m* DFSs; the following conventions will be used:*R*_{1}, *R*_{2}, *R*_{3}, *R*_{4}, *R*_{5}, and *R*_{6} with Eq. **S2**, we can obtain the probability of one to six DFSs as follows:*R*_{i}. Therefore, we write*S*_{1} = *λ*.

Now, from Eq. **8** and Eq. **S4**, we have

#### Frequency of replicons of particular size.

We have shown in Eq. **3** that the probability of a DFS in the region of DNA between a pair of adjacent ROs separated by *N* nucleotides is*M* replicons whose size is in the vicinity of *N*. The probability of no error occurring from this cohort would be the following product: *θ*. Substituing Eq. **S5** into this expression, and recognizing that all of the *N*_{k} are close to *N*, enables us to rewrite the probability of no error from the cohort as*Nq* << 1, it is straightforward to show, from expanding the denominator, that *M* ≈ 1/*N*^{2}.

## Acknowledgments

We thank Dianbo Liu and Sam Palmer for helpful discussions. A.M., J.T.C., and J.J.B. acknowledge support from Cancer Research UK (Grant C303/A14301) and the Wellcome Trust (Grant WT096598MA). M.A.M., L.A., and T.J.N. acknowledge support from the Scottish Universities Life Science Alliance. T.J.N. acknowledges support from the National Institutes of Health (Physical Sciences in Oncology Centers, U54 CA143682). The authors also acknowledge High Performance Computer resources partially supported by the Wellcome Trust (Strategic Grant 097945).

## Footnotes

↵

^{1}M.A.M. and L.A. contributed equally to this work.- ↵
^{2}To whom correspondence should be addressed. Email: T.Newman{at}dundee.ac.uk.

Author contributions: M.A.M., L.A., A.M., J.J.B., and T.J.N. designed research; M.A.M., L.A., A.M., J.T.C., J.J.B., and T.J.N. performed research; M.A.M. and L.A. analyzed data; M.A.M., L.A., A.M., J.J.B., and T.J.N. wrote the paper; M.A.M. and L.A. performed mathematical calculations; A.M. and J.T.C. performed biological experiments; and T.J.N. developed the mathematical model and performed calculations.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1603241113/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵.
- Nielsen O,
- Løbner-Olesen A

- ↵
- ↵
- ↵
- ↵.
- Diffley JFX

- ↵.
- Arias EE,
- Walter JC

- ↵
- ↵
- ↵
- ↵
- ↵.
- Cobb JA, et al.

- ↵
- ↵
- ↵.
- Newman TJ,
- Mamun MA,
- Nieduszynski CA,
- Blow JJ

- ↵
- ↵.
- Cayrou C, et al.

- ↵
- ↵
- ↵.
- Moreno A, et al.

- ↵.
- Siow CC,
- Nieduszynska SR,
- Müller CA,
- Nieduszynski CA

- ↵.
- Hayashi M, et al.

- ↵.
- Cooper GM

- ↵
- ↵.
- Maya-Mendoza A,
- Petermann E,
- Gillespie DAF,
- Caldecott KW,
- Jackson DA

- ↵
- ↵.
- Evrin C, et al.

- ↵
- ↵.
- Allen C,
- Ashley AK,
- Hromas R,
- Nickoloff JA

- ↵.
- Jones RM,
- Kotsantis P,
- Stewart GS,
- Groth P,
- Petermann E

- ↵
- ↵
- ↵.
- Francis D,
- Davies MS,
- Barlow PW

- ↵
- ↵.
- Mahbubani HM,
- Chong JPJ,
- Chevalier S,
- Thömmes P,
- Blow JJ