## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Greedy scheduling of cellular self-replication leads to optimal doubling times with a log-Frechet distribution

Edited by Ken A. Dill, Stony Brook University, Stony Brook, NY, and approved January 13, 2015 (received for review September 29, 2014)

## Significance

In production scheduling, a temporal assignment of tasks to their respective processing units is sought, such that precedence and resource constraint are satisfied and an optimization goal is minimized. So how is bacterial self-replication scheduled? Scarcity of resources creates a hard-to-solve scheduling problem in the worst case. Here, we show that proper loading of metabolic and catalytic reservoirs results in optimal scheduling, provided that the catalytic processing units are “greedy,” i.e., do not idle if they can perform a task. The distribution of doubling times of optimally scheduled self-replication has a universal form—the log-Frechet distribution. Recent measurements of doubling times of *Escherichia coli* in different media well fit this distribution, suggesting that *E. coli* optimally scheduled its replication in these experiments.

## Abstract

Bacterial self-replication is a complex process composed of many de novo synthesis steps catalyzed by a myriad of molecular processing units, e.g., the transcription–translation machinery, metabolic enzymes, and the replisome. Successful completion of all production tasks requires a schedule—a temporal assignment of each of the production tasks to its respective processing units that respects ordering and resource constraints. Most intracellular growth processes are well characterized. However, the manner in which they are coordinated under the control of a scheduling policy is not well understood. When fast replication is favored, a schedule that minimizes the completion time is desirable. However, if resources are scarce, it is typically computationally hard to find such a schedule, in the worst case. Here, we show that optimal scheduling naturally emerges in cellular self-replication. Optimal doubling time is obtained by maintaining a sufficiently large inventory of intermediate metabolites and processing units required for self-replication and additionally requiring that these processing units be “greedy,” i.e., not idle if they can perform a production task. We calculate the distribution of doubling times of such optimally scheduled self-replicating factories, and find it has a universal form—log-Frechet, not sensitive to many microscopic details. Analyzing two recent datasets of *Escherichia coli* growing in a stationary medium, we find excellent agreement between the observed doubling-time distribution and the predicted universal distribution, suggesting *E. coli* is optimally scheduling its replication. Greedy scheduling appears as a simple generic route to optimal scheduling when speed is the optimization criterion. Other criteria such as efficiency require more elaborate scheduling policies and tighter regulation.

An *Escherichia coli* bacterium is a remarkably efficient self-replicating organism. Given the right conditions, this rod-shaped bacterium will consume external metabolites and grow, adding new membrane-bound volume, while concurrently replicating its content composed chiefly from the transcription–translation machinery, metabolism, and DNA. After all of the essential elements have been replicated and spatially segregated, *E. coli* divides by completing the construction of the division plane, approximately midway between the old and the new poles. The two copies can both continue to self-replicate, as long as the permissive conditions persist.

The study of self-replication as an industrial process was pioneered by John von Neumann (1). In his first, less known model, the kinematic self-replicator, he envisioned a room full of elementary parts and a nontrivial self-replicating machine that copies itself by consuming these parts as substrates. The main goal of von Neumann was to understand how a physical system can become more complex over time. Motivated by the introduction of the universal Turing machine to the theory of computation, he introduced the concept of a “universal constructor” (*U*)—a machine that can read instructions and translate them into assembly of any machine in the factory, including itself, provided all of the substrates are available. A self-replicating factory is called nontrivial if it contains such a *U* machine as a component. In contrast, trivial self-replication is a simple autocatalytic process, such as template replication or crystal growth.

In von Neumann’s model, replication unfolds as follows (2). First, a new chassis is made, and then the universal constructor is triggered to start reading the instructions and assembling all of the internal machinery, including itself. However, the new factory is not functional until a copy of the instructions is made. To keep the design simple, von Neumann suggested that the instructions should not instruct their own replication but rather be template-replicated by a dedicated machine, *R* (produced by *U*), that is triggered upon the completion of the instruction translation phase. Remarkably, these observations were made before the discovery of the molecular structure of DNA, and the molecular mechanism of DNA replication, transcription, and translation.

The emphasis of von Neumann’s model is on the logical design of such a factory. Here we study the temporal organization of self-replication and, in particular, its scheduling. von Neumann’s model assumed serial dynamics but cells do not grow serially but rather parallelize their growth. However, using a nonoptimal parallel scheduling scheme still results in slow overall growth compared with the optimum. This difference can have a significant effect on fitness in a competitive environment.

The “scheduling problem” is the problem of finding such a schedule. It is known that, if the number of processing units available is smaller than the number of production tasks that require them, finding an optimal schedule is typically computationally hard in the worst case, although certain heuristics are known to provide good approximations for the average instances (3⇓–5).

Here, we study the scheduling problem of a self-replicating bacterial cell. Surprisingly, our analysis of recently measured datasets of *E. coli* exponentially growing in a stationary medium (6, 7) reveals that the measured distribution of doubling times fits well to the predicted distribution of doubling times of an optimally scheduled self-replicating factory. This suggests that *E. coli* is optimally scheduling its replication in these media. To explain this result, first, a coarse-grained picture of a bacterial cell is presented, and its reaction graph is briefly introduced. The concept of a project graph, well-known in system engineering, is invoked to represent the temporal precedence constraints that exist among all of the de novo synthesis tasks that define the project. We introduce the project graph of a self-replicating and balanced factory and define the “replicative buffer”—the number of basic self-replicating units within a cell. A replicative buffer greater or equal to 1 allows a large class of random scheduling algorithms known as list algorithms (8) to obtain optimal completion times. We derive the distribution of optimal completion times for a special type of project graph representing balanced production. Finally, we present the aforementioned analysis of a recently published dataset of *E. coli* growing exponentially in a rich medium (6) and in minimal media supplemented with glucose or glucose and amino acids (7), and show that the data fit well to our predicted optimal universal curve, suggesting that *E. coli* in good growth conditions at steady state is optimally scheduling its replication. We then conclude with a short discussion and give an outlook to future extensions of this framework.

## The Cell as an Autocatalytic Cycle

In a bacterial cell, prominent examples for processing units include metabolic enzymes, RNA and DNA polymerases (RNAP, DNAP), and ribosomes, which, like processing units in a factory, are required for a production task (biosynthesis), consume input materials and free energy, are not consumed during the process, yet are essential for its successful completion.

There are two unique features characterizing a self-replicating factory. The first is “closure.” Processing units convert raw material into products, while consuming free energy. In a self-replicating factory, the products are the processing units. Thus, when all of the processing units complete their production tasks, each processing unit is present in duplicate (or more). To produce a processing unit may require several other processing units, self included. The second feature is “essentiality”—each processing unit is required by at least one other reaction that produces a different type of processing unit.

To illustrate how closure is obtained in a bacterial cell, we present a coarse-grained schematic of it in Fig. 1. Each symbol in the figure represents a family of functionally related macromolecules. For example, the replisome (the *R* machine in von Neumann’s model), which is composed of many types of proteins and protein complexes, is represented by a single DNAP.

All proteins are synthesized by ribosomes, composed of ribosomal proteins (represented by the colored hexagons in the figure) and ribosomal RNA (rRNA) represented by a green strand. The rRNA is transcribed from DNA by RNAP, which is a self-assembled protein complex composed, in bacteria, from five different proteins.

To synthesize proteins, ribosomes require a pool of charged transfer RNA (tRNA) and a family of auxiliary proteins such as elongation, initiation, and maturation factors, as well as aminoacyl-synthetase proteins that catalyze the tRNA-amino acids charging. All of these auxiliary proteins are represented by the EF-Tu protein (green circle in Fig. 1). The tRNA is transcribed by RNAP and reaches its mature form with the help of some dedicated proteins (of course, mRNA is also transcribed by RNAP). Finally, membrane and division plane synthesis is facilitated by dedicated proteins. All these processes require numerous metabolites such as ribonucleosides and deoxyribonucleosides, amino acids, lipids and oligosaccharide, ATP, GTP, and NADH. All metabolites are either synthesized or imported from the outside by the metabolic machinery, which is mainly composed of metabolic proteins, represented by orange triangles in Fig. 1.

## The Catalytic Reaction Graph

To model the structure of a self-replicating factory, we follow ref. 9 and introduce a directed graph—the catalytic reaction graph,

Let *f* be the external set of materials (assumed to be fixed and abundant). For example, nucleotides and amino acids both belong to the set *F*. Let *I* be the DNA (instruction set), and *R* be the replisome machinery, which catalyzes the template replication of *I*. Furthermore, let *U* be the transcription–translation machinery (the universal constructor) that reads instructions from *I* and produces a copy of itself as well as all other processing units (proteins) that belong to *P*. The set *P* is responsible for biosynthesis processes and their control in the cell. It contains all of the proteins that are not members of *R* or *U*. For example, *P* includes metabolic proteins, transporters, and transcription factors. The set *V* represents the membrane-bound volume, made from all of the membranal layers including, e.g., the embedded protein transporters (which are a subset of *P*). Note that the membrane-bound volume is essential for all reactions, because in its absence reactions come to a halt due to molecular overcrowding.

To complete the definition of the catalytic reaction graph, we define the reaction node set. There are four types of reactions: *F* from the external materials in *f* (metabolism). *P*, *U*, and *R* (as well as RNA for private consumption) from the set *F* (transcription and translation), and the reactions in *I* while consuming elements from *F* using the existing *I* as template (DNA replication). The catalytic set *C* is composed of the subsets *P*, *U*, *V*, *R*, and *I*.

The following set of reactions summarizes the above description:*F*, *P*, and *I* are the union of the sets

In Fig. 2*A*, the directed reaction graph of the system described by Eq. **1** is presented. Fig. 2*B* illustrates the types of molecules composing each material node in the reaction graph using the same graphical notation as in Fig. 1. The exact composition is not assumed to be one molecule per type. For example, *U* have a different number of ribosomes, RNA polymerases, elongation, initiation maturation and termination factors, aminoacyl transferases, and tRNAs.

## From the Reaction Graph to the Project Graph

The project evaluation and review technique (PERT) is extensively used in system engineering as a means for scheduling important events in a project, and for estimating the expected project completion time and its distribution (10). A key concept in PERT is the project graph—a directed acyclic graph whose nodes represent milestone events in the project, marking the initiation or completion of a major project task; the directed edges represent the tasks themselves. In a production project, the initiation or completion of a production task (de novo biosynthesis) are the events, whereas the edge that connects them is the production task. Each edge in the project graph is also equipped with a nonnegative weight—the task duration. Edges also have resource demands—number and type of processing units and input materials required to perform them. We introduce two unique nodes in the project graph, *s* (start) and *t* (terminal), that mark the beginning and the end. The *s* node connects to all nodes that otherwise do not have entry nodes, and all of the otherwise terminating nodes feed into the *t* node (Fig. 3). We call a path from start to end an *st* path.

The structure of the project graph captures the temporal precedence constraints among all production tasks. The nodes function as “AND” gates, not allowing a new task to start (traverse an outgoing edge) before all predecessor tasks (preceding edges) are completed. Thus, the project graph is directed and acyclic, as it describes the progress in time of all of the tasks in the project.

Upon constructing the project graph and assigning durations to all its edges, the “optimal project completion time” *st* path, known as the critical path. Other *st* paths can have “slack”—a duration gap between their completion time and

In Fig. 2*C*, the project graph is constructed from the reaction graph (Fig. 2*A*) by splitting the nodes to account for their temporal order of appearance. For example, when all input material nodes depicted in orange and green in Fig. 2*C* are simultaneously present at

## The Scheduling Problem

In order for a biosynthetic task to be completed, the associated catalysts have to be allocated to the reaction along with the necessary input materials. Assuming the input materials are abundant, the following “scheduling problem” arises—how to temporally assign tasks to processing units such that the completion time of the entire project is minimized (5). In the noisy cellular milieu, we cannot expect scheduling algorithms to be precise. Instead, we ask what is expected from a randomized algorithm. To better characterize the scheduling problem, it is useful to introduce the following parameters. The workload *SI Text*, section 7, for an estimate of the number of self-replicating units in *E. coli* doubling every 24 min based on protein and DNA replication workloads).

When there are excess of catalysts, then whenever there is a demand for them, it can readily be met, provided that they never idle if there is an available task (greedy scheduling). Thus, although excess is wasteful in terms of free energy cost, it is beneficial in terms of speed, as it allows for optimal completion time, determined only by the critical path duration. Moreover, this excess also allows for proper balancing in the presence of random delays that affect the actual demand. However, at some point, excess of catalysts can become suboptimal, because without further control, a greedy scheduling policy can deplete either the input resources or the catalytic pool, by allocating catalysts to reactions that have randomly completed earlier, causing a speedup in the progress of certain activities at the expense of others. This problem is mitigated by balancing as discussed below. Note that because the different task durations are stochastic variables, *c*.

## Basic Unit of Parallel Self-Replication

Now we apply the average parallelism concept to the self-replicating factory. We first define the initial demand vector *C*, of size *j* required by reaction *i* (*S* is defined similarly, i.e., *j* consumed by reaction *i*. The vectors

Let

When, on the other hand, *p* is an integer, the doubling time *i*th reaction by a complete set of processing units *p* parallel self-replicating basic units advancing in unison with some residual statistical noise. In the absence of correlations, the resulting distribution of completion times is the maximum of identically and independently distributed random variables and hence equals to one of the three classical “extreme value statistics” limit distributions. However, due to the requirement for balanced production, the different reaction rates are coupled, and hence correlated. Thus, a different approach for calculating the distribution of doubling times is required.

To dynamically balance their inventory of catalysts and substrates, cells often use end-product feedback inhibition. For example, in the biosynthesis of the amino acid tryptophan, a metabolically costly amino acid, excess tryptophan binds to the enzyme at the top level of the dedicated metabolic pathway that produces it, thus down-regulating its own production (14). A beautiful mechanism for translational control was discovered by Nomura and coworker (15), where excess ribosome proteins that fail to bind to their targets alternatively bind to their respective mRNA, thus stopping their own production. All of these mechanisms use the excess end product, i.e., the part that was not consumed by downstream reactions, as a negative-feedback signal, down-regulating their own production, resulting in a dynamically balanced production line.

We define the basic unit of parallel self-replication as the smallest complete set of processing units and materials that is self-consistent: **1**). It is also natural to define the basic “serial” unit of self-replication as the minimal set of all initial states

An important class that interpolates between fully serial and fully parallel production is the pipelined self-replication, where the demand vector is charged in a way that allows one basic self-replicating unit to produce another basic self-replicating unit (marked with a star in Fig. 2*C*), with a time lag. Even before the catalysts finish building all of the machinery of the starred unit (next generation), the newly formed elements from the starred unit start replicating the double-star unit (marked with a double star in Fig. 2*C*). Thus, catalysts that finish their role in producing a given unit move to the next job of producing the same unit for a later generation. At steady state, there are multiple basic self-replicating units, with overlapping and lagging production cycles, until finally division (which is also scheduled periodically) resets this process. To conclude, in pipelined self-replication, there is an overlap in the production of several basic self-replicating units and an overall noninteger average parallelism (the integer part counts the number of overlapping cycles). The scheduling of self-replication can naturally lead to pipelined production, if tasks are ordered properly, and full parallelism is also possible when the lag times are set to zero across the factory.

## Calculating Optimal Doubling Time

Consider a given project graph, and assume that the duration of each task is a random variable distributed according to a distribution that is exponentially decaying at large times. To calculate the project completion time, we introduce the exponentially weighted adjacency matrix of the project graph, which is defined as follows:*β* is suppressed for brevity. The matrix trace *k* activities, exponentially weighted by their total duration. In the limit *k* steps and its integer multiples. We further divide *k*, because there are *k* equally contributing nodes on each cycle of length *k*, depending on where we locate the cycle’s start. Summing over all *k* values, we obtain a discrete path integral, which is a function of *β*. Dividing by *β*, we obtain the following:*l*th eigenvalue of **4** becomes exact in the limit **4** and further explanation, we refer the reader to the *SI Text*, sections 1–3 and Figs. S1 and S2.

## Balanced Self-Replication Has a Periodic Project Matrix

Let **1** and the notation presented in Fig. 2,

The balancing mechanism is important, as it prevents overproduction of certain materials on the expense of others. It locks the factory to the critical path. Recall that the node set *q* disjoint sets, *F*, *P*, *U*, *V*, *R*, or *I*, to

The project completion time of an irreducible periodic matrix with a large spectral gap (*SI Text*, section 3) can be calculated analytically, due to the discrete rotational symmetry of the adjacency matrix spectrum. The eigenvalues satisfy **4** and noticing *β* (see *SI Text*, section 3, for more details), we obtain the following:*SI Text*, section 2). The distribution of the maximal eigenvalue of a random Wishart matrix with heavy tails was recently shown to belong to the Frechet class when the entries are power law distributed (18). Thus, for *SI Text*, sections 1–6). This relation to the Wishart ensemble can breakdown if (*i*) *β* is small, where the Tracy–Widom distribution is expected (18). (*ii*) The spectrum is not gapped, e.g., when the project is not balanced (as in a totally serial project, where there is only one *st* path).

Our hypothesis regarding the existence of a replicative buffer can be tested experimentally by measuring the number of origins of replication right after division

## Comparison with Measured Distributions of Doubling Times

To test our prediction of optimal doubling time, we analyzed two recent datasets of *E. coli* growing in different media. The first dataset from ref. 6 is of strain K12, substrain MG1655 growing in Luria–Bertani (LB) medium. The second dataset is taken from ref. 7 and is of *E. coli* strains SX701 and JE116, based on *E. coli* strain BW25993 (7) growing in M9 medium supplemented with glucose, or M9 medium supplemented with glucose and amino acids. The experiment of ref. 6 contains an impressive number of single-cell growth and division measurements, out of which we filter roughly 200,000 clean divisions of either mother, daughter, or granddaughter *E. coli* cells exponentially growing in LB broth at 37 °C within a narrow microfluidic canal, at an average doubling time of 21 min (6). The experiment of ref. 7 contains 6,000 growth and division measurements of *E. coli* growing in **6**, and the second is by fitting an exponent to the measured growth curve. We obtained practically identical doubling times in the two methods. After filtering noisy data (fits with *β* is an arbitrary positive number. These parameters were then used for the log-Frechet distribution. As evident in Fig. 4, all doubling times from the three environments fit very well with the log-Frechet distribution. We tested two competing distributions, the gamma and the log-normal distributions, and found that they do not fit well with the data in contrast to the log-Frechet distribution (*SI Text*, Figs. S3 and S4).

## Conclusion and Outlook

We studied the distribution of doubling times of an optimally scheduled self-replicating bacteria. Invoking PERT and the notion of balanced growth (19), a periodic structure for the project graph of a self-replicating production process was suggested. The distribution of optimal doubling times for this graph structure was calculated and found to have a universal shape—the log-Frechet distribution. Contrary to hard combinatorial optimization problems, finding the optimal scheduling in a self-replicating factory is possible using a greedy scheduling scheme, provided that the inventory of processing units and material inputs are a larger-than-one multiple of the demand vector. To maintain such a stoichiometric balance, some level of control over the rate of production of different types of processing units is required, which leads to a periodic project graph (i.e., a project graph with a periodic adjacency matrix). To corroborate our model, we analyzed *E. coli* in three different media using the dataset of ref. 6 for LB media and of *E. coli* in M9 media with glucose and either with or without amino acids (data from ref. 7) and showed that their doubling-time distributions agrees well with the predicted optimal completion time distribution—the log-Frechet distribution, suggesting that *E. coli* is optimally scheduling its self-replication in these experiments. This natural tendency toward optimality when minimizing the doubling time is the optimization criterion, suggests that avoiding optimal growth in a permissive environment is in fact a challenge for the cell, which requires additional regulation.

## Acknowledgments

I gratefully acknowledge the support of the Fulbright Foundation, the Eric and Wendy Schmidt Foundation, and the Jensen Fellowship.

## Footnotes

- ↵
^{1}Email: rami.pugatch{at}gmail.com.

Author contributions: R.P. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The author declares no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1418738112/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵.
- von Neumann J,
- Burks AW

- ↵.
- Cairns-Smith AG

- ↵
- ↵
- ↵.
- Sinnen O

- ↵
- ↵
- ↵
- ↵
- ↵.
- Kerzner H

- ↵.
- Trietsch D

- ↵.
- Goldratt EM,
- Cox J

- ↵
- ↵.
- Stryer L

- ↵.
- Cole JR,
- Nomura M

*Escherichia coli*. Proc Natl Acad Sci USA 83(12):4129–4133 - ↵.
- Jensen JB,
- Gutin GZ

- ↵
- ↵.
- Auffinger A,
- Arous GB,
- Peche S

- ↵.
- Cooper S

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Systems Biology

- Physical Sciences
- Physics

## Jump to section

- Article
- Abstract
- The Cell as an Autocatalytic Cycle
- The Catalytic Reaction Graph
- From the Reaction Graph to the Project Graph
- The Scheduling Problem
- Basic Unit of Parallel Self-Replication
- Calculating Optimal Doubling Time
- Balanced Self-Replication Has a Periodic Project Matrix
- Comparison with Measured Distributions of Doubling Times
- Conclusion and Outlook
- Acknowledgments
- Footnotes
- References

- Figures & SI
- Info & Metrics