# Mathematical model of colorectal cancer initiation

^{a}Department of Applied Mathematics, University of Washington, Seattle, WA 98195;^{b}Oncode Institute, 3521 AL Utrecht, The Netherlands;^{c}Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences, 3584 CT Utrecht, The Netherlands;^{d}University Medical Center Utrecht, 3584 CX Utrecht, The Netherlands

See allHide authors and affiliations

Contributed by Hans Clevers, June 21, 2020 (sent for review February 28, 2020; reviewed by Niko Beerenwinkel and Marek Kimmel)

## Significance

Cancer evolution cannot be observed directly in patients, and new methodologies are needed for obtaining a quantitative understanding of this obscure process. We develop and analyze a stochastic model of malignant transformation in the colon that precisely quantifies the process of colorectal carcinogenesis in patients through loss of tumor suppressors *APC* and *TP53* and gain of the *KRAS* oncogene. Our study employs experimentally measured mutation rates in the colon and growth advantages provided by driver mutations. We calculate the probability of a colorectal malignancy, the sizes of premalignant lesions, and the order of acquisition of driver mutations during colorectal tumor evolution. Our model can be used as a starting point in evaluating early detection and treatment efforts.

## Abstract

Quantifying evolutionary dynamics of cancer initiation and progression can provide insights into more effective strategies of early detection and treatment. Here we develop a mathematical model of colorectal cancer initiation through inactivation of two tumor suppressor genes and activation of one oncogene, accounting for the well-known path to colorectal cancer through loss of tumor suppressors *APC* and *TP53* and gain of the *KRAS* oncogene. In the model, we allow mutations to occur in any order, leading to a complex network of premalignant mutational genotypes on the way to colorectal cancer. We parameterize the model using experimentally measured parameter values, many of them only recently available, and compare its predictions to epidemiological data on colorectal cancer incidence. We find that the reported lifetime risk of colorectal cancer can be recovered using a mathematical model of colorectal cancer initiation together with experimentally measured mutation rates in colorectal tissues and proliferation rates of premalignant lesions. We demonstrate that the order of driver events in colorectal cancer is determined primarily by the fitness effects that they provide, rather than their mutation rates. Our results imply that there may not be significant immune suppression of untreated benign and malignant colorectal lesions.

Worldwide efforts to sequence cancer genomes over the past decade have provided insight into the number and type of genetic alterations implicated in tumor progression (1). Recent work suggests that the acquisition of just three driver alterations (2), often inactivation of two tumor suppressor genes (TSGs) and activation of one oncogene (1, 3), is sufficient to produce an invasive phenotype. In the case of colorectal cancer (CRC), the most significant driver events are inactivation of the TSGs *APC* and *TP53* and activation of the *KRAS* oncogene (4). Activation of an oncogene requires a mutation of a single copy of the gene; inactivation of a TSG requires two genetic alterations (inactivation of both copies) (5), leading to approximately five mutational stages on the way to CRC.

Early work on multistage modeling of cancer incidence aimed to determine the number of mutations (stages) needed for cancer initiation by fitting multistage models to cancer incidence data (6, 7). Notably, Armitage and Doll (6) studied sequential accumulation of neutral mutations in a tissue and reported that approximately six stages were needed for CRC initiation. Knudson (7) used mathematical modeling and incidence data to conclude that two hits are needed for initiation of retinoblastoma; this finding was subsequently confirmed by the discovery of the Retinoblastoma (*Rb*) TSG. Subsequent work studied two-stage models allowing for clonal expansions (8, 9) or *n*-stage models where mutations accumulate in a predetermined (linear) order, typically allowing the last premalignant stage to provide proliferative advantage to cells, with the goal of explaining reported cancer incidence and estimating mutation and proliferation rates from cancer incidence data (10, 11). However, even though some driver mutations tend to be acquired before others on the way to cancer, this order is frequently violated (12⇓–14).

In the context of CRC, Luebeck, Moolgavkar, and coworkers studied the two- to five-stage multistage clonal expansion (MSCE) model, which includes a clonal expansion of cells in the final premalignant stage (10, 11), and showed that it provides a better fit to CRC incidence data compared to Armitage and Doll’s neutral model (11). The MSCE model was later expanded to include the clonal expansion of malignant cells, allowing for modeling of the size distribution of the malignant population (15). Recently, Lang et al. studied cancer incidence and size distributions of premalignant (adenoma) and malignant cells in a two-stage branching process (16) model of CRC initiation and fit their results to data on size distribution of adenomas and carcinomas to obtain estimates for rates of development of CRC (17).

In contrast to previous approaches, here we use a genomically informed stochastic model in which we allow mutations to occur in any order, leading to a biologically realistic complex network of possible evolutionary pathways that lead to cancer. Our approach is based on experimentally testable hypotheses about molecular mechanisms and gives predicted incidence and timing of mutations as outputs. We parameterize the model using recent experimental estimates of in vivo mutation rates in the colon (18) and selective advantages of driver mutations of interest (19, 20) and show that reported lifetime CRC risk can be recapitulated by assuming that *APC* inactivation and *KRAS* activation provide small proliferative advantages to colorectal crypts, which is consistent with current experimental measures. We use the model to provide estimates of the sizes of premalignant lesions and mutation order in the earliest stages of colorectal tumorigenesis.

## Results

### Model and Parameters.

We develop and analyze a stochastic mathematical model of CRC initiation through acquisition of three driver events: inactivation of tumor suppressors *APC* and *TP53* and activation of the *KRAS* oncogene. In this model, normal (wild-type) tissue consists initially of *N* colorectal crypts. Inactivation of a TSG requires inactivation of both alleles (TSG^{−/−}). *KRAS*, an oncogene, can be activated through mutation of a single copy of the gene (*KRAS*^{+}). We do not impose any particular order in which these genetic events are acquired. This leads to a complex network of available five-step pathways from an initially wild-type colorectal crypt to the one that has collected all three driver events (Fig. 1*A*).

Each allele of a TSG can be inactivated through either mutation or loss of the allele via a loss of heterozygosity (LOH) event. Loss of both alleles is not allowed, as this is usually fatal for the cell. *r*_{G} is the rate of mutation of a gene G, and *r*_{LOH} is the rate of loss of an allele of a TSG (*Materials and Methods*). In our model, the wild-type crypt genotype is denoted by (0,0,0), corresponding to no driver alterations in any of the three genes. An *APC* wild-type crypt (*APC*^{+/+}), whose genotype is denoted (0,*,*), can inactivate a single copy of the *APC* gene (*APC*^{+/−}) through LOH, which occurs with rate *r*_{LOH}, leading to genotype (1,0,0), or through mutation, which occurs with rate *r*_{APC}, leading to genotype (2,0,0).

If the first copy of *APC* was inactivated through LOH, the second copy can only be inactivated through mutation. *r*_{APC} is the rate of driver mutation of the wild-type *APC* gene, which has two alleles available for an inactivating mutation. In an *APC*^{+/−} gene, there is only one wild-type allele available for an inactivating mutation, so mutation from genotype (1,*,*) occurs with rate *r*_{APC}/2, leading to genotype (3,*,*). If the first copy of *APC* was inactivated through mutation, the second copy can be inactivated through LOH [with rate *r*_{LOH}/2, leading to genotype (3,*,*)], or through a second mutation [with rate *r*_{APC}/2, leading to genotype (4,*,*)]. The possible driver events in the *APC* gene and their corresponding rates are summarized in Fig. 1*B*. Genetic alterations in *TP53* occur in a similar manner as described above for *APC* (with mutation rate *r*_{TP53} instead of *r*_{APC}). For example, we denote by (0,2,0) a crypt that has mutated a single copy of *TP53*. A *KRAS* wild-type crypt has genotype (*,*,0) and can gain an activating *KRAS* mutation (*KRAS*^{+}) with rate *r*_{KRAS}, leading to genotype (*,*,1). The assumptions described above lead to 50 possible genotypes (5 · 5 · 2), including the four malignant genotypes (3, 3, 1), (4, 3, 1), (3, 4, 1), and (4, 4, 1). These transitions and their corresponding rates are summarized in Fig. 1*B*.

We can view the model of CRC initiation through the acquisition of the three driver events as the process of movement of crypts from the initial wild-type genotype (0,0,0) through a “space” consisting of different possible premalignant genotypes and ending in one of the malignant genotypes. In addition, we allow crypts that have accumulated certain driver alterations to undergo fission and increase in number. Crypt fission involves the division of a single crypt into two daughters through a bifurcation of a parental crypt (21). While normal crypts divide only very rarely in adult colonic tissue (20), crypt fission drives the expansion of mutant crypts in premalignant colorectal lesions (adenomas) (21). We assume that crypt fission occurs stochastically and follows a pure birth process (22) with some fixed birth (i.e., division) rate that depends on the crypt genotype.

Many of the parameter values for this model have only recently become available. Mutation rates are gene-specific: They are the product of the number of positions in the gene that will result in (in)activation if mutated (driver positions) and the point mutation rate. The number of driver positions in the specific gene can be estimated from genetics knowledge and cancer mutation databases (23, 24) (see *Materials and Methods* for details). Intestinal stem cells reside at the bottom of intestinal crypts, with each crypt containing five to eight stem cells (25). It has recently become possible to measure the rate at which mutations accumulate in normal tissues, using organoids developed from single healthy stem cells from the colon, small intestine, and liver (18). Blokzijl et al. (18) reported that individual stem cells of the colon accumulate about 40 mutations per year across the entire genome, leading to a rate of *u* ∼1.25*10^{−8} mutations per year per base pair. As colonic stem cells divide on average every 5 d (26), this measured mutation rate corresponds to mutation rate of ∼1.7 * 10^{−10} per base pair per stem cell division, in line with estimates of normal mutation rate per cell division in human cells (27). New mutations that appear in a single crypt will either be lost or will fixate in the crypt (28), suggesting that the measured rate of accumulation of mutations applies at the level of a single crypt.

There are approximately *n* = 10^{7} to 10^{8} crypts in the human colon (26, 29). It has been shown that *APC* inactivation and *KRAS* activation lead to clonal expansion of crypts carrying one of these alterations (30, 31). Division rates of *APC*^{−/−} and *KRAS*^{+} crypts have been recently measured to be *b*_{1} = 0.2/y and *b*_{2} = 0.07/y, respectively (19, 20). In contrast, *TP53* has been shown not to provide any fitness advantage on its own in normal conditions (32). As normal crypts divide only very rarely (20), we set division rates of all genotypes without inactivated *APC* or activated *KRAS* to 0. In the following, we will consider three scenarios: 1) all drivers are neutral, including *APC* and *KRAS*, and division rates of all genotypes are set to 0; 2) *APC* provides growth advantage: division rate of *APC*^{−/−} crypts is set to *b*_{1} = 0.2/y, genotypes without *APC* inactivated have division rates set to 0; and 3) *APC* and *KRAS* provide growth advantage: division rate of *APC*^{−/−} crypts is *b*_{1} = 0.2/y, division rate of *KRAS*^{+} crypts is *b*_{2} = 0.07/y, division rate of double-mutant *APC*^{−/−}*KRAS*^{+} crypts is *b*_{12} = *b*_{1} + *b*_{2} = 0.27/y, and all other genotypes have division rates set to 0. Setting the division rate of double-mutant *APC*^{−/−}*KRAS*^{+} crypts to *b*_{12} = *b*_{1} + *b*_{2} amounts to assuming that there are no epistatic interactions between *APC* and *KRAS*: While this is an area of active research, the size and direction of any such effect, if it exists, is not presently known with confidence for these two genes (33). Using the three scenarios, we will demonstrate that neutral accumulation of driver mutations is not sufficient to produce reported lifetime risk of CRC.

### Time of Appearance of the First Malignant Crypt.

We are interested in the time of appearance of the first crypt that has collected all three driver events—the first malignant crypt. Approximately 15% of CRCs harbor all three driver events in question (*APC*, *TP53*, and *KRAS*) (34). The lifetime risk of colorectal adenocarcinoma (35) is ∼4.2%. Thus, the lifetime risk of developing CRC containing these three driver events is ∼0.6%. We will compare our model predictions to this incidence rate.

We analyze the model through computer simulations and analytic calculation. First, we study a model in which driver events are neutral until all three drivers are collected, that is, we set crypt division rates of all genotypes to 0. We show that the neutral assumption leads to incidence rates that are many orders of magnitude lower than what is reported. In the case of neutral accumulation of driver mutations, the probability that a malignant crypt exists by time *t* is given by (*SI Appendix*)**1**, assuming *n* = 10^{8} crypts in the colon (29). If all mutations are neutral, probability of a malignant crypt at age 80 y (26) is less than 1 in 100 million—six orders of magnitude lower compared to the ∼1% of the population that develop CRC through acquisition of these three driver events. The assumption that all three driver mutations are neutral is thus not consistent with reported incidence.

In agreement with this analysis, it has been shown that *APC* inactivation in colorectal stem cells does lead to a fitness advantage, and that stem cells that sustain *APC* loss will go on to produce an adenoma (polyp), which can be up to a centimeter in diameter or larger (30). This will dramatically increase the chances for subsequent driver mutations within one of the cells of the *APC*^{−/−}-mutant clonal polyp.

We generalize Eq. **1**, which was derived under assumption of no crypt fission, to the case when *APC* inactivation provides crypts with proliferative advantage in the form of division rate *b*_{1} (*SI Appendix*). All other genotypes (without inactivated *APC*) have division rates set to 0. In that case, probability of a fully malignant crypt being present by time *t* is given by**2** and compare it to computer simulations of the process. Eq. **2** agrees well with probability of malignancy obtained from simulations (Fig. 2). Assuming that *APC* inactivation provides proliferative advantage to crypts leads to the probability of a malignant crypt at age 80 y, *P*(80) = 3.4*10^{−5}, which is still approximately three orders of magnitude lower than reported incidence.

We note that Eq. **2** does not take into account the increased chance that an *APC*-mutated stem cell will take over a crypt. It was measured that a stem cell with a single inactivated copy of the gene (APC^{+/−}) has a 2.1 times higher chance to take over a crypt compared to a wild-type stem cell, and that a stem cell with both copies inactivated has a 2.8 times increased chance to take over a APC^{+/−} crypt (32, 36). Thus the probability from Eq. **2** should be multiplied by *c*_{1} = 2.1*2.8 = 5.88 to obtain the corrected probability of a malignant crypt at age 80 y, *P*_{c}(80) = 2*10^{−4}, still 30-fold lower than reported incidence.

Finally, we study a model in which *APC* and *KRAS*, but not *TP53*, provide selective growth advantage to crypts, as *TP53* mutation does not seem to provide selective advantage on its own in normal conditions (32) (i.e., without the presence of colitis). *KRAS* has been reported to provide growth advantage to colorectal stem cells and crypts, typically leading to small clonal expansions of *KRAS*-mutated crypts (31). In this final scenario, we set division rate of *APC*^{−/−} crypts to *b*_{1} = 0.2/y, division rate of *KRAS*^{+} crypts to *b*_{2} = 0.07/y, division rate of double-mutant *APC*^{−/−}*KRAS*^{+} crypts to *b*_{12} = *b*_{1}+*b*_{2} = 0.27/y, and all other genotypes have division rates set to 0.

We derive the approximate formula for the probability that a malignant crypt is present by age *t* (*SI Appendix*):*c* = *c*_{1} * *c*_{2} = 5.88 * 3.6 = 21.2 is the correction to the formula to account for increased chance of fixation of *APC* and *KRAS* mutated stem cells in a crypt (32, 36). We plot Eq. **3** as a function of age *t* and compare it to computer simulations of the mathematical model of CRC initiation in Fig. 2. Eq. **3** is in good agreement with computer simulations of the process until age 60 y but starts to overestimate the probability of malignancy at later times. We note that the chance that a malignant crypt is present at age 80 y from computer simulations of our model is 0.5%, very close to the lifetime risk of CRC through the *APC*/*TP53*/*KRAS* pathway of 0.6% (Fig. 2). The comparison remains similar if we use cumulative CRC risk until age 85 y from the SEER database (35), 0.15*5% = 0.75%, which allows for 5 y between the first malignant cell and cancer detection (15).

In Fig. 2, we used *n* = 10^{8} for the number of colorectal crypts (29). As the estimates for the number of crypts vary between 4*10^{7} (26) and 10^{8} (29), we explored which degree of epistasis would explain the lifetime CRC risk, assuming that *n* = 4*10^{7}. Using computer simulations of our model, the closest fit to the reported lifetime risk is obtained by assuming that the division rate of double *APC*^{−/−}*KRAS*^{+} mutants is *b*_{12} = 0.31/y, which is ∼15% higher than the growth rate of double *APC*^{−/−}*KRAS*^{+} mutants without epistasis (0.27/y).

### Number of Mutated Crypts.

We next use our model to calculate the number of mutated colonic crypts as a function of time. The formulas for expected numbers of crypts that have inactivated a single tumor suppressor or activated the *KRAS* oncogene are given in *SI Appendix* and plotted in Fig. 3*A*. *KRAS* activation requires only a single hit; initially, the number of *KRAS*-mutated crypts is much higher than crypts with inactivated *APC* or *TP53*. However, due to the much faster fission rate of *APC*-inactivated crypts, their number surpasses the number of *KRAS*-mutated crypts by middle age. The number of crypts with only *TP53* inactivated remains low throughout the human lifetime, due to a lack of growth advantage provided by *TP53* alone (32) (Fig. 3*A*).

Formulas for the expected numbers of double-mutant colonic crypts as a function of age are given in *SI Appendix* and plotted in Fig. 3*B*. *APC*^{−/−}*KRAS*^{+} crypts far outnumber the other two types of doubly mutated crypts: By age 60 y, the expected number of *APC*^{−/−}*KRAS*^{+} crypts is ∼1,000, whereas the numbers of *APC*^{−/−}*TP53*^{−/−} and *TP53*^{−/−}*KRAS*^{+} crypts stay below ∼10 and below 1, respectively, throughout the human lifespan (Fig. 3*B*).

### Order of Mutations.

The primary outstanding question that we aim to address is in what order these three functional mutations (*APC*, *TP53*, and *KRAS*) typically appear over the course of cancer initiation. To answer this, we must differentiate between two relevant questions: 1) In which order do driver mutations arise in a typical colorectal crypt, and 2) what is the order of mutations that was experienced by the first malignant crypt? A typical colorectal crypt will not have any of the driver events in question: 99.998% of *N* initially healthy crypts will not have activated *KRAS* or deactivated *APC* or *TP53* by age 80 y. Of the small number of *N* initial crypts that have collected a driver event, the most likely first event in a typical crypt is the activation of *KRAS*, as it only requires a single mutation (hit). This starts to answer question 1: A large majority of crypts will be functionally normal, and the most likely first driver accumulated by a typical colorectal crypt is *KRAS*.

We are however, more interested in question 2, the order of driver events that will actually produce a malignant crypt during a human lifespan. To answer this question, we developed a new Monte Carlo simulation, related to the previous simulation we used to measure cancer incidence. In the new simulation, instead of simply moving cells between different sites in the “space” of possible genotypes, all possible orders of the five mutational steps were tracked (*SI Appendix*). Cells could arrive at one of the four possible cancerous genotypes (corresponding to double mutation, or mutation and LOH of the two tumor suppressors) by 270 different possible paths (Fig. 4 *A*, *Left*). The probability for each of these paths to be taken by a crypt that became malignant by age 80 y was then measured, and that path was classified into one of the six possible sequences of the three driver events (Fig. 4 *A*, *Right*).

The result of this mutational order analysis is shown in Fig. 4*B*. The majority of the first driver events in a path that ended in malignancy are in fact the inactivation of *APC*, accounting for 67.9% of first events. All of the remainder (32.1%) were activation of *KRAS*, as *TP53* was never observed to be lost first. The majority of second driver events were gain of *KRAS* (63.8%), all of which followed from *APC* being lost first. The second most common second driver event (32.1%) was the inactivation of *APC*: In a small minority of cases (4.1%), *TP53* was lost second. The last driver event in an outstanding majority of cases (95.9%) was the loss of *TP53*.

It is remarkable that the loss of *APC* was so often the first mutational event to have occurred on the path to malignancy, given that in a typical colorectal crypt *KRAS* gain is the first driver event. Our explanation for this is that the loss of *APC* carries the largest fitness advantage of any event in our model, so cells arriving at a cancerous genotype will tend to come from lineages in which *APC* is lost early. The loss of *TP53* was overwhelmingly the last event to occur out of the three. We attribute this to the lack of a fitness advantage ascribed to *TP53* loss, and the consequent small number of crypts which carry *TP53*^{−/−} as one of the first driver events (Fig. 3 *A* and *B*).

The order of mutations in our mathematical model is strikingly similar to the well-known order of mutations on the path to CRC that was first described by Fearon and Vogelstein (37). In line with our results, *KRAS* mutations were found to be present with high frequency in the earliest preneoplastic lesions in the colon and absent in morphologically normal crypt areas (38). However, while *KRAS* mutations are very common in small nondysplastic lesions, such lesions have limited potential to progress to larger tumors and ultimately cancer (39).

In contrast, lesions with *APC* inactivation as the first event have a much stronger capacity to progress (39). *APC* was found in the earliest neoplastic lesions that can be examined (39) and is believed to precede both *KRAS* and *TP53* in the adenoma–carcinoma sequence of CRC (37). Progression of *APC*-mutated adenomas usually occurs through subsequent acquisition of *KRAS* and other genes (39, 40). *KRAS* mutations were found in ∼10% of colorectal adenomas smaller than 1 cm and in ∼40% of adenomas larger than 1 cm (41), thus most likely to appear after *APC* was inactivated and started an adenoma, but before malignant transformation. As in our model, evidence points to inactivation of *TP53* being a relatively late event in colorectal tumor progression (12, 37, 42), associated with the transition from benign adenoma to malignant carcinoma (43).

In *SI Appendix*, we derive an approximate analytic result for the relative probability of each individual path to malignancy (Fig. 4 *A*, *Left*). Using the formula, we find that there are four most likely paths to malignancy, each accounting for ∼10% of the conditional probability to reach a malignant state by age 80 y. The relative likelihood of taking one of these four paths, *X*_{i}, to a malignant state *x* by time *t* is given by*b*_{1} is the division rate of *APC*^{−/−} crypts, *b*_{2} is the division rate of *KRAS*^{+} crypts, and *b*_{12} is the division rate of double-mutant *APC*^{−/−}*KRAS*^{+} crypts. The four most likely paths to malignancy all have two copies of *APC* inactivated first, followed by activation of *KRAS*, followed by inactivation of two copies of *TP53*. The four comes from the possible ordering of inactivation of two copies of TSGs (mutation followed by LOH, or vice versa). We note that these four most likely paths are exactly the paths along which the driver mutations are accumulated in exact order of highest to lowest growth advantage.

A striking finding from our analytic results on path probabilities (see *SI Appendix* for details) is that the relative probabilities of each path to malignancy are independent of transition rates between genotypes (i.e., driver mutation rates and rate of LOH) and only depend on the division rates of intermediate genotypes visited by a particular path. We have also performed lineage-tracking computer simulations to explore whether relative probabilities of paths to malignancy are independent of transition (mutation) rates in the process and find that relative path probabilities remain approximately the same across different values of transition rates (*SI Appendix*, Fig. S3).

## Discussion

In this paper we develop and analyze a stochastic model of CRC initiation through a biologically realistic complex network of possible mutational genotypes on the path to malignancy. In this model, we calculate the probability that a colorectal crypt has collected three driver events of interest (*APC*, *KRAS*, and *TP53*) and became malignant as a function of age. We find that the reported lifetime risk of CRC can be recovered using a mathematical model of CRC initiation together with experimentally measured mutation rates in colorectal tissues and proliferation rates of premalignant lesions. This result has several important implications. First, it sheds light on the important mutation–selection debate in cancer. In our model, selection acts through increasing the division rate of crypts with driver mutations above the base rate for crypts with no such drivers, enabling a slow but (on average) exponential increase in the number of crypts with drivers. Our results demonstrate that neither increased mutation rates nor large proliferation rates of premalignant lesions are necessary to explain observed cancer incidence rates. Our findings are in agreement with recent results from Wojdyla et al. (44), who also find that neither increased mutation rates nor large growth advantages are needed to explain incidence of single-driver malignancy secondary myelodysplastic syndrome. Another intriguing implication of our findings is that the immune system may not be exerting significant pressure on the premalignant and malignant lesions during tumor evolution, as the reported lifetime risk of colorectal malignancy (which includes those cancers that escaped immune surveillance and were detected in the clinic) is not lower than the probability of developing CRC predicted by the model. Our model includes all premalignant lesions and cancers with the three drivers in question that were expected to appear throughout the human lifespan. Had there been a significant immune suppression of these tumors, we would expect the prevalence of detected cancers to be significantly lower than the prevalence predicted by our model. This finding is in agreement with recent reports that there may not be significant immune selection pressure during natural tumor evolution (45).

Our model provides a precise quantitative understanding of the process of premalignant and malignant transformation in the human colon and can be used as a starting point in evaluating early detection and treatment efforts. In particular, our results imply that, even though individual *KRAS*^{+} lesions are less likely to progress to malignancy compared to *APC*^{−/−} lesions, *KRAS* mutation may be the initiating driver event in as many as a third of CRCs containing this alteration; thus, efforts to target *KRAS* (46) may prove particularly important in a significant fraction of patients diagnosed with CRC harboring a *KRAS* mutation.

While we compare our model predictions with the lifetime risk of CRC from the SEER database (35), we are not able to compare our model predictions to age-specific CRC incidence, as the SEER database does not include information on age-specific incidence of CRC with our three drivers of interest (*APC*, *TP53*, and *KRAS*). A secondary issue is that other factors such as hereditary predispositions and familial syndromes are expected to have a significant effect on the shape of the incidence curve. For example, it is thought that up to one-third of CRCs exhibit increased inherited familial risk, due to several well-defined familial syndromes as well as a number of less penetrant, but possibly more frequent, susceptibility genes (47). These inherited conditions include those with germline inactivation of one copy of the *APC* gene (which results in one less required step on the way to CRC), germline mutations in mismatch repair genes (which lead to an increased mutation rate), or mutations in many other (possibly unknown) genes with lesser known modes of action (47).

Many previous studies of (colorectal) cancer initiation employ models in which mutations accumulate in a predetermined order, typically with one to three initial neutral steps needed for appearance of a benign lesion (adenoma), followed by a clonal expansion of the benign lesion, and a single mutation leading from adenoma to carcinoma (10, 11, 15, 17). The advantages of such approaches are that they can provide simple unified models that can serve as a description of the process of cancer initiation in the entire at risk population and can be fit to cancer incidence curves to extract parameters of the model. The drawbacks of such models are that they do not typically ascribe specific molecular events to the steps in the model, do not allow for mutations to occur in any order, and do not allow for multiple clonal expansions during the premalignant phase. In contrast, our model allows for a complex network of intermediate premalignant genotypes, all possible orders in which mutations can accumulate, and multiple clonal expansions prior to malignancy; it also ascribes specific molecular events to the steps on the path to malignancy and can thus be parameterized with experimentally measured mutation and expansion rates. However, our detailed model can only describe 15% of all CRC cases (i.e., those that occur through acquisition of *APC*, *TP53*, and *KRAS* driver mutations) and cannot be fit to full cancer incidence curves using currently available epidemiologic data.

Several previous models of CRC initiation study dynamics of individual cells in colorectal crypts (48). We circumvent the need to analyze the accumulation of mutations in the stem cell populations of colonic crypts by using the recently measured rate of accumulation of mutations in normal tissues (18). New mutations are either lost or fixate in the crypt (28), which allows us to focus on crypts as the level of selection. Research suggest clonal stabilization time (i.e., time for a newly introduced mutation to be lost or fixate in the crypt) in mice crypts is ∼1 mo and in human colon ∼1 y (28, 49). As the reported times apply to neutral mutations, it is expected that driver mutations that we are studying fixate even faster, on the order of months if not less, and thus the measured rate of accumulation of mutations in colonic stem cells can also be applied at the level of individual crypts.

Healthy cells are chromosomally stable, but most CRCs are chromosomally unstable. Thus, at some point during tumor evolution cells acquire chromosomal instability (CIN). Several early studies suggested that CIN is an early event in colorectal tumorigenesis, even preceding *APC* inactivation (50, 51). However, more recent studies suggest that extensive aneuploidy such as seen in the presence of CIN occurs after inactivation of *TP53* (52, 53) or possibly inactivation of both *APC* and *TP53* (54). Since *TP53* inactivation usually follows *APC* inactivation and *KRAS* activation (37, 42), it is likely that acquisition of our three driver events of interest precedes significant CIN.

We use the measured accumulation rate of mutations in normal tissues, which includes contribution from the mutations that arise from unavoidable errors associated with DNA replication, as well as any environmental factors (55). Once the relative contributions to the overall mutation rate from these two sources are known, our model can be used to quantify the relative proportions of sporadic CRCs that arise predominantly due to replicative errors (“bad luck”) (26) or environmental factors (56).

We find that only a small fraction of human colorectal crypts will carry one of the driver mutations we study (fully inactivated *APC* or *TP53*, or activated *KRAS*). Our results do not contradict the recent report of probable driver mutations in around 1% of normal colorectal crypts in middle-aged individuals (57). Lee-Six et al. (57) examined ∼1,000 colorectal crypts from 42 individuals and examined the crypts for mutations in 90 known CRC genes. None of the crypts carried a driver mutation in *KRAS*, and no crypts carried two inactivating alterations of *APC* or *TP53*, implying that the frequency of such alterations is <0.1%.

Here we assume that point mutation rate does not increase during early stages of premalignant colorectal neoplasia. As more detailed experimental measures become available, our framework could in the future be extended to account for the potential increase in mutation rates of premalignant mutational genotypes on the road to malignancy. Of note, as the predicted probability of CRC from our model is of the same order of magnitude as the reported lifetime risk of CRC, we do not expect mutation rates to be significantly increased prior to malignant transformation (58).

## Materials and Methods

### Inactivation Rates for TSGs.

A single copy of a wild-type TSG can be inactivated through either mutation or loss of the allele via an LOH event. Inactivating mutation rate per two wild-type copies of a TSG is *r*_{TSG}, where TSG can be either *APC* or *TP53*. Rate of LOH per two wild-type copies of a TSG is *r*_{LOH}. Thus, if the crypt is TSG wild type (TSG^{+/+}), it can inactivate the first copy of the tumor suppressor through either mutation or LOH with a joint rate of *r*_{TSG} + *r*_{LOH}. If one copy of a TSG is already inactivated (TSG^{+/−}) through mutation, the second copy can be lost by either mutation or LOH, with the rate (*r*_{TSG} + *r*_{LOH})/2. If, on the other hand, the first copy was lost through an LOH event, the second copy can only be lost through mutation, with rate *r*_{TSG}/2. In other words, we are implicitly assuming that the loss of both copies through LOH is fatal for the cells.

### Gene-Specific Driver Mutation Rates.

We estimate mutation rates of individual driver genes *D*, *r*_{D}, by multiplying the base pair mutation rate *u* with the number of positions in the gene that will lead to (in)activation if mutated, *n*_{D}. In other words, *r*_{D} = *n*_{D}**u*. We previously estimated the number of driver positions in the TSG *APC* (23). A copy of an *APC* gene can be inactivated by a genetic alteration that results in a stop codon. There are 8,529 protein-coding bases encoding 2,843 codons in the *APC* gene. Of these 2,843 codons, there are 1,045 codons in which a base substitution resulting in a stop codon could occur. Only one base per each of the 1,045 codons could mutate to produce a stop codon (e.g., GAG could only mutate to TAG to produce a stop codon). Similarly, typically only one of the three possible substitutions at that base could result in a stop codon. Therefore, there are 1,045/3 = 348 bases “available” for creating stop codons, that is, 348 driver positions in the *APC* gene. Stop codons can also result from small insertions or deletions, and their abundance compared to single base substitutions has been estimated to be 0.74:1 (23). The total number of driver positions in *APC* is therefore *n*_{APC} = 604 (174% of 348 nonsense driver positions).

There are 1,179 protein-coding bases in the *TP53* gene, encoding 393 codons. Out of the 393 codons, 110 codons can be mutated via a single base substitution to produce a stop codon. In these 110 codons, only one out of three bases can be mutated to produce a stop codon. At 95 of these 110 bases, only one of the three possible substitutions can produce a stop codon. At 15 out of the 110 bases, two out of the three possible substitutions can produce a stop codon. Thus the total number of driver positions in the *TP53* gene is 1.74*(110 + 15)/3 = 73. Finally, there are approximately *n*_{KRAS} ∼20 positions in the *KRAS* gene that lead to its activation if mutated (59).

### Rate of LOH.

The rate of LOH in the absence of CIN can be estimated from mutational data from CRCs with microsatellite instability (MIN), since MIN and CIN are mutually exclusive (60, 61). Huang et al. (62) report that the ratio of MIN cancers with two inactivating mutations in *APC* to those with one inactivating mutation and one LOH event is 1:7. Let us denote by *k* the ratio between the rate of LOH and the rate of inactivating mutation in MIN cancers. The fraction of cases where the first event is LOH is *k*/(*k* + 1), and the fraction of cases where first event is mutation is 1/(*k* + 1). The fraction of cases where first event is a mutation and second event is LOH is 1/(*k* + 1) *k*/(*k* + 1), and the fraction of cases where both events are mutations is 1/(*k* + 1) *1*/(*k* + 1). It follows that the ratio of cases where one event is LOH to those where both events are mutations is *k*^{2} + 2*k*. From *k*^{2} + 2*k* = 7 we obtain that *k* ∼1.8. Thus, the rate of LOH at the *APC* locus in MIN cancers is 1.8 times higher compared to the rate of inactivating mutation. CRCs with MIN have ∼10 times increased point mutation rate compared to non-MIN cancers (63, 64), so we conclude that the rate of LOH is ∼18 times higher compared to the normal rate of inactivating mutation in *APC*: *r*_{LOH} = 18* 604*1.25*10^{−8} = 1.36*10^{−4} per year. We assume that rates of LOH at *APC* and *TP53* are approximately the same.

## Data Availability.

Simulation code is available at https://github.com/chaypaterson/CRC_Initiation.

## Acknowledgments

This work was supported in part by the University of Washington Transitional Support Program.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: h.clevers{at}hubrecht.eu or ibozic{at}uw.edu.

Author contributions: H.C. and I.B. designed research; C.P. and I.B. performed research; and C.P., H.C., and I.B. wrote the paper.

Reviewers: N.B., ETH Zurich; and M.K., Rice University.

The authors declare no competing interest.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.2003771117/-/DCSupplemental.

Published under the PNAS license.

## References

- ↵
- B. Vogelstein et al.

- ↵
- C. Tomasetti,
- L. Marchionni,
- M. A. Nowak,
- G. Parmigiani,
- B. Vogelstein

- ↵
- J. G. Reiter,
- C. A. Iacobuzio-Donahue

- ↵
- ↵
- ↵
- ↵
- A. G. Knudson Jr.

- ↵
- ↵
- ↵
- E. G. Luebeck,
- S. H. Moolgavkar

- ↵
- R. Meza,
- J. Jeon,
- S. H. Moolgavkar,
- E. G. Luebeck

- ↵
- C. S.-O. Attolini et al.

- ↵
- ↵
- ↵
- E. G. Luebeck,
- K. Curtius,
- J. Jeon,
- W. D. Hazelton

- ↵
- D. Axelrod,
- M. Kimmel

- ↵
- B. M. Lang,
- J. Kuipers,
- B. Misselwitz,
- N. Beerenwinkel

- ↵
- ↵
- ↵
- ↵
- ↵
- N. T. J. Bailey

- ↵
- I. Bozic et al.

- ↵
- COSMIC

- ↵
- ↵
- C. Tomasetti,
- B. Vogelstein

- ↵
- C. Tomasetti,
- I. Bozic

- ↵
- F. Campbell et al.

- ↵
- ↵
- H. Lamlum et al.

- ↵
- H. J. Snippert,
- A. G. Schepers,
- J. H. van Es,
- B. D. Simons,
- H. Clevers

- ↵
- L. Vermeulen et al.

- ↵
- J. van de Haar et al.

- ↵
- ↵National Cancer Institute Surveilance, Epidemiology, and End Results Program, Cancer Stat Facts: Colorectal Cancer. https://seer.cancer.gov/statfacts/html/colorect.html. Accessed 16 December 2019.
- ↵
- I. Bozic,
- M. A. Nowak

- ↵
- ↵
- ↵
- J. Jen et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- T. Wojdyla et al.

- ↵
- J. Van den Eynden,
- A. Jiménez-Sánchez,
- M. L. Miller,
- E. Larsson

- ↵
- ↵
- ↵
- ↵
- Y. Yatabe,
- S. Tavaré,
- D. Shibata

- ↵
- I. M. Shih et al.

- ↵
- M. A. Nowak et al.

- ↵
- ↵
- ↵
- ↵
- C. Tomasetti,
- L. Li,
- B. Vogelstein

- ↵
- ↵
- ↵
- ↵
- I. G. Serebriiskii et al.

- ↵
- ↵
- A. de la Chapelle,
- H. Hampel

- ↵
- J. Huang et al.

- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Genetics

- Physical Sciences
- Applied Mathematics