## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Highly variable recessive lethal or nearly lethal mutation rates during germ-line development of male *Drosophila melanogaster*

Edited* by Wen-Hsiung Li, University of Chicago, Chicago, IL, and approved July 22, 2011 (received for review January 5, 2011)

## Abstract

Each cell of higher organism adults is derived from a fertilized egg through a series of divisions, during which mutations can occur. Both the rate and timing of mutations can have profound impacts on both the individual and the population, because mutations that occur at early cell divisions will affect more tissues and are more likely to be transferred to the next generation. Using large-scale multigeneration screening experiments for recessive lethal or nearly lethal mutations of *Drosophila melanogaster* and recently developed statistical analysis, we show for male *D. melanogaster* that (*i*) mutation rates (for recessive lethal or nearly lethal) are highly variable during germ cell development; (*ii*) first cell cleavage has the highest mutation rate, which drops substantially in the second cleavage or the next few cleavages; (*iii*) the intermediate stages, after a few cleavages to right before spermatogenesis, have at least an order of magnitude smaller mutation rate; and (*iv*) spermatogenesis also harbors a fairly high mutation rate. Because germ-line lineage shares some (early) cell divisions with somatic cell lineage, the first conclusion is readily extended to a somatic cell lineage. It is conceivable that the first conclusion is true for most (if not all) higher organisms, whereas the other three conclusions are widely applicable, although the extent may differ from species to species. Therefore, conclusions or analyses that are based on equal mutation rates during development should be taken with caution. Furthermore, the statistical approach developed can be adopted for studying other organisms, including the human germ-line or somatic mutational patterns.

Because mutations manifest their effect through cell descendants, it is essential to determine the timing and rate of mutations during individual development. However, little is known about this fundamental aspect of life even for well-studied model organisms. Germ-line mutations (i.e., mutations that occur in the lineage of germ cells) are of particular importance because only they are inherited, and thus may have a lasting effect on a population. The past few decades have witnessed tremendous advances in obtaining the rate of mutation per generation for genes for many organisms (1–3). There has been slow but steady progress in documenting and understanding the relationship between various human genetic disorders and parental ages (4–7), ever since nearly a century-old observation (8) that achondroplasia was more frequently found in children whose fathers were more advanced in age. In comparison, little is known about the details of mutation at different stages of germ cell development. Our knowledge from biochemistry, individual development, and observation of the frequencies of some human genetic disorders indicates that mutation rates at different stages may differ. Knowing the details of mutational distribution during germ cell development will not only improve the understanding of many genetic disorders but shed light on broader issues in mutation research, particularly in population/evolutionary biology. Dissecting the mutational distribution requires not only knowledge of the dynamics of germ cell lineage and high-resolution data but a proper integration of both. To date, available observations and experiments from humans have yet to lead to a breakthrough in this area, perhaps partly because of the complexity of human germ-line development, the difficulty in separating compounding factors in observations, and a lack of proper mathematical models to integrate the information.

Central to the dissection of the mutational pattern during germ-line development is to observe mutants in families that each has many offspring. Furthermore, different mutations leading to observable mutants in the same family need to be identified. *Drosophila* is one of the higher organisms that were first used to identify spontaneous and induced mutations (9, 10). The development of a germ cell lineage in *Drosophila melanogaster* has been continuously studied for the past 70 y. As a result, the dynamics of the germ cell population are well understood. This study takes advantage of well-established techniques from decades of *Drosophila* research to generate an unprecedented mutation dataset in a well-controlled environment. The mutation screening experiment we used led to cost-effective observations of the number of mutants and the frequency of each independent mutation (usually 1 or 2) in each of 8,618 families.

Also necessary to the understanding of mutational patterns is a proper statistical framework for inference. We developed a likelihood framework for analyzing such data, which can be described as follows. For each family, suppose that there are, at most, two mutations. Let *n*_{0} be the number of families without any mutation; *n _{i}* the number of families with one mutation of size

*i*(

*i*> 0) (i.e., the number of mutants among offspring is

*i*); and

*n*the number of families with two mutations, one of size

_{ij}*i*and one of size

*j*. Then, the likelihood of the data is

where *p*_{0} is the probability that there is no mutation in a family; *p _{i}* is the probability that there is one mutation of size

*i*; and

*p*is the probability that there are two mutations, one of size

_{ij}*i*and one of size

*j*. To make inferences about mutation rates at various stages of germ-line development, it is necessary to express

*p*and

_{i}*p*in terms of mutation rates at various stages. The germ cell divisions from a fertilized egg to sperm will be divided into

_{ij}*I*intervals. Suppose the mutation rate per cell division for the

*i*-th interval is

*u*and is defined as

_{i}**= (**

*u**u*

_{1}, …,

*u*)

_{I}*. Ideally,*

^{T}*I*is equal to the total number of cell divisions, such that the mutation rate at each cell division can be inferred; however, even with the large volume of data from our experiment, we still only have the resolution for a relatively small value of

*I*. Nevertheless, tremendous insight into the rate variations can be learned. For the genealogy of a sample, each cell division corresponds to a segment of a branch. Let

*t*be the number of cell divisions from the

_{k}*k*-th interval and

**= (**

*t**t*

_{1}, …,

*t*)

_{I}*. Fig. 1 shows a hypothetical example of a sample genealogy of five cells with five cell divisions divided into three intervals (1: [1, 1], 2: [2, 4], and 3: [5, 5]), which results in*

^{T}**= (1, 9, 5)**

*t**. Assume that the number of mutations in a branch follows a Poisson distribution, with its parameter equal to the branch length times the mutation rate per cell division. Then, for the genealogy in Fig. 1, the probability of no mutation is In general, the number of mutations in a given genealogy is a Poisson variable with the parameter*

^{T}

*t**. Therefore, the probability that there is no mutation in a given genealogy is*

^{T}uwhere *A*_{0} = *tt** ^{T}*. One does not generally know the sample genealogy; therefore, taking into consideration many possible genealogies for the sample, we have

where and are, respectively, the expected value of ** t** and

*A*_{0}over all possible sample genealogies, which can be estimated numerically (an example is given in

*Materials and Methods*). Similar but more involved analysis leads to the expression of other required probabilities for maximum-likelihood analysis as (

*i*> 0):

where , and are constant vectors and matrices that can be estimated similarly as and . The likelihood function, together with these equations, allows for both the estimation and the hypothesis testing of ** u** using the maximum-likelihood framework.

## Results

### Mutational Distribution.

A total of 8,618 families were successfully screened in our experiment over a 4-y period. Throughout the paper, a lethal or nearly lethal mutation is defined as one leading to no more than 1% of the surviving z/z offspring, which means that at least 100 offspring need to be examined for each claimed mutant. To minimize the chance that a mutant is not counted because of randomness, allelism tests were conducted for all lines with the percentage of z/z individuals up to 5%. Furthermore, to make the claim that two mutant lines share the same mutation, we required that among the offspring of the cross, the percentage of z/z individuals must also be no more than 1%. This stringent requirement will ensure a high quality for each identified cluster of mutants but has a slight tendency to lead to smaller cluster sizes than the true ones. Our plan was to screen 20 lines for each family; however, to ensure success, most families were screened for more than 20 lines. In our analyses, we randomly remove the extra lines in some families, such that each family has exactly 20 lines. We carried out analyses on several slightly different datasets derived as such. The results are virtually the same. Thus, we report one such analysis only. To make the framework of inference (Eq. **1**) applicable, we excluded several families with 3 or 4 mutations. Table 1 gives the frequencies of various mutation configurations. The distribution of families with various numbers of mutations can be derived from Table 1. From 8,618 families successfully screened, there were 954 harvested mutations, leading to a total of 1,036 different mutations. The number of families with 0, 1, and 2 mutations are, respectively, 7,664, 872, and 82. Among the 872 families with 1 mutation, 755 led to a singleton mutant. Roughly, the number of families with *i* mutations is an order of magnitude smaller than that with *i* − 1 mutations. The number of mutants sharing the same mutation is said to be the size of that mutation or cluster size. Each of the mutations thus falls into a size between 1 and 20. The frequencies of various size mutations can also be derived from Table 1, and they are given in Table 2. Although a mutation predominantly leads to a singleton mutant, the mean size of the clusters is 2.03 (i.e., a mutation leads, on average, to 2.03 mutants in a family of 20 offspring).

### Statistical Inference.

The pattern of mutation rates along the germ cell lineage can be explored by dividing the germ cell development into intervals, such that estimates of the mutation rate, as well as the hypothesis test, can be made. For male *D. melanogaster*, each sperm from a young mature male is expected to have experienced 36 or more divisions, among which the first 14 divisions belong to the cleavage stage, the last 5 to spermatogenesis, and those between to gastrulation and organogenesis, in which the germ cells are known as germ-line stem cells that divide asymmetrically. In our analyses, we explored several ways to partition the divisions and found that the overall results are consistent. Therefore, we shall report the analysis based on one configuration that captures the essence of the results. The intervals are given in Table 3.

Before the likelihood analysis on mutation rates, coefficients in the expressions of *p _{i}* and

*p*need to be known, which can be accomplished by estimation using a Monte Carlo approach. Such estimation requires simulating genealogies that represent the sample from the sperm population in a male

_{ij}*D. melanogaster*, which can be accomplished through a two-step process. The first step is to simulate the dynamics of population size from a fertilized egg to the time spermatozoa are sampled. This step will take advantage of experimental evidence accumulated over decades of

*Drosophila*research. The principle evidence (11–13) we used is as follows: (

*i*) after the eighth cell division, about 4–6 cells become the primordial germ cells (PGCs); (

*ii*) after the 12th division, the PGC number ranges from 23–52; (

*iii*) after the 14th division, there are 5–9 PGCs in each gonad; and (

*iv*) from the 15th division to right before spermatogenesis, the number of PGCs remains more or less constant. Once the first step is completed, we take a sample of 20 alleles from the population and use a coalescent-based approach (14) to simulate their ancestral process back to the fertilized egg. The process continues for many replicates from which the estimates of coefficients can be derived. Our experience indicates that 100,000 replicates are usually sufficient; however, to ensure a high accuracy, we obtained our estimates from 500,000 replicates.

Once the coefficients in the expression of *p* are obtained, one can proceed to estimate ** u**. The maximum-likelihood estimates of

**under eight different assumptions are given in Table 4. Under the assumption that mutation rates at different stages are equal, the common mutation rate per cell division is then equal to 0.345 × 10**

*u*^{−3}(corresponding to the row of

*H*

_{1}in Table 4). On the other hand, when no constraint is imposed,

*u**× 10*

^{T}^{3}= (5.043, 0.001, 0.001, 0.006, 1.225), which indicates the mutation rate at the first cell division is the highest, followed by the rate in the spermatogenesis stage. Note that the SEs associated with these two estimates are substantially smaller than the estimates themselves, indicating excellent quality of these estimates. Without any constraint on mutation rates, the per generation mutation rate is estimated to be using Eq.

**13**. In comparison, a prior method (15) gives . Although the two estimates differ little, the maximum-likelihood estimate is superior because it carries a substantially smaller SE.

In addition to estimating the mutation rates, likelihood ratio tests of several hypotheses about ** u** can be constructed from the values in Table 4, and their values are given in Table 5. For example, to test the null hypothesis that mutation rates over the development of the germ lines are all equal against the alternative that mutation rates at different intervals can all be different, we have the log-likelihood ratio statistic equal to

*Lr*= −2[

*ln*(

*L*

_{1}) −

*ln*(

*L*

_{8})] = 751; compared with the critical value of the χ

^{2}distribution with 4 df, this result is highly significant. Table 5 shows that the hypothesis of constant mutation rates is overwhelmingly rejected with each of the seven alternative hypotheses. In comparison, hypotheses about two or more of

*u*

_{2},

*u*

_{3}, and

*u*

_{4}being equal cannot be rejected. We noted that the estimated mutation rate for the first cell division is considerably higher than that of spermatogenesis. Table 5 shows that the hypothesis of these two rates being equal is also rejected, as well as the hypothesis that the mutation rate in spermatogenesis is the same as that of the previous interval.

## Discussion

Our large-scale experiment was designed for detecting differences of mutation rates at various stages as small as half of the maximum rate. Taking advantage of the high resolution of the data, an inference framework that incorporates the knowledge of *Drosophila* development, and a rigorous statistical/computational method, we explored both the estimates of mutation rates at different stages of germ-line development as well as hypothesis testing. Our analyses show beyond a doubt that mutation rates vary significantly during the development of germ cells. Overall, the mutation rates in germ cell development exhibit a U shape, with the highest rate at the first cleavage, followed by a high rate at spermatogenesis, whereas most divisions in the middle have a mutation rate one or more orders of magnitude smaller. The likelihood ratio test of the null hypothesis of equal mutation rates for all cell divisions is overwhelmingly rejected with all the alternatives considered. Therefore, the notion of constant mutation rates during germ cell development should be abandoned.

Although it may not be too surprising to see a higher mutation rate at the first cleavage, it is unexpected in our analysis that the rate drops sharply from the second cleavage onward for the remaining divisions of cleavage. This is because the prior observation was that *Drosophila*-fertilized eggs divide about every 10 min in the cleavage stage and cellular activity is controlled by maternal proteins stored in the egg. Because the switch to zygotic control occurs at the end of the cleavage stage, one would probably expect that a significant change of mutation rate would occur around the end of the cleavage stage. To guard against incorrect assumption artifacts, we examined the consequences of alternative assumptions on the dynamics of germ cell lineage and on the outcome of the analysis, among which the assumption on the population after the eighth division appears to be most influential. It turns out that if one relaxes the range of the germ cells after the eighth division from 4–6 to 4–10, or restricts it to 2–4, and increases the total number of germ cell divisions from 36 to 40, the numerical results differ only slightly and all major conclusions remain the same. Our analysis also assumes that PGCs are formed by random sampling from the 256 cells after the eighth division. Although this is consistent with the *Drosophila* embryonic development literature (16), it is conceivable that some degree of nonrandomness leading to PGCs may exist because of spatial localization of closely related cells. The effect of nonrandom sampling can be investigated by restricting the germ cell population size at an earlier stage to be smaller than it normally should be (which is 4.72 ancestral cells at the 32-cell stage). Therefore, restricting the population size at the 32-cell stage to 4, 3, and 1–2 corresponds roughly to mild, modest, and severe sampling bias, respectively. For each of these restrictions, the same likelihood analysis was carried out. The likelihood under the assumption of random sampling has the largest value. For mild sampling bias, the log-likelihood value decreases slightly and all the conclusions made under the random sampling remain the same. For modest sampling bias, the estimated mutation rate at the first cleavage is larger than that for the second cleavage, but the difference is no longer significant. The log-likelihood value with modest bias is, however, significantly smaller than that of random sampling, such that the assumption of modest bias can be rejected at the 1% level. For severe sampling bias, the log-likelihood value decreases even more substantially. Taking the results of these additional analyses into consideration, we conclude that the mutation rate at the first cleavage is high. The rates drop sharply either immediately after the first division or in the next couple of cleavages, even with the possibility that sampling at the 258-cell stage may be biased to some extent (but extremely biased sampling is very unlikely).

Our study also indicates that the mutation rate at spermatogenesis is quite high, although significantly smaller than that of the first cleavage. There appears to be good reasons why this should be expected, because part of meiosis will weaken DNA repair mechanisms. Although our experiment screens for germ-line mutations of the male fly, sexual differentiation occurs late in development; thus, our conclusion of a high mutation rate for the first cleavage applies to the female fly as well. Per generation mutation rate is estimated to be 1.25%, which is comparable to previous estimates of completely recessive lethal mutations [1.2% in one study by Woodruff et al. (17) and 1.9% in another study by Woodruff et al. (18)].

Although making the experiment more manageable by examining only newly matured males, our experimental data do not allow one to address the potential rate changes during aging, which is an important aspect of mutation, particularly with regard to humans. Nevertheless, the results from this study have a number of implications. It is conceivable that the first conclusion stated in the abstract is true for most (if not all) higher organisms, whereas the other three conclusions are widely applicable, although the extent may differ from species to species. Therefore, conclusions or analyses that are based on equal mutation rates during development should be taken with caution. If overwhelmingly high mutation rates of the first cleavage (or first few cleavages) hold true, cells at the early stage of development will have accumulated a large number of mutations, which will then increase the opportunity for selection to act early. It will be of great interest to see if a similar mutation pattern holds for other organisms, particularly for humans. If so, it will be necessary to reevaluate some conclusions or approaches that have been based on assumptions of equal mutation rates. For example, the so-called “male-driven evolution” (19) can be better understood in light of the present work. It has been noted from various studies that the ratio of male to female cell divisions is often considerably larger than the ratio of estimated male to female mutation rates (20), which should be so if mutation rates in the first or first few cell divisions are two or more orders of magnitude larger than those in subsequent cell divisions.

Furthermore the statistical approach developed in this paper can be adopted for studying other organisms, including the human germ-line or somatic mutational patterns. For humans, different approaches will be needed to generate mutations, and advances in the next generation of sequencing technology will undoubtedly help to accelerate the study of mutational pattern in the development of humans.

## Materials and Methods

### Experiment.

The mutation screening experiment employs a three-generation assay to screen autosomal recessive lethal or nearly lethal mutations in about 1,200 genes in *D. melanogaster* (18), which takes advantage of the balancer chromosomes that were pioneered by H. J. Muller for the purpose of maintaining newly isolated mutations, including recessive lethals, without selection (21, 22). Balancers for each of the major chromosomes of *D. melanogaster* contain multiple inversions and one or more dominant visible mutations. The inversions, which are mapped by the use of giant polytene chromosomes, act as crossover suppressors, and the clearly visible dominant mutations allow for the identification of heterozygotes. With these chromosome stocks, new lethal or nearly lethal mutations are balanced in the heterozygous state against the balancer chromosomes and the new lethal is not lost over time by recombination. Three types of autosomal haploid chromosomes (genomes), denoted by β, γ, and z, were used in the experiment, and they are

The β-type balancer is homozygous lethal and is marked with the dominant visible and recessive lethal mutations, including Curly (*Cy*) wings, Lobe (*L*) eye, and Ultrabithorax (*Ubx*) enlarged halteres. It segregates as a unit and suppresses crossing over on both the second and third chromosomes (23). The γ-chromosome is also homozygous lethal and carries dominant markers. Type z represents a haploid genome with WT second and third chromosomes.

The experiment was designed to screen β/z male offspring of crosses between a single β/z male and multiple β/γ females to see if a new lethal or nearly lethal mutation occurred in chromosome z during the germ-line development of the father. Therefore, each family consists of offspring from the following:

A total of 20–40 β/z **♂** offspring were each subjected to the following assay:

*F*_{1}: Multiple β/γ virgin**♀**× single β/z**♂***F*_{2}: Multiple β/z virgin**♀**× multiple β/z**♂***F*_{3}: Observe number of z/z individuals

If a β/z male in the *F*_{1} step carries a lethal or nearly lethal mutation in the z chromosome, no surviving or few (≤1%) z/z individuals will be observed among the *F*_{3} offspring. The number of genes in *D. melanogaster* that harbor recessive lethal mutations is estimated (24) to be around 3,000. When there was more than one mutant in a family, allelism tests were conducted to determine if they shared the same mutation. This is done by crossing β/z offspring from different mutant lines. If the offspring of the cross have no or only a few z/z individuals, the two mutant lines can be considered to share the same mutation. The experiment was carried out at Yunnan University from October 2004 to October 2008.

A similar mating scheme as described above was used successfully in earlier assays for the occurrence of mutation clusters in several laboratories (17, 18, 25,26–27). It was estimated that lethal or nearly lethal mutations identified by the assay span over about 1,200 genes.

### Statistical Inference.

The germ cell divisions from a fertilized egg to sperm can be divided into *I* intervals. Suppose the mutation rate per cell division for the *i*-th interval is *u _{i}*, and define

**= (**

*u**u*

_{1}, …,

*u*)

_{I}*. For the genealogy of a sample, each cell division corresponds to a segment of a branch. A branch is said to be size*

^{T}*i*if it has exactly

*i*descendants in the sample. Let

*a*be the total number of cell divisions from interval

_{ik}*k*that are of size

*i*,

*a**= (*

_{i}*a*

_{i}_{1}, …,

*a*)

_{iI}*and*

^{T}**= ∑**

*t*

_{i}

*a**i*. That is,

*t*is the number of cell divisions from the

_{k}*k*-th interval. For the genealogy shown in Fig. 1, we have

*a*_{1}= (0, 4, 5)

*because the branch in the first interval is of size 5. There are four cell divisions in the second interval that are size 1 (2 in the branch leading to c as well as 1 to d and e each), and all cell divisions in the third interval are of size 1. Similarly,*

^{T}

*a*_{2}= (0, 4, 0)

*,*

^{T}

*a*_{3}= (0, 1, 0)

*,*

^{T}

*a*_{4}= (0, 0, 0)

*, and*

^{T}

*a*_{5}= (1, 0, 0)

*. Direct counting leads to*

^{T}**= (1, 9, 5)**

*t**, which can also be obtained by summing*

^{T}

*a**(*

_{i}*i*= 1, …, 5).

Suppose the number of mutations in a branch follows a Poisson distribution with its parameter equal to the branch length times the mutation rate per cell division. Then, given a genealogy, the number of mutations in the genealogy is also a Poisson variable with parameter *t** ^{T}u*. Therefore, the probability that there is no mutation in the genealogy is given by Eq.

**3**. The probability that there is only one mutation of size

*i*in the genealogy is equal to

where *A** _{i}* =

*a*

_{i}

*t**. The probability that there are only two mutations, one being of size*

^{T}*i*and another of size

*j*(

*i*≠

*j*), is equal to

where . Note that and . If *i* = *j*, we have

Because, apart from a few exceptions, all the families that have mutants in the experiment harbor either one or two mutations, we will not proceed further, although the approach can be extended to cover more complex situations.

Without knowing the sample genealogy, the probabilities in Eqs. **3**–**5** have to be integrated over all possible genealogies for a sample. Therefore, the probability *p*_{0} is that there is no mutation; the probability *p _{i}* is that there is one mutation of size

*i*; and the probability

*p*is that there are two mutations, one of size

_{ij}*i*and one of size

*j*. They are, respectively, as follows:

where *δ _{x}* = 1 when

*x*= 0 and 0 otherwise and where , and are the means of the corresponding vector or matrix. The above result thus leads to Eqs.

**3**–

**5**. Maximum-likelihood estimates, , of

**can be derived from ln(**

*u**L*), which, from Eq.

**1**, is

From Eq. **12**, the asymptotic covariance of the estimates can also be obtained as

Let , where *c _{k}* is the number of cell divisions in the

*k*-th interval. Then, per generation mutation rate,

*u*, can be estimated as

The variance of this estimate is . Suppose the total number of mutant lines in the experiment is *M* and the total number of lines screened is *N*. Then, an alternative estimate of ** u** is , which is unbiased regardless of whether the mutation rates during development are identical (15).

Hypotheses can be tested through the use of the likelihood ratio. For example, to test the null hypothesis *H*_{1}, that mutation rates at different cell divisions are all equal, against the alternative hypothesis *H*_{8}, that rates may all be different, the log-likelihood ratio test statistic is

which is asymptotically a χ^{2} variable with *I* − 1 df.

### Estimation of Coefficients and Simulation of Genealogy.

A key to the statistical inference described above is the mean values of various coefficients in Eqs. **3**–**5**, namely, , and . Because of their hierarchical relationship, only and are fundamental. By definition, the *j*-th element of vector and the (*k*, *l*) cell of matrix are, respectively,

where summations are taken over all possible genealogies of the sample and Pr(g) is the probability of genealogy g. Although their analytical solutions are intractable, they can be estimated with sufficient accuracy by computer simulation, which takes into consideration the developmental knowledge of the male *D. melanogaster*. Specifically, suppose M genealogies of the sample are simulated; then, the above two quantities can be estimated, respectively, by

Adopting the common practice in population genetics, we used a discrete generation model for the cells in the germ-line lineage, which assumes that the population at the *i*-th generation consists of cells that are potentially ancestral to the spermatozoa, each of which has divided *i* times since the fertilized egg. Let *N*(*i*) be the population size at the *i*-th generation. The model further assumes that each cell divides into two daughter cells, and the (*i* + 1)-th generation is formed by sampling from the pool of these daughter cells. Developmental knowledge is used to specify the sampling schemes, which will be illustrated by example. The genealogy of a sample of *D. melanogaster* male germ cells can be simulated by a two-step process.

The first step is to simulate the composition of *i*-th population. The *N*(*i*) cells at the *i*-th generation can be divided into two groups, one [*N*_{2}(*i*)] consisting of those that have siblings and another [*N*_{1}(*i*)] consisting of those that do not have a sibling. The simulation can be done sequentially as follows. Starting with a fertilized egg (thus *N*(0) = 1 at the 0th generation), the first division yields 2 daughter cells. Both can potentially be ancestral to the sperm cells; thus, *N*_{1}(1) = 0, *N*_{2}(1) = 2. These two cells divide into 4 cells, which then form the second generation; continuing this process will lead to *N*(7) = *N*_{2}(7) = 2^{7} = 128. Among the 256 daughter cells, only 4–6 are PGCs; thus, the eighth generation consists of cells that are a sample from these 256 cells. The main result shown in this paper assumes that the PGCs are a random sample from the 256 cells, but the algorithm can easily handle nonrandom sampling (the effects of nonrandom sampling are included in *Discussion*): first, randomly select a number between 4 and 6 (say 5), and then randomly select 5 cells of these 256 cells [and record the value of *N*_{2} (8) and *N*_{1} (8), which form the population at the eighth generation]. These 5 cells will then divide to form generation 9 and continues, and this leads to *N*(11) = 40 and 80 cells in their daughter pool. Because it is known that *N*(12) is between 23 and 52, similar to the previous situation, a random number between 23 and 52 is determined and the corresponding number of cells is sampled from the pool to form the 12th population. After the 14th division, the population splits into two, each consisting of 5–9 cells and starts the stem cell period, which is characterized by asymptotic division. This can be modeled by assuming for each stem cell that one of its daughter cells at each division becomes a new stem cell, with a small probability (say 0.001) of being replaced by the second daughter cell of another stem cell. After the 31st division, the derived nonstem cells go into spermatogenesis, which results in spermatozoa. We modeled this by a simple model that assumes the cells after the 31st division resume symmetrical divisions and the last 5 divisions represent the process of spermatogenesis.

The second step in the simulation of the genealogy of a sample is the coalescent process with given populations sizes at each division from the first step. A sample of *n* cells is taken from the 36th population, and their coalescence is determined backward in time. Consider *k* random cells taken from the *i*-th population; the number of coalescent events is then equal to the pairs of sibling cells among these *k* cells. For example, suppose *N*(*i*) 10, *N*_{2}(*i*) = 4, and *k* = 4. Then, the probability of having two coalescents going back one generation is equal to

for having one coalescent, and 159/210 for having no coalescent.

The model of germ cell development, dynamics of the sizes of germ cell populations, and their relationship to the sample genealogy are illustrated by Fig. 2. Note that because only population sizes after each division are recorded in the first step, the genealogical relationship of the cells sampled in the second step is unknown and there are many plausible genealogies. One important feature of the sample genealogy is that it always traces back to the fertilized egg rather than stopping at the most recent common ancestor (MRCA); consequently, its height (from the time of sampling back to the fertilized egg) is a constant that is identical to the height of the germ-line lineage (36 divisions in our analysis). This is a marked difference from the genealogy of a sample in population genetics, where every sample may have a different age for its MRCA. Therefore, the meanings of the intervals of divisions remain the same regardless of whether one is referring to the history of the germ line or the sample genealogy.

Table 6 shows the estimates of for the interval divisions in the main text, with only the SEs for components of given because of space limitation. Because the first cell division leads to 2 cells, the second division to 4 cells, and so on, it follows that, on average, 1.889 cells of the two cells are present in sample genealogy, 2.842 of the 4 cells are present in the genealogy, and so on. The SEs of these estimates are equal to and 0.694/707 = 9.8 × 10^{−4}, respectively, which shows the high accuracy of estimations. In our final analysis, coefficients were estimated with at least 1 million simulated genealogies.

## Acknowledgments

We thank all who contributed to this project, particularly those who performed part of the experiment, including Ji-fen Li, Zhen Xie, Fang Chen, Jian-rui Zhou, Qun-li Wang, Lei-hua He, Qian-qian Zhao, Zong-jun Luo, Rui-lin Zhang, Ji Yao, Rui-hong Zhang, Xing-qin Yin, Xiao-qing Yang, Ji-qin Liu, Ji-xin Yang, Xing-yun Wang, Tian-fen Zhang, Yong-ping Meng, Qiu-qi Li, Yan-hong Du, Shu-li Xiong, Mei Zhang, Mei-lan Huo, Li-xian Lou, Xiao-yun Jiang, and Ya-ting Liu. This work was supported, in part, by grants from the Chinese National Science Foundation (Grant 30570248 to Y.-X.F., Grant 30460026 to J.-J.G., and Grant 30621092 to Y.-P.Z.) and the National Basic Research Program of China (973 Program, Grant 2007CB411600 to Y.-P.Z.), funds from the Bureau of Science and Technology of Yunnan Province of China (to Y.-P.Z.), and the Endowment Fund from the University of Texas (to Y.-X.F.).

## Footnotes

^{1}To whom correspondence may be addressed. E-mail: yunxin.fu{at}uth.tmc.edu or zhangyp{at}mail.kiz.ac.cn.

Author contributions: J.-J.G., R.C.W., Y.-P.Z., and Y.-X.F. designed the experiment; R.C.W. contributed fly stocks; J.-J.G., X.-R.P., J.H., L.M., J.-M.W., Y.-L.S., and S.A.B. performed research; Y.-X.F. developed statistical methods and analyzed data; and Y.-X.F. wrote the paper.

The authors declare no conflict of interest.

↵*This Direct Submission article had a prearranged editor.

Freely available online through the PNAS open access option.

## References

- ↵
- Nei M

- ↵
- Li WH

- ↵
- Woodruff RC,
- Thompson JN Jr.

- ↵
- Vogel F,
- Motulsky AG

- ↵
- Glaser RL,
- Jabs EW

- ↵
- ↵
- Choi SK,
- Yoon SR,
- Calabrese P,
- Arnheim N

- ↵
- Weinberg W

- ↵
- ↵
- Burdette WJ

- Muller HJ,
- Oster II

- ↵
- ↵
- Drost JB,
- Lee WR

*Drosophila melanogaster*. Genetica 102-103:421–443. - ↵
- Gilbert SF

- ↵
- Ewens WJ

- ↵
- ↵
- Demerec M

- Sonnenblick BP

- ↵
- Woodruff RC,
- Thompson JN Jr.,
- Seeger MA,
- Spivey WE

*Drosophila melanogaster*. Heredity 58:223–234. - ↵
- ↵
- ↵
- ↵
- Ashburner M

- ↵
- Greenspan SF

- ↵
- Lindsley DL,
- Zimm GG

- ↵
- ↵
- Thompson JN Jr.,
- Woodruff RC

*Drosophila melanogaster*. Proc Natl Acad Sci USA 77:1059–1062. - ↵
- ↵

*Drosophila melanogaster*

## Citation Manager Formats

*Drosophila melanogaster*

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Evolution