New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution

Contributed by Phil Green, June 10, 2004
Related Article
 Biography of Phil Green Sep 28, 2004
Abstract
We describe a model of neutral DNA evolution that allows substitution rates at a site to depend on the two flanking nucleotides (“context”), the branch of the phylogenetic tree, and position within the sequence and implement it by using a flexible and computationally efficient Bayesian Markov chain Monte Carlo approach. We then apply this approach to characterize phylogenetic variation in contextdependent substitution patterns in a 1.7megabase genomic region in 19 mammalian species. In contrast to other substitution types, CpG transition substitutions have accumulated in a relatively clocklike fashion. More broadly, our results support the notion that contextdependent DNA replication errors, cytosine deamination, and biased gene conversion are major sources of naturally occurring mutations whose relative contributions have varied in mammalian evolution as a result of changes in generation times, effective population sizes, and recombination rates.
Despite their fundamental role in evolution and genetic disease, relatively little is known about the causes of naturally occurring mutations in mammalian genomes. Even basic questions, such as the relative proportions attributable to replication errors or to chemical or radiation damage, remain unresolved. Neutrally evolving genomic DNA in principle provides a faithful record of the mutations occurring within it, and through its analysis, an increasingly complex picture of the characteristics of the mutation process is emerging. Studies of pseudogenes have found that transition substitutions occur at higher rates than transversions and that substitutions from S (G or C) to W (A or T) nucleotides generally occur at a higher rate than those from W to S (1, 2). The nucleotides that flank a site have a large (≈50fold) effect on substitution rate (3, 4); the most dramatic instance is CpG dinucleotide “hotspots” (5), where the elevated rate reflects deamination of methyl cytosine, but there are significant (and as yet not understood) effects of other flanking nucleotides as well. Such “context effects” are also detected in studies of singlenucleotide polymorphisms (6) and diseasecausing mutations (7).
With the availability of large genomic datasets, more subtle trends are being uncovered. Comparison of human and mouse genomic sequences have revealed that substitution rates vary by position on a large scale (8). Recombination rate is correlated both with overall substitution rate (9) and with the ratio of W→S to S→W rates (10), the latter correlation probably reflecting biased gene conversion (11–13). There is an asymmetry in the substitution process within transcribed regions, with higher rates of purine than of pyrimidine transitions on the nontranscribed strand (14); this is hypothesized to result from an asymmetry in DNA polymerase errors that is uncovered by transcriptioncoupled repair.
Analysis of evolutionarily diverse, multispecies datasets, such as those being developed by the NISC Comparative Sequencing Program (15), provides increasing opportunity to gain insight into the biological factors that may underlie these observations by studying how trends vary across organisms and sequences. For this purpose, it is useful to have models of neutral evolution that incorporate as many of the above complexities as possible. In addition to illuminating the mutation process, such models should improve our ability to detect functional features in the genome as nonneutrally evolving regions, and they should help increase the effectiveness of standard sequence analysis methods, such as alignment and phylogenetic reconstruction. Inadequate models not only reduce analysis power but can also lead to misleading conclusions (16–18).
Of the mutational complexities mentioned above, the most difficult to accommodate mathematically is context dependence of rates. Several recent studies have developed evolutionary models allowing for context dependence but in an approximate or partial fashion. These models include a Markov chain Monte Carlo (MCMC) approach for estimating CpG effects in pairwise alignment (19); an approximate likelihood method that requires fixing the ancestral sequence to deduce CpG effects along a star topology (20, 21); and two studies that assume approximate models of neighbor dependence along a tree to estimate contextdependent rates (ref. 22 and research.microsoft.com/research/pubs/view.aspx?tr_id=687).
In this paper, we describe a rigorous evolutionary model incorporating context effects that are allowed to depend on sequence position and lineage. We implement the model by means of a flexible and computationally efficient Bayesian MCMC approach that permits large numbers of model parameters to be estimated simultaneously and reliably and apply it to analyze variation in substitution trends across a 19species mammalian phylogeny in a 1.7megabase (Mb) genomic region. We find that, in contrast to other contextdependent rates, CpG transition substitutions have accumulated in a relatively clocklike fashion; our analysis also helps illuminate factors that may underlie failure of the molecular clock for other substitution types. We find variation in the ratios of W→StoS→W rates and CpG transitions to total rate, which appears to reflect varying generation times, effective population sizes, and recombination rates during mammalian evolution. We also gain further insight into the transcriptionassociated substitution asymmetry.
Methods
Evolutionary Model. We assume we are given an alignment of several homologous sequences, together with sequence annotations indicating the presence of biological features, and a rooted evolutionary tree indicating the ancestral relationships among the sequences. The alignment is taken to imply which positions are homologous in the different sequences. Bases present in some sequences but not others at a given position reflect insertion or deletion events (indels), which we assume have been assigned (e.g., by parsimony) to particular locations on the tree. Conditional on the alignment and the assigned indel locations, we seek to model neutral sequence evolution (base substitutions and the composition of inserted segments) along the tree, allowing substitution rates to depend on the two flanking bases and on position within the sequence and within the tree. Evolution is assumed to occur independently along each tree branch.
To simplify computation, each branch is partitioned into two or more small discrete time units (such that the average substitution rate per time unit is ≤0.005), and, at most, one substitution at each sequence site is permitted per time unit. We index the tree positions that separate time units as t = 0,..., m, with 0 being the root, m – k + 1,..., m being the k leaves (corresponding to the observed sequences), and 1,..., m – k being the internal positions. For each t, let b_{t} be the branch on which t lies and (if t ≠ 0) β _{t} the tree position that immediately precedes t. We assume the indexing is such that β _{t} < t.
For bases x ≠ z, let ψ _{ib} (wxy→z) = κ _{b} _{στ}λ_{ση}(wxy→z) be the probability that, in one time unit of branch b = b_{t} , the base x at position i and time β _{t} mutates to z at time t, given neighboring bases w and y at time β _{t} . The index σ = σ _{i} specifies the “region type” in which i falls; η = η _{b} specifies a grouping of branches; and τ = τ _{wxy} _{→} _{z} specifies the “substitution type” of the contextdependent substitution wxy→z. We allow two region types (transcribed and untranscribed). In an initial analysis (see Results and Discussion), we assume a single τ that includes all contextdependent substitutions and allow variation among lineages by means of different η values for each of the major clades. In subsequent analyses, we assume a single η and allow variation among lineages by means of different τ values for each set of similarly behaved contextdependent substitutions. We scale λ_{ση} such that its weighted average for each choice of σ, η, and τ is 1, i.e., where f_{wxy} is the trinucleotide frequency in observed sequences; hence, the product of the scaling factor κ _{b} _{στ} and the number of time units in branch b approximates the expected number of substitutions of type τ for each target base per applicable σ site along the branch. The probability of no substitution is ψ _{ib} (wxy→x) = 1 – Σ _{z} _{≠} _{x} ψ _{ib} (wxy→z). For σ corresponding to untranscribed regions, we assume that complementary events have equal rates: λ_{ση}(wxy→z) = λ_{ση}(y ^{c} x ^{c} w ^{c}→z ^{c}), where x ^{c} denotes the complement of x. For notational convenience, we let ψ _{ibt} (wxy→z) = 1if t = 0orif w, x, y, or z = φ (the gap character).
We model the distribution of bases x that are at the root or that are newly inserted as an inhomogeneous secondorder Markov chain with transition parameters π_{ρ}(xv, w), where v and w are the bases that immediately precede x. The index ρ = ρ _{i} permits different categories of sequence composition. If x is not a root or a newly inserted base or if v or w = φ, we let π_{ρ}(xv, w) = 1. Secondorder Markov chains have been found to roughly approximate shortterm dependencies in DNA sequences (23). In our analyses, we allow four distinct ρ values that reflect whether the sequence position i is transcribed and whether it is within an annotated repeat. No symmetry conditions are imposed on the π values.
Let X_{it} ∈ {A, C, G, T, N, φ} denote the base or gap at the ith site of the sequence at tree position t for 1 ≤ i ≤ n and 0 ≤ t ≤ m, where n is the number of alignment columns. Because indel locations are assumed known, the set of X_{it} assigned as gaps is fixed. We let X_{it} = N if and only if m – k + 1 ≤ t ≤ m and the corresponding position in the observed sequence has an unspecified base. Let denote the nearest neighboring base (ignoring gaps) to the left of X_{it} in X _{·} _{t} ; we set = N if X_{it} is the first base in the sequence. Similarly, denotes the nextnearest neighboring base to the left of X_{it} , and X ^{+} the neighboring base to the right. We think of X _{it} as the “complete data,” composed of the observed data D (corresponding to X_{it} with m – k + 1 ≤ t ≤ m) and the “missing data” M (corresponding to X_{it} with 0 ≤ t ≤ m – k).
Under our model, the X_{it} form a unilateral Markov random field (24), and we define the probability of X, conditional on the parameter values and indel locations, as where θ = {λ, π, κ} consists of the substitution rate parameters λ_{ση}, the root and inserted sequence distribution parameters π_{ρ}, and the branchscaling parameters κ _{b} _{στ}. Note that for a given i and t, at least one of the two factors is 1. In the special case in which the sequences are ungapped, κλ is small, and λ is independent of time, sequence position, and neighboring bases, this discretetime model closely approximates the continuoustime model of DNA sequence evolution described in ref. 25.
Bayesian MCMC. We adopt the Bayesian approach to statistical inference (26) and estimate the posterior parameter distribution p(θD) implied by the observed data D. A prior distribution p(θ) represents any known information regarding the parameters before D is observed; we use uninformative priors that assign equal probability density to all combinations of parameter values satisfying λ, κ, π > 0, Σ _{z} _{≠} _{x} κ _{b} _{στ}λ_{ση}(wxy→z) < 1, and Σ _{x} _{≠G} π_{ρ}(xv, w) < 1. With increasing length of sequence data, p(θD) approaches a normal distribution centered on the maximum likelihood estimate of the parameters, independently of the prior (27).
A powerful approach to Bayesian analysis of missing data problems is MCMC sampling from p(θ, MD), the joint distribution of the parameters and of the missing data given the observed data (28). A Markov chain with stationary distribution p(θ, MD) is used to generate the sample (θ^{(1)}, M ^{(1)}), (θ^{(2)}, M ^{(2)}),...,(θ^{(sk)}, M ^{(sk)}), where each (θ^{(i)}, M ^{(i)}) is some realization of θ and M.
At each step of this chain, we update either a single parameter θ_{i} or a single missing data component M_{i} = {X_{it} t = 0,..., m – k}. Each θ_{i} is updated according to the distribution p(θ_{i}θ_{–i}, M, D), where θ_{–i} denotes all parameters other than θ_{i}, and M_{i} is updated according to p(M_{i} θ, M _{–} _{i}, D). (Details regarding the updating procedure and other implementation issues are available as Supporting Text, which is published as supporting information on the PNAS web site). Total run time for analysis of our dataset, with 0.84 billion updates, was ≈36 hours on a single 1.2GHz IBM POWER4+ processor.
After a large number of updates, the sample of realizations is effectively drawn from p(θ, MD). To reduce the storage requirements, we record the parameter values only for a set of evenly spaced realizations of θ, i.e., θ^{(k)}, θ^{(2k)},..., θ^{(sk)} for some k. The posterior distribution p(θD) is then approximated by the sample distribution of θ^{(k)}, θ^{(2k)},..., θ^{(sk)}, and the sample mean is an estimate of θ_{i}.
The difference between and the true value θ_{i} may be decomposed as , where is the (unknown) maximumlikelihood estimate. There should be no strong dependency between the two terms, and so . We estimate by using the initial monotone_{1} sequence estimator (29) and as the sample variance [because maximum likelihood estimates and posterior distributions both have variances approximated by the inverse Fisher information (30)]. As the number of samples s increases, the estimate of is approximately constant, whereas decreases at the rate 1/s (29). For our analysis runs below, the estimated is ≈3% of the estimated , suggesting that most of the variance in the estimates arises from the finite amount of data rather than the MCMC approach.
By asymptotic normality of (27), 95% confidence intervals for are approximated by . Estimates of functions of the parameters, such as averages of substitution rates over contexts, are derived as with variance and confidence intervals estimated as above. Reliability of the variance estimates and of the normality assumptions was checked by simulation (see below).
The software used in this analysis is available from www.phrap.org.
Dataset. We analyzed sequences of the greater cystic fibrosis transmembrane conductance regulator region from 19 mammals (Fig. 1), generated by the NISC Comparative Sequencing Program (15) and aligned by using tba (32). (The dataset is available at www.nisc.nih.gov/data.) This region spans ≈1.7 Mb and includes nine genes. Sequence positions present in at least three species, including representatives from at least two of four major clades (primates, rodents + rabbit, carnivores + horse + artiodactyls + hedgehog, and armadillo), along with segments <10 bases present in only one or two species, were retained; all other positions were excluded. We also removed positions falling in lowcomplexity regions, CpG islands, or known exons for any species, because the substitution process for such regions is not being modeled here. About 13% of the sequence was not present in enough species, and a further 5.5% was filtered out by content, leaving 746 kb of transcribed and 543 kb of untranscribed sequence in human and a total of 8.9 Mb of transcribed and 5.2 Mb of untranscribed sequence in all 19 mammals (see Supporting Text for filtering methods and Table 2, which is published as supporting information on the PNAS web site, for lengths of sequence available in each species). Although some of this sequence may be under selection, nonexonic selected regions appear to comprise a small percentage (<4%) of the mammalian genome (8) and should have minimal impact on our analyses.
Sequencing errors and misalignments may obscure true substitution events or incorrectly suggest their occurrence. Repeating the analyses with poorly aligned sequences removed (as described in Supporting Text) did not yield qualitatively different results, although it did decrease the branch lengths between distantly related species.
Our analysis approach requires that the tree location of indels be held fixed. At any sequence position where a gap occurs in the alignment, indels were mapped to the tree so as to minimize the total number of events. When more than one such mapping was possible, we chose the one that maximized the number of tree positions assigned as gaps so as to minimize the subsequent computational burden. This choice has the effect of placing insertion events at the last common ancestor of all sequences in which the base is present and of placing deletions immediately after internal nodes.
Reliability of Estimates. Several features of our approach (the complexity of the model, the use of a discretetime approximation, and issues regarding MCMC convergence and applicability of asymptotic theoretical results) make it important to assess the reliability of our parameter estimates and inferred distributions. We performed this assessment by analyzing simulated datasets that matched the real data in the amount of sequence of each type and history of insertions and deletions. Evolution of the sequences was simulated using a continuoustime version of the model described above, with parameter values that had previously been estimated from the real data for a particular analysis. Our Bayesian MCMC approach was then used to reestimate these parameters from the simulated dataset. For the simulation analysis, the error of each parameter estimate is known: It is the difference between the estimate and the value used for the simulation. If our variance estimates and normality assumptions are correct, then this error divided by its estimated standard deviation should follow a standard normal distribution. The agreement is excellent for the branch length and rate parameters (Fig. 2), suggesting that (provided our evolutionary model is correct) the parameter values and confidence intervals estimated from the real data by means of Bayesian MCMC are reliable. Variances of the root and inserted sequence distribution parameters tended to be underestimated; however, these were not directly considered in subsequent analyses and did not appear to affect the reliability of other estimates. Note that this simulation does not evaluate issues such as uncertainty in indel placement and the validity of the model, e.g., the assumption that rates depend only on the immediately flanking nucleotides and feature type.
Results and Discussion
Our evolutionary model allows the substitution rate at each site to depend on the two flanking nucleotides (“context”), the branch of the phylogenetic tree, and the type of biological feature in which the site is located. Parameter estimates and confidence intervals are obtained by Bayesian MCMC analysis, and their reliability is checked by using simulations (see Methods). To reveal substitutional trends during mammalian evolution, we analyzed a dataset consisting of orthologous sequences from 19 mammals for a 1.7Mb genomic region that was filtered to remove exons and other sequences likely to be nonneutrally evolving.
Variation of Context Effects Across Lineages. In an initial MCMC analysis, we explored the broad pattern of rate variation across the mammalian tree by estimating separate contextdependent rate matrices for transcribed and untranscribed regions for each of five clades (primates, rodents + rabbit, carnivores + artiodactyls + horse, hedgehog, and armadillo) and for a sixth group comprising three ancestral branches (Fig. 1), a total of 12 matrices. Each matrix has 192 parameters representing the rates of contextdependent substitution events wxy→z, where w and y are the 5′ and 3′ neighbors of x and z is the base to which x mutates (our model assumes the event affects a single nucleotide at a time, so that w and y are unchanged). In untranscribed regions (but not transcribed regions, cf. ref. 14), we assume that rates of complementary events are equal, so the number of potentially distinct matrix parameters reduces to 96. Branchspecific scaling factors (again estimated separately for transcribed and untranscribed regions) allow variation by branch within a clade by means of a multiplicative factor applied simultaneously to all contextdependent rates.
Comparison of the contextdependent rates across clades (Fig. 3) indicates that they are broadly similar but have some intriguing systematic differences. The differences appear primarily to be shifts of groups of rates of similar types parallel to the diagonal in log–log scale, implying that, within each group, a single multiplicative factor relates the (untransformed) rates in one clade to those in the other. This pattern suggests that the set of contextdependent substitutions wxy→z can be partitioned into subsets, which we call “types,” such that rate variation across the tree is largely captured by lineagespecific multiplicative shifts in the baseline rate for each type. Each specific context within a type modifies the baseline rate for that type in a manner that is largely independent of lineage.
Consequently, we adopt a model in which a single matrix of contextdependent rates applies to all tree branches but is modified by scaling factors that are specific for each branch and substitution type. To determine the optimal partitioning into types for this purpose in an objective and statistically rigorous manner, we carried out a weighted ANOVA, taking advantage of the fact that we have reliable variance and covariance estimates for the parameter estimates from the MCMC analysis. A partitioning into 14 types turns out to capture most of the variation across the phylogeny (Table 1). One of these types (NCG→T in our notation) corresponds to CpG→TpG substitutions, which are thought to arise primarily from deamination of methylated C (5). The remaining types apparently lack such simple mechanistic interpretations. We attempt below to gain some insight into the factors causing type rates to vary differentially across the mammalian tree.
This model has somewhat fewer parameters while allowing additional branchspecific variation within clades, and it captures the major variation of context effects across the tree in the variation of branchscaling factors. We obtained new parameter estimates by a MCMC analysis using this model, with 14 scaling factors per branch in untranscribed regions and with 28 scaling factors in transcribed regions to allow rate differences between each substitution type and its complement.
The estimated contextdependent substitution rates for untranscribed regions are shown in Fig. 4. The trends are similar to those noted in previous studies of human pseudogenes (3, 4). In particular, NTN→N substitution rates tend to increase with the number of flanking purines, and NCG→N rates are increased compared with NCH→N rates, for both NCG→R transversions and NCG→T transitions.
The Molecular Clock Hypothesis for Different Substitution Types. The molecular clock hypothesis (33), which states that substitutions accumulate at a rate proportional to clock time in all lineages, is known to fail for neutral nucleotide substitutions in mammals (34), with some lineages (e.g., rodents) showing greatly elevated rates relative to others. Because our substitution types capture crossphylogeny variation in relative rates, we were interested in the possibility that they might differ in the degree to which they deviate from clocklike behavior. To investigate this possibility, we computed for each substitution type the variance of the set of normalized roottoleaf distances (Fig. 5). By this criterion, the types fall roughly into three groups: NCG→T has a very low variance of 0.032; other NCN→N types have intermediate variances in the range 0.08–0.10; and NTN→N types have high variances in the range 0.10–0.18. In particular, NCG→T substitutions apparently occur at close to clocklike rates, whereas NTN→N rates are the least clocklike. These trends are corroborated by examination of tree shape (Fig. 6).
Suggested explanations for deviation from clocklike behavior include the “generation time” hypothesis (34, 35), which proposes that organisms with shorter generation times (more precisely, having a higher average number of germline cell divisions per year) have higher substitution rates as a result of DNA replication errors; and the “metabolic rate” hypothesis (36), which proposes that organisms with higher weightspecific metabolic rates have higher mutation rates as a result of oxidative damage. Because generation time and metabolic rate are both correlated with body size, these hypotheses have been difficult to distinguish (36, 37). For example, the lineages having the greatest excess over the mean branch length in our data (Fig. 6) are rodent, rabbit, and hedgehog, which have both relatively short generation times and relatively high metabolic rates.
Oxidative damage, however, is predicted to mainly induce mutations of G or C to A or T (38). Our results indicate that the most pronounced phylogenetic variation instead involves T→N substitutions. Thus, varying rates of oxidative damage do not appear to be a major direct cause of mammalian nuclear substitution rate variation [they may be more relevant for mitochondrial rates (38)]. On the other hand, armadillo and horse, which have among the lowest metabolic rates of these mammals (39, 40) but relatively short generation times, have the shortest branch lengths for all substitution types (Fig. 6), which suggests that metabolic rate may influence mutation rate by mechanisms other than oxidation. One possible mechanism, proposed in ref. 36, is errorprone repair of oxidatively damaged DNA, which involves “replication error” in a more general sense (not associated with cell division). This suggestion has the appealing feature of allowing variation in substitution rates to be attributed to a single mechanism, DNA replication. Further support for the influence of replication errors on mutation rate comes from the fact that males, which have more germline cell divisions than females, also have higher mutation rates (41).
Our finding that NCG→T substitutions are relatively clocklike presumably reflects the fact that most NCG→T mutations arise from hydrolytic deamination of methylated C (5), which should be relatively unaffected by DNA replication (note incidentally that this chemical reaction does not involve oxidative damage). Conversely, the fact that NTN→N types show the greatest variation suggests that most of these mutations may be DNA replicationassociated. Because other NCN→N substitutions show intermediate variation, they likely include both a replicationdependent component and a more clocklike component, the latter perhaps being deamination of (unmethylated) cytosine (42). Of course, it is likely that there is some variation in the accuracy and efficiency of replication and repair machinery in these organisms as well (e.g., see refs. 43 and 44), which may play a contributing role.
Given the relatively clocklike behavior of NCG→T rates, we expect the branchspecific ratio of NCG→T rate to overall substitution rate (plotted in Fig. 7A ) to correlate with those factors (generation time or metabolic rate) that are responsible for overall rate variation relative to clock time across the tree. Note that there is a 2fold variation in this ratio, from ≈30 in great apes to <15 in hedgehog. In general, the correlation appears stronger with generation time than metabolic rate; branches in Fig. 7A that correspond to extant species with shorter generation times tend to have lower ratios. The ratios also generally are lower in ancestral branches than descendant branches: seven of the 17 ancestral branches have significantly (P < 0.05) lower ratios than their descendants, whereas only one has significantly higher ratio (Fig. 27, which is published as supporting information on the PNAS web site). This finding suggests generation times have tended to lengthen in several lineages, which is consistent with paleontological evidence indicating a trend of increasing body size (Cope's rule) in many mammalian lineages (45, 46).
The trend of increasing NCG→T/overall rate also provides an alternative interpretation for observations in ref. 21. Arndt et al. (21) conjecture, based on analyzing the rates of NCG→T substitutions relative to other substitutions in human repeats, that the NCG→T rate has been increasing since the time of the mammalian radiation. We suggest instead that the NCG→T rate has remained relatively constant, whereas the rate of other, replicationdependent errors has decreased during primate evolution because of increasing generation times.
S Versus W Substitution Bias. Bias in the relative rates of W→S and S→W neutral substitutions is thought to be a major driver of genome G + C content (47). Most eukaryotes have A + T rich genomes, suggesting that there is a phylogenetically widespread mutational pressure favoring S→W over W→S mutations. However, a strong circumstantial case has recently emerged (10–13) that biased gene conversion, a tendency to repair W:S mismatches to C:G rather than T:A in DNA heteroduplexes formed during recombination, acts as a significant counterbalancing force that mitigates or reverses this mutational pressure by increasing the frequency of W→S and decreasing the frequency of S→W substitutions. The magnitude of the biased gene conversion effect is predicted to be positively correlated with conversion rate, effective population size, and the strength of the repair asymmetry (48).
Examination of the W→S/S→W ratio by tree branch (Fig. 7B ) reveals significant variation across the mammalian phylogeny. Assuming that this pattern reflects variation in the effects of biased gene conversion and that the basic characteristics of the recombination process have remained relatively constant, we expect that higher values of W→S/S→W should reflect higher effective population sizes (N _{e}) and/or recombination rates along certain branches, with N _{e} likely dominating the trends because it is thought to vary over a greater range than recombination rate among most mammals (an order of magnitude or more versus a factor of two or three). In general, N _{e} does appear to explain much of the variation. For example, the rodent + rabbit clade has relatively high W→S/S→W ratios compared with most other mammals, presumably reflecting their large N _{e} [the average crossover rate in rodents is only about half that in humans (49)]. Chimp has a significantly higher ratio than human, consistent with its apparently larger N _{e} (50). Some of the variability in ratios may reflect recombination rate differences, however; for example, the higher ratio for rat relative to mouse, which has also been observed on a genomewide scale (51), is consistent with rat's somewhat higher recombination rate [for the rat and mouse chromosomes containing the cystic fibrosis transmembrane conductance regulator region, the crossover rates are 0.55 centimorgans per Mb and 0.45 centimorgans per Mb, respectively (49)]. The exceptionally high ratio for dog may arise from the fact that dog tends to have shorter chromosomes and therefore likely has a higher average recombination rate per megabase (10) than other mammals (the cystic fibrosis transmembrane conductance regulator region is on the 163Mb chromosome 7 in human and the 72Mb chromosome 14 in dog).
It is interesting to note that, just as Fig. 7A appears to provide a window into variation of generation times across the mammalian phylogeny, Fig. 7B may provide a window into variation of effective population sizes and recombination rates.
TranscriptionAssociated Substitutional Asymmetry. We have previously found (14) that there is an asymmetry in substitution rates in transcribed regions, with pyrimidine transitions (T→C and C→T) occurring at higher rates on the transcribed strand than purine transitions (note that in ref. 14, substitutions were read on the coding, or nontranscribed, strand). We confirm that pattern here (Fig. 8) and find that the degree of asymmetry varies by neighboring context, with the ATA→C, VTG→C substitution type having the strongest asymmetry. In addition, we find significant asymmetry for four of the seven transversion substitution types.
The degree of transcriptional asymmetry for each context is correlated with the substitution rate in untranscribed regions for NTN→N substitutions, whereas no strong correlation is clear for NCN→N substitutions (Fig. 9). Moreover in general, there appears to be little or no transcriptional asymmetry for the contexts having the lowest substitution rates within each type. These observations suggest that each NTN→N type has a baseline rate of substitutions that occur via a symmetric process, and that contexts within a type act to increase the rate over the baseline via a mechanism that is subject to the asymmetry. Under the model proposed in ref. 14, the underlying asymmetry is at the level of replication errors made by DNA polymerase, namely a greater frequency of misinserted purines than of misinserted pyrimidines. The fact that the correlation we see is strongest for NTN→N substitutions is then consistent with the suggestion above that those substitutions have the highest proportion of replicationdependent errors.
Summary. Bayesian MCMC offers a powerful and flexible approach to the elucidation of molecular evolution trends and should allow increasingly complex models of the mutation and substitution process to be investigated. We have used it here to characterize mammalian variation in contextdependent substitution patterns in a 1.7Mb genomic region. Our results appear to support the hypotheses that contextdependent DNA replication errors, cytosine deamination, and biased gene conversion are the major sources of naturally occurring point mutations and that the relative contributions of these have varied in mammalian evolution as a result of varying generation times, effective population sizes, and recombination rates. In particular, CpG transitions have accumulated in a relatively clocklike fashion, in comparison with other contextdependent substitution types.
Acknowledgments
We thank the NISC Comparative Sequencing Program, in particular Eric Green, Bob Blakesley, Gerry Bouffard, and Pam Thomas, for generating and providing the annotated sequence; Matthew Stephens and Joseph Felsenstein for helpful suggestions; and Matthew Stephens and Eric Green for comments on the manuscript. This work was supported by the Howard Hughes Medical Institute, the National Institutes of Health, and the Poncin Foundation.
Footnotes

↵ † To whom correspondence may be addressed. Email: dhwang{at}u.washington.edu or phg{at}u.washington.edu.

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected on May 1, 2001.

Abbreviations: MCMC, Markov chain Monte Carlo; Mb, megabase.

See accompanying Biography on page 13991.
 Copyright © 2004, The National Academy of Sciences
References
 ↵
 ↵
 ↵
 ↵

↵
Ehrlich, M. & Wang, R. Y. H. (1981) Science 212 , 1350–1357. pmid:6262918

↵
Zhao, Z. & Boerwinkle, E. (2002) Genome Res. 12 , 1679–1686. pmid:12421754
 ↵
 ↵
 ↵

↵
Meunier, J. & Duret, L. (2004) Mol. Biol. Evol. 21 , 984–990. pmid:14963104

↵
Galtier, N., Piganeau, G., Mouchiroud, D. & Duret, L. (2001) Genetics 159 , 907–911. pmid:11693127

↵
Birdsell, J. A. (2002) Mol. Biol. Evol. 19 , 1181–1197. pmid:12082137
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵

↵
Arndt, P. F., Petrov, D. A. & Hwa, T. (2003) Mol. Biol. Evol. 20 , 1887–1896. pmid:12885958

↵
Siepel, A. & Haussler, D. (2004) Mol. Biol. Evol. 21 , 468–488. pmid:14660683
 ↵
 ↵

↵
Tavaré, S. (1986) Lect. Math. Life Sci. 17 , 57–86.

↵
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. (2003) Bayesian Data Analysis (Chapman & Hall/CRC, Boca Raton, FL), 2nd Ed.

↵
Heyde, C. C. & Johnstone, I. M. (1979) J. R. Stat. Soc. B 41 , 184–189.

↵
Wilson, I. J. & Balding, D. J. (1998) Genetics 150 , 499–510. pmid:9725864

↵
Geyer, C. J. (1992) Stat. Sci. 7 , 473–483.

↵
Ferguson, T. S. (1996) A Course in Large Sample Theory (Chapman & Hall, London).

↵
Springer, M. S., Murphy, W. J., Eizirik, E. & O'Brien, S. J. (2003) Proc. Natl. Acad. Sci. USA 100 , 1056–1061. pmid:12552136

↵
Blanchette, M., Kent, W. J., Riemer, C., Elnitski, L., Smit, A. F. A., Roskin, K. M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E. D., et al. (2004) Genome Res. 14 , 708–715. pmid:15060014
 ↵
 ↵
 ↵

↵
Martin, A. P. & Palumbi, S. R. (1993) Proc. Natl. Acad. Sci. USA 90 , 4087–4091. pmid:8483925
 ↵
 ↵
 ↵

↵
Boily, P. (2002) J. Exp. Biol. 205 , 3207–3214. pmid:12235198
 ↵

↵
Fryxell, K. J. & Zuckerkandl, E. (2000) Mol. Biol. Evol. 17 , 1371–1383. pmid:10958853

↵
Drake, J. W., Charlesworth, B., Charlesworth, D. & Crow, J. F. (1998) Genetics 148 , 1667–1686. pmid:9560386
 ↵

↵
Alroy, J. (1998) Science 280 , 731–734. pmid:9563948
 ↵
 ↵

↵
Nagylaki, T. (1983) Proc. Natl. Acad. Sci. USA 80 , 6278–6281. pmid:6578508

↵
JensenSeaman, M. I., Furey, T. S., Payseur, B. A., Lu, Y., Roskin, K. M., Chen, C.F., Thomas, M. A., Haussler, D. & Jacob, H. J. (2004) Genome Res. 14 , 528–538. pmid:15059993

↵
Yu, N., JensenSeaman, M. I., Chemnick, L., Kidd, J. R., Deinard, A. S., Ryder, O., Kidd, K. K. & Li, W.H. (2003) Genetics 164 , 1511–1518. pmid:12930756
 ↵