New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Population genomic inference of recombination rates and hotspots

Edited by Jasper Rine, University of California, Berkeley, CA, and approved February 12, 2009 (received for review January 14, 2009)
Related Article
 In This Issue Apr 14, 2009
Abstract
As more human genomic data become available, finescale recombination rate variation can be inferred on a genomewide scale. Current statistical methods to infer recombination rates that can be applied to moderate, or large, genomic regions are limited to approximated likelihoods. Here, we develop a Bayesian fulllikelihood method using Markov Chain Monte Carlo (MCMC) to estimate background recombination rates and hotspots. The probability model is inspired by the observed patterns of recombination at several genomic regions analyzed in spermtyping studies. Posterior probabilities and Bayes factors of recombination hotspots along chromosomes are inferred. For moderatesize genomic regions (e.g., with <100 SNPs), the fulllikelihood method is used. Larger regions are split into subintervals (typically each having between 20 and 50 markers). The likelihood is approximated based on the genealogies for each subinterval. The background recombination rates, hotspots, and parameters are evaluated by using a parallel computing approach and assuming shared parameters across the subintervals. Simulation analyses show that our method can accurately estimate the variation in recombination rates across genomic regions. In particular, clusters of hotspots can be distinguished even though weaker hotspots are present. The method is applied to SNP data from the HLA region, the MS32, and chromosome 19.
 ancestral recombination graph
 Bayesian inference
 linkage disequilibrium
 Markov Chain Monte Carlo
 recombination hotspot
Describing finescale recombination rate variation and the distribution of recombination hotspots across chromosomes are important goals in population genetics. Recombination is one of the fundamental evolutionary forces affecting patterns of polymorphism variation across genomes. The distribution of recombination rates and hotspots helps to reveal the molecular basis of meiotic crossovers and is a crucial factor in association study designs because of its effect on the pattern of linkage disequilibrium in human genomes (1). Traditional linkage data from pedigrees typically do not provide estimates of recombination on a fine scale because of the limited number of meioses (2, 3). Spermtyping analyses can experimentally provide estimates of recombination rates but are laborious and expensive, so only a few regions of the human genome have been studied. In addition, only recombination rates in males can be revealed from these studies (reviewed in ref. 4). Statistical inferences of recombination rates based on population genetic data represent a major approach to obtain an overall picture of finescale recombination rates and hotspot locations in the human genome, especially at present, with more population genomic data become available daily [e.g., data generated by the HapMap project (5), etc.].
Both spermtyping analyses and statistical inferences based on population genetic data (typically SNPs) reveal similar patterns of finescale recombination rates across chromosomes. These studies suggest that recombination rates vary significantly over genomic regions, and most recombination tends to occur in particular regions of chromosomes (with interval sizes of ≈1–2 kb) known as recombination “hotspots,” whereas in other “background” regions, many fewer recombinations occur (4, 6).
A number of statistical methods have been developed to estimate recombination rates by using population genetic data, including those using summary statistics, full likelihoods, and approximated likelihoods (reviewed in refs. 7 and 8). Fulllikelihood methods use all information contained in the data and, in principle, should provide more accurate estimates. However, because fulllikelihood methods based on the Ancestral Recombination Graph (ARG) (9, 10) involve integrating a large number of variable dimension genealogies, it has been challenging to develop efficient methods based on full likelihoods that are applicable to largescale data (10–13). Several methods based on approximate likelihoods have therefore been developed (14–16) and applied to human genomic data (17–19). We recently developed a fulllikelihood Bayesian Markov Chain Monte Carlo (MCMC) method for estimating finescale recombination rates (20). In our method, genealogies underlying a sampling of chromosomes are effectively modeled by using marginal individual SNP genealogies related through an ARG. Simulation studies showed that our fulllikelihood method performed well under different simulation scenarios and can be applied to smalltomoderatesize chromosomal intervals (e.g., with ≤100 SNPs).
Several methods for detecting recombination hotspots have also been developed that search for regions of accelerated recombination rates by comparison with surrounding regions or with overall background rates (16, 18, 21, 22). Auton and McVean (24) recently incorporated a model of hotspots and background recombination into the LDhat package to simultaneously estimate finescale recombination rates and detect recombination hotspots. Our method differs from Auton and McVean's composite likelihood method in several ways. Most importantly, for moderatesize genomic regions (e.g., ≤100 SNPs), the posterior probability of recombination rates is obtained by a fulllikelihood method. If genomic regions are larger, they are divided into nmarker subregions (typically, an appropriate choice for n is between 20 and 50). The likelihood is then calculated conditional on the genealogies for each subregion, and parameters are evaluated jointly across all subregions.
Here, we present a model of recombination rates and hotspots whose design is based on the observed distribution of recombination hotspots at several genomic regions obtained from spermtyping studies (4, 6). The background rates between SNPs are assumed to be independently distributed following a Γdistribution. Piecewise estimators of recombination rate change have been developed that accommodate recombination hotspots (18, 23). Auton and McVean (24) presented a model in which recombination hotspots are uniformly scattered over the region being analyzed.
In our hotspot model, a 2parameter Markov process is used to describe the distribution of the intervals between hotspots and the duration of hotspots. Both duration of hotspots and the distances between hotspots are exponentially distributed. One feature of the distribution of recombination hotspots that has been revealed by sperm typing analyses is that hotspots are sometimes clustered. The advantage of using an exponential distribution to describe the distances between hotspots is that the mode of the distribution is 0, and the variance is large. Thus, scenarios in which hotspots are far apart and in which they are clustered can be accommodated. An example of a pattern of recombination hotspots generated by using the model is given in Fig. 1. The recombination rates across chromosomal regions are a combination of the 2 independent processes (the summation of the rates from the background rate and the hotspots rates).
A reversible jump MCMC scheme is used to estimate background recombination rate and hotspots by using a population sample of SNPs from regions of the genome (25). The position and intensity of hotspots, background recombination rates, and other parameters are sampled from the Markov chains. The chromosomal intervals are divided into bins (e.g., with size 100 bp) for estimating the posterior probability and the Bayes factor (BF) that each interval contains a hotspot and the average intensity of the hotspot in each interval. To identify hotspots, a hotspot is defined by 2 BF thresholds: HT_{1} and HT_{2}. If BF (hotspot at location i) > HT_{1}, the local mode is used to estimate the modal BF of the hotspot, and the hotspot extends until BF (hotspot at location j) < HT_{2}. The hotspot is then inferred to be on the interval (i, j). Parameter HT_{1} represents the criterion for detecting hotspots. Larger HT_{1} implies higher confidence that the identified hotspots are true. Parameter HT_{2} determines the boundaries, and thus the width, of an estimated hotspot conditional on HT_{1}. The power and type I error rates can be adjusted by modifying values of HT_{1} and HT_{2}.
Results
We examined the performance of our method by applying it to both simulated data and human population genetic data. Population genetic datasets spanning the HLA (26) and MS32 (27) regions that have been previously studied by sperm crossover analysis were analyzed by our method and compared with previous results. The method was also applied to a SNP dataset across human chromosome 19 sampled from the AfricanAmerican population (28).
Simulation Studies.
To evaluate the statistical performance of the method, we used the msHOT program (29, 30) to simulate 3 sets of data. Common parameters used in all simulations include a sample size of 50 chromosomes, a population size (N_{e}) equal to 10^{4}, a mutation rate per site per generation (μ) equal to 10^{−8}, a background recombination rate of 0.15 cM/Mb (ρ = 0.06/kb, given N_{e} = 10^{4}), and a chromosomal interval of size 30 kb. Only sites with minor allele frequencies (MAF) ≥0.05 were retained and used in the analyses.
We first examined the performance of the method by considering hotspots with 2 different intensities: ρ = 40/kb (for dataset S_{1}) and ρ = 10/kb (for dataset S_{2}), representing a relatively strong and a weaker recombination hotspot. For all 3 datasets, the location of the hotspots is assumed to be the same (at a position between 15 and 16.5 kb from the left of the interval). The average [minimum, maximum] number of SNPs for S_{1} and S_{2} are 34.58 [14, 73] and 34.22 [12, 64], respectively.
The BF of recombination hotspot locations and other parameters of the model, obtained by using the program IR, were reported. Different hotspot threshold values, HT_{1} and HT_{2}, were considered to examine how the falsepositive rate, power, and the estimated hotspot intensity and width change. Here, a hotspot is counted as correct when the estimated hotspots overlapped with a true hotspot; otherwise, it is counted as incorrect. The falsepositive rate is the number of incorrect hotspots over the number of intervals examined, and the power is the percentage of successfully identified hotspots over the number of true hotspots.
As expected, both the power and the falsepositive rates increase as HT_{1} decreases [see supporting information (SI) Fig. S1 A and B]. The width of the estimated hotspots is determined by HT_{2} given HT_{1}. In general, HT_{2} can be assumed to be 2.5, and the estimated widths are approximately consistent for different HT_{1} values (Fig. S1C). The results assuming HT_{1} = 5 and HT_{2} = 2.5 are summarized in Table 1. When a hotspot is strong, the estimated intensity tends to be underestimated, because the genealogical trees are independent due to the large number of recombinations, as was pointed out by other authors (24).
The second simulation study aimed to examine the ability of the method to identify clustered hotspots. A spermtyping analysis of the HLA region (15) revealed 2 clusters of hotspots: DNA13 and DMB12. The distance between hotspot centers is 4.01 kb for DNA1 and DNA2, 7.97 kb for DNA2 and DNA3, and 3.25 kb for hotspots DMB1 and DMB2. In our simulation study, we assumed the centers of the 2 recombination hotspots are 5 kb apart, with one hotspot locating between 11.75 and 13.25 and the other between 16.75 and 18.25 kb. As was revealed by the spermtyping analysis, weaker hotspots can exist with stronger hotspots within clusters. In the simulation study, we assumed the hotspot on the left was weak with ρ = 6 (H_{1}) and the other hotspot had a moderate intensity with ρ = 30 (H_{2}). Fifty replicate datasets were simulated (dataset S_{3}), and the average [minimum, maximum] number of SNPs was 35.36 [15, 66]. If assuming HT_{1} = 5 and HT_{2} = 2.5, the falsepositive rate was 2/50 and the average power for both hotspots was 0.67 (0.5 for H_{1} and 0.84 for H_{2}). Of 50 intervals, there were 21 intervals for which both hotspots were identified, 4 intervals for which only H_{1} was identified, 21 intervals for which only H_{2} was identified, and 4 intervals for which no hotspots were identified. In all cases, none of the estimated hotspots span both H_{1} and H2, indicating that the method can discriminate between single hotspots and clusters.
The third simulation study examined the performance of the method when the approximatedlikelihood method was used by splitting larger intervals into nSNP subintervals. Of the 250 simulated samples from the above 2 simulation studies, samples that contain ≥40 SNPs were used in the third simulation study (77 in total). The intervals were broken into 2 subintervals with an approximately equal number of SNPs for each interval. The likelihood was approximated by multiplying likelihoods from subintervals with parameters shared across the entire interval. Results are listed in Table 2. The estimates are comparable between the 2 methods, even though for 14 of 77 intervals, the true hotspot locations were split across 2 subintervals.
Analysis of HLA and MS32 Regions.
We applied our method to 2 datasets from the HLA and MS32 regions that have been previously studied by sperm typing (26, 27). The HLA dataset consists of 274 SNPs distributed across 0.216 Mb, sampled from 50 unrelated individuals. Six hotspots were revealed in the sperm typing study (26). The MS32 dataset consists of 206 SNPs sampled from 80 individuals and distributed across 0.206 Mb. Both regions have been previously analyzed by using coalescent methods for the analysis of genotypes (18, 24, 27).
For our analysis, the region was divided into subregions, each with 20 markers. The posterior distribution of recombination hotspots and background rates was inferred across the entire region. Only the locations of recombination hotspots were compared. The intensities of hotspots predicted by using the 2 approaches were not expected to be the same, because the hotspot intensities inferred by using population genetic data are a product of ρ = 4N_{e}c and N_{e} likely varies across chromosomal regions due to selection. Moreover, population genetic rates are average rates over females and males. The BF of recombination hotspots across the 2 regions is shown in Fig. 2.
The hotspot locations estimated by using our method are, in general, consistent with those obtained from sperm crossover analysis. The hotspots that were independently discovered by spermtyping analysis also had high BFs in our population genetic analyses. Only hotspot NID3 in the MS32 region showed a lower BF.
Analysis of Human Chromosome 19.
We applied our method to a human variation dataset for chromosome 19 (28). The dataset consists of 23 AfricanAmerican individuals. In total, there are 18,406 SNPs on chromosome 19 from the sample. The whole chromosome was divided into 92 intervals that were analyzed separately. There are 200 markers on each interval for the first 91 intervals and 206 markers for the last interval. Each interval was split into subintervals with 20 markers (26 markers for the last segment of the last interval) in a parallel computation assuming shared parameters across the subintervals. Hotspots and background recombination rates were estimated for each interval. Figs. 3 and 4 show the BFs of recombination hotspots and the expected background recombination rates across chromosome 19 estimated by using our method. There is strong evidence for recombination hotspots in at least 10 locations on each chromosome arm. Comparing the hotspots locations with those inferred previously from the HapMap data, the majority appear to overlap (Table S1).
Discussion
In this article, we present a model of background recombination rates and hotspots to describe the changes of finescale recombination rates over genomes. It is an extension of our recently developed fulllikelihood method for estimating recombination rates by using population genetic data. Fulllikelihood methods use all information in the data and should provide more accurate estimates and have higher power to detect recombination hotspots, but the disadvantage of such methods is that they are computationally intensive. Our fulllikelihood method efficiently models the ancestral genealogy of a sample by using marker ancestry vectors to avoid modeling nonancestral lineages, which not only add a computational burden but can cause convergence problems as well. Currently, the fulllikelihood method can be applied to moderatesize chromosomal intervals. For larger intervals or whole genomes, an approximation to the fulllikelihood is used that divides an entire region, or chromosome, into subregions with the likelihood approximated by using the products of likelihoods for genealogies on each subregion but with model parameters shared across the entire region.
The results of our analyses suggest that it is possible to accurately infer recombination hotspot locations and intensities across chromosomes. In particular, our method can accurately distinguish clustered hotspots, even though weaker hotspots may be present. By choosing different criteria for identifying hotspots, the power and the falsepositive rate changes accordingly. In general, it should be profitable to study those hotspots with high BFs at the molecular level; the falsepositive rates are extremely low in these cases. Other parameters in the recombinationrate model might also be interesting, such as how the expected distances between hotspots and the background recombination rates change over the genome. In addition, variables and parameters of the ARG model may be of interest and can be estimated by sampling from the Markov chains. Such parameters include the posterior probabilities of genealogies at each SNP site and the distribution of recombination breakpoints over genomic regions.
The program InferRho (IR) and the simulated data in both msHOT and IR formats can be obtained from http://rannala.org, or by contacting Y.W.
Materials and Methods
Bayesian Inference of FineScale Recombination Rates.
Let Θ be a vector of parameters, including θ = 4N_{e}μ and ρ = 4N_{e}c, where N_{e} is the effective population size, μ is the sitespecific mutation rate per generation, and c is the recombination rate per generation in cM/Mb. Given G_{S}, the genealogical trees (τ⃗) for each marker position are then obtained. The posterior distribution of ρ⃗, is numerically evaluated by MCMC. In the Metropolis–Hastings (MH) algorithm, proposed changes include modifying the SNP genealogy by changing a local topology or by adding (or removing) a pair of recombination and coalescent nodes, modifying ancestral alleles, modifying haplotypes (if the phase of the data are unknown), modifying alleles at sites with missing alleles in the sample, and modifying the parameters θ and ρ.
Background Recombination Rates.
The background recombination rates between SNPs are assumed to follow a Γdistribution with shape parameter a_{ρ*} and scale parameter s_{ρ*}. In the analyses, s_{ρ*} is fixed, and a_{ρ*} is estimated in the MCMC.
Recombination Hotspots.
It is assumed that the distribution of recombination hotspots along chromosomes follows a Markov process. Hotspots arise with instantaneous rate λ_{1} and revert with instantaneous rate λ_{2}. The waiting distance until the occurrence of a hotspot is therefore exponentially distributed with parameter λ_{1}, and the waiting distance until the loss of a hotspot is exponentially distributed with parameter λ_{2}. The values of 1/λ_{1} and 1/λ_{2} represent the average distance between hotspots and the average duration of a hotspot, respectively.
Three variables are associated with each hotspot (H), denoted by X_{1}, X_{2}, and Z, and represent the starting location, the ending location, and the strength of the hotspot. Variable Z is assumed to be lognormally distributed with parameters μ_{Z} and σ_{Z}. Considering s hotspots across a chromosomal region, the distribution of the ith recombination hotspot given the adjacent hotspot on its left is where i = {1, …, s − 1}. If i = 0, replace H_{i−1} → X_{2} with the location of the first site of the interval in the above equation. If i = s − 1, and H_{i} → X_{2} is equal to the right bound of the interval (or the last marker on the interval), then f(H_{i} → X_{2}∣H_{i} → X_{1}, λ_{2}) = exp[−λ_{2}(H_{i} → X_{2} − H_{i} → X_{1})] to represent the fact that the end of a hotspot exceeds the right bound of the chromosomal interval. The joint density of s hotspots (H⃗) on the chromosomal interval is given by the product of the above equation over s hotspots multiplied by the density on parameters λ_{1}, λ_{2}, μ_{Z}, and σ_{Z}.
Given ρ⃗* and H⃗ across the chromosomal interval, the probability distribution of the SNP genealogy can be obtained. The posterior distribution described in Eq. 1 becomes Note that 4 parameters in the model are fixed to avoid parameter identifiability issues as well as to incorporate information from other independent studies. Parameter λ_{2} is fixed to be 1,000, which corresponds to 0.001 Mb as the average width of hotspots. Other parameters are chosen to have a less informative prior, such that μ_{Z} = 9 and σ_{Z} = 1.5, so the 95% interval for the strength of a hotspot is [428.40, 153,268.41] per Mb. Because background recombination rates are intended to describe low rates with relatively small variance, parameter s_{ρ*} is fixed to be 50. For example, if a_{ρ*} = 2, the 95% interval of the prior distribution on background rate is [12.11, 278.58] per Mb. If the size of a chromosomal interval is large, the interval is divided into K subintervals, denoted by X = {X_{i}}, to improve computational efficiency. The SNP genealogies underlying subintervals are treated as independent but allowing parameters to be jointly estimated across an entire interval. The independence assumption can be relaxed if needed, and the sizes of subintervals can be adjusted depending on the available computing resources. In this case, the posterior distribution is approximated by For additional information see SI Text, Figs. S2 and S3, and Table S2.
Acknowledgments
We thank Matthew Stephens for suggestions regarding the prior distribution on background recombination rates and 2 anonymous reviewers for their helpful comments. This work was supported by National Institutes of Health Grant HG01988 (to B.R.).
Footnotes
 ^{1}To whom correspondence should be addressed. Email: ygwang{at}ucdavis.edu

Author contributions: Y.W. and B.R. designed research; Y.W. performed research; B.R. contributed new reagents/analytic tools; Y.W. analyzed data; and Y.W. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/cgi/content/full/0900418106/DCSupplemental.
References
 ↵
 ↵
 ↵
 Coop G,
 et al.
 ↵
 ↵
 ↵
 ↵
 ↵
 ↵
 Hudson RR
 ↵
 ↵
 Kuhner MK,
 Yamato J,
 Felsenstein J
 ↵
 Fearnhead P,
 Donnelly P
 ↵
 Nielsen R
 ↵
 Hudson RR
 ↵
 McVean G,
 Awadalla P,
 Fearnhead P
 ↵
 Li N,
 Stephens M
 ↵
 ↵
 McVean G,
 et al.
 ↵
 Myers S,
 Bottolo L,
 Freeman C,
 McVean G,
 Donnelly P
 ↵
 Wang Y,
 Rannala B
 ↵
 ↵
 Fearnhead P
 ↵
 De Lorio M,
 de Silva E,
 Stumpf M
 ↵
 Auton A,
 McVean G
 ↵
 Green PJ
 ↵
 ↵
 ↵
 Hinds DA,
 et al.
 ↵
 Hellenthal G,
 Stephens M
 ↵
 Hudson RR
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
 Biological Sciences
 Genetics