Previous Article |
Table of Contents
| Next Article
BIOLOGICAL SCIENCES / GENETICS
Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome
,
,¶



Departments of *Molecular Biophysics and Biochemistry and
Genetics, Yale University School of Medicine, New Haven, CT 06520;
European Molecular Biology Laboratory, 69117 Heidelberg, Germany; Departments of ¶Molecular, Cellular, and Developmental Biology and ||Computer Science, Yale University, New Haven, CT 06520; and **Department of Pediatrics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104
Communicated by Francis H. Ruddle, Yale University, New Haven, CT, April 30, 2007 (received for review January 11, 2007)
| Abstract |
|---|
|
|
|---|
300bp. This level of resolution enables more precise correlations between CNVs and across individuals than previously possible, allowing the study of CNV population frequencies. Further, it enabled us to demonstrate a clear Mendelian pattern of inheritance for one of the CNVs.
copy number polymorphism | human genome variation | structural variants
Recently, high-resolution comparative genome hybridization (HighRes-CGH) has been developed (16, 17), a technology that can detect signatures of CNVs at large-scale. Its underlying principle, array comparative genome hybridization [(array)CGH (18)], involves cohybridization of differentially fluorescent-labeled genomic DNA from both an individual and a reference to a microarray and can be applied cost-efficiently across many samples. HighRes-CGH achieves an unprecedented resolution owing to recent advances in high-density tiling microarrays (19–21), allowing for an immense number of distinct oligonucleotide probes per chip. For instance, when using HighRes-CGH with a chromosome 22-wide 85-bp tiling path step size (denoting its theoretical resolution; i.e., the median distance between genomic regions specifically targeted by probes) in conjunction with PCR, we recently managed to identify the breakpoints of a 1.4-Mb disease-associated deletion that had previously been characterized only at standard cytogenetic resolution (16).
However, HighRes-CGH data are hard to interpret in an ad hoc fashion, and, as yet, no systematic computational approach has been developed for mapping human CNVs at the level of base pairs. Above all, the considerable enhancement in theoretical resolution of HighRes-CGH comes with additional costs: the noise level obtained from microarray readouts is high because of the short hybridizing probes and the complexity of genomic DNA, causing cross-hybridization. Novel computational approaches are thus required to benefit from the technology's theoretical resolution. In particular, tiling microarrays open new avenues; their high-resolution data enable us to make use of correlations between array signals and the actual nucleotide sequence. Here, we present an approach for CNV breakpoint identification that integrates signals from HighRes-CGH arrays and nucleotide sequence statistically, facilitating the accurate detection of CNV breakpoints. A thorough analysis of 10 individuals demonstrates its predictive power: >400 breakpoints were mapped and, in eight instances, predicted coordinates are confirmed through DNA sequencing.
| Results |
|---|
|
|
|---|
|
|
|
2,500 parameters, to gradually collapse to alternative models with fewer parameters in an intuitive fashion [i.e., through reducing numbers of bins and states of the model; see supporting information (SI) Text and SI Fig. 4]. At the extreme end of this collapse is the "core" parameterization, a univariate three-state HMM not considering nucleotide sequence information and requiring 10 parameters. Integrating experimental validations into the overall analysis of array data with BreakPtr should allow us to leverage a small amount of additional knowledge to incrementally improve the set of gold standards and refine our breakpoint predictions. We thus implemented a concept related to semisupervised machine-learning, allowing iterative optimization of BreakPtr parameters through incremental incorporation of validations (Fig. 2 B and Materials and Methods). In brief, (i) initial parameter estimation was performed by using a set of known or approximately mapped deletions and duplications; (ii) an expectation maximization (EM)-based algorithm (25) was used for parameter optimization; (iii) CNVs and their breakpoints were predicted; (iv) breakpoints were validated by DNA sequencing; (v) this process was iterated, which allowed refinement of parameters and predicted CNVs.
Fine-Mapping CNV Breakpoints.
After developing BreakPtr, we tested the approach in detail, focusing on human chromosome 22 and the
-globin locus (16) (a 100-kb region on chromosome 11). In total, 10 samples were analyzed, including eight subjects with known genetic disorders (16) and two "healthy" individuals (see SI Table 2). Because of the small set of available gold standards, BreakPtr was initially applied by using the core parameterization. Parameter estimation was performed by using a set of experiments targeting approximately mapped chromosomal aberrations that involve the 22q11 chromosomal band (see SI Table 3). Parameters of the state corresponding to unaffected genomic DNA were estimated based on an experimental control (Materials and Methods). In total, 232 putative CNVs were identified by BreakPtr (i.e., 464 breakpoints, flanking 121 duplications and 111 deletions, with median size 15 kb and mean 85 kb; see SI Table 4), many of which may be widespread in humans. In particular, 67 (29%) overlap with the genomic coordinates of previously reported CNVs listed in the Database of Genomic Variants (2). By taking into account estimated mapping resolutions of previous studies, we tentatively assigned refined CNV breakpoint coordinates to 36 of the 108 genomic locations to which breakpoints had previously been approximately mapped. (Note that breakpoint-mapping resolutions of previously carried out surveys, i.e., the expected uncertainties of mapping, were, in several instances, unknown to us and thus estimated by using criteria given in SI Text). Altogether, predicted CNVs intersected with 210 different genes. Because our survey included patients with known chromosomal disorders, not all of these genes may intersect CNVs in healthy individuals. Nevertheless, 91 genes did not overlap with the respective critical regions of the previously diagnosed chromosomal disorders (16) and are thus candidates for genes commonly varying in copy number.
Using BreakPtr (core model), we further reanalyzed the association between CNV breakpoints and SDs (SI Text). Indeed, for >2/3 of the predicted CNVs on chromosome 22, at least one breakpoint intersected with a SD. This represents a >4-fold enrichment over random (i.e., compared with "shuffled" CNVs with randomized genomic locations), consistent with previous estimates at lower resolution (9).
Benchmarking of Predictions.
Agreement with previously mapped breakpoints.
To evaluate our breakpoint predictions critically, we first focused on breakpoints that were previously precisely mapped in the regions analyzed here. Indeed, we identified all four previously sequenced breakpoints (16, 26) in the individuals available to us at nucleotide level (Table 1): both of the physical boundaries of a 619-bp heterozygous deletion causing
-thalassemia (26) and the breakpoints of a 1.4-Mb heterozygous deletion associated previously with 22q11-deletion syndrome (16). Furthermore, the heterozygosity (16, 26) of both deletions was correctly identified by BreakPtr.
|
constant region 1; RefSeq: BC012159). The deletion, which may be involved in normal variation of the immune system, overlaps a genomic region of previously reported copy-number variation (2). Second, we attempted to identify the breakpoints of a BreakPtr-predicted
1-kb homozygous deletion located in a region for which, to our knowledge, no CNVs have as yet been reported. The latter deletion intersects with a conserved noncoding element upstream of the HMG2L1 gene (which encodes high-mobility group protein 2-like 1; RefSeq: HMG2L1), and may thus cause variation at the level of gene-regulation. Subsequent to PCR analysis, we sequenced all four breakpoints, leading to the discovery of 18,231 bp and 975 bp deletions (Table 1; and SI Figs. 5 and 6). Furthermore, the observed PCR bands supported the predicted copy number ratios. By comparing genomic coordinates of predicted and validated breakpoints, we determined an effective resolution of BreakPtr of
330 bp (taking into account both the four earlier mapped and the four previously uncharacterized breakpoints).
Comparison of core and full parameterization.
We further compared BreakPtr's core parameterization to the full model. In particular, when using the small set of the only four earlier mapped (disease-associated) breakpoints for estimating the parameters for dbHMM transition states (i.e., Transitions B and B' in Fig. 3A), the full model yielded an improved effective resolution (
280 bp). Before significance can be established, more breakpoint sequences need to be solved. Nevertheless, when using the full model, the fraction of predicted CNVs overlapping with previously reported CNVs already showed a slight increase (from 29% to 31%), and we expect that with the availability of larger sets of gold standards the full parameterization should cause robust improvements over alternative models.
Use of BreakPtr to refine previously mapped breakpoints. We believe that the considerable overlap of our predictions with previously reported CNVs indicates that BreakPtr will help in refining many approximately mapped breakpoints. To exemplify this, we analyzed GM15510, a sample derived from a healthy subject previously studied by Tuzun et al. (3) by using fosmid paired-end sequencing: four of six CNVs (67%) previously identified (3) in the chromosomal regions studied here intersected with CNVs predicted by BreakPtr (core parameterization). For these, high-resolution breakpoint assignments are available from SI Table 4. Note that all cases missed by BreakPtr represent insertions detected by fosmid paired-end sequencing (3) and, thus, are not necessarily CNVs, by definition, because they may at least partly contain sequences not present in the human reference genome or products of balanced translocations not affecting gene dosage, which are not detectable by HighRes-CGH. We expect BreakPtr to be suited for refining the coordinates of many previously reported CNVs.
Breakpoint Fine-Mapping Suggests Abundance of CNPs and Mendelian Transmission.
The fine-mapping of CNV breakpoints should enable in-depth analysis of CNV frequency and inheritance across individuals; specifically, correspondences between partially overlapping CNVs cannot be reliably assessed in the absence of precisely mapped breakpoints. For instance, several CNVs reported in our study appear to be common: 11% of the 232 predicted CNVs were observed in at least two unrelated individuals (when applying a margin of ±330 bp for breakpoint identification) and, thus, most likely represent CNPs, i.e., common CNVs. (Because of the relatively small number of individuals analyzed here, the actual fraction of CNPs will presumably be considerably higher; see SI Table 5.) To further exemplify this, we carried out a pilot study examining by PCR the distribution of the previously uncharacterized 975-bp CNV across 19 HapMap individuals (including relatives and unrelated subjects). PCR results suggest Mendelian transmission of the deletion [in agreement with recent observations concerning CNV inheritance (4, 8)] and common occurrence in different populations. Altogether, the CNV was detected in
20% of the surveyed chromosomes of HapMap individuals, consistent with our provisional estimate based on BreakPtr predictions in the 10 individuals analyzed by HighRes-CGH (SI Fig. 7 and SI Table 4). This indicates that the predictive resolution of BreakPtr alone enables analyzing CNV frequency and inheritance.
| Discussion |
|---|
|
|
|---|
Finally, to evaluate the prospect of performing breakpoint validations on a large-scale, we studied the design requirements of HighRes-CGH experiments. For instance, when removing half of the probes of the chromosome 22 microarray analyzed here, effective resolutions at 0.5–1 kb were observed (data not shown), resolutions well suited for breakpoint validation. Given the ever-increasing feature density of microarray slides, surveys such as the one described here will soon be performed on a genome-wide scale. For instance, Nimblegen (Madison, WI) has recently begun producing arrays with 2.1 million probes: if by using those arrays with a 170-bp tiling path step size, only nine microarrays per individual may enable genome-wide breakpoint mapping (thereby, BreakPtr analysis is unlikely to be limiting; see SI Text). Eventually, large-scale fine-mapping and sequencing of breakpoints will shed new light on CNV origin, inheritance, population frequency, and associations of CNVs with phenotypes.
| Materials and Methods |
|---|
|
|
|---|
385,000 different probes at an 85-bp tiling path step size were designed as described (16). Labeled genomic DNA of human subject and reference samples (the latter sample, i.e., the control, comprising a pool of genomic DNA from seven healthy male individuals, from Promega, Madison, WI) were cohybridized to the arrays (16, 17). Fluorescence intensities were obtained for each spot (16) ("probe"). Fluorescence intensity normalization was performed by using the Qspline algorithm (31). We further included and reanalyzed HighRes-CGH data from a previous study (16). CNV Breakpoint Prediction by Using the Full Parameterization. HighRes-CGH data were scored by using BreakPtr (source codes available from http://breakptr.gersteinlab.org). Its full model encompasses a seven-state dbHMM operating with two emission channels: i.e., it uses normalized fluorescent intensity log2-ratios and a value quantifying the redundancy of the underlying DNA sequence derived from BlastZ (32) alignments [which can be used for identifying SDs (32, 33)]. Normalized BlastZ-scores (32) from genome-wide human-vs.-human (i.e., BlastZ-self chain) alignments depleted of lineage-specific common (interspersed) repeats were retrieved from the University of California (Santa Clara, CA) Genome Browser (http://genome.ucsc.edu; default parameters according to the Self-Chain Track, i.e., minimum BlastZ raw score = 10,000; normalized BlastZ-score = raw score/no. of bases aligned). We used cumulative scores that were obtained by summing up normalized BlastZ-scores for each BlastZ-hit intersecting with the genomic coordinate of a probe. This measure correlates with the redundancy of the nucleotide sequence, in particular SDs, and we thus considered it for incorporation into the dbHMM. BreakPtr's transition states reflect the propensity of breakpoints to coincide with SDs. Breakpoints are predicted also if not coinciding with SDs, because the model architecture allows transition states to be omitted. The dbHMM emits discrete symbols for each genomic coordinate targeted by a probe, with each symbol corresponding to a bin of the emission distribution associated with particular microarray values and nucleotide sequence composition (Fig. 3). Bins were constructed in the following way: cumulative scores were divided among N1 = 5 bins, with bin sizes selected by using the condition to place approximately equal numbers of data points into each bin. Normalized fluorescence intensity log2 ratios were assigned to N2 = 100 bins according to the following procedure: values between –1 and 1 were assigned to N2 – 2 bins covering equally sized fluorescence intensity log2-ratio intervals. Further, log2 ratios < –1, and log2 ratios >1, were assigned to additional bins. Predictions were robust to bin size selection. Most probable state assignments were found by using the Viterbi algorithm (23). BreakPtr assigns breakpoints to locations of transition to (or from) "deletion" and "duplication" states. In this particular study, given the small amount of available gold standards, emission distributions of the full model were refined by Gaussian smoothing (after parameter estimation) by using parameters that resemble the distribution of emission values of the normal state (i.e., unaffected genomic DNA). This step is unnecessary if the criteria based on Scott (24) are fulfilled (see below).
Alternative Parameterizations. Alternative parameterizations (core and intermediate models) are implemented according to Scott (24), depending on the amount of available data (see SI Text and SI Fig. 4). For instance, the core model uses a univariate continuous three-state HMM with Gaussian emission and 10 parameters. Focusing on HighRes-CGH data, it is not trained on DNA sequence signatures. Alternative models use the same structure of modules (Finder/Annotator/Flagger; Fig. 2) as the full model.
Training/Optimization of BreakPtr.
BreakPtr was iteratively optimized by using training data and gold standards. For instance, the core model was trained as follows: (i) parameter estimation was based on a labeled training set, i.e., parameters for CNV-states were estimated from approximately mapped breakpoints (see SI Table 3). For the normal state, a control microarray experiment [same DNA applied to both channels (16)] was used. Transition probabilities pb were estimated for each transition between states by dividing the number of previously known aberrations (based on approximately mapped deletions/duplications) in the concatenated training set by the number of probes. Then, transition probabilities pw within states were set:
|
|
(ii) Parameter optimization was carried out by using a larger unlabeled training set (in practice, the entire set of arrays is used). (iii) Breakpoints were predicted, and (iv) validated. These four steps were iterated, by using refined breakpoints for each round. Steps i and ii can be viewed as a special case of semisupervised learning (34), because unlabeled data are used to improve a model based on labeled data. Steps iii and iv are closely related to active sampling/learning, because a subset of unlabeled data is selected and validated, and results are subsequently incorporated in the rescoring process. BreakPtr's alternative parameterizations are optimized similarly to the core model: Transition states were trained based on data points corresponding to an interval ±1,000 bp from a mapped breakpoint. Transition-state emission distributions were assumed to be identical for both (the telomeric and centromeric) boundaries of CNVs. BreakPtr allows transition probabilities to be refined by using EM or to be adjusted manually. The latter, e.g., allows the stringency of CNV detection to be adjusted gradually if a CNV reported by using a complementary technology was initially missed.
Dosage Estimation. Copy number ratios were identified by using a six-state HMM (Annotator). Parameters were estimated from biological samples (if corresponding samples were missing, parameters were manually specified; see SI Table 3 and SI Fig. 8).
Identification of False-Positive Signals Resulting from Cross-Hybridization. Cross-hybridization of probes may, to a certain extent, affect HighRes-CGH experiments; putative cross-hybridization is identified by using a sequence alignment-based approach (this module, Flagger, is described in detail in SI Text).
Validation of Breakpoints. Genomic regions surrounding predicted breakpoints were amplified by conventional PCR or vectorette PCR and sequenced (16).
| Acknowledgements |
|---|
|
|
|---|
| Footnotes |
|---|
Abbreviations: (array-)CGH, array comparative genome hybridization; CNP, copy number polymorphism; CNV, copy number variant; EM, expectation maximization; HMM, Hidden Markov Model; dbHMM, discrete-valued bivariate HMM; HighRes-CGH, high resolution CGH; SD, segmental duplication.
To whom correspondence may be addressed. E-mail: jan.korbel{at}yale.edu, michael.snyder{at}yale.edu, or mark.gerstein{at}yale.edu
Author contributions: J.O.K. and A.E.U. contributed equally to this work; J.O.K., A.E.U., S.M.W., M.S., and M.B.G. designed research; J.O.K., A.E.U., and F.G. performed research; A.E.U., F.G., J.D., P.S., G.Z., and B.S.E. contributed new reagents/analytic tools; J.O.K., A.E.U., F.G., J.D., T.E.R., S.M.W., M.S., and M.B.G. analyzed data; and J.O.K. and M.B.G. wrote the paper.
The authors declare no conflict of interest.
Data deposition: Microarray data have been deposited in the Gene Expression Omnibus repository (accession no. GSE6010).
This article contains supporting information online at www.pnas.org/cgi/content/full/0703834104/DC1.
© 2007 by The National Academy of Sciences of the USA
| References |
|---|
|
|
|---|
Related articles in PNAS:
This article has been cited by other articles in HighWire Press-hosted journals:
![]() |
A. S. Lee, M. Gutierrez-Arcelus, G. H. Perry, E. J. Vallender, W. E. Johnson, G. M. Miller, J. O. Korbel, and C. Lee Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies Hum. Mol. Genet., April 15, 2008; 17(8): 1127 - 1136. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Cahan, L. E. Godfrey, P. S. Eis, T. A. Richmond, R. R. Selzer, M. Brent, H. L. McLeod, T. J. Ley, and T. A. Graubert wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data Nucleic Acids Res., April 1, 2008; 36(7): e41 - e41. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Nozawa, Y. Kawahara, and M. Nei From the Cover: Genomic drift and copy number variation of sensory receptor genes in humans PNAS, December 18, 2007; 104(51): 20421 - 20426. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. O. Korbel, A. E. Urban, J. P. Affourtit, B. Godwin, F. Grubert, J. F. Simons, P. M. Kim, D. Palejev, N. J. Carriero, L. Du, et al. Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome Science, October 19, 2007; 318(5849): 420 - 426. [Abstract] [Full Text] [PDF] |
||||
![]() |
ARTICLE WATCH J. Biomol. Tech., September 1, 2007; 18(4): 268 - 273. [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||