New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Minimal haplotype tagging

Contributed by Louis M. Kunkel, June 16, 2003
Abstract
The high frequency of singlenucleotide polymorphisms (SNPs) in the human genome presents an unparalleled opportunity to track down the genetic basis of common diseases. At the same time, the sheer number of SNPs also makes unfeasible genomewide disease association studies. The haplotypic nature of the human genome, however, lends itself to the selection of a parsimonious set of SNPs, called haplotype tagging SNPs (htSNPs), able to distinguish the haplotypic variations in a population. Current approaches rely on statistical analysis of transmission rates to identify htSNPs. In contrast to these approximate methods, this contribution describes an exact, analytical, and lossless method, called BEST (Best Enumeration of SNP Tags), able to identify the minimum set of SNPs tagging an arbitrary set of haplotypes from either pedigree or independent samples. Our results confirm that a small proportion of SNPs is sufficient to capture the haplotypic variations in a population and that this proportion decreases exponentially as the haplotype length increases. We used BEST to tag the haplotypes of 105 genes in an AfricanAmerican and a EuropeanAmerican sample. An interesting finding of this analysis is that the vast majority (95%) of the htSNPs in the EuropeanAmerican sample is a subset of the htSNPs of the AfricanAmerican sample. This result seems to provide further evidence that a severe bottleneck occurred during the founding of Europe and the conjectured “Out of Africa” event.
Singlenucleotide polymorphisms (SNPs) are an invaluable tool to uncover the genetic basis of common diseases (1, 2) by providing a highresolution map of the genome and allowing researchers to associate variations in a particular genomic region to observable traits. Unfortunately, the sheer number of SNPs in the human genome, which makes SNPs so useful as markers, also makes genomewide association studies unfeasible. However, the number of distinct combinations of SNP alleles (haplotypes) encountered in human samples is a small fraction of the possible haplotypes that would arise if alleles were distributed randomly. This haplotypic structure of the genome lends itself to the selection of a parsimonious set of SNPs, called haplotype tagging SNPs (htSNPs), able to distinguish the haplotypic variations in a population.
Given a set of haplotypes in a genomic region, identified through statistical (3, 4) or molecular (5, 6) methods, the process of haplotype tagging is in principle deterministic. Unfortunately, this problem is also computationally intractable (7), because its solution requires the testing of every possible combination of SNPs in the haplotype set, and the number of these combinations grows exponentially with the number of SNPs in the haplotype set. Current approaches rely on approximate methods to identify htSNPs. Most efforts (8–12) have focused on the identification of a secondary haplotype structure across several large regions of the genome. This substructure comprises regions of limited recombination, called haplotype blocks, bounded by small regions characterized by higher recombination rates. Within these smaller regions, htSNPs can be readily identified by eye or by bruteforce search. An alternative approach (13) searches for htSNPs by maximizing the haplotype diversity “explained” by a set of SNPs. Using this method, Johnson et al. (13) were able to identify htSNPs accounting for up to 80% of the genomic variations in the populations they analyzed. Despite their differences, both blockbased and direct approaches use stochastic methods to identify a reduced set of SNPs sufficient to characterize a genomic region in a population. A common concern about these approaches is that the loss of information induced by their stochastic nature could lead to overlooking rare variations responsible for less frequent diseases (14).
In contrast to these approximate approaches, we introduce the first exact, analytical, lossless solution to the problem of identifying the minimum set of SNPs accounting for the variations in an arbitrary genomic region. This method, called Best Enumeration of SNP Tags (BEST), does not follow a suboptimal heuristic or some approximate, stochastic approach but takes advantage of a peculiar aspect of the genome (the relatively small number of haplotypes with respect to the number of SNPs) to confine the source of complexity to a smaller search space. In this way, the reliability of the identified htSNPs will be only a function of the inferred haplotypes, and the haplotype tagging process will not induce any further information loss. Experimental results show that BEST runs to completion in a matter of seconds even for genomic regions containing >200 SNPs.
The method described in this contribution can take as input haplotypes inferred from crosssectional samples via stochastic systems (3, 4) or from pedigree data. Therefore, we applied our method to both a set of 105 genes from 47 independent subjects and haplotypes for 9 genes from pedigree data described in ref. 13. Our results confirm that a small proportion of SNPs is sufficient to capture the haplotypic variations in a population and show that this proportion decreases exponentially as the number of SNPs in the haplotype increases. Comparing BEST to the method proposed by Johnson et al. (13), we also show that, in two genes of nine, our method finds smaller sets of htSNPs, suggesting that BEST improves their original results, even for comparatively small haplotypes.
Materials
Data Collection. SNP genotype data for 24 selfdescribed African Americans (12 female) and 23 European Americans (11 female) were obtained for 105 genes: 85 genes from the University of Washington–Fred Hutchinson Cancer Research Center Variation Discovery Resource Program for Genomic Applications (http://pga.mbt.washington.edu) and 20 genes from the Innate Immunity Programs for Genomic Applications (http://innateimmunity.net). All sequencing was performed on the same anonymized DNA samples from the Coriell Cell Repositories (http://locus.umdnj.edu/nigms), using the same Big Dye Terminator sequencing chemistry and equipment (Applied Biosystems). Both sites used the same software and virtually identical protocols for base calling, assembly, and SNP determination, as detailed on each of the respective web sites.
We also tested our method by using published haplotypes, which included 5 genes in a maximum of 418 multiplex families from the Diabetes UK Warren 1 Repository, 3 genes in 598 subjects from Finnish families with at least one sibling diagnosed with type 1 diabetes, and one gene from United Kingdom blood donors (13).
Haplotype Identification. Haplotypes comprising all SNPs with minor allele frequency of ≥10% were inferred for each gene, independently in each ethnic sample, by using default settings of the phase program (3) for each of the 105 genes. To account for the inherent error rate of current genotyping technologies, we selected only those haplotypes seen more than once (frequency of >4%).
Methods
Preliminary Definitions. We regard a SNP as a variable bearing, at most, four states, one for each of the possible alleles (A, T, G, and C), although, in practice, SNPs with more than two states are relatively uncommon. This variable can also encode insertion (+) and deletion (–) polymorphisms. An allele is the assignment of a value to a SNP: we will say that, in a particular individual, the SNP in a particular locus bears the value. A haplotype is a set of contiguous alleles, as they appear in the population of interest. A haplotype set is a set of contiguous SNPs, and it is identified by a set of haplotypes found in the population of interest. Fig. 1 shows the unique haplotypes from SNPs with ≥10% rare allele frequency found in gene TLR7 from an AfricanAmerican sample; columns represent SNPs, rows represent haplotypes, cells represent alleles, and the entire table is the haplotype set.
A haplotype set can be a haplotype block, an entire gene, or an arbitrary genomic region. A SNP is derivable from a set of SNPs if its alleles are uniquely identified by a combination of alleles of the SNPs in such a set. A set of SNPs is sufficient to derive the SNPs in a haplotype set if all the SNPs in the haplotype set can be derived from the SNPs in such set. A set of SNPs is necessary to derive the SNPs in a haplotype set if, when one of its members is removed, at least one SNP in the haplotype set, including itself, is no longer derivable. A tagging set is the set of SNPs in a haplotype set necessary and sufficient to derive all of the SNPs in the set. Among all of the tagging sets, a minimal tagging set is a set containing the minimum number of SNPs, and its members will be called htSNPs. For a given haplotype set, there may be more than one minimal tagging set. Our goal is to find at least one of them.
Haplotype Tagging.Fig. 2 gives a skeletal description of the haplotype tagging algorithm BEST. The algorithm takes as input a set S of haplotypes, each representing a unique set of values of each SNP in the haplotype set, such as the haplotype set displayed in Fig. 1. The algorithm returns a minimal set of SNPs from which all of the other SNPs in the haplotype set can be derived. A preliminary step of the algorithm is to convert the haplotype set into binary form. The colors in Fig. 1 encode the binary conversion of a haplotype set. Although we consider here only the case of biallelic SNPs, this encoding can be easily generalized to triallelic SNPs and is currently implemented as such. When two SNPs share the same binary representation, they are termed binary equivalent. Any tagging set including a SNP will be equivalent to a tagging set where it is replaced by one of its binary equivalent SNPs. After the binary conversion, the algorithm will keep only one member of each group of binary equivalent SNPs.
A fundamental property of this binary representation is that if a SNP is derivable from a set of other SNPs, then it will be derivable from any superset of such a set. This property spares the exponential effort of identifying the set of SNPs deriving a SNP. The second critical property exploited by BEST is that the tagging set identified by adding, at each step, the SNP that derives the maximum number of SNPs leads to SNP sets containing the minimal tagging set.
Let S = {s_{1},..., s_{m}} denote the m SNPs in a haplotype set. BEST recursively partitions the set S in two groups H and D such that S = H – D, where H is the minimal tagging set (the smallest set of SNPs necessary and sufficient to derive all of the SNPs in the haplotype set), D is the set of (m – k) SNPs that are derivable from H, and a SNP d_{j} is derivable from H if the value of d_{j} in each haplotype can be expressed as a Boolean function f(.) of elements of H. A property of these Boolean functions is that if a SNP d_{j} is derivable from a subset H′ of H, then d_{j} is also derivable from any subset of H containing H′. This property follows from the fact that if a SNP is derivable from set of SNPs H′, then its alleles are uniquely identified by the allele combinations of the SNPs in H. Therefore, f(h_{1},..., h_{k}) = f(h_{1},..., h_{k}, h_{k}_{+}_{1}, h_{k}_{+}_{2},...) for all h_{1},..., h_{k} in H′. In this way, one can check whether a SNP s_{j} is derivable from a subset of SNPs in S without necessarily knowing the specific subset, therefore avoiding the exponential cost of the search. In the skeletal description of the algorithm in Fig. 2, this operation is performed by the function DERIVABLE(x, Y), which returns true if the combinations of the alleles of a subset of the SNPs in the set Y uniquely identify the alleles of the SNP x. The function DERIVED(Y) returns the set of SNPs derivable from the set of SNPs Y. We can show that the cost of the function DERIVED(Y) is polynomial by noting that if a SNP x is derivable from a set of SNPs Y, it is also derivable by any superset of Y. Hence, to check the derivability of a SNP x from a set Y, we just check whether the alleles of at least one member of Y match the alleles of x. The function DERIVED(Y) is a simple iteration of the function DERIVABLE(x,Y) across the SNP set Y and can be executed in polynomial time.
The first step of the algorithm is the generation of the set H_{1} of htSNPs that are not derivable from any subset of S. This set is identified by examining the Boolean dependency of each s_{j} on the set of SNPs S \ s_{j}, and each s_{j} that is not derivable is assigned to H_{1}. The elements of H_{1} determine a partition of the remaining SNPs into . The set D_{1} contains the SNPs derivable from H_{1}, whereas C_{1} is the set of SNPs that are not derivable from H_{1} and are therefore candidate htSNPs. Next, an augmentation procedure is applied to move one or more elements from C_{1} to H_{1} and from C_{1} to D_{1}. First, the elements of C_{1} are sorted according to this criterion: if the set D_{2}_{i} of SNPs derivable from has cardinality greater than the cardinality of the set D_{2}_{j} of SNPs derivable from . If the criterion identifies only one c_{1} such that for all other c_{j} in C_{1}, then , D_{2} = D_{21}, and , and the procedure is repeated on the set C_{2} until the set of candidate SNPs is empty. When more than one set D_{2}_{i} of SNPs derivable from has the same size of D_{21}, then parallel partitions , D_{2}_{i} and D_{2}_{i}) are generated, and the augmentation procedure is repeated on each of them. When none of the SNPs in C_{1} augments the set D_{1}, then pairs of SNPs are treated as one single variable by the augmentation procedure. When the set of candidate SNPs is empty, if the augmentation procedure returns one or more necessary and sufficient sets H of htSNPs, then the algorithm stops. If no such set is found, the whole procedure is repeated on each of the sets H.
Proof of Optimality. We prove by induction on k, the minimum number of htSNPs, that the smallest set of necessary and sufficient htSNPs returned by BEST is the minimum set of htSNPs tagging the haplotype set at hand. Suppose first that k = 1, so that the minimum set of htSNPs consists only of one SNP. Because k = 1, then all elements of S are binary equivalent and BEST will return one of the m minimal tagging sets, each set given by {s_{i}}, for i = 1,..., m. Any of this set will be the minimum solution. Next, suppose the result is true for any minimum set of size k – 1 (that is, if the minimum set of htSNPs has size j ≤ k – 1, then we assume that BEST returns one of the minimum solutions of size j) and we show that the result is true when the minimum set consists of k htSNPs. More precisely, we assume that H = {h_{1},..., h_{n}} is the set returned by BEST ordered according to the augmentation procedure, and we show that n = k. Suppose that we can decompose the set of SNPs S into , where S_{k}_{–}_{1} is the subset of S that is decomposed into , D_{k}_{–}_{1}, H_{k}_{–}_{1} consists of the first k – 1 htSNPs in H, D_{k}_{–}_{1} is the set of SNPs that are derivable from H_{k}_{–}_{1}, and S_{k} is the set of SNPs in S that are not derivable from H_{k}_{–}_{1}. By induction, H_{k}_{–}_{1} is the minimum set of htSNPs for S_{k}_{–}_{1}, and it is equivalent to any minimum set of htSNPs for S_{k}_{–}_{1}. Furthermore, because we are assuming there is a minimum solution of size k, there exists at least one htSNP in S_{k} that, once added to H_{k}_{–}_{1}, makes of the minimum set of htSNPs. Because the set of SNPs derivable from is the whole set , then the cardinality of the set of SNPs derivable from is the largest, and therefore is the set of htSNPs found by BEST and n = k. If a decomposition does not exist, then we can decompose , where and H_{k}_{–}_{i} consists of the first (k – i) SNPs, and repeat the same argument.
Results
BEST was first used to tag the 105 genes, ranging from 5 to 229 SNPs in length, described in Materials. The results of this analysis are summarized in Table 1. These results confirm that a small proportion of SNPs (14% for African Americans and 10% for European Americans) is sufficient to capture the variations in a haplotype set. It is interesting to note that the proportion of SNPs required to tag a gene decreases exponentially as the number of constituent SNPs increases. Fig. 3 shows the sharp exponential decay of the number of htSNPs as the size of the haplotype increases in both populations. This decay is due to the fact that, as the total number of SNPs increases, the observed haplotypes are likely to be a smaller ratio of the entire sample space, which grows exponentially with the number of SNPs. The algorithm takes advantage of the limited haplotype diversity in the genome to achieve an exponential saving in genotyping as the haplotype length increases.
An interesting finding is that the majority of the htSNPs in the EuropeanAmerican sample also appear as htSNPs in the AfricanAmerican sample. Because a minimal tagging set is not necessarily unique, these proportions can only be taken as a lower bound of the shared htSNPs. Still, an average of 95% (and in most cases 100%) of the htSNPs in the EuropeanAmerican sample are a subset of the htSNPs of the AfricanAmerican sample. For 98 genes (94%), all of the htSNPs found in the EuropeanAmerican sample were also found in the AfricanAmerican sample. This finding is strikingly identical across the vast majority of the 105 genes here considered and suggests that the lower variability of the EuropeanAmerican population is indeed the result of a depletion of an original gene pool, consistent with a severe bottleneck occurring during the founding of Europe and the proposed “Out of Africa” event (11, 15, 16).
An important result is that BEST successfully identifies the minimum set of htSNPs even in haplotype sets that would be unfeasible to tackle by exhaustive enumeration or prone to error by eye. We also analyzed nine genes described by Johnson et al. (13). Fig. 4 Left shows the results obtained by BEST and compares them with the results published in the original report. For two genes of nine, CASP10 and SDF1, BEST found multiple alternative htSNP sets smaller than those reported in the original report. For SDF1, the largest of the genes described in the original report spanning 22 SNPs, Johnson et al. determine that five htSNPs are required to tag haplotypes with frequency of >5%. In contrast, BEST reveals 10 equivalent sets of four htSNPs for their data. Even for the smaller gene, CASP10, with just 11 SNPs, BEST was able to identify four alternative sets of three htSNPs, against the single set of four htSNPs in the original report. These results suggest that, although their method is able to find sets of tagging SNPs, these are not necessarily optimal. In all genes, BEST identified equivalent SNPs. In our experience, knowing alternative equivalent htSNP sets is often valuable in practice when individual htSNPs prove difficult to genotype because of flanking repeat regions.
Discussion
Haplotypebased studies are today considered one of the most promising approaches to discover the genetic basis of common diseases. One consequence of the haplotypic nature of the human genome is that only a subset of the SNPs in a haplotype will be sufficient to unambiguously distinguish the haplotypes. This feature of the genome promises to significantly reduce the number of SNPs required to completely genotype a sample and, in so doing, render feasible genomewide association studies. The identification of haplotype blocks created by the evolutionary history of the genome is an important step toward the identification of redundant SNPs, but the fulfillment of the promise of haplotypebased studies rests on the possibility of identifying which SNPs are actually able to tag a haplotype set with no information loss. This contribution described a feasible, exact, and lossless method able to identify such htSNPs and analytically tag an arbitrary stretch of the genome.
Current approaches focus on the identification of htSNPs based on linkage disequilibrium and on stochastic measures of haplotype diversity. Although these efforts provide useful insight into the natural history of the genome, we have shown that analytical haplotype tagging of arbitrary genomic regions is more efficient at identifying parsimonious sets of htSNPs. We believe that the ability of our method to identify the minimum tagging set for an arbitrary region of the genome can be instrumental in delivering on the promise of haplotypebased studies. Furthermore, the ability of our method to identify alternative minimal sets of htSNPs, when available, can be valuable in practice when htSNPs prove difficult to genotype. Coupling BEST with a map of human haplotypes would provide investigators with a powerful tool to design association studies.
A computer program implementing the method described here is available at http://genomethods.org/best.
Acknowledgments
We thank Emanuela Gussoni (Harvard Medical School), Stefano Monti (Massachusetts Institute of Technology/Whitehead Institute, Cambridge, MA), Alberto Riva (Harvard Medical School), and the referees for their insightful comments. This research was supported, in part, by National Science Foundation Grant ECS0120309 (to M.F.R. and P.S.) and National Institutes of Health Grants HL66795 (to S.T.W. and I.S.K.) and P01 NS40828 (to L.M.K. and I.S.K.). L.M.K. is an investigator of the Howard Hughes Medical Institute.
Footnotes

↵§§ To whom correspondence should be addressed at: Informatics Program, Children's Hospital, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115. Email: marco_ramoni{at}harvard.edu.

Abbreviations: BEST, Best Enumeration of SNP Tags; SNP, singlenucleotide polymorphism; htSNP, haplotype tagging SNP.
 Copyright © 2003, The National Academy of Sciences
References
 ↵
Lander, E. S. (1996) Science 274, 536–539.pmid:8928008
 ↵
Collins, F. S., Guyer, M. S. & Chakravarti, A. (1997) Science 278, 1580–1581.pmid:9411782
 ↵
 ↵
 ↵
 ↵
 ↵
Garey, M. R. & Johnson, D. S. (1979) Computers and Intractability: A Guide to the Theory of NPCompleteness (Freeman, New York).
 ↵
Patil, N., Berno, A. J., Hinds, D. A., Barrett, W. A., Doshi, J. M., Hacker, C. R., Kautzer, C. R., Lee, D. H., Marjoribanks, C., McDonough, D. P., et al. (2001) Science 294, 1719–1723.pmid:11721056
 ↵
 ↵
Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002) Science 296, 2225–2229.pmid:12029063
 ↵
 ↵
Casci, T. (2002) Nat. Rev. Genet. 3, 573.
 ↵
Reich, D. & Goldstein, D. (1998) Proc. Natl. Acad. Sci. USA 95, 8119–8123.pmid:9653150
 ↵