New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
A multimetric approach to analysis of genomewide association by single markers and composite likelihood

Contributed by Newton Morton, December 19, 2007 (received for review September 15, 2007)
Abstract
Two case/control studies with different phenotypes, marker densities, and microarrays were examined for the most significant single markers in defined regions. They show a pronounced bias toward exaggerated significance that increases with the number of observed markers and would increase further with imputed markers. This bias is eliminated by Bonferroni adjustment, thereby allowing combination by principal component analysis with a Malecot model composite likelihood evaluated by a permutation procedure to allow for multiple dependent markers. This intermediate value identifies the only demonstrated causal locus as most significant even in the preliminary analysis and clearly recognizes the strongest candidate in the other sample. Because the three metrics (most significant single marker, composite likelihood, and their principal component) are correlated, choice of the n smallest P values by each test gives <3n regions for followup in the next stage. In this way, methods with different response to marker selection and density are given approximately equal weight and economically compared, without expressing an untested prejudice or sacrificing the most significant results for any of them. Large numbers of cases, controls, and markers are by themselves insufficient to control type 1 and 2 errors, and so efficient use of multiple metrics with Bonferroni adjustment promises to be valuable in identifying causal variants and optimal design simultaneously.
The last century provided a leapfrog from linkage in Drosophila to its slow development in human genetics, then to acceleration as blood groups and isozymes were replaced by DNA markers. This progress stimulated the Human Genome Project that provided a physical reference, which led in turn to the HapMap Project and endless diversity (1). Two milestones in this progression were recognition of DNA block and step structure (2) and advocacy of genomewide association (GWA) to identify common causes of disease (3). These developments were dramatically capped by success in six of seven common diseases (4), with most significant single markers (msSNPs) localized in the physical map in base pairs without reference to neighboring markers or a map in linkage or linkage disequilibrium (LDU). However, power to detect small effects, epistasis, rare causal genes, and disease determinants other than SNPs is low (5, 6). Will this simple method suffice if the numbers of cases, controls, and markers are greatly increased? Should other methods that may be more efficient be examined? It will be some time before such uncertainties are resolved, but a beginning can be made.
At this point it may seem that the many GWA studies now being conducted offer an embarras de choix. However, public tests are delayed necessarily by consortium agreements and unnecessarily by the assumption that only consortium members may be trusted to behave ethically (7). One way of filling this gap is to analyze simulated data that mimic some of the important features of real data. We evade the limitations of this approach to an unknown world by collaborating with two research groups on large sets of single nucleotide polymorphisms (SNPs) genotyped in a substantial number of cases and controls, with the understanding that the origin, disease, SNPs, locations, and other confidential information not be divulged before publication by the consortia (sample A), whereas after publication, the information is no longer confidential (sample B). In this way, some of the many questions raised by advances in association mapping can be realistically addressed before the expected flood of association data requiring GWA mapping and metaanalysis, without compromising anonymity of participants, data access agreements, the authorship rights of consortia to first publication, and other ethical constraints. However, the scope of this approach is presently limited to the first phase of GWA and is therefore heuristic rather than decisive. Only the importance of these issues justifies their examination at this time.
Stage 1 is typically a genome scan of nonoverlapping regions with the object of identifying the regional msSNP in the physical map or the maximum likelihood point estimate in a linkage or LDU map with multiple markers specified by a commercial gene chip, using conventional formula (8). Markers with rare minor allele frequencies or violating Hardy–Weinberg proportions are conventionally discarded. The most significant regions however defined are then submitted to phase 2 analysis in a different and preferably larger sample with more markers in the region of interest. Typically, the selected set is <1% of the number of regions in phase 1 and therefore can be studied in greater detail, although absence of an appropriate gene chip increases effort. More versatile phase 2 strategies that rely on sequencing are being developed with potential to detect rare causal genes not limited to SNPs (9). Limitations of any approach include choice of markers, sample selection and size, and methods of analysis. If msSNPs are used, regional significance is exaggerated relative to composite likelihood unless a Bonferroni adjustment or other appropriate correction is made as described in Materials and Methods, which also gives the way in which composite likelihood and the adjusted msSNP are combined by principal component analysis to yield three metrics. A unique feature of our algorithm is selection of the n most significant regions for each of the metrics, which generates <3n regions for phase 2. The msSNP is most sensitive to omission of a causal SNP, whereas composite likelihood is most sensitive to omission of neighboring SNPs. These metrics are given approximately equal weight and may be economically compared without expressing an untested prejudice or sacrificing the most significant result for any of them. Given adequate allowance for N regions (for example by setting P < 0.05/N), phase 3 may turn to sequence analysis with reasonable assurance both of a causal locus in the region and little probability of omitting an equally strong signal elsewhere.
Results
Composite Likelihood.
Sample A with 5,387 regions gave 53 nominal P values <0.01 for composite likelihood under H_{1}. Sample B with 3,078 regions gave similar results, with 28 regions having nominal P values <0.01. The number of permutation replicates in these subsets was raised from 1,000 to 50,000 to assure reliable estimates of variance and therefore P. The smallest P value in sample A increased 10fold from the initial value with 1,000 replicates, but there was no systematic effect on larger P values, which supports the conjecture that a number of replicates >10/P assures a reliable estimate of P (9). Taking χ_{2} ^{2} = −2lnP for sample A, the estimates of its mean and variance over all regions are μ = 2.0 and V = 4.2. Constraining V to its expected value of 4 under H_{0}, the estimate of μ remains 2.0. For sample B, the variance did not require adjustment because the values were μ = 2 and V = 4. Regression analyses with P values from composite likelihood as the dependent variable revealed no significant effect of either regional LDU length or SNP number for sample A or B.
Bonferroni Adjustment for msSNPs.
Evidence from msSNPs is very different. The value of nominal P falls far short of 1 in every region. In both samples, the maximum P value is near 0.3, showing the bias in selecting the msSNP from at least 30 SNPs in each region. We therefore assigned regions to subsets with limited SNP number diversity and determined R by regula falsi, where R is the effective number of independent SNPs in a given subset, and S is the weighted mean number of SNPs in a region. Table 1 shows these values for each of the subsets for the two samples. Sample A has more subsets and a more even distribution of regions per subset. Sample B has lower SNP density and therefore often requires >10 LDU to have at least 30 SNPs. Not surprisingly, regression on LD length is significant only in sample B.
The highly significant relationship between S and R was investigated by regression in each sample. Weighting by m, the regional number of markers, sample A gives a good fit to the linear model R = aS. Sample B does not fit a linear model, the quadratic term being significant (P = 0.0035). The best fit with two parameters is to the model R = a(1 − e ^{−bS}) (Table 2). Compared with sample A, the lower SNP density in sample B has a more variable distribution on the LD map. The relation between R and S must vary among msSNP samples, just as the error variance for composite likelihood does, but the Bonferroni correction is easily made. By using these relationships, a value of R can be calculated for each region based on the number of SNPs in the region (S). Then R is used to correct the P value, P_{ci} = 1 − (1 − P _{ni}) ^{R} . Taking χ_{2} ^{2} = −2lnP_{c} , for sample A, the estimates of mean and variance over all regions are μ = 2.0 and V = 5.2. Constraining V to its expected value of 4 under H_{0}, the estimate of μ becomes 1.8, corresponding to β = 1.1 (see Materials and Methods). For sample B, the values are μ = 2 and V = 4 as before.
Combination of Evidence.
We applied principal component analysis to all regions for the χ^{2} values of composite likelihood and msSNPs, calculated as −2lnP_{c} and adjusted to V = 4 where necessary. The first principal component was converted to a rank, which was then transformed (10) to P and χ_{2} ^{2}. For sample A, the largest combined χ^{2} (2 df) is 17.2, and the top 50 are all >9.3. For sample B the largest χ_{2} ^{2} is 16.1, and the top 50 are all >8.2. As with composite likelihood and msSNPs examined separately, no region met the critical significance level of 0.05/N, corresponding to χ_{2} ^{2} of 23.17 for sample A and 22.06 for sample B.
Among the 100 most significant regions (50 from each sample) identified by combination of the two χ^{2} values, 4 had a second principal component with value >4, indicating substantially greater significance for the msSNP than for composite likelihood. In no instance was the converse observed (second principal component less than −4). Local LDU maps for these outlier regions were constructed from control data, and the derived composite likelihoods were compared with initial results from the cosmopolitan HapMap. There was little difference between the LDU maps in terms of the fit to the data and the structure and length of the maps. The composite likelihood χ^{2} values were also very similar (Table 3). These 4 cases had the largest difference in χ^{2} between the msSNP and the next most significant SNP in their respective samples, as expected for an LDU subregion with low SNP density.
The high rank of the 4 msSNPs even after Bonferroni correction is not paralleled by composite likelihood, the rank of which exceeds 500 and shows no suggestion of association even in local LDU maps constructed from control data that give slightly larger estimates of ε for the rate of LDU decline. This inconsistency cannot be resolved until metaanalysis of multiple samples has determined the error rate of the two methods. Provisionally, the first principal component PC1 is favored because it combines the evidence, providing an intermediate P value that lies within the top regions without competing with the most significant ones. Absence of a standard error for the msSNP complicates metaanalysis, whether or not the PC1 is used.
Tables 4 and 5 summarize all regions within the 10 most significant by composite likelihood, msSNP, or combined rank (PC1). The last is most significant in sample A for one region and least significant for another, whereas in sample B the distribution is 1 most significant and 2 least significant. All five differences are small and occur among the combined ranks of 10 or less. Sample A has greater variability than sample B in composite likelihood rank, despite its greater number of individuals. This variability may well be an artifact of the greater number of regions in sample A generated by more SNPs and the convention of at least 30 SNPs per region. Instead of the 30 tests expected if the three metrics were independent, there are 20 tests for sample A, in 8 of which χ_{2} ^{2} is greater for composite likelihood than for the Bonferroniadjusted msSNP. In sample B, there are 16 tests, in 9 of which χ_{2} ^{2} for composite likelihood is greater. The most powerful test has yet to be determined, but prejudice in phase 2 can be avoided by selecting regions with the smallest P values for any metric. If a large number of cases, controls, and markers is used, we conjecture that minimal rank need not exceed 10 as in Tables 4 and 5.
Discussion
“Classical genetics emerged in 1900 with the rediscovery of Mendelism and ended in 1953 with the publication of the doublehelical structure of DNA” (11). Its definitive characteristic was experimental as anticipated by Mendel (12), and therefore the observational priority of de Maupertuis (13) in deducing dominant inheritance from a human pedigree of polydactyly is seldom recognized, although the decisive role of cytogenetic observation in the emergence of classical genetics is generally acknowledged. Human genetics (despite continued reliance more on observation than experimentation) played an increasing role in science during the postclassical half century, culminating in the Human Genome Project and its HapMap successor. For the first time, tests of the assumptions underlying evolutionary genetics were becoming feasible, although not probative. Genetic epidemiology has become less familyoriented in rising to a new challenge: “The success of the HapMap will be measured in terms of the genetic discoveries enabled, and improved knowledge of disease etiology” (1). As the unit in all branches of genetics shrinks from the species to the cell, distinction between observation and experimentation becomes fastidious. However designated, the challenge will be to determine how to exploit the massive accumulation of genomic data soon to be released.
Many of our initial results in GWA tests were not anticipated. It proved surprisingly easy to obtain a Bonferroni adjustment for msSNPs, despite the diversity of their regional lengths and SNP densities, but as shown in Table 2, the two samples conform to different twoparameter models. We conjecture that this represents different sampling rules. When evidence was combined in the PC1, the most significant region in sample B gave a point location that coincided with a gene identified in multiple samples as causal. Its region ranked 8 for composite likelihood and 1 for msSNP and PC1. However, when the 50 most significant regions in each sample were examined, none met the conservative Bonferroni level of 0.05/N, where N is the number of regions in a given genome scan. Of the 100 most significant regions when the two samples were pooled, 4 gave evidence only from the msSNP, with no support from composite likelihood. Were those msSNPs type 1 errors or wrongly placed? When the cosmopolitan HapMap was replaced by the local map, evidence from composite likelihood became even weaker. Was SNP coverage inadequate in those regions? At present, there can be no objective recognition of the more reliable test, and so we include PC1 that places the 4 outliers among the most significant regions, but far from the top. The second principal component was used only to detect discrepancies between evidence from msSNPs and composite likelihood and thereby to retain for metaanalysis the 4 outliers that were not identified by composite likelihood.
During the past year much has been learned about association mapping, but the field is still in its infancy compared with a century for linkage. Whereas composite likelihood has proved its utility in regional studies, reliance on msSNPs in genomewide tests has produced some notable successes that account for a small fraction of disease association. Many causal markers must remain to be identified (5, 6). Very large numbers of cases and controls have been invoked as a panacea, either for a single incisive study or metaanalysis of multiple smaller studies that by luck or design use the same most predictive SNP. One alternative is to infer SNPs that may exist and if so may be correctly imputed (14). The authors of that approach note that “the optimal way to combine called genotypes with imputed data is not clear.” A valid analysis requires Bonferroni correction of significance attributed to the inferred msSNP based on both observed and imputed SNPs. The type 1 and 2 errors of these alternatives have not yet been examined. If association mapping is approached as carefully and from as many directions as linkage analysis of major loci in the last century, high power and reliability will be attained.
Materials and Methods
Samples and Regions.
Sample A is composed of affected and normal subsamples, termed case and control, respectively. It has >200,000 SNPs typed in 403 cases and 395 controls, analyzed on an Illumina 300M gel. Because no results have yet been published, we are committed in this work to minimal description. Sample B dichotomizes the 7.5% tails of a quantitative trait: for our present purpose we classify high values as cases and low values as controls. This German sample is part of the material that led to recognition of the NOS1 regulator NOS1AP as a modulator of cardiac repolarization. The quantitative trait is the electrocardiographic (ECG) QT interval with ≈30% heritability. Typing was by Affymetrix oligonucleotide arrays containing 115,571 SNPs as described in detail (15). Our sample differs from these data only by inclusion of males and females, both corrected for heart rate, age, and sex as published and giving 208 cases and 201 controls. Hardy–Weinberg tests were conducted on controls in both samples A and B, excluding SNPs with χ_{1} ^{2} >10. The remaining SNP data in each sample were split into nonoverlapping regions, each of which covers at least 10 LDU and contains a minimum of 30 SNPs without breaking blocks of LD. Because of differences in SNP density, doing so gives 5,387 regions in sample A and 3,078 regions in sample B.
LDU Maps and CHROMSCAN.
Genomic patterns of LD are informative for locating disease genes, and power increases when a causal marker is typed. LDU maps describe these patterns more accurately than kilobase maps (16). Physical locations were taken from build 35 of the human genome sequence (University of California, Santa Clara, May 2004). Unlike physical maps, studyspecific and various genomewide LDU maps are available corresponding to the four HapMap samples separately and combined (17). The LDU map with the highest SNP density, largest sample size, and closest to the experimental data should be optimal. We therefore use the cosmopolitan LDU maps constructed from Phase II HapMap data (release no. 20, January 2006) and available at www.som.soton.ac.uk/research/geneticsdiv/epidemiology/LDMAP/map2.htm.
The advent of GWA analysis led to dramatic increase in the computation time of CHROMSCAN, which analyses regions sequentially. Therefore, a parallel version (CHROMSCANcluster) has been developed to analyze multiple regions simultaneously (18; www.som.soton.ac.uk/research/geneticsdiv/epidemiology/chromscan/). In this way, large datasets can be studied without difficulty. These and earlier applications of composite likelihood are based on the Malecot model, two subhypotheses of which are used to test for a causal polymorphism within each region. Model A, which assumes no association between affection status and SNPs, is taken as the null hypothesis and compared with model D, which estimates disease location (S), its intercept under complex inheritance (M), and residual association (L). The test statistic depends on the composite likelihoods of these two models. To account for autocorrelation between SNPs as a result of LD, significance is determined empirically by a rankbased permutation test. To determine accurate levels of significance, the number of permutation replicates must approach 10/P. We therefore use 1,000 replicates to perform preliminary screens and, where necessary, increase this number to 50,000. The only deviation from the recent description of CHROMSCAN (9) is that P values are now taken from Ewens (10) instead of Tukey (19) because the former more closely approximates a uniform distribution (last section). CHROMSCAN assumes allelic additivity because a causal SNP may well not have been tested and nonadditivity degrades with recombination.
msSNPs.
To compare evidence from composite likelihood and single SNPs, we identify the msSNP for each region. Selecting the msSNP from such a large number of SNPs (30 or more) biases the nominal χ_{1} ^{2} and conventional P value computed on the null hypothesis. This bias was confirmed by regressing msSNP P values on LDU length and SNP number. Under H_{0}, the P values for random SNPs should correspond to χ_{2} ^{2} = −2lnP (20), with variance V = 4 and mean μ = 2. If selection of msSNPs were unbiased, adjustment of V would give an estimate of μ near 2, whereas μ is less sensitive to small values of P and therefore would not provide a good estimate of V. We must reduce the bias in μ before adjusting V.
Step 1.
Because regions defined above vary in the number of SNPs, our first problem is to select subsets with limited diversity but including a large number of regions so that estimates of the Bonferroni parameter R will be accurate. For a given subset, the weighted mean number of SNPs is S = Σf _{i} m _{i}/Σf _{i}, where f _{i} is the number of regions with m _{i} SNPs.
Step 2.
Let R be the effective number of independent SNPs in a subset assigned S SNPs. The Bonferroni model assumes a corrected P value of P_{ci} = 1 − (1 − P _{ni}) ^{R} . To obtain a mean of χ_{2c} ^{2} = 2 when χ_{2c} ^{i2} = −2lnP_{ci} , we take and solve the equation by regula falsi to give the Bonferroni P_{ci} with the desired mean χ_{2c} ^{2} of 2. This method requires two estimates of R flanking the final estimate so that one gives a negative solution and the other gives a positive solution. These values of R are then iterated until a solution sufficiently close to 0, in this case to five decimal places, is obtained (Table 1). The relationship between R and S was then determined by regression so that a value of R could be assigned to each region given S. Corrected P values for msSNPs are then given by 1 − (1 − P _{ni}) ^{R} .
Step 3.
To set the variance of χ_{2c} ^{2} to 4 requires dividing both χ_{2c} ^{2} and μ by to give the desired variance with mean 2/β, which is acceptable only if β ≈ 1.
This calculation greatly reduces the significance of msSNPs but conserves the order of the nominal P values without consideration of smaller effects of numbers of SNPs tested in the region and length of the region in kilobases or LDU. Analysis of composite likelihood is simpler because steps 1 and 2 are not required. Whereas the power of an msSNP depends on its inclusion in a sample that is a small fraction of all SNPs in the genome, composite likelihood does not require inclusion of a causal SNP. Instead of a single P value, composite likelihood gives a point estimate, standard error, and information that allow inference of confidence intervals and efficient metaanalysis.
Combination of Evidence.
The relationship between the corrected msSNP χ_{2} ^{2} and composite likelihood converted to χ_{2} ^{2} was determined by using correlation analysis. A principal component analysis based on this correlation matrix gives a PC1 with positive coefficients and a second principal component (PC2) that is negative when the msSNP has a lower rank than composite likelihood. PC1 was used to order and rank all N regions. Following Ewens (10), this was converted to a P value by rank/IN to give χ_{2} ^{2} = −2lnP for each region based on the combination of evidence. For each sample, the regions with the highest χ_{2} ^{2} defined in this way were examined further.
Calculation of Empirical P Values from Composite Likelihood.
Association mapping of a gene contributing to complex phenotypes requires an efficient estimate of genomic location and its standard error, derived from autocorrelated markers whose complex relationships must be parsimoniously approximated. This is a classical problem for composite likelihood formed by adding together individual component log likelihoods, each of which corresponds to a marginal or conditional event (21). Composite likelihood is often used in statistical genetics to make inferences about current or ancestral populations. It invariably encounters the problem that its component log likelihoods are not independent, and so conventional estimates of P values and standard errors are approximate. However, reliable values can be obtained by simulation of n replicates. Under H_{0}, the rankbased distribution of P is uniform with mean 1/2 and variance 1/12 in the limit as n → ∞. Many recipes have been proposed for reliability with smaller values of n, all of the form P _{i} = (i − A)/(n + B) for the ith value of P, where A and B are constants. Because the mean of the arithmetic progression from i = 1, … ., n is ı̄ = (n + 1)/2, the mean of P _{i} is (n + 1 − 2A)/2(n + B), which equals ½ only if B = 1 − 2A or n → ∞. Two suggestions for Monte Carlo methods, A = B = 0 and A = −1, B = 1, are biased in opposite directions with finite n. Because the mean is specified without error, the variance is computed with df = n.
This is shown in Table 6 for n = 10, which is unrealistically small but shows how different values of A and B behave. Taking A = B = 0, the mean for a uniform distribution is exceeded and the variance is underestimated, but the smallest P value (P _{1}) is uniquely 1/n. This is critical because P for single samples under H_{1} is estimated by interpolation from replicates under H_{0}, requiring higher accuracy for the smallest values that are of greatest interest. All models that give the expected mean of 0.5 underestimate P _{1}, most grossly when A is set at 0.525 to recover the correct variance, by using regula falsi to estimate A so that the variance is correct to five decimal places (i.e., 0.08333). Progress with increasing n is shown in Table 7. The estimate of skewness γ_{1} is 0 and of excess γ_{2} is −1.2, as expected for a uniform distribution. The variance approaches 1/12 in the last three models for n = 100 but more slowly as B increases. The ad hoc fit of the variance by the last model for n = 10 deviates for n = 100 and 1,000. Even at n = 100,000 (data not shown), A = B = 0 is the only model that gives P _{1} = 1/n, although A = 0, B = 1 converges quickly. The latter has the advantage of giving a correct mean but at the cost of a more conspicuous underestimate of the variance. All of the other models give misleading estimates of P _{1} unless n is substantially >10,000.
Our role has been to determine the properties of different models that until now have been considered competitive to estimate significance levels for composite likelihood by Monte Carlo methods, widely used for association mapping and other applications of population genetics. This evidence demonstrates that the model of Ewens is best among the several alternatives that have been disputed and the infinite number that could be proposed. Alternatives should be abandoned.
Acknowledgments
This work was sponsored by grant GM69414 from the National Institute of Health. We are grateful to Dan Arking and Avarinda Chakravarti for facilitating access to the KORA data (15).
Footnotes
 ^{‖}To whom correspondence should be addressed at: Human Genetics Division, School of Medicine, Duthie Building, Mailpoint 808, Southampton General Hospital, University of Southampton, Tremona Road, Southampton SO16 6YD, United Kingdom. Email: nem{at}soton.ac.uk

Author contributions: N.M. designed analysis; D.C., A.P., C.G., H.E.W., S.K., and T.M. performed research; A.R.C. contributed new analytical tools; J.G., W.T., and W.Z. analyzed data; and N.M. wrote the paper.

The authors declare no conflict of interest.
 © 2008 by The National Academy of Sciences of the USA
References
 ↵
 ↵

↵
 Risch N ,
 Merikangas K
 ↵
 ↵
 ↵
 ↵
 ↵

↵
 Bentley DR
 ↵

↵
 Carlson EA

↵
 Mendel G
 ↵
 ↵
 ↵

↵
 Maniatis N ,
 et al.

↵
 Tapper W ,
 et al.

↵
 Lau W ,
 Kuo TY ,
 Tapper W ,
 Cox S ,
 Collins A
 ↵

↵
 Fisher RA
 ↵
 ↵

↵
 van der Waerden BL
 ↵

↵
 Blom C
Citation Manager Formats
Sign up for Article Alerts
Jump to section
You May Also be Interested in
More Articles of This Classification
Biological Sciences
Genetics
Related Content
 No related articles found.