# Measuring missing heritability: Inferring the contribution of common variants

^{a}Department of Statistics and Operations Research, School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv, Israel 69978;^{b}Broad Institute of MIT and Harvard, Cambridge, MA 02142;^{c}Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139; and^{d}Department of Systems Biology, Harvard Medical School, Boston, MA 02155

See allHide authors and affiliations

Contributed by Eric S. Lander, October 10, 2014 (sent for review June 15, 2014)

### This article has a Letter. Please see:

- Relationship between Research Article and Letter - September 28, 2015

## Significance

Studies have identified thousands of common genetic variants associated with hundreds of diseases. Yet, these common variants typically account for a minority of the heritability, a problem known as “missing heritability.” Geneticists recently proposed indirect methods for estimating the total heritability attributable to common variants, including those whose effects are too small to allow identification in current studies. Here, we show that these methods seriously underestimate the true heritability when applied to case–control studies of disease. We describe a method that provides unbiased estimates. Applying it to six diseases, we estimate that common variants explain an average of 60% of the heritability for these diseases. The framework also may be applied to case–control studies, extreme-phenotype studies, and other settings.

## Abstract

Genome-wide association studies (GWASs), also called common variant association studies (CVASs), have uncovered thousands of genetic variants associated with hundreds of diseases. However, the variants that reach statistical significance typically explain only a small fraction of the heritability. One explanation for the “missing heritability” is that there are many additional disease-associated common variants whose effects are too small to detect with current sample sizes. It therefore is useful to have methods to quantify the heritability due to common variation, without having to identify all causal variants. Recent studies applied restricted maximum likelihood (REML) estimation to case–control studies for diseases. Here, we show that REML considerably underestimates the fraction of heritability due to common variation in this setting. The degree of underestimation increases with the rarity of disease, the heritability of the disease, and the size of the sample. Instead, we develop a general framework for heritability estimation, called phenotype correlation–genotype correlation (PCGC) regression, which generalizes the well-known Haseman–Elston regression method. We show that PCGC regression yields unbiased estimates. Applying PCGC regression to six diseases, we estimate the proportion of the phenotypic variance due to common variants to range from 25% to 56% and the proportion of heritability due to common variants from 41% to 68% (mean 60%). These results suggest that common variants may explain at least half the heritability for many diseases. PCGC regression also is readily applicable to other settings, including analyzing extreme-phenotype studies and adjusting for covariates such as sex, age, and population structure.

Comprehensive genomic studies have begun to uncover the genetic basis of common polygenic inherited diseases and traits, identifying thousands of loci and pointing to key biological pathways. However, the genetic variants implicated so far account for less than half the estimated heritability of most diseases and traits (1). Explaining the remainder of the heritability—often termed “missing heritability”—is of considerable biological interest and medical importance. This is our third article on exploring the mystery of missing heritability.

In our first paper (2), we noted that some of the apparently missing heritability may arise from a methodological issue. Specifically, we showed that the presence of genetic interactions among loci might substantially inflate estimates of the total (narrow-sense) heritability and thus overstate the extent of missing heritability. However, this likely is only a partial explanation.

In our second paper (3), we explored the design of association studies to discover genetic variants associated with the risk of a disease or trait. Specifically, our paper focused on rare variant association studies (RVASs), for which large-scale comprehensive efforts are just becoming feasible with advances in sequencing technology. The approach involves sequencing every human gene in a large case–control study to see whether the aggregate frequency (or “burden”) of a set of rare variants differs between cases and controls. (Rare variants may be defined operationally as having frequency ≤0.5%.) We studied how the power of RVASs depends on various factors, such as the selection coefficient against null alleles, the type of rare variants to be aggregated (based, for example, on allele frequency and mutational type), and the population studied. We concluded that RVASs with adequate power to detect genetic effects of interest should involve at least 25,000 cases.

In this third paper, we turn our focus to common variant association studies (CVASs). (Such studies typically are referred to simply as genome-wide association studies, or GWASs, but we prefer the term CVAS to highlight the complementarity with RVAS.) CVAS involves testing millions of common genetic variants for correlation with disease in case–control studies. CVAS has the advantages that one can enumerate the complete set of common variants in a population; each variant is frequent enough to be tested individually; and variants may provide information about a nearby region (as the result of linkage disequilibrium). Whereas RVAS only now is becoming feasible, CVAS became practical with the advent of inexpensive large-scale genotyping arrays roughly a decade ago. CVASs have been performed for hundreds of diseases, involving a total of approximately 2 million samples. The fruits of these studies include the discovery of hundreds of loci for inflammatory bowel disease, schizophrenia, early heart disease, and type 2 diabetes (4).

Whereas early association studies in the 1990s used loose thresholds for statistical significance (e.g., *P* ≤ 0.05) and were notoriously irreproducible, CVAS imposes an extremely stringent threshold for statistical significance (on the order of *P* ≤

An open question has been how much of the heritability of a trait is attributable to common variants. A straightforward approach for inferring the heritability due to common variants is to add up the estimated heritability contributed by each of the genetic variants that have achieved clear genome-wide statistical significance. This calculation typically yields a relatively low proportion—e.g., 5–10% for height (5⇓⇓–8), which has estimated heritability of 80%. The obvious problem with the approach is that it provides only a lower bound that likely is a substantial underestimate because it ignores the many loci that have not yet reached genome-wide significance. Thus, there has been considerable interest in ways to estimate the total heritability attributable to common variants in an indirect manner that does not require definitively identifying the loci.

Visccher and coworkers [Yang et al. (9)] made major contributions to this program, focusing on the situation of quantitative phenotypes studied in a random population sample. The fundamental idea is to estimate the heritability due to common variants by studying the extent to which the phenotypic similarity across pairs of individuals in a sample is explained by their genotypic similarity at common variants. Rather than using simple correlation, they used a family of elegant statistical models, called linear mixed models (LMMs), and estimated the heritability using a technique called restricted maximum likelihood (REML) estimation. By applying REML to a study of height, they estimated that the common variants examined explained ∼50% of the heritability, which was considerably greater than the 5–10% obtained from summing the contributions of variants that have achieved statistical significance. Following this pioneering work, the REML approach has been applied to many other CVASs of quantitative phenotypes, yielding significant increases in estimated heritability explained by common variants.

Subsequently, the same group [Lee et al. (2011) (10)] sought to extend this approach to disease traits (or, more generally, any binary phenotypes) in case–control studies. Modifying the REML method for this setting, they applied it to three diseases studied by the Wellcome Trust Case Control Consortium (WTCCC): type 1 diabetes, Crohn’s disease, and bipolar disorder. The estimated heritability explained by common variants was considerably higher than that obtained from summing the contributions of individual loci. Their approach has since been applied to numerous other disease phenotypes, with similar results (see, e.g., refs. 11⇓–13).

Here, we reexamine the REML approach for disease traits in case–control studies. We identify a flaw in the underlying assumptions that creates a serious bias in the REML estimate for disease traits. Consequently, we show that the REML approach of Lee et al. (10) underestimates the heritability explained by common variants. The magnitude of the bias is affected by many factors, most notably the study size, the prevalence of the disease in the population, the proportion of cases in the study, the true underlying heritability, and the number of genotyped SNPs [some of these factors were noted independently by others (14, 15)]. For example, the simulation studies below show that for a disease with 0.1% prevalence in which common variants actually explain 50% of the heritability, the REML approach applied to a balanced case–control study of 4,000 individuals will yield, on average, an estimate as low as 30–35% (depending on the number of SNPs used). Importantly, this bias increases with study size, suggesting that the inaccuracy of REML estimates will increase as larger and larger CVASs are conducted.

Instead, we propose a general framework for heritability estimation, which we term phenotype correlation–genetic correlation (PCGC) regression, which produces unbiased estimates for case–control studies of disease traits. The approach generalizes traditional regression-based approaches used in genetics (16, 17). We show that PCGC regression yields substantially higher estimates of the heritability explained by common variants for several diseases, including Crohn’s disease, bipolar disorder, type 1 diabetes, early-onset myocardial infarction (MI), schizophrenia, and multiple sclerosis (MS). The PCGC framework also is suitable for other settings involving phenotype-guided sampling, including selection of extreme phenotypes for quantitative traits.

Below, we begin by outlining a general model for a quantitative trait, and review methods for heritability estimation in this situation. We then turn to case–control studies of disease traits and describe current efforts for estimating heritability. We discuss the challenges induced by case–control sampling and use simulations to demonstrate that current methods result in serious downward-biased heritability estimates. To overcome this problem, we introduce an alternative approach, PCGC regression for heritability estimation, and demonstrate that it provides unbiased estimates and improved accuracy. We then use PCGC regression to estimate the heritability due to common SNPs in several case–control studies. Our results show that the fraction of heritability explained by common SNPs is larger than previously thought. We conclude by describing several extensions of our PCGC framework to other CVAS scenarios, such as accounting correctly for additional covariates and analyzing extreme-phenotype studies.

## Results

### General Model of Quantitative Traits.

In the general case, a quantitative phenotype *p* depends on genotype *g* and environment *e*, according to a function ψ. For the *i*th individual, we have *p*_{i} *=* ψ*(g*_{i}*,e*_{i}*)*. Here, the genotype *g*_{i} *=* (*g*_{i1}*, g*_{i2}*, . . . g*_{in}) is the diploid genotype at every variant site across the genome, where *g*_{ik} is the number of copies of a designated allele at the *k*th variant site, and *f*_{j} is the frequency of the designated allele. (We will assume that the variant sites are in linkage equilibrium and all alleles are biallelic, although these assumptions may be relaxed.)

This definition is completely general. It may be used with whole-genome sequence data, with the variant sites corresponding to every nucleotide in the human genome. Moreover, the function ψ allows for arbitrary gene–gene (GxG) and gene–environment (GxE) interactions.

For convenience, we will work below with “normalized” phenotypes *p*_{i} and genotypes *g*_{ik}, where the quantities *p*_{i} and *g*_{ik} have been centered to have mean 0 and standardized to have variance 1.

### Additive Model for Quantitative Traits.

Analyses of heritability typically assume a much simpler additive model. We do so here, assuming that our quantitative trait follows a simple additive model with no GxG or GxE interactions. We also assume there is no correlation in the environmental effects among individuals. We write*u*_{k} denotes the normalized effect of the *k*th variant. This model is illustrated in Fig. 1*A*.

### Defining Heritability.

Heritability quantifies how much of the variability of *p* is the result of variability in *g*. There are two types of heritability: broad-sense heritability *H*^{2}, which measures the full contribution of genes, and narrow-sense heritability *h*^{2}, which is meant to capture the “additive” contribution of genes (see ref. 2 for definitions). For our additive model, these definitions coincide, and we have

### Estimating Heritability from Significantly Associated Variants.

Suppose a certain subset *S* of variants has been shown definitively to be associated with our quantitative trait, based on a large CVAS with a stringent threshold for statistical significance. The heritability explained by these known loci may be estimated as**3** fails to include many additional variants associated with the trait because they have not reached statistical significance in the sample—for example, because of low effect sizes or lower minor allele frequencies. It is clear that most current disease studies have many false negatives, inasmuch as the number of loci identified has been continuing to grow with sample size.

### Estimating the Aggregate Impact of All Variants.

The challenge thus is to estimate the heritability attributable to all common variants associated with a disease, without actually identifying these variants. The basic idea follows from the classical notion, articulated by Galton (18) in the 19th century, that the heritability of a trait is captured by the extent to which genetic similarity between individuals predicts their phenotypic similarity. The challenge is to convert this notion into a mathematical procedure for estimating heritability.

Yang et al. (9) made the key observation that because we are interested in the value *u*_{k}, the effect sizes may be regarded as “nuisance parameters.” They adopted a “random effects” model in which the *u*_{k} are treated as identical and independently distributed random variables drawn from a distribution with mean 0 and variance

Under our additive model above with random effects, we have the following elegant relationship:*G*_{ij} denotes the genetic correlation between individuals *i* and *j*, given by**4** provides an intuitive method for estimating heritability—regressing the empirical phenotypic correlations (*p _{i}p_{j}*) onto the genetic correlations

*G*

_{ij}. The estimated slope of this regression is an unbiased estimator of the heritability. This procedure is known in the genetics literature as Haseman–Elston regression (16, 17). We refer to the general idea of regressing the phenotypic correlations onto the genetic correlations and using the slope for heritability estimation as PCGC and note that Haseman–Elston regression is a special case of PCGC regression.

To apply Eq. **4**, we need to know the genetic correlation *G*_{ij} between people. However, a problem shared by all heritability estimation methods is that these correlations are unknown and need to be estimated from the data. A classical approach used in human genetics is to use the expected kinship coefficient for individuals in a pedigree (for example, 50% for siblings, 3.125% for second cousins, or 0 for unrelated individuals). However, the genetic correlation between related individuals may vary considerably around the kinship coefficient, and unrelated individuals may have substantial cryptic sharing, resulting in nonzero correlations. For example, the genetic correlation between siblings varies around the expected value of 50% with an SD of 3.6% (19). Hence, more accurate estimates of *G*_{ij} may be obtained by directly examining partial genotype or complete sequence information. We return to the problem of estimating *G*_{ij} below.

### Improving Heritability Estimates with REML.

Although the PCGC regression approach for estimating heritability is easy to understand and implement, and produces unbiased estimators, it does not, in fact, make full use of all available information. In a sense, PCGC regression looks only at pairs of individuals at a time, whereas there is additional information in looking at trios of individuals simultaneously, or even at the entire cohort simultaneously. More precisely, the PCGC regression estimator is a moments-based estimator. In statistics, maximum likelihood estimators generally are preferred because they can extract more information and provide more precise estimates.

To use maximum-likelihood estimation, one is required to put forward an explicit probabilistic model of the data (by contrast, the PCGC regression approach requires only independence assumptions, namely that the genetic and environmental effects are independent, and that the environmental effects of different individuals are independent). The common additional assumptions are that the genetic component *g* and the environmental component *e* of the phenotype both follow a normal (Gaussian). In this case, the phenotype *p* is distributed normally, and the joint distribution of the phenotype vector *p**= (p*_{1}*,p*_{2}*,…*,*p*_{n}) is given by*I* denotes the identity matrix, and *G* is the genetic correlation matrix whose off-diagonal entries are the pairwise correlations *G*_{ij} and its diagonal is 1. This model is a special case of a more general statistical approach known as random effects models, and the problem of estimating heritability is a special case of the problem of estimating variance components.

Yang et al. (9) use this framework for the estimation of heritability, which allows them to apply well-established methods such as REML to obtain estimates of

### Model for Disease Traits.

From the description above, estimating the heritability due to common variants may be considered largely solved for the case of a quantitative trait. However, the primary focus of medical genetics is disease traits—which are binary (affected vs. unaffected) rather than quantitative. Disease traits pose further challenges.

Disease phenotypes traditionally have been modeled by a liability threshold model (illustrated in Fig. 1*B*; see ref. 2 for details). The model assumes the existence of a quantitative trait, called the “liability score” and denoted *l*. As above, we have *l* = *g* + *e*, where the genetic component *g* and the environmental component *e* both are normally distributed and uncorrelated with each other. Individuals are affected if and only if their liability score exceeds a threshold *t*. The value of *t* determines the prevalence of the disease in the population, so the liability threshold model can accommodate diseases of any frequency by adjusting the threshold parameter accordingly.

Heritability may be calculated based on either the unobserved liability scale (denoted *K* is the prevalence of the disease in the population, *t* is the threshold, and

### Adapting REML to Case–Control Studies of Disease Traits.

Lee et al. (10) sought to use this framework to adapt the REML method to disease traits. In the case of a random sample from the population, they offer a simple recipe: (*i*) code the disease phenotype as a 0/1 variable, (*ii*) use the REML procedure for a quantitative trait to calculate the heritability of the 0/1 on this observed scale, and (*iii*) convert the resulting estimate to the liability scale as in Eq. **7**. There is an important caveat: although the liability is a continuous quantitative trait, the 0/1 variable itself is not and therefore does not actually fit the likelihood function assumed in the REML method (Eq. **6**). Although this approach yields unbiased estimates (*SI Appendix*, section 5.4.2), the resulting estimates no longer are maximum-likelihood estimates; thus, some of the favorable properties of maximum likelihood estimates no longer are guaranteed to hold.

Real disease studies, however, rarely involve a random sample from the population: the number of affected individuals captured in the sample would be too small. Instead, geneticists use case–control studies, in which cases are considerably oversampled relative to their prevalence in the population. Because of this oversampling of cases, several assumptions of the probabilistic model of REML are violated: (*i*) the marginal distributions of the genetic and environmental effects, as well as of the liability, no longer are normal; (*ii*) the multivariate distribution of these effects no longer is multivariate normal; and (*iii*) the genetic and environmental effects no longer are independent. We illustrate two of these problems below.

### Problem 1: Nonnormality of the Liability.

Whereas the liability score follows a normal distribution in the case of a random population sample, it does not when case–control sampling is used: the oversampling of cases inflates the right tail of the liability score distribution, resulting in a nonnormal distribution of the liability score in the study (Fig. 1*C*).

Lee et al. (10) acknowledge the nonnormality induced by case–control sampling and propose the following analog to Eq. **7** to account for this issue in transforming REML estimates from the observed to the liability scale:*K* and *P* are the prevalence of the disease in the population and the study respectively, and **7**, accounting for the nonnormality of the liability caused by the oversampling of cases. In a random sampling scheme, we have **8** boils down to Eq. **7** as expected.

### Problem 2: Case–Control Sampling Causes “Induced” GxE Interactions.

Although Lee et al. (10) consider the issue of the nonnormality of the liability score induced by the case–control sampling, their analysis misses a subtle but important problem. It turns out that case–control sampling creates an induced positive correlation between the genetic and environmental effects for the samples in the study, as can be seen in Fig. 1*C*. Although there is no GxE interaction in the population, there is an obvious interaction between *g* and *e* under case–control sampling.

### REML Underestimates the Heritability Explained.

Just as it was demonstrated in ref. 2 that the presence of GxG interactions leads to underestimation of the fraction of heritability explained, we suspected that the presence of such induced GxE interactions might result in underestimation of the heritability. To test this idea, we simulated the entire generative process of a case–control study of a disease.

In each simulation run, we generated data for millions of individuals using a liability threshold model corresponding to the desired population prevalence *K*. For each individual, we generated genotypes at 10,000 independent loci, with minor allele frequencies between 0.05 and 0.5. The effect of each SNP was drawn from a normal distribution, as in Yang et al. (9). The liability was computed according to the polygenic model of Eq. **1**, with the variance of the genetic effect set to achieve the desired heritability. Each individual then was classified as affected or unaffected according to the appropriate threshold. Finally, all affected individuals were sampled for the study as cases, whereas unaffected individuals were chosen with a probability set to achieve the desired proportion of cases *P* in expectation. This process was repeated until 4,000 individuals (with the desired proportion of cases) were collected for the study. We note that the choice of simulating SNPs at linkage equilibrium was motivated by a theoretical result from Patterson et al. (22), which shows that the resulting distribution of correlation matrices is equivalent to the distribution obtained from a larger number of SNPs in linkage disequilibrium.

Our simulations confirm our expectation: heritability estimates obtained by applying REML and correcting using Eq. **8** are strongly downward biased (Fig. 2*A*). The magnitude of the bias increases when (*i*) the disease is rarer, (*ii*) the proportion of cases is closer to half, and (*iii*) the heritability is higher. Indeed, these circumstances each increase the induced GxE interaction.

To illustrate the magnitude of the bias, consider a situation representative of a balanced case–control study of a disease in which true underlying heritability due to common SNPs is 50%. If the prevalence is 0.1% (comparable to the frequencies of MS or Crohn’s disease), then the REML method yields an estimated heritability of only 29.4% on average. The bias decreases for more common diseases. For prevalences of 0.5%, 1%, 5%, and 10%, the expected heritability estimate in a balanced case–control study is 34.9%, 37.9%, 43.7%, and 48.1%, respectively.

In addition to the factors described above, the bias of REML estimates depends on other factors—most importantly, it increases with the study size. To illustrate the effect of study size on the bias of REML estimates, we simulated case–control studies with a varying number of individuals (between 2,000 and 8,000) but kept all other parameters constant (we simulated 10,000 SNPs in linkage equilibrium, prevalence of disease in the population was set to 1%, average proportion of cases in the study was 30%, and the heritability was set to 50%). The average REML estimates of heritability decreased with study size, from 43.6% (SE: 0.7%) with 2,000 individuals to 35.3% (SE: 0.2%) with 8,000 individuals (Fig. 2*B*).

We note that Lee et al. (10) tested their method using simulations, which appeared to confirm that the REML method provides unbiased estimates. However, these simulations did not explicitly simulate genotypes. Instead, they proceeded as follows: they (*i*) generated individuals in batches of 100; (*ii*) assigned genetic correlations to all pairs of individuals, by assuming that individuals in the same batch have correlation 0.05 and individuals in separate batches have correlation 0; and (*iii*) simulated phenotypes, by generating liabilities according to Eq. **6** and comparing them to the threshold *t*. All affected individuals were retained for the simulation, together with an equal number of randomly selected unaffected individuals. The problem with this simulation scheme is that most batches contain few cases—indeed, often zero or one for a disease with low frequencies. Because there were few cases within each batch, the genetic correlation between most case pairs, most control pairs, and most case–control pairs was exactly 0 (with a small number of nonzero correlations set at 0.05). In some simulation scenarios, up to 99.98% of the correlations were set to 0. In short, the simulation did not resemble actual genetic correlations. Our simulations above overcome this problem by explicitly simulating genotypes (*SI Appendix*, section 5).

### PCGC Regression as a General Method for Estimating Heritability.

Given the serious bias in REML estimates for case–control studies, we sought to develop an alternative approach based on PCGC regression. Because PCGC regression estimates are moment estimators, they (in contrast to maximum-likelihood estimators) do not require assuming an entire probabilistic model to obtain unbiased estimates; this is a useful feature in situations in which the actual probabilistic setup is complex.

PCGC regression is based on the simple idea that the heritability of a trait controls the strength of the relationship between genotype and phenotype. In the general case, the relationship among genetic correlation (*f* depends on (*i*) the design of the study and (*ii*) the properties of the phenotype. Given the function *f*, we can estimate *i*) the individuals comprise a random sample from the population and (*ii*) the phenotype is an additive polygenic quantitative trait, in which case we have

PCGC regression may be extended to other study designs, although explicit expressions for *f* are harder to obtain. However, when focusing on studies of largely unrelated individuals, the values of *f* can be approximated by a Taylor series at *g* and *e* are normally distributed, we show (*SI Appendix*, section 1.3) that *c* that depends on the properties of the phenotype and the study, but not on the heritability.

In this situation, we can use linear regression to estimate *c*. We provide step-by-step calculations for determining *c* under several relevant study designs (*SI Appendix*, sections 1–4).

As an example, consider the situation of case–control studies, as above. We show (*SI Appendix*, section 1.3) that the value of *c* is given by**6**. [Notably, the REML correction is derived by Lee et al. (10) to correct for the problem of nonnormality (problem 1 above), whereas the use of regression in combination with this *c* addresses both the marginal nonnormality and the induced GxE correlations (problem 2) simultaneously.] As a test, we applied PCGC regression to the simulated case–control data. The results (Fig. 2*A*) confirmed that the approach indeed yields estimates that are unbiased and considerably more accurate than those achieved by the method of Lee et al. (10). We tested PCGC regression across many scenarios: simulating a wide range of disease prevalence (*SI Appendix*, Figs. S5 and S6), populations with cryptic relatedness (*SI Appendix*, Fig. S7); different numbers of SNPs (*SI Appendix*, Fig. S8), increasing study sizes (Fig. 2*B*), and alternative polygenic architectures (*SI Appendix*, Figs. S11 and S12); and estimating heritability in the presence of additional covariates (see below for more details, or see *SI Appendix*, section 5.6). In all scenarios, PCGC regression yielded unbiased estimates of heritability.

### Estimating the Heritability Due to Common SNPs.

So far, we have discussed PCGC regression as a general method for estimating heritability. Its input is a matrix

The commonly used estimate of the genetic correlation is the empirical correlation, computed over the set *A* of variants genotyped or sequenced:*SI Appendix*, section 7). We address alternative approaches in *Discussion*.

### Applying PCGC Regression to Case–Control Studies of Disease.

We applied PCGC regression to six case-control studies of disease: the WTCCC studies of Crohn’s disease, bipolar disorder, and type 1 diabetes investigated by Lee et al. (10); studies of MS and schizophrenia to which the same REML methodology was applied (11, 24, 25); and a study of early-onset MI (26). Where necessary, we applied stringent quality control, as suggested by Lee et al. (10), to mitigate batch effects (*SI Appendix*, section 10). We included sex as a covariate (*SI Appendix*, section 2) and removed the top 10 principal components of the correlation matrix to control for population structure (*SI Appendix*, section 8). To address the problem of imperfect LD between causal SNPs and genotyped SNPs (which, as noted above, results in underestimation of heritability), we applied the correction proposed by Yang et al. (9) (*SI Appendix*, section 7); we note that this correction also was applied by Lee et al. (10) for their REML estimates.

As expected, PCGC regression estimates were higher than REML estimates for all six diseases (Table 1). Specifically, the estimated heritability attributable to common variants increased from 39.3% to 47% for bipolar disorder, from 20.2% to 24.6% for Crohn’s disease, from 14.6% to 16.3% for type 1 diabetes (excluding chromosome 6 from the analysis), from 38.2% to 42.1% for schizophrenia, and from 33.3% to 38.2% for MI. The most notable increase was for the least frequent disease: for MS, the estimated heritability attributable to common variants increased from 33.5% to 45.3%.

We can infer the proportion of the overall heritability attributable to common variants by dividing these values by published estimates of the total heritability (derived from population studies of the phenotypic correlations among relatives; Table 1). Mindful of the considerable uncertainty in these published estimates, we estimate that the proportion of the overall heritability attributable to common variants for these diseases ranges from 41% to 68% (mean 60%).

### Extending PCGC Regression to Incorporate Covariates in Heritability Estimation.

PCGC regression similarly can deal with other important situations. Although genetic and environmental effects often are assumed to represent the sum of many small effects and thus to be distributed normally, some specific effects may be very large. For example, men are considerably taller than women, the prevalence of Alzheimer’s disease increases sharply with age, and lung cancer is more prevalent in smokers than nonsmokers. Such covariates (sex, age, smoking) also are referred to as “fixed effects” and must be accounted for in attempting to estimate the heritability due to common variants.

In the case of a randomly sampled continuous phenotype, REML methodology can be naturally extended to account for additional covariates, and this extension indeed is implemented in the popular GCTA software (27). In the scenario of case–control studies, Lee et al. (10) continue to treat the phenotype as quantitative, apply the extended REML approach to account for fixed effects, and use their correction (Eq. **6**) to transform the resulting estimates of heritability to the liability scale. However, the presence of fixed effects only aggravates the problems arising from case–control sampling, and as a result, the heritability estimates obtained in this manner are even more biased. The increased bias is seen in our simulations (Fig. 2*C* and *SI Appendix*, Fig. S13) and also is supported by theoretical arguments (28).

By contrast, PCGC regression can be extended readily to account for covariates while still yielding unbiased estimates. This is done by replacing the constant *c* from Eq. **7** with a set of constants *i* and *j*. Deriving the specific values of *SI Appendix*, section 2). Briefly, we let *i*, which includes all relevant additional data (smoking habits, sex, age, and so on). Rather than having studywide parameters denoting (*i*) the fraction *K* of cases in the population, (*ii*) the liability threshold *t*, and (*iii*) the probability *P* that an individual in the study is affected, we have individual-specific parameters, where each individual has (*i*) an individual-specific probability *ii*) a corresponding individual-specific liability threshold *iii*) a corresponding individual-specific probability

Using these definitions, we show (*SI Appendix*, section 2.2) that*C*).

The relatively complex formulas for PCGC regression in the fixed-effects scenario shed light on a key difference between REML-based methods and PCGC regression. When no fixed effects are involved, both the REML method of Lee et al. (10) and PCGC regression appear rather similar: both may be viewed as a two-step procedure, in which the first step entails applying a simple recipe appropriate for quantitative traits to a disease trait coded as 0/1 and the second step entails applying a “correction” to the result, to account for issues overlooked in the first step.

Although Lee et al. (10) use the same approach when fixed effects are involved (i.e., they apply a standard REML procedure and an ex post facto correction), PCGC regression no longer may be viewed in this manner. Accounting for the fixed effects is an intrinsic part of the estimation process, rather than an ex post correction. We no longer regress the product of the phenotypes onto

Thus, although REML-based methods and PCGC regression may appear similar, they actually are fundamentally different. PCGC regression constructs the estimates from first principles, which is why its estimates are unbiased in the presence of both case–control sampling and covariates.

### Applying PCGC Regression to Extreme-Phenotype Studies of Quantitative Traits.

PCGC regression also can be applied to other study designs in which unbiased REML estimates are not available because of the complexity of the situation. A particularly important design is an “extreme-phenotypes” study, in which only individuals with extremely high or low values of the phenotype are selected for genotyping. Lander and Botstein (29) showed that such designs are efficient for genetic mapping, because most of the power resides in individuals with extreme phenotypes.

Extreme-phenotype studies pose challenges similar to those of case–control studies for REML-based estimation of heritability, because the assumptions of the normality and GxE independence no longer hold (*SI Appendix*, Fig. S15). As a result, REML estimates will be biased.

Our PCGC regression framework can deal with extreme-phenotype studies in a fashion analogous to that of case–control studies. For example, when sampling individuals whose phenotypes are either below the *K*th quantile, or above the *c* is given by (*SI Appendix*, section 4)*SI Appendix*, section 4). Simulations confirm that the PCGC estimates are unbiased (*SI Appendix*, Fig. S16).

To study the potential benefit of extreme-phenotype studies, we used simulations to measure the improvement in the accuracy of heritability estimates obtained by extreme-phenotype sampling (*SI Appendix*, section 4). We find, for example, that the accuracy obtained by randomly sampling *N* individuals from the population can be achieved by sampling approximately *N*/8 individuals from each of the top and bottom deciles (*SI Appendix*, Fig. S17). We observe similar accuracy benefits when comparing the SEs of heritability estimates of HDL levels in GWASs based on random-sampling vs. GWASs based on extreme-phenotype sampling (*SI Appendix*, section 10.3).

## Discussion

The genetic architecture of most common traits and diseases is complex, involving contributions from genetic variants at many genes and, potentially, interactions among them. These genetic variants likely span the range of allele frequencies, from common variants in the population to rare variants present at extremely low frequencies.

CVASs (typically called GWASs) of traits and diseases already have uncovered numerous statistically significant associations with common variants at individual genetic loci. However, statistical analyses suggest that many more associated variants lurk below the surface—falling short of statistical significance because of inadequate sample sizes. Accordingly, it would be valuable to have reliable methods to infer the overall contribution of common variants without the need to identify each individual locus.

### REML Methods for Estimating Heritability.

Visscher and coworkers (9) pioneered the random-effects approach for using CVAS data to estimate the narrow-sense heritability of a quantitative trait due to common variants. By modeling effect sizes as random variables, this method elegantly circumvents the need to estimate the effect size of each SNP. Instead, it focuses on estimating the overall heritability due to the entire set of common variants. Applying this REML methodology to a wide range of quantitative phenotypes indicates that for many phenotypes, common genetic variants account for a substantial portion of the overall heritability—much more than the portion explained by the individual loci that so far have attained genome-wide significance.

Visscher and coworkers (10) subsequently sought to extend the REML methodology to disease phenotypes. Disease studies typically involve case–control designs, wherein cases are considerably oversampled relative to their frequency in the population; this sampling design violates various assumes of the REML framework. Lee et al. (10) attempted to account for these issues by applying a post hoc correction. However, as demonstrated by our extensive simulations (and by actual genetic studies of six diseases), their method yields strongly downwardly biased estimates of the heritability due to common variants in a variety of interesting and relevant scenarios. The bias depends on properties of the disease, including the prevalence in the population and the true underlying heritability. Most troublingly, the bias increases with study size and with the proportion of cases in the sample. (The bias also depends on the number of SNPs actually genotyped, although this is of secondary importance.) We conclude that REML methods do not provide a suitable framework for estimating heritability for disease phenotypes.

### PCGC Regression.

To solve this problem, we developed PCGC regression, which provides a powerful framework for estimating the heritability due to common variants in a wide range of scenarios. Extensive simulations show that PCGC regression yields heritability estimates that are unbiased (as expected mathematically) and more accurate than the REML approach, when applied to case–control studies.

PCGC regression is a general framework for heritability estimation. In the case of an unascertained quantitative phenotype, it boils down to the well-known regression method of Haseman and Elston. When dealing with case–control studies, PCGC regression allows unbiased estimation of heritability, even in the presence of covariates. As such, it subsumes several recent anecdotal observations (14, 15), provides a theoretical foundation, and, importantly, allows for the incorporation of covariates. In addition to case–control studies, the PCGC regression framework can readily accommodate other complex study designs. One important application is extreme-phenotype studies for quantitative traits, which may be more cost-effective than random sampling studies for the purposes of identifying causal loci and estimating heritability.

One limitation of PCGC regression is that it is based on a first-order approximation of the relationship between phenotypic similarity and genetic correlation. This approximation is expected to be accurate when individuals in the study largely are unrelated. However, in populations with a high degree of cryptic relatedness, this might not be the case. In such cases, PCGC regression can be extended by using second-order, or higher-order, approximations (*SI Appendix*, section 1.4). We note, however, that for the WTCCC population, using first- or second-order approximations yielded very little difference.

### Design of Simulations.

Our paper highlights a critical issue concerning the design of genetic simulations. Lee et al. (10) did not observe the serious bias inherent in REML methods for case–control studies because they used a highly incomplete simulation. In particular, they did not explicitly generate genotypes and phenotypes for each individual, but rather arbitrarily assigned genetic-correlation values to pairs of individuals. This shortcut eliminated critical correlations between genotype and phenotype. By contrast, our simulations use a generative approach: they explicitly (*i*) assign effect sizes to each variant, (*ii*) generate genotypes for each individual, and (*iii*) ascertain cases and controls based on phenotype, by selecting or rejecting individuals. These simulations readily revealed the large downward bias in REML estimates.

One issue with the generative approach is the considerable running time of each simulation, which greatly limits the possible number of individuals and SNPs that can be simulated, as well as the prevalence of the disease simulated. To overcome this problem, we developed a dynamic programming approach, allowing direct sampling of genotypes of cases (*SI Appendix*, section 5.8). We implemented this approach in a software package called simCC, which is freely available from our website. Simulations using simCC are considerably faster; therefore, we could expand our simulations from 10,000 SNPs to 100,000 SNPs, with qualitatively similar results (*SI Appendix*, Fig. S18).

We note that our simulations fail to be fully realistic because they do not model linkage disequilibrium among variants. Although the mathematical result by ref. 22 means that this limitation has little impact on the results in the context of heritability estimation, it is interesting to ask how linkage disequilibrium might be included for the benefit of other endeavors. Several authors have used an approach that takes actual genotypes from an existing study (e.g., WTCCC), assigns effect sizes, and then calculates phenotypes (e.g., refs. 13, 23, 30). However, as currently implemented, this approach has the serious flaw that it does not impose selection to obtain cases and controls and, accordingly, cannot yield a realistic correlation structure for a case–control study (such as the striking correlation effects in Fig. 1*C*). One could solve this problem by using a vastly larger collection of actual genotypes, such that one could obtain cases by imposing stringent selection (i.e., discarding 90–99% of samples, depending on the disease frequency); however, current datasets are too small for his approach. Alternatively, it may be possible to use programs such as HapGen (31) to simulate realistic genotypes from a smaller sample of actual genotypes. Whether this approach is feasible at the required scale is a topic for further study.

### Improved Estimates of Genetic Correlation.

We demonstrate that PCGC regression yields unbiased heritability estimates, given knowledge of the genetic correlation matrix *G.* However, *G* is unknown, and estimates of *G* are used by most heritability estimation methods. As pointed out by Yang et al. (9), replacing the true value of *G* with a noisy estimate results in underestimation of the heritability. This effect is not unique to heritability estimation and is known as “diluted” regression (or “errors-in-variables”): when regressing a dependent variable *y* onto a noisy measurement of an explanatory variable *x*, the estimated slope is attenuated. It is important to stress that this bias is not a result of the heritability estimation method, but rather to the use of noisy estimates of genetic correlations instead of the true correlations.

Yang et al. (9) overcome this problem by quantifying the bias via simulations, and correcting for it post hoc. This eliminates the bias but increases the variance of the estimate. We adopt their method when estimating the heritability due to common SNPs using real data.

Several recent papers (13, 21, 30, 32⇓⇓–35) focused on improving the estimation of the genetic correlation matrices, by accounting for sparsity (33, 34), LD and LD-dependent genetic architecture (30), prior information regarding effect sizes of different SNPs (35), or the relationship between minor allele frequency and effect size (30). As better methods for estimating the genetic correlation matrix are developed, they may be used in PCGC regression in place of the correction method of Yang et al. (9). More accurate estimates of the genetic correlation matrix are expected to increase the estimated proportion of heritability due to common SNPs even further.

### Deviating from the Probabilistic Assumptions of the Model.

A key assumption throughout most works addressing the problem of heritability estimation in general, and estimating heritability using case–control GWAS in particular, is that the genetic and environmental effects follow a normal distribution. The assumption of normality typically is justified by the central limit theorem, as both the genetic and environmental effects are assumed to be the sum of many small contributions, resulting in an approximately normal distribution.

These assumptions might prove invalid when some covariates have a considerable effect on the phenotype: for example, the effect of sex on height or coronary artery disease. When the problematic covariate is known and observed, it can easily be accounted for directly by including it as a fixed effect in the model. However, not all important covariates may be known or observed. When the phenotype is quantitative, normality assumptions can be tested; therefore, situations in which the normality assumption is invalid can, at the very least, be identified. On the other hand, for disease phenotypes, the liability is unobserved, and it is unclear how to test the normality assumptions, as well as what the effect of deviating from these assumptions would be. This question remains to be addressed in future work.

We assume throughout that the underlying genetic architecture is additive. However, one common speculation is that GxG interactions play a considerable role in human disease (36). In this case, one still might attempt to estimate the additive heritability explained by common SNPs by means described here. However, the proper interpretation of the result might depend on the exact genetic architecture.

Another key assumption is that the selection of individuals for the study is guided only by their phenotypes. Although this often is the situation with case–control designs, other, more complicated selection schemes exist. Two important examples are covariate-driven sampling [in which the selection is guided by both the phenotype and by a risk factor of interest, e.g., type 2 diabetes patients with a low body mass index (BMI) and controls with a high BMI (37)] and case–control matching (in which, for each case in the study, an effort is made to recruit a control with similar characteristics). Such designs pose an interesting challenge for future research.

### Role of Common Variants in Common Disease.

Beyond the methodological results in this paper, our key biological finding is that the heritability of disease traits attributable to common genetic variants is even higher than current estimates. For the diseases analyzed above, the heritability estimated by PCGC regression is 9–34% (mean 19%) higher than the estimates produced by REML.

The phenotypic variance attributable to common variants is 25%, 38%, 42.1%, 45%, and 47% for Crohn’s disease, early-onset MI, schizophrenia, MS, and bipolar disorder, respectively. For type 1 diabetes, our estimate is 16.3% when we exclude the large effects due to the major histocompatibility complex (MHC), and 51.3% when we include it. [The contribution of the MHC can be estimated from family studies (38, 39), as well as by considering the contribution of the individual common variants as fixed effects (*SI Appendix*, section 11).]

We can estimate the proportion of heritability explained by common variants by dividing (*i*) the proportion of phenotypic variance explained by common variants by (*ii*) the proportion of the phenotypic variance explained by additive genetic factors—that is, the total heritability—based on phenotypic similarity among relatives (e.g., monozygotic and dizygotic twins). We note that estimates of the total heritability involve substantial uncertainty [due to small study size and potential artifacts resulting from the underlying genetic architecture (2)] and may vary considerably across studies; where multiple estimates had been reported, we used the largest value. The proportions are 41%, 58%, 60%, 64%, 66%, 66%, and 68% for Crohn’s disease, type 1 diabetes, MS, schizophrenia, bipolar disorder, and early-onset MI, respectively, with a mean value of 60% (Table 1).

Our improved estimates still may underestimate the true proportion of heritability explained by common variants for three reasons. First, we calculated the proportion of heritability explained by using the largest estimate of the total heritability, when multiple estimates had been reported; this yields a conservative estimate. Second, as noted above, uncertainty about the genetic correlation matrix decreases the heritability explained. Improved estimates of the genetic correlations likely will increase the estimated heritability explained by common SNPs. Third, our analysis assumes that the contributions of variants are drawn from a uniform prior distribution, regardless of biological context. Instead, variants might be categorized based on biological annotation (e.g., those within or near coding regions, regulatory elements, and so on). A distinct correlation matrix might be estimated for each set and the matrices used simultaneously for heritability estimation, a task that can be accommodated readily by PCGC regression (*SI Appendix*, section 3). A refined correlation structure should provide a better model of the genetic architecture of disease, and thus would be expected to yield higher estimates of the heritability explained by common variants, as well as to provide useful scientific insights.

Our results suggest that larger CVASs will identify many additional common variants related to common diseases, although many additional common variants likely still will have effect sizes that fall below the limits of detection given practically achievable sample sizes. Still, common variants clearly will not explain all heritability. As discussed in the first two papers in this series (2, 3), rare genetic variants and genetic interactions likely will make important contributions as well. Fortunately, advances in DNA sequencing technology should make it possible in the coming years to carry out comprehensive studies of both common and rare genetic variants in tens (and possibly hundreds) of thousands of cases and controls, resulting in a fuller picture of the genetic architecture of common diseases.

## Acknowledgments

We are grateful to the Wellcome Trust Case Control Consortium for making data on multiple diseases available (www.wtccc.org.uk), the ANZgene Consortium for sharing the MS data, Sekar Kathiresan for sharing the MI data, and Dan Rader for sharing the HDL data. D.G. was supported by a Colton fellowship and a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel Aviv University. E.S.L. was supported in part by NIH HG003067 and by funds from the Broad Institute. S.R. was supported in part by Israeli Science Foundation Grant 1487/12.

## Footnotes

↵

^{1}Present address: Department of Genetics, Stanford University, Stanford, CA 94305.- ↵
^{2}To whom correspondence may be addressed. Email: lander{at}broadinstitute.org or saharon{at}post.tau.ac.il.

Author contributions: D.G., E.S.L., and S.R. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1419064111/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵
- ↵.
- Zuk O,
- Hechter E,
- Sunyaev SR,
- Lander ES

- ↵.
- Zuk O, et al.

- ↵.
- Welter D, et al.

- ↵
- ↵
- ↵.
- Lettre G, et al., Diabetes Genetics Initiative; FUSION; KORA; Prostate, Lung Colorectal and Ovarian Cancer Screening Trial; Nurses’ Health Study; SardiNIA

- ↵
- ↵
- ↵
- ↵.
- Lee SH, et al., ANZGene Consortium; International Endogene Consortium; Genetic and Environmental Risk for Alzheimer’s disease Consortium

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Galton F

- ↵
- ↵.
- Dempster ER,
- Lerner IM

- ↵
- ↵
- ↵.
- Lee SH, et al., Schizophrenia Psychiatric Genome-Wide Association Study Consortium (PGC-SCZ); International Schizophrenia Consortium (ISC); Molecular Genetics of Schizophrenia Collaboration (MGS)

- ↵
- ↵.
- Ripke S, et al., Multicenter Genetic Studies of Schizophrenia Consortium; Psychosis Endophenotypes International Consortium; Wellcome Trust Case Control Consortium 2

- ↵
- ↵
- ↵
- ↵.
- Lander ES,
- Botstein D

- ↵
- ↵
- ↵
- ↵.
- Golan D,
- Rosset S

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Genetics

### See related content:

- Downward bias in heritability estimates is real- Sep 28, 2015