## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Linear mixed model for heritability estimation that explicitly addresses environmental variation

Edited by Richard M. Shiffrin, Indiana University, Bloomington, IN, and approved April 8, 2016 (received for review July 9, 2015)

## Abstract

The linear mixed model (LMM) is now routinely used to estimate heritability. Unfortunately, as we demonstrate, LMM estimates of heritability can be inflated when using a standard model. To help reduce this inflation, we used a more general LMM with two random effects—one based on genomic variants and one based on easily measured spatial location as a proxy for environmental effects. We investigated this approach with simulated data and with data from a Uganda cohort of 4,778 individuals for 34 phenotypes including anthropometric indices, blood factors, glycemic control, blood pressure, lipid tests, and liver function tests. For the genomic random effect, we used identity-by-descent estimates from accurately phased genome-wide data. For the environmental random effect, we constructed a covariance matrix based on a Gaussian radial basis function. Across the simulated and Ugandan data, narrow-sense heritability estimates were lower using the more general model. Thus, our approach addresses, in part, the issue of “missing heritability” in the sense that much of the heritability previously thought to be missing was fictional. Software is available at https://github.com/MicrosoftGenomics/FaST-LMM.

- heritability estimation
- linear mixed model
- environment
- Gaussian radial basis function
- model misspecification

An important causal question comes from the age-old debate about nature versus nurture. For any phenotype such as height or intelligence quotient, how much of the phenotype is inherited and how much is determined by environment? This question was made precise by Fisher (1) and Wright (2) almost a century ago: Given observations of a phenotype from a population of individuals, what is the fraction of variance of the phenotype that is caused by inherited factors relative to the total variance of the phenotype due to both inherited and environmental factors? This fraction, termed “heritability,” has been the subject of intense study across various phenotypes and populations since it was defined. Note that, in contrast to how some interpret the informal question around the nature-versus-nurture debate, heritability is not an absolute quantity but rather a quantity relative to a given population. For example, a phenotype in a population where environmental factors have large variation will have a smaller heritability than in an otherwise similar population where environmental factors have a small variation.

Over the years, many approaches have been developed to estimate heritability from data (3, 4). Here, we concentrate on an approach made possible by the recent ability to sequence genomes at a modest cost (5, 6). The approach uses a linear mixed model (LMM), a form of multivariate regression of the genomic and environmental factors on the phenotype, which we examine in detail in the next section.

In the standard LMM approach, the effects of environmental factors on the phenotype are modeled as noise. Specifically, the phenotype of each individual is assumed to be the sum of two random effects, one based on genomic factors and one based on environmental factors, where the latter is assumed to be mutually independent across individuals. As we shall see, this model for environmental effects can lead to inflated estimates of heritability.

To avoid this inflation, we could measure and model environmental effect explicitly (e.g., ref. 7). Unfortunately, in most circumstances there are many environmental factors to be measured. Furthermore, some environmental factors may be unrecognized and consequently are unmeasurable. In this work, we investigate the use of an easy-to-measure surrogate for environmental factors—namely, spatial location. We show how this surrogate can be incorporated into the LMM as an additional random effect. We investigate our more general model with simulated data and with data from a Ugandan cohort of about 5,000 individuals.

## Results

### Heritability Estimation.

First, let us consider a standard approach for estimating heritability using an LMM (6, 8). The estimate is based on observations consisting of ** y**, an

*N*individuals, and

**X,**an

*N*individuals and

*M*variants. Note that it is customary to normalize the causal variants so that each one has a mean of zero and an SD of one across individuals. Given these observations, we model

**as a multivariate linear regression on**

*y***X:**

**2**into Eq.

**1**and integrating out

**3**is known as a linear mixed model with a random effect having the covariance matrix

*i,j*of

Narrow sense heritability, denoted **3**. The term “heritability” without the modifier “narrow sense” is typically reserved for the quantity that includes all genomic effects. Herein, for simplicity, we will concentrate on the estimation of narrow-sense heritability although, as we mention later, our approach can be extended to estimate more general quantities.

In practice, we do not know which genomic variants are causal, so we use an approximation for *i,j*) is the fraction of the genome shared identical by descent (IBD) among individuals *i* and *j* (6). That is, we use the model*i* and *j*, and *α* is a scaling parameter. Intuitively, the inclusion of *α* can be thought of as the spatial range of the environmental effect. The larger the value for *α*, the larger the range or extent of the effect. As in the standard case, we fit all parameters, now including *α,* with REML.

Recall that the standard LMM follows from modeling the phenotype as a regression on genomic variants. Similarly, Eq. **5** can be interpreted as the result of modeling the phenotype as a multivariate regression on both genomic variants and spatial location. In particular, Mercer’s theorem (10, 11) states that, if **K(***z*_{i},*z*_{j}**)** is a continuous symmetric positive semidefinite function to *z*_{i} and *z*_{j} each in a compact Hausdorff space, then there exists a set of functions *ϕ*_{k}(*z*), *k* = 1,…,∞, such that **K(***z*_{i},*z*_{j}**)** is equal to the dot product *z*_{i} as the spatial location of individual *i* and **K(***z*_{i},*z*_{j}**)** as element *i,j* in **5** is equivalent to conditioning on spatial features *ϕ*_{k}(*z*_{i}), *i* = 1,…,*N*, *k* = 1,…,∞. We note that the Gaussian radial basis function is guaranteed to be positive semidefinite.

Finally, let us consider nonlinear interactions between genomic and environmental components. We can model some of these interactions by introducing a third random effect to the LMM:*i,j* is the product of elements *i,j* from *i,j* of**X**_{ik} *ϕ*_{l}(*z*_{i}), *i* = 1,…,*N*, *k* = 1,…,*M*, *l* = 1,…,∞. Product features such as these are often used to model nonlinear interactions (12). In our analysis of the Ugandan cohort, we use this instance of

We note that the standard model given by Eq. **3** is nested in the model given by Eq. **5**, which in turn is nested in the model given by Eq. **8**.

### Heritability Analysis on Simulated Data.

We first applied our approach to the analysis of simulated data. We generated from the Balding–Nichols model (13) with a 50:50 population ratio, a baseline minor allele frequency (MAF) sampled uniformly from [0.05, 0.5], and a value for Wright’s *F*_{ST} equal to 0.1. We generated a spatial location for each individual by sampling randomly from one of two spherical Gaussian distributions with SD 625,000 and separation between Gaussian centers equal to 4 × 625,000. This procedure produced a distribution of spatial locations similar to the real data and satisfied an assumption underlying Eqs. **6** and **7** that genomic variants and spatial locations are independent. We next generated the phenotype using Eq. **5** with *α* = 4 × 625,000. We generated 50 datasets and then, for each one, computed uncorrected and corrected heritability estimates, based on Eqs. **3** and **5**, respectively, with

The estimates of

As mentioned, Eqs. **6** and **7** were derived under an assumption that genomic and spatial factors are independent. In practice, however, this assumption may not hold. To investigate the robustness of these estimates to nonindependence, we modified the above data-generation procedure to create a dependence between genomic and spatial variation. In particular, the spatial locations of all individuals from the same Balding–Nichols population were drawn from the same spherical Gaussian. Despite this relatively strong dependence, estimates of

Sample code for these experiments in the form of an iPython notebook can be found in *SI Appendix*.

### Heritability Analysis on the Ugandan Cohort.

We next applied our approach to an analysis of a Ugandan cohort (*Methods*) across 34 phenotypes including anthropometric indices, blood factors, glycemic control, blood pressure, lipid tests, and liver function tests. A description of phenotypes can be found in Table 1. Not unexpectedly, heritability estimates varied widely, with corrected estimates ranging from 0.55 for mean platelet volume (MPV) to 0.10 for levels of alkaline phosphatase (Table 2).

Consistent with our studies on simulated data, uncorrected heritability estimates were inflated. The inflation was significant for 14 of the 34 phenotypes (Fig. 1). In addition, 23 phenotypes had a value for

For each phenotype, we were also able to determine the geographical range of the environmental effect (i.e., the optimized value of the scaling parameter *α*), which varied by more than three orders of magnitude across the phenotypes (Table 2).

Interestingly, corrections were most substantial for anthropometric indices, lipid tests, and measures of liver function. This pattern, which may or may not be real, is under investigation. We are also working to identify specific environmental effects responsible for the large heritability corrections. For the phenotype mean corpuscular hemoglobin concentration (MCHC), which had the largest correction, we ruled out several factors as substantial sources of environmental effects. In particular, elevation was excluded because the terrain of the study is essentially flat. Also, the heritability corrections were about the same for males and females, which we would not expect if iron was a contributor. In addition, heritability correction remained high when MCHC was adjusted for primary occupation and alcohol consumption. On examining this spatial effect in more detail, we found that MCHC varied substantially from one village to the next. Previously, we have shown substantial variation in urbanicity indicators among these villages (14), consistent with this observation. We further explored the possibility of this spatial variation arising because villages were sampled in batches over time. We observed significant improvement in the model on inclusion of sampling date as a covariate. Nonetheless, even with this inclusion, the ratio of uncorrected to corrected

Finally, we estimated the variance of the phenotypes due to interactions between genomic and environmental components, fitting the three random effects corresponding to

## Discussion

We have introduced an LMM approach that includes explicit representation of spatial location in estimates of narrow-sense heritability. Spatial location is presumably a surrogate for (some) environmental effects and, unlike many environmental variables, is easy to measure. On simulated data, we have shown that estimates of heritability based on the more general model seem to be unbiased, whereas the estimates based on the standard model are inflated for some phenotypes. Similarly, in an analysis of 34 phenotypes in a Ugandan cohort, we have found that the uncorrected estimates of heritability based on the standard model are inflated relative to corrected estimates based on the new model. Furthermore, on simulated data, we have shown that the degree of bias is not substantially influenced by the absence or presence of dependence between genomic and environmental factors. Overall, we have demonstrated that estimates of heritability can depend on the nature of environmental variation.

The corrections were substantial, emphasizing the importance of explicitly modeling environmental effects in the estimation of heritability. Furthermore, the amount of inflation varied considerably across the 34 phenotypes, being the greatest for anthropometric indices, lipid tests, and measures of liver function. Presumably in this study, and perhaps in others, spatially related environmental factors affect these phenotypes more. A better understanding of differential bias across traits will require further exploration in cohorts with the relevant environmental phenotypic data.

Our approach has been applied only to the analysis of simulated data and the Ugandan cohort. Nonetheless, if the inflation seen here is typical, then this work offers a new interpretation of results of genome-wide association studies (GWAS). In particular, GWAS studies have so far revealed consistent “missing heritability,” where the variability explained by SNPs identified as associated with a phenotype has been far less than the variability identified in heritability estimates (15). This work suggests that much of this missing heritability was not missing in the first place.

An important lesson from this work is that model misspecification can lead to substantial bias in heritability estimates. Because our use of the Gaussian radial basis function to quantify spatial similarity is itself likely to be misspecified, our corrected estimates of heritability on the Ugandan cohort may remain biased to some degree. Thus, we are investigating alternative similarity functions and methods to select the best one based on data.

We are also investigating modifications to the model beyond the form of the similarity function. One potential modification is based on the well-known fact that narrow-sense heritability estimates will be inflated when the data contains closely related individuals due to effects of dominance and epistasis (3, 4). This inflation could be mitigated by including variance components reflecting IBD2 (both alleles shared) (e.g., see ref. 16) and epistasis (e.g., ref. 17). In addition, one could include a variance component based on whether individuals are in the same household or a variance component based whether individuals are in the same village. As another example, one could include a variance component based on social connectivity, which is known to affect various phenotypes, including obesity (18).

Finally, in addition to heritability estimation, LMMs are also commonly used for identifying associations between genomic variants and phenotypes (e.g., genome-wide association studies) and for prediction. The LMM models described in this work could be applied to these applications as well.

## Methods

We collected data for 5,000 individuals from nine ethnolinguistic groups from the General Population Cohort (GPC), Uganda (19). The GPC is a population-based open cohort study established in 1989 by the Medical Research Council in collaboration with the Uganda Virus Research Institute (UVRI) to examine trends in prevalence and incidence of HIV infection and their determinants. Samples were collected from individuals during a survey from the study area located in southwestern Uganda in Kyamulibwa subcounty of Kalungu district, ∼120 km from Entebbe town. The study area is divided into villages defined by administrative boundaries varying in size from 300 to 1,500 residents and includes families living within households. Data on health and lifestyle were collected using a standard individual questionnaire, blood samples obtained, and biophysical measurements taken, when necessary, as described previously (19). Spatial location was recorded in Global Positioning System coordinates. The measurements were translated and scaled to mitigate privacy concerns.

The GPC study was approved by the Uganda Virus Research Institute, Science and Ethics Committee (Ref. GC/127/10/10/25), the Uganda National Council for Science and Technology (Ref. HS 870), and the U.K. National Research Ethics Service, Research Ethics Committee (Ref. 11/H0305/5). Care was taken to obtain genuine informed consent from participants, including the use of reliable intermediaries as appropriate to ensure that the implications of participation were fully understood. Consent forms were translated from English into Luganda and checked for accuracy. The Lugandan translation was given to participants to read themselves, or was read out aloud to them by study staff. Participants could choose to consent to all, or just selected parts, of the survey. The informed consent of participants was obtained with a signature on the consent forms or a thumb print if the participant was unable to write. For participants aged 13–17 y, parental consent as well as child formal assent were collected. The immediate counter signature of a witness was then obtained. The APCDR committees are responsible for curation, storage, and sharing of the data under managed access. The genomic data have been deposited at the European Genome-phenome Archive (EGA, https://www.ebi.ac.uk/ega/) under accession number EGAS00001001558. Requests for access to phenotype data may be directed to data@apcdr.org.

We genotyped 5,000 samples from the Ugandan Survey on the Illumina HumanOmni 2.5M BeadChip array at the Wellcome Trust Sanger Institute. Sequenom quality control and gender checks were carried out before genotyping. A total of 2,314,174 autosomal and 55,208 X-chromosome markers were genotyped on the HumanOmni2.5–8 chip. Of these, 39,368 autosomal markers were excluded because they did not pass the quality thresholds for the SNP called proportion (<97%, 25,037 SNPs) and Hardy–Weinberg equilibrium (HWE) (*P* < 10^{−8}, 14,331 SNPs). HWE testing was only carried out on the founders for autosomes, and female unrelated individuals for the X chromosome defined by an IBD threshold <0.10 as estimated by PLINK. A total of 91 samples were dropped during sample quality control because they did not pass the quality thresholds for proportion of samples called (>97%) or heterozygosity (outliers: mean ± 3 SD), or the gender inferred from the X-chromosome data did not match the supplied gender. Three additional samples were dropped because of high relatedness (i.e., IBD >0.90). Principal component analysis was carried out on unrelated individuals projecting onto related individuals, for SNPs LD pruned at an r^{2} threshold of 0.2, with a MAF threshold of >5%. No samples were identified as population/ancestry outliers based on this analysis.

To generate the phased dataset, we first mapped pedigrees within our dataset based on relationships provided in the data. To detect any errors in these pedigrees, we ran KING (20) on each cohort and also used the results to identify any cryptic first-degree relationships that had not been mapped. We further removed pedigrees where age information was inconsistent with the pedigree specified. In addition to the quality control described, we also removed SNPs with a minor allele frequency in the founders less than 5%, or with more than 1% Mendelian errors. We set all remaining Mendelian errors to missing, as well as any genotypes flagged as unlikely by the detection algorithm Merlin (21). SNPs with more than 1% missingness were then removed. We phased this curated dataset of 1,340,101 SNPs using SHAPEIT2 (22), first phasing the samples ignoring family information, and then running a hidden Markov model on every parent–child duo. This procedure corrects phasing errors inconsistent with the pedigree structure, further improving phasing accuracy. We have previously shown this method produces highly accurate results in our cohort with negligible switch error rates (22). To construct

Phenotypes were transformed before analysis. Residuals were obtained following regression of the trait on age, age squared, and sex. Residuals were then inverse-normally transformed for analysis. For HbA1c, regression was carried out on age, age squared, sex, and month of sample collection (as an indicator variable) to account for seasonal trends in HbA1C that have been described previously (24).

Heritability estimation was performed with the FaST-LMM toolset available at https://github.com/MicrosoftGenomics/FaST-LMM. To determine a *P* value for the null hypothesis *P* value for the null hypothesis

## Acknowledgments

We thank Christoph Lippert for discussions about Mercer’s theorem, Noah Zaitlen for discussions about more general models for heritability estimation, Ashish Kapoor for discussions on how best to fit the scaling parameters for radial basis functions, and Johanna Riha for discussions regarding the sources of spatial variance for some phenotypes. We thank the African Partnership for Chronic Disease Research for providing a network to support this study as well as a repository for deposition of curated data. We also thank all study participants who contributed to this study and the National Institute of Health Research Cambridge Biomedical Research Centre for data collection and phenotype analysis. This work was funded by the Wellcome Trust, Wellcome Trust Sanger Institute Grant WT098051, Medical Research Council Grants G0901213-92157, G0801566, and MR/K013491/1, and the Medical Research Council/Uganda Virus Research Institute Uganda Research Unit on AIDS core funding.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: heckerma{at}microsoft.com.

Author contributions: D.H., D.G., C.K., C.W., and M.S.S. designed research; D.H., D.G., C.K., C.P., T.C., H.M., K.E., R.N.N., G.S., A.K., P.K., C.W., and M.S.S. performed research; D.H., C.K., and C.W. contributed new reagents/analytic tools; D.H., D.G., C.K., C.W., and M.S.S. analyzed data; and D.H., D.G., C.K., C.W., and M.S.S. wrote the paper.

Conflict of interest statement: D.H., C.K., and C.W. were employees of Microsoft Research while performing this research.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Drawing Causal Inference from Big Data,” held March 26–27, 2015, at the National Academies of Sciences in Washington, DC. The complete program and video recordings of most presentations are available on the NAS website at www.nasonline.org/Big-data.

This article is a PNAS Direct Submission.

Data deposition: The genomic data have been deposited at the European Genome-phenome Archive (EGA, https://www.ebi.ac.uk/ega/) (accession no. EGAS00001001558).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1510497113/-/DCSupplemental.

## References

- ↵
- ↵.
- Wright S

- ↵.
- Falconer DS,
- Mackay TFC

- ↵.
- Lynch M,
- Walsh B

- ↵National Genome Human Research Institute (2015) National Genome Human Research Institute. Available at https://www.genome.gov/..
- ↵
- ↵.
- Valdar W, et al.

- ↵
- ↵
- ↵.
- Bernhard S,
- Smola AJ

- ↵.
- Rasmussen CE,
- Williams CKI

- ↵.
- Cordell HJ

- ↵
- ↵
- ↵
- ↵
- ↵.
- Stern MP, et al.

- ↵
- ↵.
- Asiki G, et al., GPC team

- ↵.
- Manichaikul A, et al.

- ↵
- ↵
- ↵
- ↵.
- Tseng CL, et al.

## Citation Manager Formats

### More Articles of This Classification

### Related Content

- No related articles found.

### Cited by...

- Heritability of Vascular Structure and Function: A Parent-Child Study
- The Genomic Architecture of Interactions Between Natural Genetic Polymorphisms and Environments in Yeast Growth
- Leaf Growth Response to Mild Drought: Natural Variation in Arabidopsis Sheds Light on Trait Architecture
- Drawing causal inference from Big Data