## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Massively expedited genome-wide heritability analysis (MEGHA)

Edited by C. Thomas Caskey, Baylor College of Medicine, Houston, TX, and approved January 15, 2015 (received for review August 12, 2014)

## Significance

Practical tools for high-dimensional heritability-based screening are invaluable for prioritizing phenotypes for genetic studies with the dramatic expansion of available phenotypic data. Classical estimates of heritability require twin or pedigree data, which can be costly and difficult to acquire. Alternative methods based on whole-genome data from unrelated individuals exist but are computationally expensive. Here we present a novel, fast, and accurate statistical method for massively expedited genome-wide heritability analysis, making heritability-based prioritization of millions of phenotypes based on data from unrelated individuals tractable for the first time to our knowledge. We apply our method to large-scale heritability analyses of brain imaging measurements and demonstrate its potential for facilitating phenome-wide analyses and characterizing the genetic architecture of complex traits.

## Abstract

The discovery and prioritization of heritable phenotypes is a computational challenge in a variety of settings, including neuroimaging genetics and analyses of the vast phenotypic repositories in electronic health record systems and population-based biobanks. Classical estimates of heritability require twin or pedigree data, which can be costly and difficult to acquire. Genome-wide complex trait analysis is an alternative tool to compute heritability estimates from unrelated individuals, using genome-wide data that are increasingly ubiquitous, but is computationally demanding and becomes difficult to apply in evaluating very large numbers of phenotypes. Here we present a fast and accurate statistical method for high-dimensional heritability analysis using genome-wide SNP data from unrelated individuals, termed massively expedited genome-wide heritability analysis (MEGHA) and accompanying nonparametric sampling techniques that enable flexible inferences for arbitrary statistics of interest. MEGHA produces estimates and significance measures of heritability with several orders of magnitude less computational time than existing methods, making heritability-based prioritization of millions of phenotypes based on data from unrelated individuals tractable for the first time to our knowledge. As a demonstration of application, we conducted heritability analyses on global and local morphometric measurements derived from brain structural MRI scans, using genome-wide SNP data from 1,320 unrelated young healthy adults of non-Hispanic European ancestry. We also computed surface maps of heritability for cortical thickness measures and empirically localized cortical regions where thickness measures were significantly heritable. Our analyses demonstrate the unique capability of MEGHA for large-scale heritability-based screening and high-dimensional heritability profile construction.

In quantitative genetics, the variance of a phenotype is commonly attributed to genetic components, environmental factors, and their interactions (1). The proportion of phenotypic variance captured by total additive (allelic) genetic effects is conceptualized as narrow-sense heritability. With the rapid expansion of comprehensive phenotypic data, practical tools to estimate heritability are invaluable as they can be used to prioritize high-dimensional phenotypes for genetic studies.

Classical estimates of narrow-sense heritability require twin or pedigree data (2⇓–4), which can be costly and difficult to acquire. As genome-wide data became widely available, genome-wide complex trait analysis (GCTA) (5, 6) was developed, which assesses aggregate effects of common SNPs spanning the genome on phenotypes and thus provides an SNP-based heritability estimate, a lower bound for narrow-sense heritability. This method has been successfully applied to the heritability analyses of several complex traits and mental disorders (5, 7, 8) and has been used to investigate the puzzle of “missing heritability” (5, 9, 10). However, GCTA is a computationally expensive procedure. The use of a time-consuming iterative optimization procedure in the fitting of variance component models makes it prohibitive to use for evaluating very large numbers of phenotypes or with nonparametric sampling techniques, such as permutation tests. More practical and computationally efficient methods are needed to facilitate the identification of phenotypes that are most appropriate for genetic studies especially in instances where the complexity of the phenotype provides thousands or even millions of options.

Here we present a fast and accurate statistical method for heritability analysis using genome-wide SNP data from unrelated individuals, which we call massively expedited genome-wide heritability analysis (MEGHA). MEGHA largely falls in the kernel machines framework (11), which subsumes the GCTA model as a special case and uses a variance component score test (12), known as sequence kernel association test (SKAT) (13⇓–15), for efficient statistical inferences. MEGHA provides both magnitude estimates and significance measures of heritability with orders of magnitude less computational effort relative to GCTA, making it possible to analyze millions of phenotypes and develop sampling techniques that produce accurate inferences for arbitrary statistics of interest and accommodate complex correlation structures within phenotypic data.

As a demonstration of application, MEGHA was applied to brain structural MRI and genome-wide SNP data from 1,320 unrelated subjects, as part of the Harvard/Massachusetts General Hospital (MGH) Brain Genomics Superstruct Project (GSP) (16). Brain imaging data are a prototype case where a vast array of potentially relevant phenotypes are routinely collected and phenotypic complexity has grown exponentially as new tools to analyze high-resolution structure and point-to-point connectivity have emerged. A wide range of volume-, surface-, and connection-based measures are of potential interest in analyzing the relationship between genetic and brain data in the context of clinical conditions (17⇓⇓⇓–21). Although, in principle, any measure computable from different brain imaging modalities can be used as phenotypes in genetic studies, ideal candidate imaging traits should be heritable intermediate (or endo-) phenotypes (22⇓–24), to uncover the genetic underpinnings of various neuropsychiatric disorders or biological processes of interest (25). However, due to the computational complexity and inordinate options of brain imaging measurements, few tools exist to enable efficient heritability-based screening of these phenotypes (26, 27), and the exploration of their genetic basis has been limited to a small subset of the search space. All high-dimensional (whole-brain, voxel-/vertex-wise) heritability maps computed to date have relied on twin or pedigree data (28⇓⇓⇓–32). MEGHA may thus offer a powerful method for large-scale heritability screening and high-resolution heritability profile construction in imaging genetics.

## Results

Table 1 shows the SNP-based heritability estimates of a number of global morphometric measurements, including intracranial volume (ICV; i.e., head size), total brain volume, left/right hemispheric cortical gray matter (GM) volume, total cortical GM volume, total subcortical GM volume, total GM volume, left/right hemispheric white matter (WM) volume, total WM volume, left/right hemispheric mean cortical thickness, overall mean cortical thickness, left/right hemispheric total surface area, and total surface area. The test–retest reliability of these measurements measured by correlation coefficient was computed using 42 individuals that had repeated brain MRI scans on separate days. All measurements show high test–retest reliability. ICV, total brain volume, and mean cortical thickness measures are highly heritable, with familywise error corrected (FWEc) significant *P* values computed by the proposed permutation procedure (*Materials and Methods*). Cortical GM volumes are also heritable, with uncorrected significant *P* values. Subcortical GM volume, WM volumes, and surface area measures show moderate heritability. The proposed permutation procedure implicitly accounts for the correlation structure among measurements and provides more accurate FWEc *P* values (based on one million permutations) than Bonferroni-corrected GCTA *P* values. MEGHA estimates of heritability magnitude are tightly correlated with GCTA results (Fig. 1).

We next applied both MEGHA and GCTA to the heritability analyses of average cortical thickness measures in 68 regions of interest (ROIs; 34 ROIs per hemisphere) defined by the Desikan–Killiany atlas (33), producing SNP-based heritability estimates and significance measures (Table S1). The MEGHA heritability estimates, *P* values, and permutation *P* values (based on one million permutations) show excellent concordance with the GCTA results (Fig. 2 and Fig. S1). Four brain regions—the bilateral superior parietal cortex, the left precuneus cortex, and the left rostral anterior cingulate cortex—are significantly heritable after multiple testing corrections over all of the ROIs. The right precuneus cortex and the right supramarginal gyrus are marginally significant with FWEc (*P* < 0.10).

As shown in Table 2, an analysis that involves 50 or 100 phenotypes would be easily handled by both MEGHA and GCTA, although MEGHA is hundreds of times faster (cases 1 and 2). For example, it took ∼400 s for GCTA to compute the *P* values for all of the 68 ROIs, whereas MEGHA required less than 1 s with a MATLAB implementation on a MacBook Pro with 8 GB of memory and a 2.4-GHz Intel Core i7 processor. The dramatic improvement of MEGHA in computational efficiency makes it possible for high-dimensional heritability screening and mapping (case 3), for inferences on arbitrary statistics of interest based on thousands of permutations (case 4), and even for a combination of both (case 5). Using GCTA in any of these analyses would require months, years, or even decades of computational time, and could be prohibitively slow even if parallel computation is used.

As a demonstration of the usefulness and flexibility of MEGHA in high-dimensional heritability analyses, we conducted vertex-wise MEGHA of cortical thickness measures to produce high-resolution surface maps for SNP-based heritability estimates (Fig. S2) and significance (Fig. 3). (Also see Fig. S2 for spatial heritability maps of sulcal depth, curvature, and cortical surface area measures.) We then performed surface-based clustering on the significance map, using *P* = 0.01 as a cluster-forming threshold, to localize heritable regions of cortical thickness. These empirically identified clusters are typically not aligned with sulcal/gyral patterns or predefined anatomical/functional ROIs. Cluster inferences (18, 34, 35) using the proposed permutation procedure identified five clusters (white outlined and annotated in Fig. 3) with FWEc significance over the entire cortical surface based on 1,000 permutations. Cluster 1, the largest cluster identified comprising 6,518 vertices with a FWEc *P* < 0.001, spans the left superior parietal cortex, cuneus, precuneus, and the left posterior cingulate cortex. Cluster 2 (FWEc, *P* = 0.003) and cluster 3 (FWEc, *P* = 0.015) largely overlap with the left precentral/postcentral cortex and the left superior temporal cortex, respectively. Clusters 4 and 5 are located on the right hemisphere. Specifically, cluster 4 (FWEc, *P* = 0.004) spans the right supramarginal cortex and the lateral part of the precentral/postcentral cortex. Last, cluster 5, which comprises 4,523 vertices with a FWEc *P* < 0.001, extends from the right superior parietal cortex to the right cuneus and precuneus.

## Discussion

In this paper, we present MEGHA, a fast and accurate statistical method for heritability analysis using genome-wide SNP data from unrelated individuals. Our method has excellent concordance with GCTA, but is thousands of times faster. This computational efficiency allows for examination of complex phenotypes that have millions of combinations, and the development of nonparametric sampling techniques such as permutation tests and Jackknife resampling that can produce accurate and flexible inferences for arbitrary statistics of interest. As a case study of its application, MEGHA was used to prioritize brain structural MRI phenotypes based on heritability. We conducted global, regional, and vertex-wise heritability analyses of cortical thickness measures, which empirically identified significantly heritable regions in superior parietal, precuneus, precentral/postcentral, superior temporal, and visual cortex. Prior studies have also published heritability maps of brain image derived phenotypes, where cortical thickness measures are under substantial genetic influences (28⇓⇓–31). These studies have relied on twin or pedigree samples, each spanning a different age range. In particular, twin-modeling results reported in ref. 30 are the most pertinent to our analyses, as the study sample consisted of young adults with an age span similar to our sample. The heritability maps of thickness presented in ref. 30 are similar to the results observed here based on genotypic data. Several differences do emerge such as the present results emphasizing significantly heritable clusters in bilateral association regions of the parietal cortex extending into precuneus. One explanation for these differences might be that MEGHA only captures SNP-based heritability due to common variation, whereas twin or pedigree based analyses can capture components of heritability due to rare variation.

### Methodological Assessment.

Although MEGHA provides estimates of heritability magnitude, the values need to be interpreted with caution. The reason is that when searching over a large number of phenotypes, the heritability estimates ranked at the top are highly likely to be inflated by noise. Reporting heritability magnitude extracted from significantly heritable brain regions is also invalid, representing a general problem of double dipping (36, 37). For this reason, although we demonstrate the usefulness of MEGHA to screen heritability of large numbers of phenotypes, we recommend deriving unbiased estimates of the magnitude of heritability in independent, replicate datasets.

We also note that heritability estimates and significant measures can be affected by the reliability of extracted measurements. Caution is thus needed when comparing heritability estimates across different measurements that are computed using different techniques. Although it appears from Fig. S2 that cortical thickness measures are more heritable than other morphometric features, this may be partly due to the fact that cortical surface area measures have much lower vertex-wise test–retest reliability than cortical thickness and sulcal depth measures in this particular data set.

A completely empirical permutation procedure to assess heritability significance would have to break the association between phenotypes and genotypes while retaining the observed phenotypic correlation structure. To the best of our knowledge, no strategy exists to achieve this requirement. The permutation procedure designed in this paper relies on the assumption that the linear mixed effects model is a good description of the data, making the score test valid and accurate. For example, the permutation inference is only valid when the model residuals under the null hypothesis (no additive genetic effects) can be approximately treated as independent and identically Gaussian distributed. Therefore, the assumptions underlying classical GCTA analyses (e.g., common environmental effects are ignorable across individuals) remain important to our permutation procedure.

### Potential Applications and Extensions.

Although we demonstrate the capability of MEGHA using brain imaging data, it can be potentially applied to many types of high-dimensional phenotypes. In recent years, the study of complex diseases is shifting from the investigation of an isolated outcome variable to a complete and systematic characterization of the “phenome” (38, 39)—the full set of phenotypes of an individual—to unveil disease etiology and accommodate heterogeneity across individuals. The availability of high-dimensional phenotypic resources contained in electronic health records and population-based biobanks has spurred interest in phenome-wide association studies (PheWAS) (40, 41). The ability to identify or prioritize heritable traits in such large-scale repositories could facilitate important biomedical discoveries.

In addition to the capability of handling extremely high-dimensional phenotypic data, the flexibility of the kernel machines framework, which MEGHA is built on, offers multiple choices on kernel functions and SNP grouping strategies. This opens opportunities to examine different sources of genetic contributions (e.g., additive vs. epistatic effects) and dissect the genetic architecture of complex traits (42). Specifically, as shown in *Materials and Methods*, using a linear kernel function to combine all SNPs spanning the genome essentially produces an equivalent model to GCTA and assesses total additive genetic effects from common variants on phenotypic variables. Other kernels, such as a polynomial kernel or the identity-by-state (IBS) kernel (13, 43, 44), may capture complex genetic interactions (epistasis) and facilitate the analysis of broader-sense heritability. Alternatively, grouping SNPs based on genes, pathways, findings of previous genome-wide association studies (GWASs) or other biologically informative information, and using different weighting strategies when combining SNPs, enable the investigation of genetic contributions from specific collections of SNPs.

## Materials and Methods

### MEGHA.

MEGHA makes use of the semiparametric kernel machines model

where *N* is the total number of subjects, *i*, *i* (e.g., age, sex, and top principal components to adjust for population stratification), *L* SNP markers for subject *i*, ** K** that can be estimated from SNP data, and

where ** g** is an

*L*SNPs combined, and

**is an identity matrix. Using a linear kernel function to combine all of the SNPs spanning the genome assesses the total additive genetic effects from common variants on phenotypes and essentially creates a linear mixed effects model equivalent to the one used in GCTA, which is useful for narrow-sense heritability analyses. The flexibility of the modeling framework allows for the use of other kernel functions (e.g., the quadratic kernel and the IBS kernel) and various SNP grouping strategies (e.g., based on genes, pathways, and previous GWAS findings), making it possible to model different sources of genetic contributions (e.g., additive vs. epistatic effects) from a specific collection of SNPs to phenotypes.**

*I*GCTA uses the iterative restricted maximum likelihood (ReML) algorithm to estimate the variance components

where ** y** and follows a mixture of χ

^{2}s under the null. We use the Satterthwaite method to approximate the distribution of

^{2}distribution

*κ*is the scale parameter and

*ν*is the degrees of freedom that captures the effective number of independent SNPs combined by the kernel function. The two parameters are calculated by matching the first two moments, mean and variance, of

Solving the two equations gives *ρ* by *P* value of an observed score statistic ^{2} distribution *v*. If all phenotypes share the same covariate matrix ** X**, the projection matrix

To obtain a point estimate of the SNP-based heritability, we consider the Wald test statistic for the null hypothesis

a half-half mixture of the χ^{2} distribution ^{2} distribution with 1 degrees of freedom *P* value is identical to the score test *P* value, which can be converted into a Wald test statistic *T* using the mixture of χ^{2}s in Eq. **5**. Then, assuming that the phenotypic variance

### Permutation Procedure.

The efficient computation of the test statistic allows for the use of standard permutation procedures. However, with the presence of covariates ** X**, shuffling the rows and columns of the GRM

**in Eq.**

*K***3**produces inaccurate inferences. Inspired by the ideas of the Huh–Jhun permutation (47), we propose a permutation procedure that involves a transformation that projects the data from

*N*dimensional space onto an

**is a diagonal matrix with**

*D**p*zeros on the diagonal. Without loss of generality, we assume that the first

*p*columns of

**and denote the resulting**

*U*Because

Shuffling the rows and columns of the transformed GRM

Permutation allows us to consider arbitrary statistics of interest and offers great flexibility in making inferences. For example, to obtain an FWEc *P* value of heritability for each region in an ROI-based analysis, using a predetermined anatomical or functional atlas with a total of *R* regions, we compute the permuted score test statistic *r*, and record the maximal statistic over the *R* ROIs, *r*th ROI, the FWEc *P* value can be computed as (48)

This permutation procedure implicitly accounts for the correlation structure among the ROIs and therefore produces more accurate FWEc *P* values than Bonferroni-type corrections, which treats each measure as independent. As a second example, we consider cluster inferences on voxel-/vertex-wise significance maps of heritability, as commonly used in the neuroimaging literature (18, 34, 35). Clusters are defined by contiguous voxels/vertices with test statistics above a predefined threshold (or equivalently, *P* values below a threshold). For each permutation *v*, threshold the statistical map, and record the maximal cluster size *c*, the FWEc *P* value is (48)

Inferences on other statistics, e.g., a weighted voxel-/vertex-wise average of test statistics to summarize heritability into a single number and provide an overall significance, can be easily made following similar procedures.

### The Brain GSP.

The Harvard/MGH Brain GSP is a neuroimaging and genetics study of brain and behavioral phenotypes. More than 3,500 native English-speaking adults with normal or corrected-to-normal vision were recruited from Harvard University, MGH, and the surrounding Boston communities. To avoid spurious effects resulting from population stratification, we restricted our analyses to 1,320 young adults (18–35 y old) of non-Hispanic European ancestry with no history of psychiatric illnesses or major health problems (age, 21.54 ± 3.19 y old; female, 53.18%; right-handedness, 91.74%). All participants provided written informed consent in accordance with guidelines set by the Partners Health Care Institutional Review Board or the Harvard University Committee on the Use of Human Subjects in Research. For further details about the recruitment process and participants, and imaging data acquisition, we refer the reader to ref. 16.

### Imaging Data Processing.

We used FreeSurfer (freesurfer.net) (49), version 4.5.0, a freely available, widely used, and extensively validated brain MRI analysis software package, to process the structural brain MRI scans and compute global and regional morphological measurements. For vertex-wise heritability analyses of cortical thickness, sulcal depth, curvature, and cortical surface area, we resampled subject-specific measures onto FreeSurfer’s *fsaverage* representation, which consists of more than 300,000 vertices across the two hemispheres with an intervertex distance of ∼1 mm, and smoothed the coregistered surface maps using a Gaussian kernel with full width at half maximum (FWHM) 20 mm. We also defined a neighborhood structure on the surface mesh for surface-based clustering.

### Genetic Analysis.

We used PLINK (pngu.mgh.harvard.edu/purcell/plink/) (50), version 1.07, to preprocess the GSP genome-wide SNP data. Major procedures included sex discrepancy check, removing population outliers, spuriously related subjects, and subjects with low genotype call rate (*P* ≥ 1 × 10^{−6}; 580,479 SNPs remained for analysis after quality control. We performed a complete linkage clustering of individuals and a multidimensional scaling (MDS) analysis (Fig. S3), based on autosomal genome-wide SNP data in PLINK, to ensure that no clear population stratification and outliers exist in the sample. We used the GCTA toolbox (6), version 1.24.4 (www.complextraitgenomics.com/software/gcta/download.html), to estimate the GRM used in the heritability analyses from all genotyped autosomal SNPs.

### Heritability Analyses of Brain Imaging Measurements.

For all MEGHA and GCTA analyses of global, regional, and vertex-wise brain imaging measurements, we included age, sex, handedness, scanner group, console group, and coil type as covariates. To account for population substratification, the top five principal components (PCs) of the GRM were also included in the model as nuisance variables. We adjusted for ICV in all of the analyses of cortical/subcortical gray/white matter volume measures, and sulcal depth, curvature, and cortical surface area measures, but not in the cortical thickness analyses because cortical thickness is not correlated with ICV.

### Availability.

A MATLAB implementation of MEGHA is available for download at www.massgeneral.org/psychiatry/research/pngu_software.aspx.

## Acknowledgments

This research was carried out in whole or in part at the Athinoula A. Martinos Center for Biomedical Imaging at Massachusetts General Hospital (MGH), using resources provided by the Center for Functional Neuroimaging Technologies, P41EB015896, a P41 Biotechnology Resource Grant supported by the National Institute of Biomedical Imaging and Bioengineering (NIBIB), National Institutes of Health (NIH). Data were provided by the Brain Genomics Superstruct Project (GSP) of Harvard University and MGH, with support from the Center for Brain Science Neuroinformatics Research Group, Athinoula A. Martinos Center for Biomedical Imaging, Center for Human Genetic Research, and Stanley Center for Psychiatric Research. Twenty individual investigators at Harvard and MGH generously contributed data to the overall project. This research was also funded in part by NIH Grants R01 EB015611-01 and U54 MH091657-03 (to T.E.N.); K99MH101367 (to P.H.L.); K01MH099232 (to A.J.H.); R01 NS083534, R01 NS070963, and NIH NIBIB 1K25EB013649-01 (to M.R.S.); K24MH094614 and R01 MH101486 (to J.W.S.); Wellcome Trust Grants 100309/Z/12/Z and 098369/Z/12/Z (to T.E.N.); and BrightFocus Grant AHAF-A2012333 (to M.R.S.). J.W.S. is a Tepper Family MGH Research Scholar.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: tge1{at}mgh.harvard.edu or jsmoller{at}hms.harvard.edu. ↵

^{2}M.R.S. and J.W.S. contributed equally to this work.

Author contributions: T.G., J.L.R., R.L.B., M.R.S., and J.W.S. designed research; T.G., P.H.L., A.J.H., J.L.R., R.L.B., M.R.S., and J.W.S. performed research; T.G. and T.E.N. contributed new reagents/analytic tools; T.G., P.H.L., and A.J.H. analyzed data; and T.G., R.L.B., M.R.S., and J.W.S. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1415603112/-/DCSupplemental.

Freely available online through the PNAS open access option.

## References

- ↵.
- Falconer D

- ↵
- ↵.
- Neale M,
- Cardon L

- ↵
- ↵
- ↵
- ↵
- ↵.
- Lee SH, et al., Schizophrenia Psychiatric Genome-Wide Association Study Consortium (PGC-SCZ), International Schizophrenia Consortium (ISC), Molecular Genetics of Schizophrenia Collaboration (MGS)

- ↵
- ↵
- ↵
- ↵.
- Lin X

- ↵
- ↵
- ↵
- ↵.
- Holmes AJ, et al.

- ↵
- ↵
- ↵
- ↵.
- Glahn DC, et al.

- ↵
- ↵.
- Gottesman I,
- Shields J

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Vul E,
- Harris C,
- Winkielman P,
- Pashler H

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Zuk O,
- Hechter E,
- Sunyaev SR,
- Lander ES

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Westfall P,
- Young SS

- ↵
- ↵

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Genetics

- Physical Sciences
- Statistics