## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations

Edited by Elizabeth A. Thompson, University of Washington, Seattle, WA, and approved December 9, 2014 (received for review April 3, 2014)

## Significance

High-throughput DNA sequencing provides an unprecedented opportunity to discover rare genetic variants associated with complex diseases and traits. However, sequencing a large number of subjects is prohibitively expensive. It is common to select subjects for sequencing from the cohorts that have collected genotyping array data. We impute the sequencing data from the array data for the cohort members who are not selected for sequencing and perform gene-level association tests for rare variants by properly combining the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects. This integrative analysis is substantially more powerful than the use of sequencing data alone and can accelerate the search for disease-causing mutations.

## Abstract

In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, ignoring the uncertainties of imputed rare variants in downstream association analysis will inflate the type I error when sequenced subjects are not a random subset of the GWAS subjects. In this article, we provide a valid and efficient approach to combining observed and imputed data on rare variants. We consider commonly used gene-level association tests, all of which are constructed from the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects is unbiased. We derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error. We demonstrate through extensive simulation studies that the proposed tests are substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We provide an application to the Women’s Health Initiative. The relevant software is freely available.

- data integration
- gene-level association tests
- genotype imputation
- linkage disequilibrium
- whole-exome sequencing

Recent technological advances have made it possible to conduct high-throughput DNA sequencing studies on rare variants, which have a stronger impact on complex diseases and traits than common variants (1). However, it is still economically infeasible to sequence all subjects in a large cohort, and, therefore, only a subset of cohort members can be selected for sequencing. A cost-effective sampling strategy is to preferentially select subjects in the extremes of a quantitative trait distribution or those with a specific disease (2, 3). For case−control studies, an equal number of cases and controls provides more power than other case−control ratios. For quantitative traits, the power increases as more extreme values are sampled (2).

Trait-dependent sampling has been adopted in many sequencing studies, including the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP) and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) resequencing project. The NHLBI ESP consists of three studies that sequenced subjects with the largest and smallest values of body mass index (BMI), low-density lipoprotein, and blood pressure, one case−control study on myocardial infarction, and one case-only study on stroke (2). The CHARGE resequencing project selected subjects with the highest values of 14 quantitative traits, as well as a random sample (4).

The cohorts from which subjects are drawn for sequencing often have collected genotyping array data on all or most cohort members through genome-wide association studies (GWAS). This is certainly the case with the cohorts used in the NHLBI ESP and CHARGE projects. If we impute the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, we will dramatically increase the number of subjects with information on rare variants. Indeed, such imputation is being carried out in the Women’s Health Initiative (WHI) (5), which is a major component of the NHLBI ESP, and other cohorts (6). The algorithms for rare variant imputation include MaCH (Markov chain based haplotyper) (7), IMPUTE2 (8), BEAGLE (9), and minimac (10), all of which impute rare variants by leveraging the linkage disequilibrium (LD) between sequenced and GWAS variants.

The aforementioned imputation algorithms have been routinely used to impute untyped common single nucleotide polymorphisms (SNPs) (i.e., variants not present on the genotyping array) in GWAS (11, 12). Common SNPs can be imputed with high degrees of accuracy, and single-SNP association tests treating imputed genotype values as observed quantities have reasonable control of the type I error (13, 14).

There are major differences between imputing common SNPs in GWAS and imputing rare variants in sequencing studies. First, the former uses an external reference panel, such as the HapMap (15) or the 1,000 Genomes (16), whereas the latter relies primarily on an internal reference panel, i.e., the subset of GWAS subjects who are sequenced. Second, rare variants cannot be imputed very accurately because of the low minor allele count and weak LD. Third, untyped common SNPs are unobserved and imputed for all subjects in GWAS whereas rare variants are observed in a subset of the study subjects (i.e., sequenced subjects) and imputed for the rest (i.e., nonsequenced subjects) in sequencing studies, creating within-study differential quality in genotype data. If the selection of subjects for sequencing depends on the trait values, then the variance of the trait for sequenced subjects is generally different from that of nonsequenced subjects. The combination of differential genotype quality and differential trait variability between the sequenced and nonsequenced samples will cause inflation of the type I error in association testing, as will be explained in *Methods* section.

To reduce within-study differential quality in genotype data, one may use a postimputation quality control (QC) procedure (6) to filter out variants that have been imputed with high degrees of uncertainty. As stated, the imputation accuracy for rare variants is typically low; therefore, the use of any reasonable QC filter will remove a large number of rare variants. The removal of variants results in loss of important information because the observed genotype data for sequenced subjects cannot be used if a variant is excluded due to low imputation quality for nonsequenced subjects and because the imputed genotypes, even for a variant that cannot be imputed accurately, may still contain valuable information about the association.

In this article, we show how to perform valid and efficient association tests when rare variants are imputed for nonsequenced subjects. Because single-variant tests and commonly used gene-level tests, such as burden (17, 18), variable threshold (VT) (19, 20), and sequence kernel association test (SKAT) (21), are all based on the score statistic for testing the associations between the genotypes of individual variants and the trait of interest, we investigate the properties of the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects. We find that the score statistic is unbiased (provided that there is no strong population stratification). We show that the standard variance estimator for the score statistic is valid if a random subset of the GWAS subjects is selected for sequencing but is invalid if the selection depends on the trait values and the imputation is inaccurate. In addition, we derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests are guaranteed to have correct type I error. We show, in realistic simulation studies, that the proposed approach is substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We further demonstrate the advantages of the new methodology in an application to empirical data from the WHI.

## Methods

We first consider single-variant analysis without covariates. Let *G* denote the genotype (i.e., number of minor alleles) at the variant site, and let *Y* denote the trait of interest. Suppose that a total of *N* unrelated cohort members are measured on *Y* and GWAS SNPs and that a subset of *n* subjects is selected for sequencing and thus measured on *G*. The selection may be completely random or trait dependent. For example, when *Y* is quantitative, one may select subjects with the largest and smallest values of *Y*, and when *Y* is binary, one may draw a case−control sample with an equal number of cases (i.e.,

We infer the unknown values of *G* for the *G* by the expected count of the minor allele (i.e., dosage). The imputation is performed by any of the commonly used algorithms (7⇓⇓–10) without considering the trait information (e.g., by combining cases and controls in a case-control study). Let *G* for the nonsequenced subject and the observed value of *G* for the sequenced subject.

We relate quantitative *Y* to *G* through the linear regression model

and binary *Y* to *G* through the logistic regression model

where *β* is the association parameter, and *ϵ* is normal with mean zero and variance

where *U* based on the Fisher information is

where

Let us order the data such that the first *n* subjects are the sequenced ones. By some simple algebra, *Y* is independent of **S2** of *SI Appendix, SI Method A*, the means of *U* is zero regardless of the imputation quality. However, the standard variance estimator *U* with imputed rare variants, as explained below. First, the imputed values are less variable than the observed values when the imputation accuracy is not sufficiently high. Second, the variance of *Y* may be different between sequenced and nonsequenced subjects when the selection for sequencing depends on *Y*. Thus, the variance of *Y* may be related to the variance of *Y* and the variance of *U*; a formal proof is provided in *SI Appendix, SI Method B*.

The following robust variance estimator fully captures the variability of *U* under any sampling scheme by properly adjusting for the imputation accuracy:

where *ρ* is the Pearson correlation coefficient between the true and imputed genotypes, and Rsq (7) is the ratio of the variance of the imputed genotype to the variance of the true genotype. Eq. **1** is a special case of Eq. **S13** derived in *SI Appendix, SI Method C*. Note that **1**, the residuals of sequenced subjects are adjusted by *SI Appendix, SI Method A* that Rsq is equivalent to

If the imputation is so accurate that

which is the empirical variance estimator based on efficient score functions (22). Moreover, if **2** is equivalent to *Y*, so the standard analysis is valid.

In *SI Appendix, SI Method C*, we extend our methodology in several directions. First, we consider multiple variants in a gene and derive a robust estimator for the variance−covariance matrix of the (vector-valued) score statistic, from which all commonly used gene-level tests (e.g., burden, VT, and SKAT) can be constructed (23). Second, we incorporate covariates (e.g., demographic variables and principal components for ancestry) into the regression model and show that the score statistic continues to be unbiased as long as covariates are independent of genotypes. (This implies that the score statistic is approximately unbiased when there is no strong population stratification.) We also provide the corresponding robust variance estimator in Eq. **S13** in *SI Appendix*. Third, we accommodate other types of traits such as ordinal and count data by adopting the class of generalized linear models. Fourth, we modify the robust variance estimator for binary traits to improve numerical stability; see Eq. **S14** in *SI Appendix*. Finally, we show how to perform meta-analysis of multiple studies.

## Results

### Simulation Studies.

We conducted extensive simulation studies to evaluate the performance of the proposed methods in realistic settings. We chose one gene, *NPHS2* (nephrosis 2, idiopathic, steroid-resistant) (accession *NM_014625*), on chromosome 1 and restricted our analysis to variants with MAFs *NPHS2*, with MAFs of 0.002, 0.004, 0.001, 0.022, and 0.005 for snp.98398, snp.98400, snp.98401, snp.98418, and snp.98419, respectively. The number of variants and the total MAF (i.e., sum of MAFs over variant sites) for *NPHS2* equal the median number of variants and the median total MAF for all genes, respectively, making this gene a good representation. We used GWAsimulator (24) to generate genotype data for the variants identified by sequencing and for the flanking GWAS SNPs by mimicking the MAFs and LD patterns observed in the WHI genotype data.

We considered a cohort of 5,000 subjects. We generated quantitative traits from the linear regression model *S* is the total number of mutations the subject carries in the gene, *X* is a normal random variable with mean *ϵ* is an independent standard normal variable, and *X* is a potential confounder that may represent population stratification. We selected a subset of cohort members for sequencing. For quantitative traits, we considered five sampling schemes: (Q1) a random sample of 500 subjects; (Q2) 250 subjects with the largest values of *Y* and 250 with the smallest values; (Q3) 500 subjects with the largest values of *Y* and 250 with the smallest values; (Q4) 250 subjects with the largest values of *Y* plus a random sample of 250 from the remaining cohort; and (Q5) 250 subjects with the largest values of *Y* plus a random sample of 1,000 from the remaining cohort. For binary traits, we considered five disease rates: 50%, 30%, 20%, 10%, and 5%, to be referred to as B1, B2, B3, B4, and B5, respectively, and we selected 250 cases and 250 controls regardless of the disease rate. (Note that B1 is equivalent to the situation in which a random subset of subjects from the parent case−control GWAS with the same case−control proportion is sequenced.) For subjects that were not selected, we masked the genotypes for the variants identified by sequencing and used minimac (10) (a low-memory, computationally efficient variant of the MaCH algorithm for haplotype-to-haplotype imputation) to impute them. Due to the computational burden of phasing all replicates, we avoided phasing the reference panel and the target sample but retained the haplotype information generated from the simulation. The average Rsq is 0.22 for snp.98418, which is relatively common; it is less than 0.1 for the other variants.

We constructed the burden, VT, and SKAT tests based on

We first considered the situation of no confounding (i.e., *U*. Consequently, T5-rob always has correct control of the type I error. Under random sampling (i.e., Q1 and B1), *U*, and the type I error rate can be 50 times the nominal level. The results for VT and SKAT exhibit the same patterns as those of T5.

Fig. 1 and *SI Appendix*, Fig. S1 compare the power of various T5 tests for quantitative and binary traits, respectively. The results for sampling scheme Q3 are not included in Fig. 1 because they are similar to those of Q5. The results for B3 are intermediate between those of B2 and B4 and thus omitted from *SI Appendix*, Fig. S1. It is clear that T5-rob is uniformly more powerful than T5-seq, demonstrating the benefit of integrating sequencing and GWAS data. Under sampling scheme Q1, T5-std is slightly more powerful than T5-rob. Under B1, the two tests have the same power. Under other schemes, T5-std has inflated type I error (Table 1), so the power comparisons would not be meaningful. To make fair comparisons, we calculated the power of T5-std by resetting the critical values to attain correct type I error. The power of T5-std so calculated is similar to or much lower than the power of T5-rob.

We examined the robustness of the proposed methods to confounding (i.e., *SI Appendix*, Fig. S2, however, the type I error of T5-rob is still reasonable, especially for binary traits.

As mentioned previously, a common practice is to use a postimputation QC procedure to filter out inaccurately imputed variants before association analysis. For example, Auer et al. (6) excluded variants if their Rsq values were less than the thresholds chosen such that within each MAF category, variants passing the threshold had an average Rsq of 0.8 or higher. In particular, they chose Rsq thresholds of 0.3, 0.6, 0.8, and 0.9 for variants with MAFs 3–5%, 1–3%, 0.5–1%, and 0.1–0.5%, respectively, and excluded all variants with MAFs *NPHS2* entirely since the largest Rsq of the variants was merely 0.22. By contrast, our integrative analysis not only allowed association tests for this poorly imputed gene but also gained substantial power over the analysis of sequenced subjects only.

To further demonstrate the benefits of using inaccurately imputed variants, we considered a second gene, *OR10J3* (olfactory receptor, family 10, subfamily J, member 3) (accession *NM_001004467*), on chromosome 1 that has seven variants, snp.88226, snp.88228, snp.88232, snp.88236, snp.88240, snp.88241, and snp.88244, with MAFs of 0.016, 0.01, 0.008, 0.002, 0.001, 0.008, and 0.002, respectively. The Rsq is 0.67 for snp.88226 and almost zero for the other variants. With the use of a QC procedure (6), only snp.88226 would be retained. As shown in Fig. 2 and *SI Appendix*, Fig. S3, the T5-rob test based on pre-QC variants is substantially more powerful than the T5-std test based on post-QC variants.

The aforementioned simulation studies were based on two specific genes. We also considered all genes on chromosome 1 and generated a binary trait from the logistic regression model *S* is the total number of mutations the subject carries in five genes, and *SI Appendix*, Fig. S4, the test based on

### WHI Data.

The WHI was established by the National Institutes of Health in 1991 to address major health issues causing morbidity and mortality among postmenopausal women (5). We focused on the BMI values for the African American participants of the WHI cohort. Among the 8,142 African American participants who were genotyped by the Affymetrix 6.0 arrays, 360 with BMI values > 40 or < 25 were selected for whole-exome sequencing in the NHLBI ESP (2). The distribution of the BMI values is displayed in *SI Appendix*, Fig. S5.

We used the 360 sequenced subjects as an internal reference panel to impute sequencing data from the GWAS data for the nonsequenced subjects. Specifically, we used MaCH (7) to construct a reference panel of 720 phased haplotypes consisting of both the variants discovered by exome sequencing and the SNPs on the GWAS arrays. We also prephased haplotypes at the GWAS SNPs for the remaining cohort members. We then used minimac (10) to impute genotypes at the variants discovered by exome sequencing for the nonsequenced subjects. We restricted our attention to nonsense, missense, and splice site variants with MAFs *SI Appendix*, Fig. S6 shows the Rsq values for the variants whose MAFs are lower than the 5% and 1% thresholds. The variants with MAFs ≤ 1% account for a majority (83.3%) of all variants with MAFs

We used the log-transformed BMI value as the quantitative trait and included age and the proportion of African ancestry estimated by FRAPPE (frequentist approach for estimating individual ancestry proportion) (25) as covariates. The Pearson correlation coefficients between the ancestry variable and the burden scores of the variants with MAFs

The quantile−quantile plots are displayed in Fig. 3 and *SI Appendix*, Fig. S7. For all tests based on *P* values agree very well with the global null hypothesis of no association, except at the extreme right tails. By contrast, the observed *P* values for all tests associated with *P* values than their counterparts in the integrative analysis based on

The top 10 genes identified by T5-rob are listed in Table 2. The top gene *ODF2L* (accession *NM_020729*) is ranked the second by VT-rob and the fourth by SKAT-rob, and its imputation accuracy is quite high, with an average Rsq of 0.685. The common SNPs in the seventh gene *BDNF* were previously found to be associated with BMI in several GWAS (26, 27). This gene is not in the top 10 list by any test with

For any type of test, the version based on *SI Appendix*, Fig. S7. A gene identified by a test based on *TPSG1* (accession *NM_012467*) identified by T5-std is not among the top 10 genes by T5-rob and has an average Rsq of only 0.028. Because the tests based on

Applying the postimputation QC procedure of Auer et al. (6) to the WHI data, we found that only 17.1% of variants and 47.9% of genes passed the QC criteria. To be specific, 92.8% of variants with MAFs 3–5% passed QC, 69.6% with MAFs 1–3%, 29.5% with MAFs 0.5–1%, and only 3.5% with MAFs 0.1–0.5%. We repeated our association analysis for the post-QC variants and genes. As shown in *SI Appendix*, Figs. S8 and S9, the tests with *P* values are less extreme than those of *SI Appendix*, Table S1, which can be compared with Table 2. We see that the association signal for *BDNF* is weaker after QC. None of the other genes in *SI Appendix*, Table S1 has previously been found to be associated with BMI.

## Discussion

This article provides a valid and efficient approach to integrative analysis of sequencing and GWAS data. The approach is very general in several aspects: (*i*) It can handle any type of trait and any sampling scheme; (*ii*) it encompasses all commonly used gene-level tests for rare variants; (*iii*) it includes single-variant tests as a special case; and (*iv*) it allows for covariates. The computation is the same as the usual gene-level tests except for the replacement of the standard variance estimator with the robust one. The proposed methods have been implemented in the software program SEQGWAS (integrative analysis of SEQuencing and GWAS data), which is freely available at web1.sph.emory.edu/users/yhu30/software.html. It took ∼2 h on an IBM HS22 machine to analyze the WHI data.

For binary traits, the GWAS subjects may come from a cohort or case-control study. The standard variance estimator is valid if the case−control ratios are the same between the sequencing study and the parent study. This condition typically holds when the parent study is a case−control study but likely fails when the parent study is a cohort study.

It is also desirable to integrate other types of genetic data. For example, a subset of the WHI African Americans was genotyped on Metabochip (28) and used as the reference panel to impute Metabochip variants for the remaining African Americans with GWAS data (29). In addition, the GWAS genotyping arrays can be replaced by the Metabochip (28) or the Exomechip (30), which can then be imputed against sequencing data (28). Our approach can be readily used to analyze such mixtures of observed and imputed data.

Our approach is built upon existing imputation algorithms and focused on properly analyzing a mixture of observed and imputed genotype data. Thus, we can take full advantage of newly developed imputation programs to improve imputation accuracy and computational efficiency. Although imputed data may pertain to the genotype dosage, the most likely genotype, or the genotype probabilities, our approach is tailored to the dosage data. In practice, the dosage is the most commonly used because, unlike the most likely genotype, it partially accounts for the uncertainty in the imputed value and because the dosage is computationally more tractable than the genotype probabilities while still retaining most information.

The postimputation QC process can alleviate the inflation of the type I error caused by the use of

## Acknowledgments

The authors thank the WHI investigators and staff for their dedication and the study participants for making the program possible. The authors thank two referees for their helpful comments. Data from the Exome Sequencing Project were supported through NHLBI RC2 HL-102924, RC2 HL-102925, and RC2 HL-102926. This work was supported by NIH Grants R01CA082659, P01CA142538, R37GM047845, R01HG006292, and R01HG006703. The WHI program is funded by the NHLBI (HHSN268201100046C, HHSN268201100001C, HHSN268201100002C, HHSN268201100003C, HHSN268201100004C, and HHSN271201100004C).

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: lin{at}bios.unc.edu.

Author contributions: Y.-J.H. and D.-Y.L. designed research; Y.-J.H. and D.-Y.L. performed research; Y.-J.H., Y.L., and P.L.A. analyzed data; and Y.-J.H. and D.-Y.L. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1406143112/-/DCSupplemental.

## References

- ↵
- ↵.
- Lin DY,
- Zeng D,
- Tang ZZ

- ↵.
- Prentice RL,
- Pyke R

- ↵.
- Lin H, et al.

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Li C,
- Li M

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵

## Citation Manager Formats

### More Articles of This Classification

### Physical Sciences

### Statistics

### Biological Sciences

### Related Content

- No related articles found.

### Cited by...

- No citing articles found.