## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk

Contributed by Kathryn Roeder, April 27, 2020 (sent for review October 28, 2019; reviewed by William Fithian and Matthew Stephens)

## Significance

Variation is rampant throughout human genomes: some of it affects disease risk, and most does not; to separate the two requires a plethora of hypothesis tests. This challenge of multiple testing—limiting false positives while maximizing power—arises in many “omics” studies and sciences. One approach is to control the false discovery rate (FDR), and a recent selective inference method for controlling FDR, adaptive *P*-value thresholding (AdaPT), facilitates incorporation of auxiliary information (covariates) related to each hypothesis test. How AdaPT performs on data is an open question. We apply AdaPT to results from genomic association studies and include many covariates. This adaptive search discovers a more complex and interpretable model with far greater power than classic multiple testing procedures.

## Abstract

To correct for a large number of hypothesis tests, most researchers rely on simple multiple testing corrections. Yet, new methodologies of selective inference could potentially improve power while retaining statistical guarantees, especially those that enable exploration of test statistics using auxiliary information (covariates) to weight hypothesis tests for association. We explore one such method, adaptive *P*-value thresholding (AdaPT), in the framework of genome-wide association studies (GWAS) and gene expression/coexpression studies, with particular emphasis on schizophrenia (SCZ). Selected SCZ GWAS association *P* values play the role of the primary data for AdaPT; single-nucleotide polymorphisms (SNPs) are selected because they are gene expression quantitative trait loci (eQTLs). This natural pairing of SNPs and genes allow us to map the following covariate values to these pairs: GWAS statistics from genetically correlated bipolar disorder, the effect size of SNP genotypes on gene expression, and gene–gene coexpression, captured by subnetwork (module) membership. In all, 24 covariates per SNP/gene pair were included in the AdaPT analysis using flexible gradient boosted trees. We demonstrate a substantial increase in power to detect SCZ associations using gene expression information from the developing human prefrontal cortex. We interpret these results in light of recent theories about the polygenic nature of SCZ. Importantly, our entire process for identifying enrichment and creating features with independent complementary data sources can be implemented in many different high-throughput settings to ultimately improve power.

Large-scale experiments, such as scanning the human genome for variation affecting a phenotype, typically result in a plethora of hypothesis tests. To overcome the multiple testing challenge, one needs corrections to limit false positives while maximizing power. Introduced in ref. 1, false discovery rate (FDR) control has become a popular approach to improve power for detecting weak effects by limiting the expected false discovery proportion (FDP) instead of the more classic family-wise error rate. The Benjamini–Hochberg (BH) procedure was the first method to control FDR at target level α using a step-up procedure that is adaptive to the set of *P* values for the hypotheses of interest (1). Other methods for FDR control have led to improvements in power over BH by incorporating prior information, such as by the use of *P*-value weights (2). In the “omics” world—genomics, epigenomics, proteomics, and so on—the challenge of multiple testing is burgeoning, in part because our ability to characterize omics features grows continually and in part because of the realization that multiple omics are required for describing phenotypic variation. One might imagine merging complementary omics data and tests using a priori hypothesis weights to improve power; however, until recently, it was not clear how to choose these weights in a data-driven manner.

Recent methodologies have been proposed to account for covariates or auxiliary information while maintaining FDR control (3⇓⇓⇓–7). We implement a selective inference approach, called adaptive *P*-value thresholding [AdaPT (8)], to explore prior auxiliary information while maintaining guaranteed finite-sample FDR control. A recent review compared the performance of AdaPT with other covariate-informed methods for FDR control with off-the-shelf one-dimensional and two-dimensional covariate examples (9). One of the weaknesses they ascribe to AdaPT is the unintuitive modeling framework for incorporating covariates; however, AdaPT is not a specific algorithm that one can simply apply to a dataset but rather, a metaalgorithm for marrying machine learning methods to multiple testing problems without compromising FDR control. We fully embrace AdaPT’s flexibility via gradient boosted trees in a much richer, high-dimensional setting. Our boosting implementation of AdaPT easily scales with more covariates, enabling practitioners to capture interactions and nonlinear effects from the rich resources of prior information available.

In this manuscript, we demonstrate our gradient boosted trees implementation of AdaPT on results from genome-wide association studies (GWAS), incorporating covariates constructed from independent GWAS and gene expression studies. Specifically, we apply AdaPT to GWAS for detecting single-nucleotide polymorphisms (SNPs) associated with schizophrenia (SCZ) using bipolar disorder (BD) GWAS results from an independent dataset as a covariate. Additionally, we incorporate results from the recent BrainVar study to identify a set of expression single-nucleotide polymorphisms (eSNPs) based on 176 neurotypical brains, sampled from pre- and postnatal tissue from the human dorsolateral prefrontal cortex (10). Along with the genetically correlated BD *z* statistics, we create additional features from this complementary data source by summarizing the associated developmental gene expression quantitative trait loci (eQTL) slopes and membership in gene coexpression networks. We demonstrate that this process of identifying an enriched set of eSNPs and applying AdaPT with covariates summarizing gene expression from the developing human prefrontal cortex yield substantial improvement in power with each additional piece of information from the BrainVar study. Furthermore, we validate the replication of our results using more recent, independent SCZ studies.

This study had two goals: to explore the use of AdaPT in a realistic high-dimensional multiomics setting and to determine what can be learned about the neurobiology of SCZ by this exploration. Our results revealed the power of incorporating auxiliary information with flexible gradient boosted trees. While each covariate independently provided at best a modest increase in power, our adaptive search discovered a more complex model with far greater power. These discoveries also led to increasing support for the polygenic basis of SCZ, complementing recent findings and suggesting that there are many physiological avenues to its underlying neurobiology. We emphasize that the process and analysis undertaken with this implementation of AdaPT can be extended to a variety of omics and other settings to utilize the rich contextual information that is often ignored by standard multiple testing corrections. We highlight this feature by analyzing two other sets of GWAS studies, type 2 diabetes (T2D) and body mass index (BMI), using results from these analyses to interpret findings from SCZ.

## Results

### Methodology Overview.

AdaPT is an iterative search procedure, introduced in ref. 8, for determining a set of discoveries/rejections, R, with guaranteed finite-sample FDR control at target level α under conditions outlined below. We apply AdaPT to the collection of *P* values and auxiliary information, *i*’s association with the phenotype of interest (e.g., SCZ). The covariates from some feature space,

For each step *P* values above the “mirror estimator” of

If *P* values determining

The algorithm proceeds by sequentially updating the threshold

This procedure guarantees finite-sample FDR control under independence of the null *P* values and as long as the null distribution of *P* values is mirror conservative (i.e., the large “mirror” counterparts *P* values *Data* and additionally provide simulations in *SI Appendix* showing that AdaPT appears to maintain FDR control in relevant positive dependence settings. However, one practical limitation we encounter with the FDP estimate in Eq. **1** is observing *P* values exactly equal to one. While this can understandably occur with publicly available GWAS summary statistics, *P* values equal to one will always contribute to the estimated number of false discoveries *SI Appendix* an adjustment to the *P* values for T2D and BMI GWAS applications that alleviates this problem, although future work should explore modifications to the FDP estimator itself.

The modeling step of AdaPT estimates conditional local fdr with an expectation–maximization (EM) algorithm. In this context, we use gradient boosted trees, which construct a flexible predictive function as a weighted sum of many simple trees, fit using a gradient descent procedure that minimizes a specified objective function. The two objective functions considered correspond to estimating the probability of a test being nonnull and the distribution of the effect size for nonnull tests. The advantage of this approach to function fitting is that it is invariant to monotonic variable transformations, automatically incorporates important variable interactions, and is able to handle a large number of covariates without degrading significantly in performance due to the high dimensionality. In contrast, less effective methods might fail to capture useful information because the covariates are poorly transformed for a linear model, because the important information is only revealed through a combination of covariates, or because the important signal is simply swamped by the number of possible predictors to search through. Our choice of method gives the flexibility to include many potentially useful covariates without being overly concerned about the functional form with which they enter or their marginal utility. In our implementation, we employ the XGBoost library (12) to capitalize on its computational advantages. Fig. 1 displays the full pipeline of our implementation of AdaPT to GWAS summary statistics for SNPs using eQTL to select the SNPs under investigation.

### Data.

Our investigation includes AdaPT analyses of published GWAS *P* values, *z* statistics from BD GWAS,

After matching alleles from both 2014-only and all-2018 studies and limiting SNPs to those with imputation score

The second source was the Genotype-Tissue Expression (GTEx) V7 project dataset (21) with adult samples from 53 tissues. As the first winnowing step, we identified the set of GTEx eQTLs for any of the available tissues at target FDR level *SI Appendix*).

For each eSNP i, we created a vector of covariates *P* values from GWAS studies of related phenotypes, and relationships inferred from gene expression studies. First, we utilize the mapping of eSNPs to genes derived from eQTLs assessed in a relevant tissue type r. Although the majority of observed eSNPs have one unique cis-eQTL gene pairing, 14% of SNPs in BrainVar were eQTL for multiple genes. Let

For the

### AdaPT Discoveries.

As noted elsewhere (24), eSNPs are more likely to be associated with a GWAS phenotype than are randomly chosen SNPs. This is true for the eSNP from BrainVar too, when evaluated in light of the SCZ GWAS *P* values (Fig. 2*A*). To evaluate the performance of the AdaPT search algorithm using the eSNP data, we compare the fitted full covariate model with results from its intercept-only version (Fig. 2*B* vs. Fig. 2*C*). As expected, the intercept-only analysis performs better than BH, with all 269 BH discoveries contained within the intercept-only discoveries because it incorporates an estimate for the proportion of nonnull tests. The full model rejects *z* statistics, then 2) eQTL slope summaries, and then 3) the WGCNA indicators (Fig. 2 *D* and *E*).

The largest number of discoveries occurs when all 24 covariates are fitted (Fig. 2*D*), highlighting that all three types of information together are required. Notably, only 540 associations are discovered using all covariates without interactions, fewer discoveries than only using module-based covariates with interactions. This highlights the improvement in AdaPT’s performance from modeling the interactions between covariates via gradient boosted trees. As might be expected from their counts of discoveries (Fig. 2*D*), the greatest overlap with the full model occurs by fitting all covariates, but without interactions, or by fitting the module-based covariates (Fig. 2*E*).

Additional discoveries are of little interest if they consist primarily of SNPs in linkage disequilibrium (LD) with SNPs already discovered using a simpler model, such as the logit model typically used for SCZ GWAS. For context, however, of the initial 25,076 eSNPs we analyzed, only 4 have *P* values *SI Appendix*). This thinning results in roughly 3,960 eSNPs to be analyzed by the different models (Fig. 2). (Ties in *q* values add or subtract a few SNPs to this 3,960 count, depending on the model analyzed.) When AdaPT is fit to these independent SNPs, we obtain analogous improvements in performance compared with the larger set of SNPs (*SI Appendix*, Figs. S2 and S3): the full AdaPT model discovers 95 independent loci, while the intercept-only model discovers only 42 loci. Likewise, the full model is the best model, and interactions remain important. Finally, no location in the genome exerts unusual influence on the results, which is also the case for the analyses of 25,076 eSNPs.

As described previously, we performed similar analyses of T2D and BMI GWAS *P* values. All results for these analyses, as well as more details regarding analyses of SCZ, are available in Dataset S1 and *SI Appendix*.

### Variable Importance and Relationships.

We examine the variable importance and partial dependence plots from the gradient boosted models to provide insight into the relationships between each of the covariates and SCZ associations. Fig. 3*A* displays the change in variable importance for the probability of being nonnull (*z* statistics are estimated as the most important for each *P* values.

Fig. 3*B* displays the partial dependence plot (25) at each AdaPT model fitting iteration for the estimated marginal relationship between the BD *z* statistics and the probability of being nonnull, evaluated at the *Methods*) is to order the remaining masked *P* values, the *P* values: as the rejection threshold *P* values are more likely nonnull (assuming there is signal). However, for each model iteration, Fig. 3*B* reveals an increasing likelihood for nonnull results as the BD *z* statistics grow in magnitude from zero, as well as a diminished impact of BD *z* statistics on the estimated *C* displays the clear enrichment for eSNPs with cis-eQTL genes that are members of the salmon WGCNA module reported by ref. 10, which was the most important WGCNA module indicator in the first model fitting step. This differs from the unassigned gray module variable: it is predictive of SNPs that are classified as null, rather than associated with the phenotype. Taken together, Fig. 3 emphasizes the use of all covariates across different steps of the AdaPT search. *SI Appendix* has more analyses highlighting the advantages of accounting for interactions between covariates.

### Replication in Independent Studies.

Next, we examine the replicability of the 2014-only SCZ AdaPT results using independent 2018-only studies. We find (Fig. 4) an increasing smoothing spline relationship between these sets of values, with noticeably increasing evidence indicated by the 2018-only *P* values for the set of AdaPT discoveries at *P* values *SI Appendix* has supporting simulations).

### Gene Ontology Comparison.

Using the SNP discoveries, which span the genome, we next sought biological insights. We applied gene ontology enrichment analysis (26, 27) to the 136 genes obtained from the eQTL variant–gene pairs associated with the 843 discoveries. This analysis produced no clear signal, yielding only a minor enrichment for biological processes related to peptide antigen assembly. Several explanations are plausible; we explore two: either AdaPT is discovering SNPs of such small effect that the discoveries are not meaningful, or SCZ is a highly complex disorder with a large number of biological processes involved. For comparison, we applied our full pipeline to GWAS summary statistics for T2D (14). This comparison is of interest because T2D is a disease with a well-understood functional basis and this is a well-powered study (74,124 T2D cases and 824,006 controls). We restricted our analysis to 176,246 eSNPs based on eQTLs obtained using GTEx data. Next, we created eQTL-based covariates using pancreas, liver, and adipose tissue samples (*SI Appendix* has more details). After creating a vector of covariates from GTEx, AdaPT returned 14,920 eSNPs at *SI Appendix*).

### Pipeline Results for All-2018 Studies.

In addition to applying the pipeline to SCZ *P* values from the 2014-only studies in ref. 15, we also modeled *P* values from all-2018 studies. The latter yields far more discoveries due to smaller SEs from increased study sizes, even though the covariates were the same: for *P* values for the most up-to-date set of studies vs. 843 for the 2014-only studies. Notably, the intercept-only version of AdaPT returned 1,865 discoveries at *SI Appendix*). Simply accounting for more auxiliary information does not guarantee an improvement in power, and the advantages thereof diminish as power increases, as witnessed by results for all-2018 studies for SCZ and the large-scale BMI GWAS. Additionally, the larger number of discoveries for the SCZ all-2018 studies, 2,228, maps onto 382 genes. Despite this increase, these genes did not reveal any clear signal from the gene ontology enrichment analysis, comporting with results from the 2014-only results.

## Discussion

Our goals in this study were to explore the use of AdaPT for high-dimensional multiomics settings and investigate the neurobiology of SCZ in the process. AdaPT was used to analyze a selected set of GWAS summary statistics for SNPs, together with numerous covariates. Specifically, SNPs were selected if they were documented to affect gene expression; these SNP–gene pairs were dubbed eSNPs. Covariates for these eSNPs included GWAS test statistics from a genetically correlated phenotype, BD, which were mapped to eSNPs through SNP identity, as well as features of gene expression and coexpression networks, which were mapped to eSNPs through genes. By coupling flexible gradient boosted trees with the AdaPT procedure, relationships among eSNP GWAS test statistics and covariates were uncovered, and more SNPs were found to be associated with SCZ, while maintaining guaranteed finite-sample FDR control. The tree-based handling of covariates addresses a perceived weakness of AdaPT, namely the unintuitive modeling framework for incorporating covariates (9). Moreover, it is worth noting that the original approach implemented by ref. 8, a generalized linear model with spline bases, yields similar results (361 discoveries at target *z* statistics. This is an even more straightforward implementation for handling covariates without interactions. The pipeline we built should be simple to mimic for a wide variety of omics and other analyses.

The results shed light on the level of complexity underlying the neurobiology of SCZ. If the origins of SCZ arose by perturbations of one or a few pathways, we would expect to converge on those pathways as we accrue more and more genetic associations. On the other hand, if the ways to generate vulnerability to SCZ were myriad—even if there is a single ultimate cause shared across all cases—then we might expect no such convergence, at least with regards to the common variation assessed through GWAS. Gene ontology analysis of associated discovery genes from either the 2014-only or all-2018 studies reveals no enrichment for biological processes for SCZ. There are many possible explanations for these null findings, one of which is simply a lack of power or specificity of our results. However, the result stands in stark contrast to the results for T2D, for which the gene ontology analysis converges nicely on accepted pathways to T2D risk; yet, they comport with those for BMI, which is known to have myriad genetic and environmental origins. Therefore, our results are consistent with myriad pathways to vulnerability for SCZ, although it is impossible to rule out other explanations: for example, the possibility that we understand so little about brain functions that gene ontology analyses lack specificity. In any case, our results are consistent with two recent theories underlying the genetics of SCZ, namely extreme polygenicity (29) and “omnigenic” origins (30).

Although the examples considered in this manuscript pertain to omics data, this process can be adapted for a large variety of settings. We demonstrate in *SI Appendix* simulations showing that AdaPT appears to maintain FDR control in positive dependence settings emulating LD block structure underlying GWAS results. There is a clear need, however, for future work to explore AdaPT’s properties and computational challenges under various dependence regimes. The growing abundance of contextual information available in omics settings provides ample opportunity to improve power for detecting associations, using a flexible approach such as AdaPT, when addressing the multiple testing challenge.

## Methods

### Two-Groups Model.

The most critical step in the AdaPT algorithm (8) involves updating the rejection threshold *P* values are modeled as uniform [*P*-value density with a beta distribution density parametrized by

There are two categories of missing values in these regression problems: *P* values for tests *SI Appendix* that reflect the approach taken in the R adaptMT package by the same authors, which differs slightly from ref. 1.

During the E step of the

The M step consists of estimating

We follow the procedure detailed in section 4.3 of ref. 8 to estimate the conditional local fdr for each

### AdaPT Gradient Boosted Trees with Cross-Validation Steps.

As a flexible approach for modeling the conditional local fdr, we use gradient boosted trees (25) via the open-source XGBoost implementation (12). Gradient boosted trees are an ensemble of many small tree models that jointly contribute to predictions. Let

To tune the variety of parameters for gradient boosted trees within AdaPT, such as the number of trees P and maximum depth of each tree, we use the cross-validation (CV) approach recommended in ref. 8. If we are considering M different options of boosting parameters, then we evaluate each of the M choices during the modeling phase of the AdaPT search. At step t, we divide the data into K folds, preserving the relative proportions of masked and unmasked hypotheses. Then, for each set of boosting parameters

### Computational Aspects of AdaPT.

Practical decisions are necessary to implement the AdaPT search. In addition to the covariates and *P* values *P* values using this flexible multiple testing correction. Additionally, by lowering the starting threshold, more true information is available to the gradient boosted trees at the start of the AdaPT search. For instance, with the set of BrainVar eSNPs, 21,248 true *P* values are immediately revealed with *SI Appendix* show that on average our choice for using a lower threshold results in higher power.

The most computationally intensive part of the procedure is updating the rejection threshold via the EM algorithm. Instead of updating the model for estimating *SI Appendix*. Additionally, one needs to choose the potential M model parameter choices. Technically, unique combinations can be used for both models, *SI Appendix* showing how extensively increasing the number of trees P leads to decreasing power but maintains valid FDR control.

### Code Availability.

We provide a modified version of the adaptMT R package to implement the AdaPT CV tuning steps with XGBoost models at https://github.com/ryurko/adaptMT and provide all code used to generate the manuscript’s results at https://github.com/ryurko/AdaPT-GWAS-manuscript-code.

## Acknowledgments

This work was supported by National Institute of Mental Health Grant R37MH057881, Simons Foundation Grant SFARI SF575097, and NSF Grant DMS 1613202. We are grateful for help from Lambertus Klei regarding datasets and their interpretation.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: roeder{at}andrew.cmu.edu.

Author contributions: R.Y., M.G., K.R., and B.D. designed research; R.Y., M.G., K.R., and B.D. performed research; R.Y., M.G., K.R., and B.D. contributed new reagents/analytic tools; R.Y. analyzed data; and R.Y., M.G., K.R., and B.D. wrote the paper.

Reviewers: W.F., University of California, Berkeley; and M.S., The University of Chicago.

The authors declare no competing interest.

Data deposition: A modified version of the adaptMT R package to implement the AdaPT CV tuning steps with XGBoost models is available at GitHub (https://github.com/ryurko/adaptMT). All code used to generate the manuscript’s results is also available at GitHub (https://github.com/ryurko/AdaPT-GWAS-manuscript-code).

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1918862117/-/DCSupplemental.

- Copyright © 2020 the Author(s). Published by PNAS.

This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).

## References

- ↵
- Y. Benjamini,
- Y. Hochberg

- ↵
- ↵
- J. G. Scott,
- R. C. Kelly,
- M. A. Smith,
- P. Zhou,
- R. E. Kass

- ↵
- ↵
- S. M. Boca,
- J. T. Leek

- ↵
- A. Li,
- R. F. Barber

- ↵
- M. J. Zhang,
- F. Xia,
- J. Zou

- ↵
- L. Lei,
- W. Fithian

- ↵
- ↵
- D. M. Werling et al.

- ↵
- ↵
- T. Chen,
- C. Guestrin

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- GTEx Consortium

- ↵
- B. Zhang,
- S. Horvath

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- L. J. O’Connor et al.

- ↵

## Citation Manager Formats

## Sign up for Article Alerts

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology