The harmonic mean p-value and model averaging by mean maximum likelihood

Analysis of ‘big data’ frequently involves statistical comparison of millions of competing hypotheses to discover hidden processes underlying observed patterns of data, for example in the search for genetic determinants of disease in genome-wide association studies (GWAS). Model averaging is a valuable technique for evaluating the combined evidence of groups of hypotheses, simultaneously testing multiple levels of groupings, and determining post hoc the optimal trade-off between group composition versus significance. Here I introduce the harmonic mean p-value (HMP) for assessing model-averaged fit, which arises from a new method for model averaging by mean maximum likelihood (MAMML), underpinned by generalized central limit theorem. Through a human GWAS for neuroticism and a joint human-pathogen GWAS for hepatitis C viral load, I show how HMP easily combines information to detect statistically significant signals among groups of individually nonsignificant hypotheses, enhancing the potential for scientific discovery. HMP and MAMML have broad implications for the analysis of large datasets by enabling model averaging for classical statistics.

falsely rejecting a null in favour of an alternative hypothesis in one or more of all tests performed. Controlling the FWER when some subset of the alternative hypotheses tested might be true is considered the strongest form of protection against false positives.
However, the simple and widely-used Bonferroni method for controlling the FWER tends to be conservative, especially when the individual tests are positively correlated, as often occurs when alternative hypotheses are compared against the same data. In practice, the conservative nature of Bonferroni correction exacerbates the stringent criterion of controlling the FWER, jeopardizing sensitivity to detect true signals.
Alternatives to controlling the FWER have been proposed based on arguments for less stringency. Controlling the false discovery rate (FDR) guarantees that among the significant tests, the proportion in which the null hypothesis is incorrectly rejected in favour of the alternative is limited. 4 The widely-used Benjamini-Hochberg procedure 4 for controlling the FDR shares with the Bonferroni method a robustness to positive correlation between individual tests, 5 but does not share the consequent problem of becoming overly conservative. These advantages have increased the popularity of FDR control, but necessitate the acceptance of a less rigorous standard of control than the FWER, which in practice can produce large numbers of false positives.
Bayesian statistics has an answer to these problems. While the posterior odds of any individual hypothesis test are inevitably decreased by increasing the number of alternative hypotheses, model averaging allows alternative hypotheses to be combined, so that comparing a group of alternatives against a common null may rule out the null hypothesis collectively. In the case of GWAS, even if no individual variant shows sufficiently strong evidence of association in a region, the modelaveraged signal may still achieve sufficiently strong posterior odds. 6,7 Since, in the absence of WGS, associations at typed variants must be interpreted merely as markers of regional signals, it is clear that analysing variation exhaustively can only improve the prospects for discovery.
However, there is no general method for combining evidence across hypotheses by model averaging in classical statistics. While Bayesian arguments might advocate abandoning classical statistics, 8 they have shown that p-values from likelihood-based inference are mathematically closely related to Bayesian quantities. 9, 10 Pragmatically, factors including difficulties specifying prior information, a tendency for slower methods and inertia mean that uptake of Bayesian methods still lags behind classical approaches in many settings, including GWAS. Here I show that hypotheses can be model-averaged quickly and easily through the harmonic mean p-value, improving the prospects for scientific discovery using classical statistics, and prompting a reevaluation of the issue of controlling false positive rates in analyses of big data.

Results
The harmonic mean p-value For observed data X consider L mutually exclusive alternative hypotheses M i , i = 1 . . . L, all with the same nested null hypothesis M 0 . Suppose each alternative has been tested against the null to produce a p-value, p i .
The main result of this paper is that the weighted harmonic mean p-value (HMP) of any subset R containing |R| of the p-values, defined by Equation 1, combines the evidence in favour of the group of alternative hypotheses R against the common null, is an approximately well-calibrated p-value for small values, and that the test defined by Equation 2 controls the strong-sense family-wise error rate (FWER) at level approximately ↵ for ↵  0.05, no matter how many subsets are tested: where where w R = P i2R w i . Further, generalized central limit theorem (see e.g. ref 11 ) can be used to obtain a p-value that becomes exact for large |R| because 1/ p R tends towards a Landau distribution, 12 which has probability density function This allows tables of significance thresholds to be computed for direct interpretation of the HMP (Table 1), and direct computation of a better-calibrated p-value using the HMP as a test statistic: The HMP outperforms Bonferroni and Simes correction when the tests are mutually exclusive. The HMP complements Fisher's method for combining p-values, with the HMP being more appropriate when (i) rejecting the null implies that only one alternative hypothesis may be true, and not all of them (ii) the p-values might be positively correlated, and cannot be assumed independent.
In the next section the theory giving rise to the HMP is explained. Readers most interested in application of the HMP can skip to the following sections.

Model averaging by mean maximum likelihood
A classical analogue of the Bayes factor is the maximized likelihood ratio, which measures the evidence for the alternative hypothesis against the null: In a likelihood ratio test (LRT), the p-value is calculated as the probability of obtaining an R i as or more extreme if the null hypothesis were true: For nested hypotheses (⇥ M0 2 ⇥ Mi ), Wilks' theorem 13 approximates the null distribution of R i as LogGamma(↵ = ⌫/2, = 1) when there are ⌫ degrees of freedom. The idea of this paper is to develop a classical analogue to the model-averaged Bayes factor by deriving the null distribution for the mean maximized likelihood ratio, where the weights could represent prior evidence for each alternative hypothesis. Formally this means the model is treated as a random effect. The distribution ofR cannot be approximated by central limit theorem because the LogGamma distribution is heavy tailed, with undefined variance. Instead generalized central limit theorem can be used, 11 which states that for equal weights (w i = 1/L) and independent and identically distributed R i s, where = 1 is the heavy-tail index of the LogGamma(⌫/2, 1) distribution, a L and b L are constants and R is a Stable distribution with tail index . When ⌫ = 2, the specific form of the Stable distribution is the Landau. The assumptions of equal weights, independence and identical degrees of freedom can be relaxed. Full details of the Stable distribution approximation are in the Methods.
Notably, when ⌫ = 2 and the assumptions of Wilks' theorem are reasonable, the p-value equals the inverse maximized likelihood ratio: so the mean maximized likelihood ratio equals the inverse HMP:R Under these conditions, interpretingR and the HMP are exactly equivalent. This equivalence motivates use of the HMP more generally because 1. The HMP will capture similar information toR regardless of the degrees of freedom. 2. The Landau distribution gives an excellent approximation forR with ⌫ = 2, and so for 1/ p. 3. Combining p i s rather than R i s automatically accounts for differences in degrees of freedom.
Further, the HMP is approximately well calibrated because the LogGamma cumulative distribution function is regularly varying, meaning that the model-averaged p-value (Equation 3) is approximated by (see e.g. ref 14 ) Directly interpreting the HMP using Equation 2 constitutes a multilevel test in the sense that any significant subset of hypotheses implies the HMP of the superset will also be signifi- However, this is only approximate because the exact significance threshold varies by the number of hypotheses combined ( Table 1). The formal definition of the multilevel test that controls the strong sense FWER at level ↵ and outperforms the power of Bonferroni and Simes correction is So Equation 2 is formally a shortcut procedure that guarantees either superior power over Bonferroni and Simes or strong sense FWER depending on whether ↵ or ↵ L is employed. In practice, the less stringent threshold ↵ could be used as a first pass, and the significance of important results could then be checked using the exact threshold from Table 1.
The key insights here are that model averaging yields more powerful tests, arbitrary combinations of hypotheses can be tested for the same cost, and yet a non-significant overall HMP implies no subsets are likely to be significant.

HMP enables adaptive multiple testing correction by combining p-values
That the Bonferroni method for controlling the FWER can be overly stringent, especially when the tests are nonindependent, has long been recognized. In Bonferroni correction, a p-value is deemed significant if p  ↵/L, which becomes more stringent as the number of tests L increases. Since human GWAS began routinely testing millions of variants by statistically imputing untyped variants, a new convention was adopted in which a p-value is deemed significant if p  5 ⇥ 10 8 , a rule that implies the effective number of tests is no more than L = 10 6 . Several lines of argument were used to justify this ad hoc threshold, 16,17 most applicable only to human GWAS.
In contrast, the HMP affords strong control of the FWER while avoiding both ad hoc rules and the undue stringency of Bonferroni correction, an advantage that increases when tests are non-independent. To show how the HMP can recover significant associations among groups of tests that are individually non-significant, I reanalysed a GWAS of neuroticism, 15 defined as a tendency towards intense or frequent negative emotions and thoughts. 18 Genotypes were imputed for L = 6 524 432 variants across 170 911 individuals. I used the HMP to perform model-averaged tests of association between neuroticism and variants within contiguous regions of 10, 100 and 1000 kilobases (kb), 10 megabases (Mb), entire chromosomes and the whole genome, assuming equal weights across variants. Figure 1 shows the p-value from Equation 3 for each region R adjusted by a factor w 1 R to enable direct comparison to the significance threshold ↵ = 0.05. Similar results were obtained from direct interpretation of the HMP ( Figure  S1). Model averaging tends to make significant and nearsignificant adjusted p-values more significant. For example, for every variant significant after Bonferroni correction, the model-averaged p-value for the corresponding chromosome was found to be at least as significant.
Model-averaging increases significance more when combining a group of comparably significant p-values, e.g. the top hits in chromosome 9. The least improvement is seen when one p-value is much more significant than the others, e.g. the top hit in chromosome 3. This behaviour is predicted by the tendency of harmonic means to be dominated by the smallest values. In the extreme case that one p-value dominates the significance of all others, the HMP test becomes equivalent to Bonferroni correction. This implies that Bonferroni correction might not be improved upon for 'needle-in-a-haystack' problems. Conversely, dependency among tests actually improves the sensitivity of the HMP because one significant test may be accompanied by other correlated tests that collectively reduce the harmonic mean p-value.
In some cases, the HMP found significant regions where none of the individual variants were significant. For example, no variants on chromosome 12 were significant by Bonferroni correction nor by the ad hoc genome-wide significance threshold of 5 ⇥ 10 8 . However, the HMP found significant 10Mb regions spanning several peaks of non-significant in-dividual p-values. One of those, variant rs7973260, which showed an individual p-value for association with neuroticism of 2.4 ⇥ 10 7 , had been reported as also associated with depressive symptoms (p = 1.8 ⇥ 10 9 ).
In chromosome 3, individual variants were found to be significant by the ad hoc threshold of 5 ⇥ 10 8 , but neither Bonferroni correction nor the HMP agreed those variants or regions were significant at a FWER of ↵ = 0.05. Indeed the HMP found chromosome 3 non-significant as a whole. Variant rs35688236, which had the smallest p-value on chromosome 3 of 2.4 ⇥ 10 8 , had not validated when tested in a quasi-replication exercise that involved testing variants associated with neuroticism for association with subjective wellbeing or depressive symptoms. 15 These observations illustrate that the the HMP adaptively combines information among groups of similarly significant tests where possible, while leaving lone significant tests subject to Bonferroni-like stringency, providing a general approach to combining p-values that does not require specific knowledge of the dependency structure between tests.

HMP allows large-scale testing for higher-order interactions without punitive thresholds
Scientific discovery is currently hindered by avoidance of large-scale exploratory hypothesis testing for fear of attracting multiple testing correction thresholds that render signals found by more limited testing no longer significant. A good example is the approach to testing for pairwise or higherorder interactions between variants in GWAS. The Bonferroni threshold for testing all pairwise interactions invites a threshold of (L + 1)/2 times more stringent than the threshold for testing variants individually, and strictly speaking this must be applied to every test, even though this is highly conservative because of the dependency between tests. The alternative of controlling the FDR risks a high probability of falsely detecting an association of some sort. Therefore interactions are not usually tested for.
To show how model-averaging using the HMP greatly alleviates this problem, I reanalysed human and pathogen genetic variants from a GWAS of pre-treatment viral load in hepatitis C virus (HCV)-infected patients. 19 Jointly analysing the influence of human and pathogen variation on infection is an area of great interest, but requires a Bonferroni threshold of ↵/(L H L P ) when there are L H and L P variants in the human and pathogen genomes respectively, compared to ↵/(L H +L P ) if testing the human and pathogen variants separately. In this example, L H = 399 420 and L P = 827.
In the original study, a known association with viral load was replicated at human chromosome 19 variant rs12979860 in IFNL4 (p = 5.9 ⇥ 10 10 ), below the Bonferroni threshold of 1.3 ⇥ 10 7 . The most significant pairwise interaction I found, assuming equal weights, involved the adjacent variant, rs8099917, with p = 2.2 ⇥ 10 10 . However, this did not fall below the more stringent Bonferroni threshold of 1.5 ⇥ 10 10 (Figure 2A). If the original study's authors had performed and reported all 330 million tests, they could have been compelled to declare the marginal association in IFNL4 non-significant, despite what intuitively appears like a clear signal.
Model averaging using the HMP reduces this disincentive to perform additional related tests. Figure 2B shows that despite no significant pairwise tests involving rs8099917, model averaging recovered a combined p-value of 3.7⇥10 8 , below the multiple testing threshold of 1.3 ⇥ 10 7 . Additionally, two viral variants produced statistically significant modelaveraged p-values of 5.5 ⇥ 10 5 and 4.8 ⇥ 10 5 at polyprotein positions 10 and 2 061 in the capsid and NS5a zinc finger domain (GenBank AQW44528), below the multiple testing threshold of 6.0 ⇥ 10 5 .
These results show how model-averaging using the HMP can enhance scientific discovery by (i) encouraging tests for higher order interactions when they otherwise would not be attempted and (ii) recovering lost signals of marginal associations after performing an 'excessive' number of tests.
Untangling the signals driving significant modelaveraged p-values When more than one alternative hypothesis is found to be significant, either individually or as part of a group, it is desirable to quantify the relative strength of evidence in favour of the competing alternatives. This is particularly true when dis- entangling the contributions of a group of individually nonsignificant alternatives that are significant only in combination.
Sellke, Bayarri and Berger 9 proposed a conversion from p-values into Bayes factors which, when combined with the prior information contained in the model weights, produces posterior model probabilities and credible sets of alternative hypotheses. Because the form of relationship is a regularly varying function, Bayes factors for similarly-favoured alternatives are approximately proportional to the inverse p-value. This linearity mirrors the HMP itself, whose inverse is an arithmetic mean of the inverse p-values.
After conditioning on rejection of the null hypothesis by normalizing the approximate model probabilities to sum to 100%, the probability that the association involved human variant rs8099917 was 54.4%. This signal was driven primarily by the three viral variants with the highest probability of interacting with rs8099917 in their effect on pre-treatment viral load: position 10 in the capsid (10.9%), position 669 in the E2 envelope (8.7%) and position 2061 in the NS5a zinc finger domain (11.4%) ( Figure 3). Even though the model-averaged p-value for the envelope variant was not itself significant, this revealed a plausible interaction between it and the most significant human variant rs8099917.

Discussion
The HMP provides a way to calculate model-averaged pvalues, providing a powerful and general method for combining tests while controlling the strong-sense FWER. It provides an alternative to both the overly conservative Bonferroni control of the FWER, and the lower stringency of FDR control. The HMP allows the incorporation of prior information through model weights, and is robust to positive dependency between the p-values. The HMP is approximately wellcalibrated for small values, while a null distribution, derived from generalized central limit theorem, is easily computed. When the HMP is not significant, neither is any subset of the constituent tests.
The HMP is more appropriate for combining p-values than Fisher's method when the alternative hypotheses are mutually exclusive, as in model comparison. When the alternative hypotheses all have the same nested null hypothesis, the HMP is interpreted in terms of a model-averaged likelihood ratio test. However, the HMP can be used more generally to combine tests that are not necessarily mutually exclusive, but which may have positive dependency. It can be used alone or in combination, for example with Fisher's method to combine model-averaged p-values between groups of independent data.
The theory underlying the HMP provides a new way to think about controlling the FWER through multiple testing correction. The Bonferroni threshold increases linearly with the number of tests, whereas the HMP is the reciprocal of the mean inverse p-value. To maintain significance with Bonferroni correction, the minimum p-value must decrease linearly as the number of tests increases. This strongly penalizes exploratory analyses. In contrast, to maintain significance with the HMP, the mean inverse p-value must remain constant as the number of tests increases. This does not penalize exploratory analyses so long as the 'quality' of the additional hypotheses tested, measured by the inverse p-value, does not decline.
Through example applications to GWAS, I have shown that the HMP combines tests adaptively, producing Bonferronilike adjusted p-values for 'needle-in-a-haystack' problems when one test dominates, but able to capitalize on numerous strongly significant tests to produce smaller adjusted p-values when warranted. I have shown how model averaging using the HMP encourages exploratory analysis and can recover signals of significance among groups of individually non-significant tests, properties that have the potential to enhance the scientific discovery process.