The harmonic mean p-value for combining dependent tests

Significance The widespread use of Bonferroni correction encumbers the scientific process and wastes opportunities for discovery presented by big data, because it discourages exploratory analyses by overpenalizing the total number of statistical tests performed. In this paper, I introduce the harmonic mean p-value (HMP), a simple to use and widely applicable alternative to Bonferroni correction motivated by Bayesian model averaging that greatly improves statistical power while maintaining control of the gold standard false positive rate. The HMP has a range of desirable properties and offers a different way to think about large-scale exploratory data analysis in classical statistics.


Derivation of the null distributions for the model-averaged mean maximized likelihood ratio and harmonic mean p-value
A. Generalized central limit theorem approximates the distribution of the mean maximized likelihood ratio. The motivating idea of the paper was to develop a classical model-averaged mean maximum likelihood (MAMML), analogous to the model-averaged Bayes factor, by seeking a null distribution for the mean maximized likelihood ratio defined in Equation 1 asR When the assumptions of Wilks' theorem (1) are met, the null distribution of the maximized likelihood ratio R i ∼ LogGamma α = ν 2 , β = 1 [9] where α and β are the shape and rate parameters of the LogGamma distribution and ν is the difference in the number of parameters between the alternative and nested null hypotheses. The LogGamma distribution has probability density function . [10] When ν = 2, interpreting the mean maximized likelihood ratio is equivalent to interpreting the harmonic mean p-value (HMP) because, from Equation 3,R = 1/ • p. This means that the results forR when ν = 2 also apply to 1/ • p, except that the assumptions of Wilks' theorem no longer need to hold, only that the individual p-values follow the uniform distribution under the null hypothesis, and therefore that 1/p i follows a LogGamma (1,1) distribution. In what follows, the derivation for 1/ • p is therefore the same as forR with ν = 2. Central limit theorem cannot be used to approximate the null distribution ofR (or 1/ • p) because the mean and variance of the LogGamma distribution are undefined for β ≤ 1 and β ≤ 2 respectively. However, when the R i s are independent and identically distributed with regularly varying cumulative distribution function, generalized central limit theorem (2) states that [11] where a L and b L are constants, λ is the heavy-tail index of the R i s and R λ is a Stable distribution with tail index λ. The LogGamma distribution can be shown to be regularly varying, and its cumulative distribution function can be approximated in order to apply generalized central limit theorem. A positive measurable function f (x) is defined to be regularly varying (3,4) if for all t > 0, [12] and slowly varying if λ = 0. Any regularly varying function f (x) can be written in terms of a slowly varying function S(x) as f (x) = S(x) x −λ . [13] The probability density function of the LogGamma distribution is regularly varying with heavy-tail index λ = β + 1 and slowly varying function The cumulative distribution function of the LogGamma distribution is thus regularly varying with a different heavy-tail index λ = β because, by Karamata's theoreom (see e.g. (4)), x → ∞, [15] for λ > 0, so it can be approximated by x → ∞. [16] B. Parameterization of the Stable distribution approximation for ν = 2. To utilize generalized central limit theorem it is necessary to find the parameters of the Stable distribution R λ , the location constant a L and the scale constant b L . This can be achieved (2) by approximating the cumulative distribution function of the R i s as x → ∞. [17] The Stable distribution R λ then has tail index parameter λ and skewness parameter (c−d)/(c+d). Since a LogGamma distributed random variable cannot take a value less than one implies that d = 0, so the skewness parameter takes its maximum value of one, producing an Extremal Stable distribution. When ν = 2, the LogGamma(ν/2, 1) distribution for the R i s simplifies to a Pareto distribution with tail distribution function [18] and the representation in Equation 17 becomes exact. So for ν = 2, c = 1. When λ = 1, as here because λ equals the rate parameter β of the LogGamma distribution, (5) defines the location constant to be (with a less precise form in (2)): [19] where C γ ≈ 0.5772157 is the Euler-Mascheroni constant, and the scale constant to be (2): [20] SoR is said to follow an Extremal Stable distribution which, in Nolan's S0 parameterization (6) of the Stable distribution, S(α, β, γ, δ; 0), can be written as ; 0 ∼ S (1, 1, c π/2, c (log L + 0.874367); 0) . [21] Therefore the combined p-value for the mean maximized likelihood ratio is pR = 1 − F Stable R | 1, 1, c π/2, c (log L + 0.874367); 0 [22] which can be computed via the pEstable function of the FMStable R package (7) as 1 − pEstable(Rbar, setParam(alpha = 1, location = c * (log(L) + 0.874367), logscale = log(c * pi/2), pm = 0)) An explicit form for the probability density function ofR can be obtained by noting the equivalence of the Extremal Stable distribution with heavy-tail index λ = 1 and the Landau distribution (8,9), as can be seen from a comparison of the characteristic functions (10,11). In Nolan's notation, the characteristic function of the Stable S(1, 1, γ, δ; 0) distribution (6) is E exp iuR = exp iuδ − |u γ| 1 + 2 π i log |u γ|sign(u) [23] and the characteristic function for a two-parameter Landau(µ, σ) distribution can be written identically but with location parameter µ = δ and scale parameter σ = γ. (Landau's original distribution, which describes the random loss of energy of fast charged particles as they cross a thin layer, is retrieved under this parameterization when µ = log(π/2) and σ = π/2.) Expressed as a Landau distribution, [24] The advantage of this form is that the probability density function can be written explicitly (12) as π t log t sin(2t) dt [25] which means that the combined p-value can be computed numerically without the specialist package, if necessary, as and likewise for the HMP, substitutingR = 1/ C. Parameterization of the Stable distribution approximation for ν = 2. When ν = 2, the LogGamma(ν/2, 1) distribution no longer simplifies to a Pareto distribution, and the representation of Equation 17 is no longer exact. This means that [27] is no longer constant with respect to x, so the choice will affect the performance of the Stable distribution approximation.
Since the null distribution of the maximized likelihood ratios is heavy tailed, the mean is likely to be dominated by the largest value, i.e. R (L) ≈ LR, where R (i) is the ith largest value. Conversely, if every ratio contributed equally to the mean, all values would be R i =R. These observations suggest that the relevant value of x at which to evaluate c lies betweenR and LR, with the upper end of the range of most relevance. For simplicity I investigated the performance of the Stable distribution approximation at both endpoints of the range, defining with λ = 1, The combined p-value for the mean maximized likelihood ratio is approximated as per Equation 22 but with c = c 1 (R) or c = c L (R) for the two approximations. This can be computed for example in the FMStable R package (7) using or it can be computed using a general purpose numerical integration tool via the Landau distribution in Equation 26. The next section shows that c 1 should be used when ν < 2 and c L should be used when ν > 2.

The null distributions of MAMML and the HMP are robust to the number of tests, unequal weights, unequal degrees of freedom and dependency between the tests
The derivation above assumes equal prior weights for the alternative hypotheses, equal degrees of freedom and independence between the tests. However, generalized central limit theorem is robust to departures from these assumptions. I investigated the performance of the Stable distribution approximation across a range of scenarios. These simulations demonstrated key properties of the null distribution ofR as the mean of a large number of regularly varying random variables (the R i s): In particular, the simulations demonstrated the superior performance of the approximation when ν = 2. This superior performance, and the equivalence between the mean maximized likelihood ratio when ν = 2 with the intuitive representation as a harmonic mean p-value, motivated the focus on the HMP in the rest of the paper.
A. Performance of the approximation under favourable assumptions. Property 1, that the upper tail ofR behaves like the upper tail of the R i s, is an inherent property of generalized central limit theorem (4), and is expressed by the approximation in Equation 5. Property 2, an insensitivity to the exact number of tests, L, is revealed by the parameterization of the Stable distribution approximation for the null distribution ofR (Equations 21 and 22). For ν = 2 degrees of freedom, when the scaling of the tail probability c equals one, the location parameter of the Stable distribution increases in proportion to log L, i.e. more slowly than linearly, and the scale parameter does not change at all in relation to L. The heavy-tail index and skewness parameters are also unaffected by L regardless of ν. When ν = 2, the scale for the tail probability, c, is only approximately locally constant with respect toR, and may depend on L. This weak dependency on L is represented in the c L approximation of the scale for the tail probability (Equation 29). The simulations demonstrated these properties empirically. I performed 100 000 simulations each of L = 10 1 , 10 1.5 , 10 2 , 10 2.5 and 10 3 independent maximized likelihood ratio test statistics, {R i , i = 1 . . . L} and for each simulation calculatedR. I repeated the simulations for three degrees of freedom: ν = 1, 2 and 5. The rainbow-coloured lines in Figure S2 show the empirical cumulative distribution functions (top panels) and empirical 'significance' or − log 10 tail probabilities (bottom panels) from L = 10 1 (red) to L = 10 3 (purple).
The top panels in Figure S2 show that the total number of tests, L, has a substantial effect on the null distribution ofR below the 95th percentile (indicated by the grey horizontal dashed line labelled α = 0.05). In contrast, the bottom panels show that the upper tails of the distribution above the 95th percentile are almost identical irrespective of L, and converge to the distribution of an individual R i (grey solid lines), demonstrating Properties 1 and 2. These observations are consistent with much slower convergence of the left tail of the distribution to the Stable distribution limit. Nevertheless, the utility of the null distribution approximation is to calculate p-values, i.e. upper tail probabilities, for which the simulations demonstrate the desired insensitivity of the upper tail to L.
As anticipated, the insensitivity to L is superior for ν = 2. Notably, the direction of the effect of L on the relative magnitude of the significance (− log 10 tail probabilities) switches for ν < 2 and ν > 2. For ν = 1, the significance of R is greater for larger L. For ν = 5, the significance ofR is greater for smaller L. This behaviour has an important effect on whether the Stable distribution approximation is conservative or anti-conservative when ν = 2.
The Stable distribution approximations c 1 and c L for the null distribution ofR (Equations 28 and 29) are shown by the solid and dashed black lines respectively. Both approximations were calculated only for the maximum value of L = 10 3 to give an indication of the robustness of the tests to model misspecification, in particular over-estimation of the number of independent tests. For ν = 2, the scale parameter of the tail probability is c = 1 for both approximations, so the two lines are overlaid. The Stable approximation is extremely close to the simulated null distributions ofR above the 95th percentile. For ν = 1, the c 1 approximation is conservative in the sense of always underestimating the significance ofR and the c L approximation is anti-conservative in the sense of always overestimating the significance ofR in the upper tail. For ν = 5, the pattern is reversed. This behaviour indicates that the c 1 and c L approximations do indeed represent the two endpoints for the choice of the tail scaling constant c, as intended, and that the c 1 approximation should be used when ν < 2 and the c L approximation should be used when ν > 2 to ensure the test errs on the side of being conservative.
B. Robustness of the approximation to unequal prior model weights. The approximation of Equation 5 shows that the convergence of the upper tail of the null distribution ofR, the mean of a large number of regularly varying random variables with heavy-tail index λ = 1, is robust to unequal weights, and this property, Property 3, is shared with generalized central limit theorem (4). To demonstrate this empirically, I simulated unequal model weights independently for each simulation from a symmetric Dirichlet distribution with parameter 1 L , and used these weights  S4. Robustness of the Stable approximation to extreme dependency between tests. The simulations were conducted as for Figure S2 except that I used a resampling strategy to produce large numbers of identical tests with each simulation, producing an extreme form of dependency, i.e. identity. This involved simulating weights from a symmetric multinomial distribution, so that on average a proportion ∼ e −1 = 0.368 of tests were replaced by others.
to calculateR using Equation 1. Figure S3 shows that the null distributions ofR with unequal model weights are extremely similar to the results with equal model weights ( Figure S2), particularly in the upper tails of the distribution (lower panels) but also in the lower tails (upper panels), and even with a small number of tests L = 10 1 (red lines). The Stable distribution approximations performed equally well as for unequal model probabilities, particularly for ν = 2, with the c 1 approximation conservative for ν < 2 and the c L approximation conservative for ν > 2, as before.
C. Robustness of the approximation to extreme dependency between tests. The robustness of the Stable distribution approximation to unequal weights implies a degree of robustness to positive dependency between tests, Property 4. This is because upweighting some tests is equivalent to observing identical test statistics multiple times. Convergence of various specific forms of dependency to Stable distributions have been investigated explicitly (13), suggesting that the assumption of independence between tests might be relaxed to one of exchangeability. I simulated a form of extreme dependency between tests by simulating weights from a symmetric multinomial distribution, independently for each simulation. This is equivalent to duplicating, one or more times, some tests and completely removing others. Figure S4 shows again that the null distribution ofR is very similar in the presence of this positive dependency between tests to the distribution under independence and equal model weights ( Figure S2).

D. Robustness of the approximation to unequal degrees of freedom among tests.
When the individual test statistics, the R i s, differ in their degrees of freedom, those with the largest degrees of freedom are expected to dominate the null distribution ofR (Property 5) because their tails are the heaviest. This suggests that when, in practice, the degrees of freedom are unequal between tests, the Stable distribution approximation should be calculated using the maximum value of ν. To investigate the performance of this strategy, I simulated test statistics with two different degrees of freedom, ν = 1 and ν = 5 in varying proportions of 1:9, 1:1 and 9:1. Figure S5 shows that use of the c L Stable approximation is always conservative in estimating upper tail probabilities when there is variation in the degrees of freedom, but when the proportion of tests with the maximum degrees of freedom is small, it can be substantially over-conservative. For example, when mixing tests with ν = 1 and ν = 5 in proportion 9:1, the c L Stable distribution approximation under-estimated significance by an order of magnitude. As expected, the simulated distribution ofR and the c L approximation were still very similar in the upper tail, indicating that the disparity is caused by a discrepancy in the lower tail of the distribution. For such practical cases, it would be worthwhile approximating the lower tail of the null distribution ofR by simulation so that where F sim (x) is the empirical cumulative distribution function of a large number of simulations and the approximation switches to the Stable distribution at a large percentile such as the 99th, so that p switch = 0.99.
E. Performance of the harmonic mean p-value. The simulations above show that the Stable distribution approximation to the null distribution ofR is superior for ν = 2, whereas for ν = 2, the Stable distribution approximation is less insensitive to the number of tests L, and therefore conservative tests are required to avoid over-estimating the significance of the observedR. These observations motivate the use of the HMP, • p, instead ofR, particularly when there may be variability in the degrees of freedom between tests. Interpreting the HMP is equivalent toR when ν = 2 and the assumptions of Wilks' theorem are met. When ν = 2, interpreting the HMP is not exactly equivalent toR, but it is expected to capture similar information. Partly, this is because the conversion between a maximized likelihood ratio statistic R i and a p-value p i is approximately locally linear for large R i , i.e. small p i . By Karamata's theorem, Equation 16 shows that which is a regularly varying function, meaning that log(R i ) ν/2−1 varies only slowly with respect to R i and is approximately locally constant. Therefore the conversion from R i to p i is approximately linear even for ν = 2. Moreover, the null distribution for the HMP is more robust to departures from the assumptions of Wilks' theorem, differences in degrees of freedom among tests, and the total number of tests, these latter two properties demonstrated by Figure S6. As a type of mean, it has a more intuitive form for combining tests. In section §4 I show how it also parallels the construction of the test statistic in Fisher's method for combining p-values. For these reasons, the focus of the applications of these results is on the HMP.

Controlling the strong-sense family-wise error rate of the HMP test
Limiting the probability with which a combined test falsely rejects the null hypotheses when it is true is known as weak-sense control of the family-wide error rate (FWER). Strong-sense control of the FWER additionally entails limiting the probability with which any of the true null hypotheses is rejected when one or more null hypotheses are false. A procedure that controls the strong-sense FWER can be constructed from any method that controls the weak-sense FWER using the closure principle (14). Consider a situation in which a series of not necessarily independent binary hypothesis tests are conducted, each one between a null hypothesis O and an alternative A and define [32] and by implication [33] The null hypothesis associated with an individual U i is called an elementary hypothesis while I will call intersections of null hypotheses associated with {U i ∩ U j } compound hypotheses. Each test is associated with a p-value assumed to follow a Uniform(0, 1) distribution if the elementary or compound hypothesis is true. The closure of the elementary hypotheses is the set of all possible intersections involving one or more elementary hypotheses. For example, for the elementary hypotheses {U 1 , U 2 , U 3 }, the closure is In a closed testing procedure (14) (CTP), an elementary hypothesis U i is only rejected if every intersection hypothesis involving U i is also rejected. For example, Suppose R ⊆ {1, 2, . . . , L} is a set indexing some or all of the L elementary hypotheses U 1 . . . U L , ψ R is a test that controls the weak-sense FWER at level α for the compound hypothesis U R = {∩ i∈R U i } and p ψ R is the corresponding p-value. Then the following test controls the strong-sense FWER at level α: A. Derivation of the closed testing procedure. Suppose R 0 is the unknown set indexing only the correct elementary hypotheses: this implies the existence of R 0 , a (possibly empty) complementary set indexing the incorrect elementary hypotheses. Define the following events: A : One or more of the correct elementary hypotheses U i , i ∈ R 0 is rejected by the CTP This is the standard definition for the strong-sense FWER, and it is robust to non-independence between tests within the set of true null hypotheses (R 0 ). However, its first assumption could be violated by correlation between the p-values of the true null hypotheses and true alternative hypotheses. If this is invalid then control of the weak-sense FWER, and therefore the strong-sense FWER, cannot be guaranteed. This caveat would seem to apply to any CTP.

B. Bonferroni achieves strong-sense FWER.
Suppose that ψ R records the result of a single weighted Bonferroni test controlling the weak-sense FWER for whether the intersection of elementary hypotheses indexed by set R, [35] where w R = L i∈R w i . A naive implementation of CTP involves (2 L − 1) of these tests. But in practically useful implementations of CTP, shortcuts are found which mean not all tests need performing. In the case of Bonferroni, the test is a shortcut procedure that satisfies the CTP because its rejection implies all intersection hypotheses involving the elementary hypotheses indexed by R will also be rejected, a property that is easier to see by rewriting the test, cancelling the w R term on both sides. This means that only the L elementary hypotheses need directly testing to control the strong-sense FWER at level α using the Bonferroni test.

C. A multilevel CTP for the HMP.
By the same argument, testing significance by directly interpreting the HMP as per Equation 6, is a multilevel CTP controlling the strong-sense FWER at level approximately α, as shown by Equation 7. By this test, any set that contains a significant subset is itself significant and, conversely, if the set of all hypotheses is not significant, no subset is significant. However, this shortcut procedure is only approximate because the exact significance threshold varies by the number of hypotheses combined ( Table 1). The full CTP that controls the strong-sense FWER without sacrificing power is derived from Equation 34 as The disadvantage of this CTP is that many p-values must be considered for each test. Shortcut procedures can be defined that improve on a naive implementation of Equation 37 but instead one can apply weighted Bonferroni correction to make a simple adjustment to Equation 6 by substituting α |R| for α. (Equivalently, one can apply weighted Bonferroni correction to Equation 4.) This produces a more conservative multilevel CTP than Equation 37 that is nevertheless guaranteed to control the strong-sense FWER, up to the order of the Stable distribution approximation.
In practice, control of the strong-sense FWER can be achieved by a two-pass procedure in which the more lenient threshold α is used initially to discover candidate significant hypotheses, which are then confirmed using the exact threshold α |R| .

The relationship between the HMP and other combined tests
A. The HMP is closely related to but more powerful than Simes' procedure. In this section I show that the HMP is closely related to Simes' test (15). This provides a new, intuitive interpretation of the Simes test as an approximation to the harmonic mean p-value by showing that it provides the smallest upper bound on the HMP based on a single p-value. Simes' test, as extended by Hochberg and Liberman (16) to include unequal weights, can be written as where p * (i) is the ith smallest value of the p * j s and w R = i∈R w i . This test controls the weak-sense FWER at level α when max i w i /w R ≤ 1/(α |R|), and is conservative otherwise. The following modified test defines a shortcut multilevel CTP that controls the strong-sense FWER: where L i=1 w i = 1. Like the Bonferroni and HMP shortcut procedures, a set of alternative hypotheses will be significant by this modified Simes' test if any subset of those alternative hypotheses is significant. This is because when the rank of the p * j defining the minimum test statistic in set R is i, its rank in any superset must be at least i as well. However, the converse is not certain.
To facilitate the comparison between Simes' test statistic and the HMP, it can be rewritten in terms of 'b-values' (inverse p-values) as where b * [i] denotes the ith largest value of the b * j s. Compare this to the inverse HMP, When b Simes is maximized at the largest, second largest and third largest of the b * j s respectively, it is clear that w R ≤b By extension, the Simes test statistic p Simes = 1/b Simes approximates the harmonic mean p-value • p R = 1/b R , providing the smallest upper bound on the HMP possible using a single p-value. This means the behaviour of the two tests should be very similar. Indeed, not only has Simes' procedure been extended to unequal weights (16), it has also been shown to be robust to positive dependence between tests (17). Like the HMP, it combines tests adaptively, producing a Bonferroni-like p-value for 'needle-in-a-haystack' problems when one test dominates, but capitalizing on numerous strongly significant or suggestive tests when warranted to produce a smaller combined p-value. However, it makes less use of the data and does not share the property of the HMP that when the test is not significant, neither is any subset of the constituent tests, suggesting it may be less statistically efficient, and therefore less powerful. To investigate this, I conducted power simulations comparing Bonferroni, Simes and HMP tests.  Figure S7 shows that except for the 'Needle-in-a-haystack' scenario, in which the alternative hypothesis is true for only a single test, the HMP substantially outperformed both Bonferroni and Simes' tests. Under a 'Mixture of signals' scenario, in which the alternative hypothesis was true for 10% of tests, the HMP outperformed Simes and Bonferroni by 14% and 29% respectively. Under a 'Subtle pervasive signal' scenario, in which the alternative hypothesis was true for all tests, but differed only subtly from the null hypothesis, the HMP outperformed Simes and Bonferroni by 45% and 75% respectively. Even under the needle-in-a-haystack scenario, where Simes' and the HMP test performed similarly, they slightly outperformed Bonferroni correction by 1.2% and 1.4% respectively. The power of all tests was adversely affected by dependency and poorly-chosen, unequal weights, but the relative performance characteristics of the tests remained similar.

B. Complementarity to Fisher's method for combining p-values.
Fisher's method (18, p. 103) provides another procedure for combining p-values that is often very powerful, but makes a strong assumption of independence between the tests. As I will show, Fisher's method and the HMP can be seen as complementary methods for combining tests when the alternative hypotheses are independent or mutually exclusive respectively. The test statistic for Fisher's method can be written as the sum of the 'significances', expressed as the − log e p-values, of the individual tests: Under the null hypothesis, this quantity follows a Gamma(α = |R|, β = 1) distribution (18). Fisher's method can be expressed as As for the HMP, Fisher's method can be interpreted in terms of a likelihood ratio test, an interpretation that is exact when the individual p-values are derived from likelihood ratio tests with ν = 2 degrees of freedom. In this form, it becomes apparent that Fisher's method combines a series of tests by testing the grand alternative hypothesis against the grand null: where O i and A i represent the nested null and alternative hypothesis associated with likelihood ratio test i. In contrast, the HMP can be interpreted in the same circumstances as testing a model-averaged alternative hypothesis against the grand null in which each individual alternative hypothesis is treated as mutually exclusive from the others: [44] Here the model weights, the w i s, are interpreted as proportional to prior model probabilities on the mutually exclusive alternative hypotheses. A more detailed consideration of the role of the weights is provided in section §5C. Thus, the interpretation of what it means to reject a compound hypothesis (Equation 33) depends on the test. For example, if U i represents the acceptance of the null O i over the alternative A i , and (U i ∩ U j ) represents the rejection of the compound hypothesis U i ∩ U j , then in the case of Fisher's method this corresponds most closely to whereas in the case of the HMP it corresponds most closely to While the data must be independent for Fisher's method to be valid, in the case of the HMP, the data do not need to be independent. In fact the same data can appear in multiple tests, in which case terms in Equation 44 will cancel, so that in the case where the same data and same null hypothesis are used for every test, the interpretation simplifies to as originally intended in Equation 1. This flexibility formalizes the idea that the HMP can be applied not only to p-values constructed from a series of likelihood ratio tests for the same data, same nested null hypothesis and different alternative hypotheses, but can be applied more widely as a general-purpose method for combining p-values. Indeed, there may be cases when imposing the constraint that the alternative hypotheses are mutually exclusive is more powerful than testing the grand alternative against the grand null, as in Fisher's method. In this respect, the HMP is a complementary tool to Fisher's method.
To investigate the power of the HMP relative to Fisher's method, and to illustrate situations in which the HMP is appropriate but Fisher's method is not, I repeated the simulation study from the previous subsection, including Fisher's method for combining p-values. For needle-in-a-haystack problems, Fisher's method showed considerably less power than Bonferroni, Simes' and the HMP tests ( Figure S8). This can be understood through the interpretation of Fisher's test as comparing the grand alternative to the grand null. When only 0.1% of tests are drawn from the alternative hypothesis, the distribution of p-values resembles the grand null more closely than the grand alternative, resulting in a 66% reduction in power compared to the HMP.
In the two other scenarios, where there were a mixture of signals or a subtle pervasive signal, the power of Fisher's method was 60% and 40% greater than the HMP respectively. However, in the presence of non-independence between tests, the superior performance of Fisher's test was counterbalanced by an extremely high false positive rate of 12%, around 2.5 times higher than the specified false positive rate of α = 5%. This reveals a trade-off between power and robustness: the assumption of independence between tests contributes to considerably higher power for Fisher's method, but the test is not robust when that assumption is violated, leading to inadmissibly high false positive rates.   Simulations were conducted as per Figure S7, for both the false positive rate (left) and power (right). To calculate false positive rates, the parameter θ was set to one throughout; the three scenarios are therefore equivalent for this purpose. Fisher's method does not account for weights, so these were ignored.

C. Position within Loughin's classification of methods for combining p-values.
Loughin (19) classified methods for combining p-values from independent tests into two broad categories. In the first, each p-value is transformed via an inverse cumulative distribution function into a variable that is then summed to produce the combined test statistic, Loughin provided examples including Fisher's method that utilizes the chi-squared distribution with ν = 2 degrees of freedom (18), Lancaster's that allows variable degrees of freedom (20), Lipták's that utilises the standard normal distribution (21), Edgington's that utilizes the uniform distribution (22) and Mudholkar and George's that utilizes the logistic function (23). The second broad category of methods define the test statistic in terms of an order statistic of the observed p-values. Again, this is compared against a null distribution to determine whether it is significantly extreme. Examples include the minimum and maximum p-values. With respect to this classification, the HMP appears to represent a novel method belonging to the first broad category, in which the transforming inverse cumulative distribution function is the Pareto distribution, with F −1 (1 − p i ) = 1/p i .

Bayesian connections and comparing competing alternative hypotheses
A. Similarity between Bayesian model-averaged significance testing and the HMP. It is widely recounted that Bonferroni correction for multiple testing is conservative, especially when the tests are not independent, but comparison to Bayesian procedures suggests this view may be over-stated in a specific sense. As a means of combining tests, section §4 shows that the HMP and Simes' test comfortably outperform Bonferroni, except for 'needle-in-a-haystack' problems. However, as a means of testing whether a specific individual alternative hypothesis is significantly better than the null hypothesis, taking into account the total number of alternative hypotheses, the Bonferroni procedure is appropriate. The Bonferroni procedure utilizes Boole's inequality to control the family-wise error rate (FWER) at level α, or some more stringent level below α: The FWER is controlled at exactly level α if only one p-value can be significant at once, which is not usually the case. While this explains why combined tests have been devised that control the FWER at level exactly α, and are therefore more powerful, a comparison to the Bayesian approach suggests that the Bonferroni procedure is not a fundamentally conservative approach to assessing the significance of a specific alternative hypothesis, correcting for the multiple alternative hypotheses. This issue is relevant to the HMP short-cut procedure (Equation 6) which improves power by simultaneously performing combined tests at multiple levels, but for which the rejection of the null hypothesis in favour of a specific alternative hypothesis, or set of alternative hypotheses, R, is subject to the combined p-value, For the purpose of comparison to the Bayesian approach, suppose that the alternative hypotheses are mutually exclusive. This assumption is not restrictive because the outcomes of any group of non-mutually exclusive tests can, in principle, be fully enumerated to form an exhaustive list of mutually exclusive outcomes. Consider L mutually exclusive alternative hypotheses M i , i = 1 . . . L and a common null hypothesis M 0 . The HMP employs weights w i , The two procedures reject the null hypothesis M 0 in favour of the model-averaged alternative hypothesis M R when Bayesian HMP Since, for small α (e.g. α ≤ 0.05), α |R| ≈ α, this shows there exists a combination of probability threshold α, weights {w i , 1 ≤ i ≤ L}, prior odds µ 0 , and prior probabilities {µ i , 1 ≤ i ≤ L} for which the Bayesian procedure would reject the null in favour of the same set of alternative hypotheses and model-averaged alternative hypotheses as the HMP. Note however that the scales of the Bayes factor and the b-value are not identical. Nevertheless, this reiterates the observation that the inverse p-values interact approximately linearly, like Bayes factors, at least when they are large. This relationship is formalized in a Bayesian interpretation of the HMP below. The parallels between the Bayesian approach to model-averaged hypothesis testing and the HMP are instructive, because they offer an alternative motivation for the underlying Bonferroni multiple-testing correction that does not view the procedure as fundamentally flawed because it is conservative, but rather as a natural approach to testing specific hypotheses or groups of hypotheses. For advocates of the false discovery rate (FDR), for whom control of the FWER is conservative, the Bonferroni procedure is doubly conservative. Yet the Bayesian connection, which is developed further below, suggests that contrary to this widespread critique, both Bonferroni correction and, by implication, the FWER are frequentist analogs to natural Bayesian procedures.

B. Parallels between Bayesian model-averaging and Simes' method.
Simes' procedure also parallels the Bayesian model-averaging approach, and this demonstrates the close connection between the HMP and Simes' procedure. The two procedures reject the null hypothesis M 0 in favour of the alternative set of hypotheses M R when Bayesian Simes This reiterates that there exists a combination of significance threshold, weights and prior model probabilities for which the Bayesian procedure would reject the null in favour of the same set of alternative hypotheses as the classical procedure, in this case Simes'. It also reiterates the lower statistical efficiency of Simes' procedure compared to the HMP because the Bayesian criterion analogous to Simes' procedure for rejecting the null hypothesis will always be smaller, and therefore less powerful, than the Bayesian criterion analogous to the HMP for the same data. In these tests, the ith largest value of the weighted Bayes factor or weighted b-value defines the minimum threshold α at which the null hypothesis would be rejected. This means that when the test is marginal in significance, all the top i alternative hypotheses are needed as a group to reject the null, and only groups of alternative hypotheses including them will be significant.
Simes also proposed the Benjamini-Hochberg (BH) procedure as an exploratory tool for prioritizing alternative hypotheses for further investigation (15). Benjamini and Hochberg subsequently showed (24) that this procedure controls the expected false discovery rate (FDR) at level α. In the BH procedure, the null is rejected in favour of all alternative hypotheses M i for which This seems to define the set of alternative hypotheses that, together with any (i − 1) alternatives with larger weighted b-values, would reject the null hypothesis in favour of the i alternatives. In this sense, the null is really rejected in favour of the alternatives as groups or sets of increasing size, rather than individually. That the HMP makes this interpretation explicit can be seen as another advantage over the BH procedure.
C. Interpretation of the harmonic mean p-value as a Bayesian procedure. The HMP can be interpreted as directly proportional to a Bayes factor, subject to the following assumptions: 1. The distribution of the p-values under the alternative hypotheses can be approximated by Beta distributions.
2. The power of the tests are good, so that the mean p-values are much less than 1 under the alternative hypotheses.
3. The model weights account for the prior model probabilities and the mean p-values under the alternatives.
The construction of a Bayes factor from the HMP proceeds as follows. Since the p-value can be thought of as a low-dimensional summary of the full data, a Bayes factor can be defined as where f Oi (p) and f Ai (p) represent the probability density functions of the p-value under the null and alternative hypotheses respectively. By definition, the p-value follows a Uniform(0, 1) distribution under the null, so f Oi (p i ) = 1. Sellke, Bayarri and Berger (25) considered a Beta(ξ, 1) distribution for the p-value under the alternative, with 0 < ξ ≤ 1, so the Bayes factor can be written This would imply that the power of the hypothesis test has the form and the mean p-value under the alternative hypothesis would be [51] The model-averaged Bayes factor for the set of alternative hypotheses R versus the common null hypothesis would then be where • µ i is the prior probability of alternative hypothesis M i , conditional on rejection of the null M 0 , so that •ξ = L i=1 µ i ξ i is the mean p-value averaged across all alternative hypotheses. An important point to note is that more powerful tests are penalized in the calculation of the Bayes factor because the model weight w i incorporates ξ i , the mean p-value under the alternative, which is smaller for more powerful tests. The same weighting is obtained by optimizing the weights to maximize the frequentist power of the HMP (see below).
The Bayesian interpretation of the HMP allows posterior model probabilities to be calculated. For example, in the event that the null hypothesis is rejected, the posterior probability of hypothesis M i can be computed as In this way, 100(1 − α)% credibility intervals can be constructed by taking the smallest group of alternative hypotheses whose posterior probabilities, conditional on rejection of the null hypothesis, sum to at least 1 − α. The harmonic mean p-value can also be written in the form of a Bayesian Information Criterion (26,27) as [54] As above, this interpretation assumes that the weights are proportional to the product of the prior model probabilities

What are the values of ξ?
3. What does use of equal prior weights imply? Figure S9 addresses the first of these questions, showing that when the mean p-value under the alternative is ξ = 0.1, the approximation is already reasonably accurate, and for values below this it is very good. For context, when ξ = 0.1, the power is α ξ = 0.74 for the conventional significance threshold of α = 0.05. This reiterates the requirement that the power needs to be good to interpret the HMP directly as a Bayes factor.
The question of what values the ξ i s take requires an assumption about the value or distribution of values of the parameters of the alternative hypothesis. Usually power is specified a priori, so this assumption amounts to a prior distribution, based on which power can be computed using a standard formula or numerically, for example by simulation. The prior distribution used for calculating power can be obtained in different ways, for example using previous data or by specifying a 'subjective' prior. Methods that specify an 'objective' prior that is improper cannot be used for Bayesian hypothesis testing nor for interpreting the HMP as a Bayes factor because, in the latter case, ξ would be undefined.
To take the multiple regression model used in the GWAS as an example, the power of each test can be calculated as follows. Suppose that only biallelic variants are considered, and numerically encoded in an n × L matrix G as 0 (common allele) or 1 (rare allele). Suppose that the vector y records the phenotypes and X is an n × c matrix of covariates, the first column of which is a vector of ones, corresponding to the intercept term, and the last column of which contains the genotypes from G ·i . The other columns contain any other relevant covariates. The maximum likelihood estimate of the effect of genetic variant G ·i on the phenotype y is then and the variance of the estimator is For large samples the test statistic follows a Normal(0, 1) distribution under the null hypothesis M 0 , in which β i = β 0 . If the parameter takes the value β i = β A under alternative hypothesis M i and a priori it is assumed that β A − β 0 ∼ Normal(0, σ 2 β ), then the test statistic S i follows a Normal(0, 1 + σ 2 β /V (β i )) distribution under the alternative hypothesis M i . The power of the test is where Q χ 2 1 is the quantile function of a χ 2 distribution with one degree of freedom. The mean p-value under M i is which can be computed numerically for any given value of . Thus power is computable by making a prior assumption about the relative magnitude of the genetic effects on the phenotype compared to the unexplained variance of the phenotype. If the ratio σ 2 β /σ 2 is assumed constant over all genetic variants, then power will be greater when (X X) −1 c c is smaller. The choice of σ 2 β /σ 2 could be based on prior information or subjective belief. An alternative approach is to find an 'objective' way to choose the prior. One objective could be to ensure that the condition ξ i 1 is met for all tests, and thereby validate the approximation of the Bayes factor from the HMP (Equation 52). Thus, the desired power could be specified a priori, and the value of σ 2 β /σ 2 required to achieve that power calculated per test. If all tests were assumed to attain the same power, that would imply larger effects are expected a priori for tests with inherently lower power. This is a defensible approach because it effectively penalizes inherently lower-powered tests by requiring a higher standard of evidence, in terms of effect size, for the rejection of the null hypothesis. The penalty is imposed through the ξ i term in the Bayes factor, which is smaller for tests that are assumed to be more powerful. In the context of GWAS for human disease, such priors are common and are advocated on the basis that risk alleles of larger effect would be maintained at lower frequency by natural selection, an assumption that can be tested in practice (28). Therefore the use of equal weights can be motivated by a uniform prior over the model probabilities for the alternative hypotheses, in conjunction with a prior that expects larger effect sizes for alternative hypotheses whose tests are inherently less powerful.
Optimal weights for the most powerful HMP. Interestingly, the weights suggested by the Bayesian interpretation of the HMP can be motivated by an argument based on optimizing the power of the HMP. The power of the HMP can be approximated roughly as [60] This expression can be maximized subject to the constraint that L i=1 w i = 1 by writing the Lagrangian function Solving the derivative with respect to w i gives which for ξ i 1 is solved by where the constant λ imposes the constraint that L i=1 w i = 1. So power appears to be optimized in the frequentist setting by (i) up-weighting the inverse p-values of alternative hypotheses that are more probable a priori and (ii) down-weighting the inverse p-values of more powerful tests, recapitulating the weights required for a Bayesian interpretation of the HMP. D. Relaxing significance thresholds when multiple alternative hypotheses are true. Some procedures for controlling the FDR explicitly estimate the proportion of alternative hypotheses that are true, reducing the stringency of the threshold for calling a test significant when a larger proportion of alternatives are estimated to be true (see e.g. refs (29)(30)(31)(32)(33)). The question arises whether similar reasoning can be used to motivate a relaxation of the significance threshold in the model-averaging setting.
Consider, for example, a GWAS in bacteria for a trait of interest, such as antimicrobial resistance (34,35). Antimicrobial resistance is often influenced by multiple loci, but the tests at different loci cannot be treated as independent because of strong dependency (i.e. linkage disequilibrium) between loci caused by relatively low rates of genetic exchange and strong structuring of populations into sub-populations or 'strains' (35). To formally test for the involvement of multiple loci in the trait, it would be necessary to fit all higher-order models involving two or more loci. However, to test for the involvement of K loci would involve L K tests, which quickly becomes very large even for modest K.
Even in human GWAS, tests at different loci can never be truly independent because the same outcomes (the phenotypes) are shared between analyses. However, it is often considered appropriate to test associations between multiple loci and a single trait as if they were independent, particularly for loci that are physically distant or on different chromosomes (i.e. unlinked loci), and it is in this context that FDR arguments are applied to human GWAS (29). Since the control of the FDR in human GWAS carries an explicit assumption that multiple loci might be associated with the trait, it therefore carries an implicit assumption that the signal of association at any individual locus tested alone is in some sense averaged over all the other loci which might also be associated. In other words, the so-called 'marginal' (i.e. single locus) test for the association is implicitly assumed to approximate an equivalent model-averaged test.
Stating this explicitly, arguments in favour of controlling the FDR in GWAS assume, perhaps optimistically, that the Bayes factor or p-value for the marginal test approximates a model-averaged Bayes factor or p-value. Let M i • denote the set of all models involving locus i and (K − 1) other loci, of which there are L−1 K−1 combinations. The criterion for declaring a significant association involving a particular locus i, according to this approximation, is Bayesian HMP In the case of equal weights among the L K hypotheses, then w i • = L−1 K−1 / L K = K L , which is a factor K times bigger than w i = 1 L . Therefore this approximation motivates the relaxation of the stringency of the significance threshold by a factor K, where K is the number of loci involved in the trait, so the threshold is now a function of the proportion of alternative hypotheses expected to be true, rather than simply the number of tests. The obvious problem is that either a strong assumption regarding the true value of K has to be made or K has to be estimated somehow, e.g. by explicitly evaluating the higher-order tests over all sets of K loci for different values of K. If this latter procedure were computationally feasible, the approximation would be unnecessary. Nevertheless, the effective relaxation of the multiple testing threshold would still be gained because w i • would inevitably increase if more loci were involved in the trait.

GWAS of neuroticism
I downloaded the p-values for 6 524 432 directly assayed and imputed variants arising from a meta-analysis of neuroticism in 170 911 individuals (36), one of many human GWAS catalogued by LD Hub (37), and whose summary statistics are freely publicly available (http://ssgac.org/documents/Neuroticism_Full.txt.gz). These p-values were calculated from a meta-analysis of numerous cohorts using METAL (38). METAL conducts a z-test which is analogous to a likelihood ratio test for whether the effect of each variant on the trait is different from zero. The common null hypothesis is that no variant has an effect. I defined overlapping sliding windows of various sizes: 10kb, 100kb, 1000kb, 10Mb, entire chromosomes and the whole genome. I computed the HMP (Equation 8) and corresponding p-value (Equation 4) for each window. I computed adjusted values by dividing the HMP and corresponding p-value for the region R spanned by each window by w R , the sum of the weights across all variants within the window. Since I assumed equal weights across variants, w R equalled the proportion of all variants within the window. I then compared the adjusted HMPs and p-values directly to the significance threshold α = 0.05.

GWAS of HCV pre-treatment viral load
I applied to the STOP-HCV consortium (http://www.stop-hcv.ox.ac.uk/data-access) for access to the 399 420 human genotypes and 410 viral polyprotein sequences from a GWAS of pre-treatment viral load in 410 white European patients with HCV genotype 3a (39). I imputed missing base calls and indels in the aligned HCV nucleotide sequences using ClonalFrameML (40) having estimated a maximum likelihood phylogeny using PhyML (41). I converted to a codon alignment and identified 827 variable codons in which the frequency of the consensus codon was less than 95%. I conducted 330 320 340 tests of association between log 10 pre-treatment viral load and the interaction between each pair of human and virus variant. I fitted a linear model with the following factors and covariates: human variant, virus variant, an interaction between human variant and virus variant, and, following (39) , the leading three principal components (PCs) of human genetic variation, the leading ten PCs of virus genetic variation, patient sex and an indicator of sustained virological response (SVR). I assumed codominance between human alleles. Each p-value was produced by a likelihood ratio test between the fitted model and the common null hypothesis in which only the three human PCs, ten virus PCs, sex and SVR were included. To quantify the relative strength of evidence among the alternative hypotheses, having rejected the common null hypothesis, I calculated the approximate 'post-data' or 'posterior' probability of each alternative particular hypothesis M i . This was possible because I treated the model formally as a random effect in which the model weights take into account a 'pre-data' or 'prior' probability distribution and the power of each test, and is analogous to, e.g., posterior decoding in a hidden Markov model: [64] This approximation is justified in section §5. I took L = (L H L P ), the p i s from each of the L likelihood ratio tests and equal weights w i = 1/(L H L P ).