Higher criticism thresholding: Optimal feature selection when useful features are rare and weak
See allHide authors and affiliations

Contributed by David Donoho, August 2, 2008 (received for review July 8, 2008)
Abstract
In important application fields today—genomics and proteomics are examples—selecting a small subset of useful features is crucial for success of Linear Classification Analysis. We study feature selection by thresholding of feature Zscores and introduce a principle of threshold selection, based on the notion of higher criticism (HC). For i = 1, 2, …, p, let π_{i} denote the twosided Pvalue associated with the ith feature Zscore and π_{(i)} denote the ith order statistic of the collection of Pvalues. The HC threshold is the absolute Zscore corresponding to the Pvalue maximizing the HC objective (i/p − π_{(i)})/
The modern era of highthroughput data collection creates data in abundance; however, this data glut poses new challenges. Consider a simple model of linear classifier training. We have a set of labeled training samples (Y_{i}, X_{i}), i = 1, …, n, where each label Y_{i} is ±1 and each feature vector X_{i} ∈ R^{p}. For simplicity, we assume the training set contains equal numbers of 1's and −1's and that the feature vectors X_{i} ∈ R^{p} obey X_{i} ∼ N(Y_{i}μ, Σ), i = 1, …, n, for an unknown mean contrast vector μ ∈ R^{p}; here, Σ denotes the feature covariance matrix and n is the training set size. In this simple setting, one ordinarily uses linear classifiers, taking the general form L(X) = Σ_{j}^{p} = 1 w(j)X(j), for a sequence of “feature weights” w = (w(j): j = 1, …, p).
Classical theory going back to R. A. Fisher (1) shows that the optimal classifier has feature weights w ∝ Σ^{−1} μ; at first glance, linear classifier design seems straightforward and settled. However, in many of today's most active application areas, it is a major challenge to construct linear classifiers that work well.
In many ambitious modern applications—genomics and proteomics come to mind—measurements are automatically made on thousands of standard features, but in a given project, the number of observations, n, might be in the dozens or hundreds. In such settings, p ≫ n, which makes it difficult or impossible to estimate the feature covariance straightforwardly. In such settings one often ignores feature covariances. Working in standardized feature space where individual features have mean zero and variance one, a bynow standard choice uses weights w(j) ∝ Cov(Y, X(j)) ≡ μ(j) (2, 3). Even when this reduction makes sense, further challenges remain.
When Useful Features Are Rare and Weak
In some important applications, standard measurements generate many features automatically, few of which are likely to be useful in any specific project, but researchers do not know in advance which ones will be useful in a given project. Moreover, reported misclassification rates are relatively high. Hence, the dimension p of the feature vector is very large, and although there may be numerous useful features, they are relatively rare and individually quite weak.
Consider the following rare/weak feature model (RW feature model). We suppose the contrast vector μ to be nonzero in only k out of p elements, where ε = k/p is small, that is, close to zero. As an example, we might have p = 10,000, k = 100, and so ε = k/p = 0.01. In addition, we suppose that the nonzero elements of μ have common amplitude μ_{0}. Because the elements X(j) of the feature vector where μ(j) = 0 are entirely uninformative about the value of Y(j), only the k features where μ(j) = μ_{0} are useful. The problem is how to identify and benefit from those rare, weak features. Setting
Naïve application of the formula w ∝ Cov(Y, X) in the RW setting often leads to very poor results; the vector of empirical covariances (
Feature Selection by Thresholding
Feature selection, that is, working only with an empirically selected subset of features, is a standard response to data glut. Here, and below, we suppose that feature correlations can be ignored and that features are standardized to variance one. We consider subset selectors based on the vector of feature Zscores with components Z(j) = n^{−1/2} Σ_{i}Y_{i}X_{i}(j), j = 1, …, p. These are the Zscores of twosided tests of H_{0,j}: Cov(Y, X(j)) = 0. Under our assumptions Z ∼ N(θ, I_{p}), where θ =
Definition 1:
Let ★ ∈ {soft, hard, clip}. The threshold feature selection classifier makes its decision based on L_{t}^{★} <> 0, where L̂_{t}^{★}(X) = Σ_{j}^{p} = 1 ŵ_{t}^{★}(j) X(j), and ŵ_{t}^{★}(j) = η_{t}^{★} (Z(j)), j = 1, …, p.
In words, the classifier sums across features with large trainingset Zscores, using a simple function of the Zscore to weight the corresponding feature appropriately.
Well known methods for linear classification follow this approach: The Shrunken Centroids method (9) reduces, in our twoclass setting, to a variant of soft thresholding; the schemes discussed in refs. 10 and 11 are variants of hard thresholding.
Thresholding has been popular in estimation for more than a decade (4); it is known to be successful in “sparse” settings where the estimand has many coordinates, of which only a relatively few coordinates are significantly nonzero. Although classification is not the same as estimation, an appropriate theory for thresholding can be constructed (unpublished work) showing that threshold feature classifiers with ideally chosen thresholds work well and even optimally control the misclassification rate.
One crucial question remains: how to choose the threshold based on the data? Related proposals for threshold choice include crossvalidation (9), control of the false discovery rate (12–14), and control of the local false discovery rate (15).
Higher Criticism
We propose a method of threshold choice based on recent work in the field of multiple comparisons.
HC Testing.
Suppose we have a collection of N Pvalues π_{i}, which under the global null hypothesis are uniformly distributed: π_{i} ∼_{iid} U[0,1]. We perform the increasing rearrangement into order statistics: π_{(1)} ≤ π_{(2)} ≤ … ≤ π_{(N)}; and we note that, under the null hypothesis, these order statistics have the usual properties of uniform order statistics, including the asymptotic normality π_{(i)} ∼_{approx} Normal(i/N, i/N(1 − i/N)). The ordered Pvalues may be compared with such properties, leading to the following notion.
Definition 2 (HC testing) (7):
The higher criticism (HC) objective is Fix α_{0} ∈ (0, 1) (e.g., α_{0} = 1/10). The HC test statistic is HC* = max_{1 ≤ i ≤ α0N} HC(i; π_{(i)}).
In practice, HC* is typically insensitive to the selection of α, especially in rare/weak situations. The HCobjective function is the “Zscore of the Pvalue,” that is, a standardized quantity with asymptotic distribution N(0, 1) under the null hypothesis. In words, we look for the largest standardized discrepancy between the expected behavior of the π_{i} under uniformity and the observed behavior. When this is large, the whole collection of Pvalues is not consistent with the global null hypothesis. The phrase “higher criticism” reflects the shift in emphasis from single test results to the whole collection of tests (7). The HC test statistic was developed to detect the presence of a small fraction of nonnull hypotheses among many truly null hypotheses (7). Note: there are several variants of HC statistic; we discuss only one variant in this brief note; the main results of ref. 7 still apply to this variant. For full discussion, see ref. 7 and unpublished work by the authors.
HC Thresholding.
Return to the classification setting in previous sections. We have a vector of feature Zscores (Z(j), j = 1, …, p). We apply HC notions by translating the Zscores into twosided Pvalues, and maximizing the HC objective over index i in the appropriate range. Mixing standard HC notations with standard multivariate data notation requires a bit of care. Please recall that p always refers to the number of measured classifier features, whereas terms such as “Pvalue” and “π_{(i)}” refer to unrelated concepts in the HC setting. In an attempt to avoid notational confusion, let N ≡ p and sometimes use N in place of p. Define the feature Pvalues π_{i} = Prob{N(0, 1) > Z(i)}, i = 1, …, N; and define the increasing rearrangement π_{(i)}, the HC objective function HC(i; π_{(i)}), and the increasing rearrangement Z_{(i)} correspondingly. Here, is our proposal.
Definition 3 (HC thresholding):
Apply the HC test to the feature Pvalues. Let the maximum HC objective be achieved at index î. The higher criticism threshold (HCT) is the value t̂^{HC} = Z_{(î)}. The HC threshold feature selector selects features with Zscores exceeding t̂^{HC} in magnitude.
Fig. 1 illustrates the procedure. Fig. 1A shows a sample of Zscores, B shows a PP plot of the corresponding ordered Pvalues versus i/N, and C shows a standardized PP plot. The standardized PP plot has its largest deviation from zero at î, and this generates the threshold value.
Performance of HCT in RW Feature Model
In the RW(ε,τ) model, the feature Zscores vector Z ∼ N(θ, I_{p}), where θ is a sparse vector with fraction ε of entries all equal to τ and all other entries equal to 0.
Fig. 2 exhibits results from a collection of problems all with p = 1,000 features, of which only 50 are truly useful, that is, have θ nonzero in that coordinate, so that in each case the fraction of useful features is ε = 50/1,000 = 0.05. In this collection, the amplitude τ of nonzeros varies from 1 to 3. Here, the useful features are indeed weak: they have expected Zscores typically lower than some Zscores of useless coordinates.
We compare HC thresholding with three other thresholding rules: (i) FDRT(.5), thresholding with false feature discovery rate (FDR) control parameter q = 0.5; (ii) FDRT(.1), thresholding with false feature discovery rate control parameter q = 0.1; and (iii) Bonferroni, setting the threshold so that the expected number of false features is 1. These three rules illustrate what we believe to be today's orthodox opinion, which strives to ensure that most features in the classification rule are truly useful, and to strictly control the number of useless features present in the trained classifier. Local false discovery rate control shares the same philosophy. We generated 1,000 Monte Carlo realizations at each choice of parameters. We present results in terms of the dimensionless parameter τ, which is independent of n; if desired, the reader may choose to translate these results into the form μ_{0} = τ/
HCT Functional and Ideal Thresholding
We now develop connections between HCT and other important notions.
HCT Functional.
The HCT functional is, informally, the “threshold that HCT is trying to estimate.” More precisely, note that, in the RW(ε, τ) model, the empirical distribution function F_{n,p} of feature Zscores F_{n,p}(t) = Ave_{j}I{Z(j) ≤ _{t}}, approximates, for large p and n arbitrary, the theoretical CDF F_{ε,τ}(t) = (1 − ε) Φ(t) + εΦ(t − τ), t ∈ R, where Φ(t) = P{N(0, 1) ≤ t} is the standard normal distribution. The HCT functional is the result of the HCT recipe on systematically replacing F_{n,p}(t) by F_{ε,τ}(t).
We define the underlying true positive rate, TPR(t); the false positive rate, FPR(t); and the positive rate, PR(t), in the natural way as the expected proportions of, respectively, the useful, the useless, and of all features, having Zscores above threshold t. The HC objective functional can be rewritten (up to rescaling) as In the RW(ε,τ) model, we have TPR(t; ε, τ) = Φ(t − τ) + Φ(−t − τ), FPR(t; ε, τ) = 2Φ(−t), and PR(t; ε, τ) = (1 − ε) FPR(t) + ε·TPR(t). Let t_{0} = t_{0}(ε,τ) denote the threshold corresponding to the maximization limit α_{0} in Definition 2: PR(t_{0}; ε,τ) = α_{0}. The HCT functional solves a simple maximization in t: Rigorous justification of this formula is supplied in ref. 12, showing that in the RW(ε,τ) model, t̂_{n,p}^{HC} converges in probability to T_{HC}(F_{ε}, τ) as p goes to infinity with n either fixed or increasing; so indeed, this is what HCT is “trying to estimate.”
Ideal Threshold.
We now study the threshold that (if we only knew it!) would provide optimal classifier performance. Recall that, in our setting, the feature covariance is the identity Σ = I_{p}; the quantity Sep(w; μ) = w′μ/‖w‖_{2} is a fundamental measure of linear classifier performance. The misclassification rate of the trained weights ŵ on independent test data with true contrast vector μ obeys where again Φ is the standard normal N(0, 1) CDF. Hence, maximizing Sep is a proxy for minimizing misclassification rate (for more details, see unpublished work).
For a fixed threshold t, let Sep(w_{clip}^{t}; μ) denote the realized value of Sep on a specific realization. For large p and n arbitrary, this is approximately proportional to where TPR and PR are the true and positive rates defined earlier, and TSER(t) denotes the expected True Sign Error Rate TSER(t) ≡ P{Z(j) < 0μ(j) > 0}. We are in the RW model, so (t;ε,τ) can be written in terms of TPR(t; ε, τ), PR(ε, τ), and TSER(t; τ) = Φ(−t −τ). We define the ideal threshold functional Among all fixed thresholds, it achieves the largest separation for a given underlying instance of the RW(ε, τ) model.
Comparison.
How does the HCT functional compare with the ideal threshold functional, both in value and performance? They seem surprisingly close. Fig. 3 presents the values, FDR, MDR, and MCR for these functionals in cases with ε > 0 fixed and τ varying. The HCT functional quite closely approximates the ideal threshold, both in threshold values and in performance measures. In particular, we note that the behavior of the HCT rule that was so distinctive in the rare/weak features model—high falsefeature discovery rate and controlled missed detection rate—are actually behaviors seen in the ideal threshold classifier as well. The systematic discrepancy between the HCT and the ideal threshold at small τ is due to the constraint t > t_{0} in Eq. 3.
ROC Analysis.
The similarity between the HCT functional and the ideal threshold functional derives from the expressions for PR(t; ε, τ) and TPR(t; ε, τ) in the RW model. In the setting where very few features are selected, (1 − PR(t)) ≈ 1, and 2TSER(t) ≪ TPR(t), so we see by comparison of Eqs. 2 and 5 that (t; ε, τ) ≈
Consider two related expressions: Proxy_{1} = ε·TPR(t)/
The maximizer of Proxy_{2} has a very elegant characterization, as the point in t where the secant to the ROC curve is double the tangent to the ROC curve,
For comparative purposes, FDR thresholding finds a point on the ROC curve with prescribed secant:
Complements
Performance on Standard Datasets.
In recent literature on classification methodology, a collection of six datasets has been used frequently for illustrating empirical classifier performance (16). We have reservations about the use of such data to illustrate HCT, because no one can say whether any specific such dataset is an example of rare/weak feature model. However, such comparisons are sure to be requested, so we report them here.
Of the standard datasets reported in ref. 16, three involve twoclass problems of the kind considered here; these are the ALL (10), Colon (17), and Prostate (18) datasets. In ref. 17, 3fold random training test splits of these datasets were considered, and seven well known classification procedures were implemented: Bagboost (16), LogitBoost (19), SVM (20), Random Forests (21), PAM (9), and the classical methods DLDA and KNN. We applied HCT in a completely outofthebox way by using definitions standard in the literature. HCThard, which uses feature weights based on hard thresholding of feature Zscores, gave quite acceptable performance. For comparison, introduce the relative regret measure Regret(A) = [err(A) − min_{A′}err(A′)]/[max_{A′}err(A′) − min_{A′}err(A′)]. This compares the achieved error rate with the best and worst performance seen across algorithms. We report errors rates and regrets side by side in Table 1, where rows 2–7 are from Dettling (16), row 8 is provided by Tibshirani, and row 9 is the result of HCThard.
Additionally, column 5 is the maximum regret across three different datasets, and column 6 is the rank based on the maximum regret. In the randomsplit test, HCThard was the minimax regret procedure, always being within 29% of the best known performance, whereas every other procedure was worse in relative performance in at least some cases.
It is worth remarking that HCTbased feature selection classifiers are radically simpler than all of the other methods being considered in this competition, requiring no tuning or crossvalidation to achieve the presented results.
Comparison to Shrunken Centroids.
The well known “Shrunken Centroids” (SC) algorithm (9) bears an interesting comparison to the procedures discussed here. In the twoclass setting, SC amounts to linear classification with feature weights obtained from soft thresholding of feature Zscores. Consequently, HCTsoft can be viewed as a modification to SC, choosing thresholds by HCT rather than crossvalidation. We made a simulation study contrasting the performance of SC with HCThard, HCTsoft, and HCTclip in the rare/weak features model. We conducted 100 Monte Carlo simulations, where we chose p = 10.000, k = 100 (so ε = k/p = 0.01), n = 40, and τ ∈ [1,3]. Over this range, the best classification error rate ranged from nearly 50%—scarcely better than ignorant guessing—to <3%. Fig. 5 shows the results. Apparently, HCTsoft and SC behave similarly—with HCTsoft consistently better (here SC is implemented with a threshold picked by 10fold crossvalidations). However, HCTsoft and SC are not at all similar in computational cost at the training stage, as HCTsoft requires no crossvalidation or tuning. The similarity of the two classifiers is, of course, explainable by using discussions above. Crossvalidation is “trying” to estimate the ideal threshold, which the HCT functional also approximates. In Table 2, we tabulated the mean and standard deviation (SD) of HCT and crossvalidated threshold selection (CVT). We see that CVT is on average larger than the HCT in this range of parameters. We also see that CVT has a significantly higher variance than the HC threshold; presumably, this is why HCTsoft consistently outperforms SC. In fact, crossvalidation is generally inconsistent in the fixedn, largep limit, whereas HCT is consistent in the RW model; hence the empirical phenomenon visible in these graphs should apply much more generally.
Alternative Classifier by Using HC.
HC can be used directly for classification (22), without reference to linear discrimination and feature selection (for comparison with the method proposed here, see unpublished work).
Theoretical Optimality.
In a companion article (unpublished work), we developed a largep, fixedn asymptotic study and showed rigorously that HCT yields asymptotically optimal error rate classifiers in the RW model.
Reproducible Research.
All original figures and tables presented here are fully reproducible, consistent with the concept of Reproducible Research (23).
Acknowledgments
We thank the Newton Institute for hospitality during the program Statistical Challenges of HighDimensional Data, and in particular, D.M. Titterington for his leading role in organizing this program. Peter Hall gave helpful comments on the behavior of crossvalidation in the large p, small n case. This work was supported in part by National Science Foundation Grants DMS0505303 (to D.D.) and DMS0505423 and DMS0639980 (to J.J.).
Footnotes
 ^{‡}To whom correspondence should be addressed. Email: donoho{at}stat.standord.edu

Author contributions: D.L.D. and J.J. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

Freely available online through the PNAS open access option.
 © 2008 by The National Academy of Sciences of the USA
References
 ↵
 Anderson TW
 ↵
 ↵
 Fan J,
 Fan Y
 ↵
 Donoho D,
 Johnstone I
 ↵
 Donoho D,
 Johnstone I,
 Hoch JC,
 Stern AS
 ↵
 Ingster YI
 ↵
 ↵
 Jin J
 ↵
 Tibshirani R,
 Hastie T,
 Narasimhan B,
 Chu G
 ↵
 Golub T,
 et al.
 ↵
 ↵
 Benjamini Y,
 Hochberg Y
 ↵
 Abramovich F,
 Benjamini Y
 Antoniadis A,
 Oppenheim G
 ↵
 ↵
 ↵
 Dettling M
 ↵
 Alon U,
 et al.
 ↵
 ↵
 Dettling M,
 Bühlmann P
 ↵
 ↵
 Breiman L
 ↵
 Hall P,
 Pittelkow Y,
 Ghosh M
 ↵
 Donoho D,
 Maleki A,
 UrRahman I,
 Shahram M,
 Stodden V