New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
Statistical learning and selective inference
Contributed by Robert J. Tibshirani, May 7, 2015 (sent for review March 2, 2015; reviewed by Rollin Brant and John D. Storey)
See related content:
- QnAs with Robert Tibshirani- Jun 23, 2015

Significance
Most statistical analyses involve some kind of ‟selection”—searching through the data for the strongest associations. Measuring the strength of the resulting associations is a challenging task, because one must account for the effects of the selection. There are some new tools in selective inference for this task, and we illustrate their use in forward stepwise regression, the lasso, and principal components analysis.
Abstract
We describe the problem of “selective inference.” This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have “cherry-picked”—searched for the strongest associations—means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.
Statistical science has changed a great deal in the past 10–20 years, and is continuing to change, in response to technological advances in science and industry. The world is awash with big and complicated data, and researchers are trying to make sense out of it. Leading examples include data from “omic” assays in the biomedical sciences, financial forecasting from economic and business indicators, and the analysis of user click patterns to optimize ad placement on websites. This has led to an explosion of interest in the fields of statistics and machine learning and spawned a new field some call “data science.”
In the words of Yoav Benjamini, statistical methods have become “industrialized” in response to these changes. Whereas traditionally scientists fit a few statistical models by hand, now they use sophisticated computational tools to search through a large number of models, looking for meaningful patterns. Having done this search, the challenge is then to judge the strength of the apparent associations that have been found. For example, a correlation of 0.9 between two measurements A and B is probably noteworthy. However, suppose that I had arrived at A and B as follows: I actually started with 1,000 measurements and I searched among all pairs of measurements for the most correlated pair; these turn out to be A and B, with correlation 0.9. With this backstory, the finding is not nearly as impressive and could well have happened by chance, even if all 1,000 measurements were uncorrelated. Now, if I just reported to you that these two measures A and B have correlation 0.9, and did not tell which of these two routes I used to obtain them, you would not have enough information to judge the strength of the apparent relationship. This statistical problem has become known as “selective inference,” the assessment of significance and effect sizes from a dataset after mining the same data to find these associations.
As another example, suppose that we have a quantitative value y, a measurement of the survival time of a patient after receiving either a standard treatment or a new experimental treatment. I give the old drug (1) or new drug (2) at random to a set of patients and compute the mean difference in the outcome
If not taken into account, the effects of selection can greatly exaggerate the apparent strengths of relationships. We feel that this is one of the causes of the current crisis in reproducibility in science (e.g., ref. 1). With increased competiveness and pressure to publish, it is natural for researchers to exaggerate their claims, intentionally or otherwise. Journals are much more likely to publish studies with low P values, and we (the readers) never hear about the great number of studies that showed no effect and were filed away (the “file-drawer effect”). This makes it difficult to assess the strength of a reported P value of, say, 0.04.
The challenge of correcting for the effects of selection is a complex one, because the selective decisions can occur at many different stages in the analysis process. However, some exciting progress has recently been made in more limited problems, such as that of adaptive regression techniques for supervised learning. Here the selections are made in a well-defined way, so that we can exactly measure their effects on subsequent inferences. We describe these new techniques here, as applied to two widely used statistical methods: classic supervised learning, via forward stepwise regression, and modern sparse learning, via the “lasso.” Later, we indicate the broader scope of their potential applications, including principal components analysis.
Forward Stepwise Regression
Supposed that we have a dataset with N observations
With a moderate or large number of predictors, it is common to use forward stepwise regression to select a good subset of predictors. This procedure enters predictors one at a time, choosing the predictor that most decreases the residual sum of squares
Fig. 1 quantifies this. With a sample size of
A simulation example with
The correct P values in Fig. 1 were estimated by simulating from the Gaussian distribution and recording the maximal value for
As an example, Rhee et al. (2) studied six nucleoside reverse transcriptase inhibitors that are used to treat HIV-1. The target of these drugs can become resistant through mutation, and Rhee et al. (2) compared a collection of models for predicting the log susceptibility, a measure of drug resistance based on the location of mutations. We chose one of the inhibitors, with a total of
P values for HIV data, as predictors are entered successively into the model by forward stepwise regression .The black points are the naive P values that do not correct of the selection effects, and the green points are selection-adjusted.
The naive P values suggest that there are six strong predictors (P value less than 0.05), but the more realistic selection-adjusted P values yield only two or three significant predictors.
How were these adjusted P values obtained? Trying to obtain them by naive Monte Carlo sampling would be tedious, and perhaps infeasible for a large number of predictors, because we would have to generate many samples and keep only those for which the sequence of selected variables is the same as that obtained in our original data. Fortunately, there is a simple mathematical construction that gives us these selection-adjusted P values in closed form.
Recall that our data vector y is normally distributed with mean
How do we adjust for the effects of selection? It turns out that for forward stepwise regression and many other procedures the selection events can be written in the “polyhedral” form
Let’s dig more deeply to understand how this works. The limits
Fig. 3 depicts this situation, showing a truncated normal distribution, with truncation limits
HIV data: truncated normal distribution for the coefficient of
Ignoring the selection effects, under the null hypothesis that the true coefficient for
False Discovery Rates and a Sequential Stopping Rule
Looking at the sequence of adjusted P values (green points) in Fig. 1, a natural question arises: When should we stop adding variables? Should we stop at, say, two or three predictors, beyond which the P values are above 0.05? If we do stop there, what can we say about the resulting model? The notion of false discovery rate (FDR) is useful for this purpose. This concept has been an important one for large-scale testing, especially in the biomedical sciences (see ref. 5).
Let’s review this concept and other related, more traditional ones. Table 1 summarizes the theoretical outcomes of m hypothesis tests.
Possible outcomes from m hypothesis tests
The quantity V counts false-positive calls, and much of statistical testing focuses on
Now, it turns out that there is a simple stopping rule for sequences of P values as in Fig. 1 that gives us automatic control of the FDR of the resulting model (Eq. 6). Choosing a target FDR level of α, and denoting the successive P values by
In the HIV example, for
The Lasso
We now broaden our discussion to a more modern approach to supervised learning, one that can be applied to large datasets. We give a brief description of the method and then show how selective inference can be applied in this more general framework.
The lasso, or
Because of the absolute value function appearing in the penalty, over a range of λ values the solution to this problem is sparse, that is, many of the weights
For a fixed choice of the tuning parameter λ, the solution to Eq. 7 has nonzero values
Paths of estimated coefficients from the lasso applied to the HIV data. Each coefficient path corresponds to the estimated β for a mutation number, with some of mutation site numbers with larger coefficients on the right. The estimated optimal budget is indicated by the vertical green line.
Each profile corresponds to one predictor (mutation site). The different points on each profile correspond to different values of the penalty parameter λ, with this value decreasing as we move from left to right. For interpretability, we have not plotted against λ on the horizontal axis, but the “parameter budget”
Also shown on the plot is the optimal value of λ (green line) estimated using cross-validation. Cross-validation works as follows. (i) We divide the samples into roughly 10 equal-sized parts at random. (ii) For each part K, leave out this part, fit the lasso to the other nine parts over a range of λ values, and then record the prediction error over the left-out part. (iii) Repeat step ii for
It turns out that this selection of predictors can again be described as a polyhedral region of the form
Fig. 5 shows naive least squares confidence intervals and the selection-adjusted intervals for the nine underlying parameters. We see that some of the selection-adjusted intervals are much wider than their naive counterparts.
Selection intervals for the HIV data, from the fitted lasso model.
Principal Components and Beyond
Principal components analysis (PCA) is a classic statistical method developed in the early 1900s but now used more widely used than ever. For example, PCA was a core method used by most leading competitors in the Netflix movie prediction competition (10). PCA is a method for unsupervised learning that seeks to discover the important correlation patterns among a set of features. It works by computing the sets of linear combinations of features that are maximally correlated. It can be thought as a method for determining the K leading eigenvectors and eigenvalues of the sample correlation matrix of the features.
In PCA analysis one computes the leading eigenvectors and eigenvalues (one for each component) and then must decide on the number of components K that are “significant.” Traditionally, this is done through the so-called scree plot, which is simply a plot of the eigenvalues in order from smallest to largest. Fig. 6, Left shows an example from some simulated data with nine features. The underlying correlation matrix, from which the data were simulated, has two strong components.
Eigenvalue (scree) plots for two simulated data matrices, each with nine features. (Left) There are two strong leading components in the underlying population. (Right) The two leading components are only moderate in strength.
The scree plot shows a nice “elbow” and would lead us to choose the correct number of components (two). In Fig. 6, Right, we have reduced the signal-to-noise ratio, so that the population eigenvalues of the first two eigenvectors are only moderately larger than the others. Now the scree plot shows no elbow, and we would be hard pressed to guess the number of components.
However, selective inference can come to our rescue. Choosing the leading eigenvectors of a covariance matrix is similar in spirit to forward stepwise regression, and with somewhat more complicated mathematics one can derive selection-adjusted P values for each successive increase in the rank (11). For the example in Fig. 6, Right, the adjusted P values are (0.030, 0.064, 0.222, 0.286, 0.197, 0.831, 0.510, 0.185, and 0.1260), and hence this gives a moderately strong indication of the correct rank. Although this may seem too good to true based on the scree plot, one must remember that the P value estimation procedure uses more information in the correlation matrix than just the eigenvalues shown here.
Discussion
Many, or most, modern statistical methods, in this era of big data, use some form of selection to choose a best-fitting model among a plethora of choices. We have seen two examples above: forward stepwise regression and the lasso. There are many more examples, including methods for time and spatial analysis, interpretation of images, and network analysis. For all of these methods, work is underway to develop tools for selective inference, so that the analyst can properly calibrate the strength of the observed patterns.
The methods described here are “closed form,” that is, the P values are derived from (complicated) formulae. There are also sampling based methods under development, using Markov-chain Monte Carlo and bootstrapping, that can provide improvements in power (12). We would be remiss if we did not mention a simpler, attractive approach to selective inference, namely sample splitting (see, e.g., ref. 13). Here we randomly divide the observations into, say, two parts: We do the model fitting and selection on one part, and then estimate P values and confidence intervals on the second part. This is a valid method but can suffer from a loss of power and difficulty in interpretation, because the model and results will change with a different random split. However, sample splitting may be more robust to assumptions about the form of the noise distribution of the data.
The work here adjusts for the actual selection procedure that was applied to the dataset. An alternative approach (14) adjusts for any possible selection procedure.
This is an exciting time for statisticians, and judging by the burgeoning interest in statistical and machine learning courses this excitement is being shared by other scientists and analysts who use statistics and data science. We expect that many software packages will soon offer implementation of these new tools.
The work discussed here represents a joint collaboration with our colleagues, students, and former students. References include 3, 11, 12, and 15–20. Interested readers who want to learn more about sparse methods in general and may consult the forthcoming book (9) in which chapter 6 covers most of the material here in some detail and also discusses the covariance test of ref. 13, a simple asymptotic version of a test presented here, based on the exponential distribution. This latter paper contains interesting general discussion of the selective inference problem by a number of researchers.
Acknowledgments
We would like to thank our many collaborators in this work, including Yun Jin Choi, Alexandra Chouldechova, Will Fithian, Max G'Sell, Jason Lee, Richard Lockhart, Dennis Sun, Yukai Sun, Ryan Tibshirani, and Stefan Wager. R.J.T. was supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Grant N01-HV-28183.
Footnotes
- ↵1To whom correspondence should be addressed. Email: tibs{at}stanford.edu.
This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected in 2012.
Author contributions: J.T. and R.J.T. designed research; J.T. and R.J.T. performed research; and R.J.T. wrote the paper.
Reviewers: R.B., University of British Columbia; and J.D.S., Princeton University.
The authors declare no conflict of interest.
↵†By way of contrast, in the unsupervised learning problem we observe just the features x and no outcome variables y, and we try to uncover the correlations and other dependencies between these features. Principal components analysis is a leading example and is discussed later in this article.
↵‡A convex optimization is a standard and attractive form for a numerical problem, involving minimization or maximization of a convex function over a convex domain.
See QnAs on page 7621.
References
- ↵
- ↵.
- Rhee S-Y, et al.
- ↵.
- Taylor JE,
- Lockhart R,
- Tibshirani RJ,
- Tibshirani R
- ↵.
- Lee JD,
- Sun DL,
- Sun Y,
- Taylor JE
- ↵.
- Storey JD,
- Tibshirani R
- ↵.
- G’Sell MG,
- Wager S,
- Chouldechova A,
- Tibshirani R
- ↵.
- Candès E
- ↵.
- Hastie T,
- Tibshirani R,
- Wainwright M
- ↵.
- Bennett J,
- Lanning S
- ↵.
- Choi Y,
- Taylor J,
- Tibshirani R
- ↵.
- Fithian W,
- Sun D,
- Taylor J
- ↵
- ↵
- ↵
- .
- Taylor J,
- Loftus J,
- Tibshirani R
- .
- G’Sell MG,
- Taylor J,
- Tibshirani R
- .
- Loftus JR,
- Taylor JE
- .
- Reid S,
- Taylor J,
- Tibshirani R
- ↵.
- Lee JD,
- Taylor JE
Citation Manager Formats
Sign up for Article Alerts
Article Classifications
- Physical Sciences
- Statistics