# Statistical learning and selective inference

See allHide authors and affiliations

Contributed by Robert J. Tibshirani, May 7, 2015 (sent for review March 2, 2015; reviewed by Rollin Brant and John D. Storey)

## Significance

Most statistical analyses involve some kind of ‟selection”—searching through the data for the strongest associations. Measuring the strength of the resulting associations is a challenging task, because one must account for the effects of the selection. There are some new tools in selective inference for this task, and we illustrate their use in forward stepwise regression, the lasso, and principal components analysis.

## Abstract

We describe the problem of “selective inference.” This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have “cherry-picked”—searched for the strongest associations—means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.

Statistical science has changed a great deal in the past 10–20 years, and is continuing to change, in response to technological advances in science and industry. The world is awash with big and complicated data, and researchers are trying to make sense out of it. Leading examples include data from “omic” assays in the biomedical sciences, financial forecasting from economic and business indicators, and the analysis of user click patterns to optimize ad placement on websites. This has led to an explosion of interest in the fields of statistics and machine learning and spawned a new field some call “data science.”

In the words of Yoav Benjamini, statistical methods have become “industrialized” in response to these changes. Whereas traditionally scientists fit a few statistical models by hand, now they use sophisticated computational tools to search through a large number of models, looking for meaningful patterns. Having done this search, the challenge is then to judge the strength of the apparent associations that have been found. For example, a correlation of 0.9 between two measurements A and B is probably noteworthy. However, suppose that I had arrived at A and B as follows: I actually started with 1,000 measurements and I searched among all pairs of measurements for the most correlated pair; these turn out to be A and B, with correlation 0.9. With this backstory, the finding is not nearly as impressive and could well have happened by chance, even if all 1,000 measurements were uncorrelated. Now, if I just reported to you that these two measures A and B have correlation 0.9, and did not tell which of these two routes I used to obtain them, you would not have enough information to judge the strength of the apparent relationship. This statistical problem has become known as “selective inference,” the assessment of significance and effect sizes from a dataset after mining the same data to find these associations.

As another example, suppose that we have a quantitative value *y*, a measurement of the survival time of a patient after receiving either a standard treatment or a new experimental treatment. I give the old drug (1) or new drug (2) at random to a set of patients and compute the mean difference in the outcome *s* is an estimate of SD of the raw difference. Then I could approximate the distribution of *z* by a standard normal distribution, and hence if I reported to you a value of, say, *P* value is about 1%). However, what if instead I tried out many new treatments and reported to you only ones for which *P* value of 0.27 to the finding, rather than 0.01.

If not taken into account, the effects of selection can greatly exaggerate the apparent strengths of relationships. We feel that this is one of the causes of the current crisis in reproducibility in science (e.g., ref. 1). With increased competiveness and pressure to publish, it is natural for researchers to exaggerate their claims, intentionally or otherwise. Journals are much more likely to publish studies with low *P* values, and we (the readers) never hear about the great number of studies that showed no effect and were filed away (the “file-drawer effect”). This makes it difficult to assess the strength of a reported *P* value of, say, 0.04.

The challenge of correcting for the effects of selection is a complex one, because the selective decisions can occur at many different stages in the analysis process. However, some exciting progress has recently been made in more limited problems, such as that of adaptive regression techniques for supervised learning. Here the selections are made in a well-defined way, so that we can exactly measure their effects on subsequent inferences. We describe these new techniques here, as applied to two widely used statistical methods: classic supervised learning, via forward stepwise regression, and modern sparse learning, via the “lasso.” Later, we indicate the broader scope of their potential applications, including principal components analysis.

## Forward Stepwise Regression

Supposed that we have a dataset with *N* observations *p* predictors (or features). We wish to build a model for predicting *y* supervises (or guides) the prediction process.^{†} The standard linear regression model has the form*β* is a vector of *p* unknown coefficients (weights) to be estimated.

With a moderate or large number of predictors, it is common to use forward stepwise regression to select a good subset of predictors. This procedure enters predictors one at a time, choosing the predictor that most decreases the residual sum of squares *i*). Classic statistical theory for assessing the strength each predictor would go as follows. Defining *k* predictors, we use the change in residual sum of squares to form a test statistic*σ* assumed known here), and compare it to a χ^{2} distribution with one degree of freedom. There is a big problem with this approach, however. This classic theory assumes that the models being compared were prespecified before seeing the data. However, this is clearly not the case here: We have “cherry-picked” the best predictor at each stage to maximize

Fig. 1 quantifies this. With a sample size of *P* value (set by the statistician) and the vertical axis is the actual *P* value from this process, which properly accounts for the selection of the best predictor among the *p*. For

The correct *P* values in Fig. 1 were estimated by simulating from the Gaussian distribution and recording the maximal value for *P* value not just for the first step of forward stepwise regression, but for all steps.

As an example, Rhee et al. (2) studied six nucleoside reverse transcriptase inhibitors that are used to treat HIV-1. The target of these drugs can become resistant through mutation, and Rhee et al. (2) compared a collection of models for predicting the log susceptibility, a measure of drug resistance based on the location of mutations. We chose one of the inhibitors, with a total of *P* values.

The naive *P* values suggest that there are six strong predictors (*P* value less than 0.05), but the more realistic selection-adjusted *P* values yield only two or three significant predictors.

How were these adjusted *P* values obtained? Trying to obtain them by naive Monte Carlo sampling would be tedious, and perhaps infeasible for a large number of predictors, because we would have to generate many samples and keep only those for which the sequence of selected variables is the same as that obtained in our original data. Fortunately, there is a simple mathematical construction that gives us these selection-adjusted *P* values in closed form.

Recall that our data vector *y* is normally distributed with mean *β* and some variance *y*. In particular, the number of predictors from which we chose these two should affect how we assess the apparent association. If we had started with 300 rather than 30 predictors, we should have set a higher bar for calling

How do we adjust for the effects of selection? It turns out that for forward stepwise regression and many other procedures the selection events can be written in the “polyhedral” form *A* and vector *b* (3, 4). We can understand this statement as follows. Suppose that we consider any new vector of outcomes, say *y*. We run our forward stepwise procedure and keep track of the predictors entered at each stage. These will not likely be the same as our original list because the data have changed. However, the polyhedral form says that the set of new data vectors *A* and *b* depend on the data and the selected variables. Roughly speaking, each stage of forward stepwise regression represents a competition among all *p* variables, and *A* and *b* simply reconstruct this competition and check whether **3**) is replaced by a truncated normal distribution:**4**), which exactly accounts for the selection process through the truncation limits *P* values in Fig. 1.

Let’s dig more deeply to understand how this works. The limits *c* and *d*, respectively. With correlated predictors—as in our data—the definitions of *c* and *d* are more complicated, but the idea is the same.

Fig. 3 depicts this situation, showing a truncated normal distribution, with truncation limits

Ignoring the selection effects, under the null hypothesis that the true coefficient for *c* and *d* (4.3 and 6.3 in the Fig. 3, respectively.) Hence, the value of 5.1 represents moderate but not overwhelming evidence that the coefficient is nonzero.

## False Discovery Rates and a Sequential Stopping Rule

Looking at the sequence of adjusted *P* values (green points) in Fig. 1, a natural question arises: When should we stop adding variables? Should we stop at, say, two or three predictors, beyond which the *P* values are above 0.05? If we do stop there, what can we say about the resulting model? The notion of false discovery rate (FDR) is useful for this purpose. This concept has been an important one for large-scale testing, especially in the biomedical sciences (see ref. 5).

Let’s review this concept and other related, more traditional ones. Table 1 summarizes the theoretical outcomes of *m* hypothesis tests.

The quantity *V* counts false-positive calls, and much of statistical testing focuses on *m* is not too large (say, <50). However, with a larger number of tests, common in today’s big data problems, we expect that *V* will be much great than 1 and this approach is too stringent. Instead we shift our focus to the FDR:*R* tests that we reject, that is, effects that we call significant. (*V/R* is defined to be zero when *R* = 0.)

Now, it turns out that there is a simple stopping rule for sequences of *P* values as in Fig. 1 that gives us automatic control of the FDR of the resulting model (Eq. **6**). Choosing a target FDR level of *α*, and denoting the successive *P* values by *P* value up to that point is below some target FDR level *α*. This average, however, is taken on the (complementary) log scale. Using the ForwardStop rule, the final model contains the predictors entered at the first *α*.

In the HIV example, for

## The Lasso

We now broaden our discussion to a more modern approach to supervised learning, one that can be applied to large datasets. We give a brief description of the method and then show how selective inference can be applied in this more general framework.

The lasso, or ^{‡} The lasso solves the problem*λ* effectively balances the tradeoff between two objectives: the goodness of fit to the data (sum squares in the first term) with the complexity of model (second term). It turns out that this is equivalent to minimizing the sum of squares with a “budget” constraint *s* depends on *λ*: Larger values of *λ* imply a smaller budget *s*. The best value for *λ* or *s* depends on the data and is usually chosen by cross-validation (described below).

Because of the absolute value function appearing in the penalty, over a range of *λ* values the solution to this problem is sparse, that is, many of the weights *p* available predictors. However, because the problem has been cast as a convex optimization, the search throughout the possible models can be carried out much more effectively. Lasso and

For a fixed choice of the tuning parameter *λ*, the solution to Eq. **7** has nonzero values

Each profile corresponds to one predictor (mutation site). The different points on each profile correspond to different values of the penalty parameter *λ*, with this value decreasing as we move from left to right. For interpretability, we have not plotted against *λ* on the horizontal axis, but the “parameter budget” *λ*. On the left, *λ* is very large and the effective budget is zero, so that all parameter estimates are zero. As we move to the right, *λ* decreases and the effective budget increases, so that more parameters become nonzero. The vertical dotted lines mark the places where a coefficient becomes nonzero, as we move from left to right. On the right, the value of *λ* is zero so that there is no budget constraint on the parameters. Hence, the values at the right end of the plot are the usual least squares estimates.

Also shown on the plot is the optimal value of *λ* (green line) estimated using cross-validation. Cross-validation works as follows. (*i*) We divide the samples into roughly 10 equal-sized parts at random. (*ii*) For each part *K*, leave out this part, fit the lasso to the other nine parts over a range of *λ* values, and then record the prediction error over the left-out part. (*iii*) Repeat step *ii* for *λ*, compute the average prediction error over the 10 parts. Finally our estimate *λ* yielded a model with nine predictors (Fig. 4).

It turns out that this selection of predictors can again be described as a polyhedral region of the form *λ*, the vector of response values *A* and *b* depend on the predictors, the active set and *λ*, but not *y*. This fact gives us what we need to construct selection-adjusted intervals for the parameters of the fitted model, using the same arguments as above for forward stepwise regression.

Fig. 5 shows naive least squares confidence intervals and the selection-adjusted intervals for the nine underlying parameters. We see that some of the selection-adjusted intervals are much wider than their naive counterparts.

## Principal Components and Beyond

Principal components analysis (PCA) is a classic statistical method developed in the early 1900s but now used more widely used than ever. For example, PCA was a core method used by most leading competitors in the Netflix movie prediction competition (10). PCA is a method for unsupervised learning that seeks to discover the important correlation patterns among a set of features. It works by computing the sets of linear combinations of features that are maximally correlated. It can be thought as a method for determining the *K* leading eigenvectors and eigenvalues of the sample correlation matrix of the features.

In PCA analysis one computes the leading eigenvectors and eigenvalues (one for each component) and then must decide on the number of components *K* that are “significant.” Traditionally, this is done through the so-called scree plot, which is simply a plot of the eigenvalues in order from smallest to largest. Fig. 6, *Left* shows an example from some simulated data with nine features. The underlying correlation matrix, from which the data were simulated, has two strong components.

The scree plot shows a nice “elbow” and would lead us to choose the correct number of components (two). In Fig. 6, *Right*, we have reduced the signal-to-noise ratio, so that the population eigenvalues of the first two eigenvectors are only moderately larger than the others. Now the scree plot shows no elbow, and we would be hard pressed to guess the number of components.

However, selective inference can come to our rescue. Choosing the leading eigenvectors of a covariance matrix is similar in spirit to forward stepwise regression, and with somewhat more complicated mathematics one can derive selection-adjusted *P* values for each successive increase in the rank (11). For the example in Fig. 6, *Right*, the adjusted *P* values are (0.030, 0.064, 0.222, 0.286, 0.197, 0.831, 0.510, 0.185, and 0.1260), and hence this gives a moderately strong indication of the correct rank. Although this may seem too good to true based on the scree plot, one must remember that the *P* value estimation procedure uses more information in the correlation matrix than just the eigenvalues shown here.

## Discussion

Many, or most, modern statistical methods, in this era of big data, use some form of selection to choose a best-fitting model among a plethora of choices. We have seen two examples above: forward stepwise regression and the lasso. There are many more examples, including methods for time and spatial analysis, interpretation of images, and network analysis. For all of these methods, work is underway to develop tools for selective inference, so that the analyst can properly calibrate the strength of the observed patterns.

The methods described here are “closed form,” that is, the *P* values are derived from (complicated) formulae. There are also sampling based methods under development, using Markov-chain Monte Carlo and bootstrapping, that can provide improvements in power (12). We would be remiss if we did not mention a simpler, attractive approach to selective inference, namely sample splitting (see, e.g., ref. 13). Here we randomly divide the observations into, say, two parts: We do the model fitting and selection on one part, and then estimate *P* values and confidence intervals on the second part. This is a valid method but can suffer from a loss of power and difficulty in interpretation, because the model and results will change with a different random split. However, sample splitting may be more robust to assumptions about the form of the noise distribution of the data.

The work here adjusts for the actual selection procedure that was applied to the dataset. An alternative approach (14) adjusts for any possible selection procedure.

This is an exciting time for statisticians, and judging by the burgeoning interest in statistical and machine learning courses this excitement is being shared by other scientists and analysts who use statistics and data science. We expect that many software packages will soon offer implementation of these new tools.

The work discussed here represents a joint collaboration with our colleagues, students, and former students. References include 3, 11, 12, and 15–20. Interested readers who want to learn more about sparse methods in general and may consult the forthcoming book (9) in which chapter 6 covers most of the material here in some detail and also discusses the covariance test of ref. 13, a simple asymptotic version of a test presented here, based on the exponential distribution. This latter paper contains interesting general discussion of the selective inference problem by a number of researchers.

## Acknowledgments

We would like to thank our many collaborators in this work, including Yun Jin Choi, Alexandra Chouldechova, Will Fithian, Max G'Sell, Jason Lee, Richard Lockhart, Dennis Sun, Yukai Sun, Ryan Tibshirani, and Stefan Wager. R.J.T. was supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Grant N01-HV-28183.

## Footnotes

- ↵
^{1}To whom correspondence should be addressed. Email: tibs{at}stanford.edu.

This contribution is part of the special series of Inaugural Articles by members of the National Academy of Sciences elected in 2012.

Author contributions: J.T. and R.J.T. designed research; J.T. and R.J.T. performed research; and R.J.T. wrote the paper.

Reviewers: R.B., University of British Columbia; and J.D.S., Princeton University.

The authors declare no conflict of interest.

↵

^{†}By way of contrast, in the unsupervised learning problem we observe just the features*x*and no outcome variables*y*, and we try to uncover the correlations and other dependencies between these features. Principal components analysis is a leading example and is discussed later in this article.↵

^{‡}A convex optimization is a standard and attractive form for a numerical problem, involving minimization or maximization of a convex function over a convex domain.See QnAs on page 7621.

## References

- ↵
- ↵.
- Rhee S-Y, et al.

- ↵.
- Taylor JE,
- Lockhart R,
- Tibshirani RJ,
- Tibshirani R

- ↵.
- Lee JD,
- Sun DL,
- Sun Y,
- Taylor JE

- ↵.
- Storey JD,
- Tibshirani R

- ↵.
- G’Sell MG,
- Wager S,
- Chouldechova A,
- Tibshirani R

- ↵.
- Candès E

*Proceedings of the International Congress of Mathematicians*, eds Sanz-Solé M, Soria J, Varona JL, Verdera J (European Mathematical Society, Helsinki), Vol 3 - ↵.
- Hastie T,
- Tibshirani R,
- Wainwright M

- ↵.
- Bennett J,
- Lanning S

*Proceedings of the KDD Cup and Workshop 2007*. Available at www.cs.uic.edu/∼liub/KDD-cup-2007/proceedings/The-Netflix-Prize-Bennett.pdf - ↵.
- Choi Y,
- Taylor J,
- Tibshirani R

- ↵.
- Fithian W,
- Sun D,
- Taylor J

- ↵
- ↵
- ↵
- .
- Taylor J,
- Loftus J,
- Tibshirani R

- .
- G’Sell MG,
- Taylor J,
- Tibshirani R

- .
- Loftus JR,
- Taylor JE

- .
- Reid S,
- Taylor J,
- Tibshirani R

- ↵.
- Lee JD,
- Taylor JE

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Statistics

### See related content:

- QnAs with Robert Tibshirani- Jun 23, 2015