# Why significant variables aren’t automatically good predictors

See allHide authors and affiliations

Contributed by Herman Chernoff, September 17, 2015 (sent for review December 15, 2014)

## Significance

A recent puzzle in the big data scientific literature is that an increase in explanatory variables found to be significantly correlated with an outcome variable does not necessarily lead to improvements in prediction. This problem occurs in both simple and complex data. We offer explanations and statistical insights into why higher significance does not automatically imply stronger predictivity and why variables with strong predictivity sometimes fail to be significant. We suggest shifting the research agenda toward searching for a criterion to locate highly predictive variables rather than highly significant variables. We offer an alternative approach, the partition retention method, which was effective in reducing prediction error from 30% to 8% on a long-studied breast cancer data set.

## Abstract

Thus far, genome-wide association studies (GWAS) have been disappointing in the inability of investigators to use the results of identified, statistically significant variants in complex diseases to make predictions useful for personalized medicine. Why are significant variables not leading to good prediction of outcomes? We point out that this problem is prevalent in simple as well as complex data, in the sciences as well as the social sciences. We offer a brief explanation and some statistical insights on why higher significance cannot automatically imply stronger predictivity and illustrate through simulations and a real breast cancer example. We also demonstrate that highly predictive variables do not necessarily appear as highly significant, thus evading the researcher using significance-based methods. We point out that what makes variables good for prediction versus significance depends on different properties of the underlying distributions. If prediction is the goal, we must lay aside significance as the only selection standard. We suggest that progress in prediction requires efforts toward a new research agenda of searching for a novel criterion to retrieve highly predictive variables rather than highly significant variables. We offer an alternative approach that was not designed for significance, the partition retention method, which was very effective predicting on a long-studied breast cancer data set, by reducing the classification error rate from 30% to 8%.

An early 2013 *Nature Genetics* article (1), “Predicting the influence of common variants,” identified prediction as an important goal for current genome-wide association studies (GWAS). However, a puzzle that has recently arisen in the GWAS-related literature is that an increase in newly identified variants (variables) does not necessarily seem to lead to improvements in current predictive models. Although intuitively it would seem that the addition of information (more statistically significant variants) should increase predictive powers, in recent models of prediction the power is not increased when adding more significant variants to classical significance test-based approaches (2⇓⇓–5). [We refer to “statistically significant” variables throughout this paper as simply “significant.”]

A typical GWAS study collects data on a sample of subjects: cases, who have a disease, and controls, who are disease-free. A very large list of single-nucleotide polymorphisms (SNPs) is evaluated for each individual where each SNP corresponds to a given locus on the genome, and can take on the value 0, 1, or 2 depending on how many copies of the “minor” allele show up. The SNPs are distributed over the whole genome. Typically the researcher wants to select a subgroup of the SNPs that is associated with the disease, so that she can study how the disease works. She may also be interested in predicting whether a new individual has the disease by analyzing the individual’s selected SNPs.

Whether or not an individual has the disease is regarded as the dependent variable. [Here we focus on discrete outcomes, as is common in GWAS studies that are case-control.] The SNP values are the explanatory variables. In a typical study there may be several thousand subjects and hundreds of thousands of SNPs. From the scientist’s point of view there are two basic problems, complicated by the large size of the data set. These are variable selection and prediction. For variable selection, we wish to find a relatively small set of SNPs associated with the disease. For prediction we wish to find how a small set of such variables can be used to predict whether the subject has the disease. The size of the data set is such that the typical approach to variable selection has been to see how well correlated each SNP value is with the disease, and to keep only those for which the statistical significance was very high. Only recently has there been serious consideration of the possible interactions among two or more SNPs by some investigators. The prediction problem has typically been approached by using some variation of linear regression based on the limited number of SNPs from the variable selection stage.

If predictivity is measured by how well the method works on the (training) data used to derive the predictions, we are almost bound to get overoptimistic results. Methods of cross-validation will result in more accurate estimates. Alternatively one may use a separate test sample, independent of the data used to produce the prediction model. Much of our discussion is also relevant to large data sets in other fields of study. Indeed, this problem is not unique to genetic data; we find cases of similar problems in the social sciences. For instance, significant explanatory variables for civil wars serve nearly negligible input for predicting civil wars (6). Likewise, variables found to be significant for fluctuations in the stock market index carry no predictive power (7). This phenomenon is pervasive across different types of data as well as different sample sizes. Thus, the goal of this paper is to offer theoretical insight and illuminating examples to demonstrate precisely how finding highly significant variables is different from finding highly predictive ones—regardless of data type. For illustrative purposes however, we use the lens of prediction for genetic data throughout.

One might ask why one method of variable selection that works perfectly well for a significance-based research question might not work so well for a classification-based research question. Fundamentally, the main difference is that what constitutes a good variable for classification and what constitutes a good variable for significance depend on different properties of the underlying distributions. The test for significance is a test of the null hypothesis that the distributions of *X* under the two states are the same, whereas the classification error is a test of whether *X* belongs to one state or the other. Different properties of the distributions are involved. The tests used also may or may not be efficient. In fact, significance was not originally designed for the purposes of prediction.

Some might also comment that perhaps it is clear and intuitive why it is that some significant variables do not appear as highly predictive. After all, variables may be significantly associated with the outcome simply for a small group of individuals in the population, thereby leading to poor prediction on the population. This is true to an extent. However, there is still a fair amount of research using significant variables to predict, perhaps because of a lack of obvious alternative options for variable selection. For instance, currently, prediction-oriented GWAS research uses genetic variants for constructing additive prediction models for estimating disease risk. A recent *New England Journal of Medicine* article illustrates one example of such an approach, whereby researchers constructed a model based on five genetic variants from GWAS results on prostate cancer; the researchers report that the variants do not increase predictive power (8). Likewise, Gränsbo et al. show that chromosome 9p21, although significantly associated with cardiovascular disease, does not improve risk prediction (9).

In addition, whereas the intuition behind significant variables not appearing predictive might be reasonably obvious, the fact that highly predictive variables do not appear necessarily as highly significant is perhaps less so. We discuss and then demonstrate this phenomenon with both a theoretical explanation and a series of examples. Finally, whereas superficially we might reason that indeed, significance cannot be the same as predictivity, why this is precisely so and what makes for their differences is also not quite so obvious.

With this in mind, we provide a short theoretical explanation for the differences between highly significant and highly predictive variables. We then demonstrate, with a series of artificial examples of increasing relevance, how and why seeking significance and prediction can lead to very different decisions in variable selection. These examples are artificial, partly because they assume that the underlying probabilities are known, whereas the scientist can only infer these from the data. In these examples we compare significance and prediction, and show how the relatively simple *I* score, defined in *Materials and Methods*, which we have used in our partition retention (PR) approach to variable selection (10⇓⇓–13), seems to correlate well with predictivity. We offer the *I* score as one possible useful tool in the study of increasing predictivity. We show a highly successful real application of the PR approach for increasing predictivity in the analysis of a longstanding data set on breast cancer, for which we show some results. Finally, some conclusions are offered to aid in the study of improving predictivity in GWAS research.

There is a long-established literature in statistics on classification with major applications to biology. In recent years the fields of pattern recognition, machine learning, and computer science became heavily involved, often with different terminology and new ideas adapted to the increasing size of the relevant data sets. In the *Supporting Information*, we present a very brief description of some of the techniques, approaches, and terminology.

## Highly Significant vs. Highly Predictive Variables

Data has substantially grown in recent years with both exponential increases in the number of variables and, in many cases, increases in sample sizes as well. This has served as stimulation for a large number of applications via the novel retooling of well-known concepts. Two popular concepts, statistical significance and prediction (including classification), serve as the focus of this article. Historically, significance has played a larger role in statistical inference whereas prediction has served more in identifying future data behavior. The retooling of significance has found a role in data dimension reduction for prediction, that of guiding the feature selection/variable selection step (14). We evaluate this retooling and consider how significance and predictivity are related in the goal of good prediction.

We have mentioned that a key difference between what makes a variable highly significant versus highly predictive lies in different properties of their underlying distributions. We elaborate on this point a bit more here.

Suppose a statistician is given a variable set denoted by *X*. It is assumed that among control observations *X* follows a distribution *X* follows a distribution

To carry out a test between *x*, the statistician chooses a test statistic *x* of *X* for the *n* cases and *n* controls, calculates *P* value, is sufficiently small.

To decide whether *x*, the observed value of *X* for a single individual, comes from the distribution **1** would be written with integrals rather than summations.] Thus, we may write*x* represents the possibly multivariate observation that can assume a finite number of values; **1** defined above requires the knowledge of the true probability distributions, whereas, in practice, the statistician can only infer such knowledge from the data.

The key difference between finding a subset of variables to be highly significant versus finding it to be highly predictive is that the former uses assumptions on, but no knowledge of, the exact distributions of the variables, whereas the latter, as shown in Eq. **1**, requires knowledge of both

Should the statistician still wish to pursue the significance route to identify variables that are highly predictive, he might wish to compare two subsets of explanatory variables, *x* and *P* value. Because of his limited knowledge on the underlying distributions he is restricted to use tests that are not necessarily powerful enough. Often he is reduced to using a χ^{2} technique, recommended, for example, in the studies of complex diseases, which is not very powerful for the multiple variable cases. The suboptimality of the test procedure makes the significance level an unreliable basis for comparing subsets of variables and for the usefulness in prediction. It is no surprise that searching for variables based on significance level and based on correct prediction rate can lead us in conflicting directions.

The statistician’s *P* value for the test is a random variable and here we have assigned the significance value to be the median of the *P* values, which we may calculate, knowing the probability distributions. The statistician sees only the *P* value. To make his prediction using *x*, in the case of equal sample sizes and equal costs of error, he can select for each observed value *x*, either D or H depending on whether there are more cases or controls in his samples corresponding to *x*. A naive estimate of the correct prediction rate, the training prediction rate, is obtained by simply using this method on the observed samples. It tends to be overoptimistic. Many sampling properties, such as the significance, the expected training prediction rate, and the median of the *I* score, can often be calculated conveniently by simulation.

Our next section uses artificial examples to illustrate how highly significant variables and highly predictive variables might differ.

## Three Examples

Although we are concerned with large data, our first few examples use only a few observations to cleanly illustrate the issues. The three examples are followed by comparisons, based on a set of 546 more relevant and related examples, each involving 6 SNPs and many observations as summarized as example 4. These examples will show how and why significance and predictivity can differ and that the *I* score can serve as a useful sign of predictivity. They also show that the problems we run into in prioritizing significance instead of predictivity in our variable selection stage can grow with the complexity of the data. The comparisons in the last example require many simulations and are meant to demonstrate a complicated data scenario, more akin to a GWAS.

### Example 1.

For example 1, there is a single observation *X*, the distribution of which is normal with mean 0 and SD 1 under a hypothesis *H*, which can be thought of as health. But, there is an alternative hypothesis *K*, under which *X* has a normal distribution with mean 3 and SD 3. We wish to use *X* to determine whether *H* or *K* is the correct hypothesis. Our problem can be thought of as predicting or classifying the state of an individual yielding the observation *X*. It is a standard problem of testing the hypothesis *H* and we may regard large values of *X* as favoring *K* and suggesting rejection of *H*.

Statistical theory tells us that the optimal test of *H* consists of rejecting *H* when the likelihood ratio is large. For any choice *c* of what constitutes large enough, we have two error probabilities, *H* and *K*, respectively. Notice that if *c* increases it becomes harder to reject *H* and *c* which minimizes the average of *X*.

For this problem a plausible, if slightly suboptimal, test is to reject *H* when *X* is sufficiently large. For each possible value *x* of *X*, there is a probability *H*, that *X* will be as large as *x* or larger. Then *P* value when *X* is observed. Before observing *X*, we know that *X* and the *P* value are random variables. Under *H*, *K*, *X* is very good at discriminating between *H* and *K*, *K*. We label the median value of *K* as the significance *X*. In this case *H* against the alternative *K*, and is related to the classification problem of deciding which of several (in this case two) situations applies. Thus, prediction, classification, and hypothesis testing are different names for the same problem.

Now suppose that there is another variable *Y* which is also normally distributed with mean 0 and SD 1 under *H*, but normally distributed with mean 0 and SD 0.05 under *K*. Here we calculate *H* when *Y* is large, we obtain *H* when the absolute value of *Y* is too small.) Forgetting for the moment how silly the test is, let us consider the dilemma of the scientist who must decide, based on these numbers, whether to observe *X* or *Y*. He prefers *Y* if he decides on the basis of error rate or predictivity and *X* if the decision is based on significance. We refer to this situation where the preferred choice between *X* and *Y* depends on the use of significance or predictivity as a reversal.

There are several explanations for the reversal. One is that there was some arbitrariness in our choices of measures of predictability and significance (measures

The following two examples, illustrated in Fig. 2, are more relevant and show the same sort of reversal under considerably more reasonable circumstances. They are also more conventional examples of obtaining significance for the test of a null hypothesis.

### Example 2.

In example 2 the outcome variable is case or control status. The explanatory variable *X* is the reading on one SNP for each of 500 cases and 500 controls, for which the probabilities under cases and controls are listed in the blue table in Fig. 2. In this case the minor allele frequency (MAF) is 0.5 and the odds ratio is close to 1 for each of the three possible observations 0, 1, and 2. For *Y*, based on the other SNP described in the red table in Fig. 2, the MAF is between 0.1 and 0.2 depending on what proportion of the population is healthy. For *Y*, the odds ratio varies from 4 to 1. In this example we have ^{2} test for the null hypothesis that the two distributions for case and control are the same. This yields *P* value. The figure also lists the median *I* score for both *X* and *Y*, which favors *X* as does the prediction rate.

### Example 3.

Example 3 is also presented in Fig. 2. Here the variable *X* in the blue table consists of the outcome of two SNPs (two-way interaction effect). This outcome can fall in one of the *I* score favors *X* as opposed to *Y* (in the red table) as does the prediction rate. Whereas the prediction rates are comparable, the median *P* values are wildly different. Note in both plots of distributions of the predictive variable sets (predictive VS) and significant variable sets (significant VS) in examples 1 and 2, there is overlapping between variable sets but large portions of predictive variable sets are not significant and vice versa. In addition, in both examples the *I* score follows the preferred prediction rate and not the significance (median *P* values).

## Comparing Significance Tests with the *I* Score

Before drawing conclusions from the three examples, we present a more complex data simulation for example 4, which consists of a comparison of 546 related, more relevant cases with large numbers of subjects.

In these cases we deal with six independent but similar SNPs (encapsulating six-way interaction effects), and the observation for a given subject falls into one of ^{2} test. The latter two are medians of measures based on observed data and their calculation requires extensive simulations. The graphs show how poorly these correlate with truth until the number of subjects becomes very large. Whereas the *I* score and its median are also based on the data, Fig. 4 shows that it is very well correlated with the truth for modest sample sizes; at large sample sizes *I* is still better correlated with truth than are the training prediction rate and χ^{2} test.

## Applying the *I* Score to Real Breast Cancer Data

To reinforce the previous section we turn to a brief examination of real disease data. As noted before, our research team has made heavy use of the *I* measure in a variable selection method called “partition retention.” This method, applied to real disease data, has not only been quite successful in finding possibly interacting influential variable sets but has also resulted in variable sets that are very predictive and do not necessarily show up as significant through traditional significance testing (10, 15, 16). Here “predictive” refers to both high in *I* score as well as having high correct prediction rates as determined by *k*-fold cross-validation. We present examples of some discovered variable sets found to be highly predictive for a real data set on breast cancer (17) that are not highly significant. When using these newly found variable sets, the team was able to reduce the error rate on prediction from the literature standard of

In Table 1 we investigate the top five-variable module (subset of interacting variables) in the breast cancer data found to be predictive through both top *I* score and performance in prediction in cross-validation and an independent testing set in ref. 15. To find how significant these variables are, we calculate the individual, marginal association of each variable in the marginal *P* value. When testing 1,000 variables having no effect, it is likely that some will have *P* values of around 0.001. Here, we have 4,918 variables and therefore desire a *P* value of *P* value that is significant either.

## Comments and Conclusion

In our exposition of the differences between highly predictive versus highly significant variable sets, we use artificial examples. We need to know the true relevant underlying probability distributions to treat the problem as one of testing a simple hypothesis against a known alternative for which statistical theory can calculate optimal tests and predictive rates. Our four simulated examples can demonstrate with clarity the reversals we see in choosing significant versus predictive variable sets. Real examples are more difficult because the researcher must rely on a limited number of individuals to infer the relevant distributions and the number of possible variables is huge. However, to demonstrate the potential usefulness of our proposed measure, we additionally provided the highly promising results of applying the *I* score to the real and well-known van’t Veer breast cancer data set (ccb.nki.nl/data/).

One may wonder whether the shortcoming of using significance is due to the custom of using marginal significance and not taking into account the possible interaction effects of groups of variables. In our examples the problem of reversals seems to increase when using significance-based measures on routine tests when dealing with groups of interacting variables. In example 4, six-way interactions are considered and traditional significance approaches do not capture predictive variable sets. However, using the PR approach based on the measure *I* for the variable selection stage does well for prediction. Finally, even when we can capture joint effects that are highly predictive, as in the case of the captured variable sets in the van’t Veer example, these groups of variables were not significant. Seeking highly predictive groups of variables through significance alone would not have retrieved these variable sets.

If that is the case, how did we manage to get good results in the breast cancer problem? We used the PR approach, relying heavily on the *I* score for the variable selection aspect. For reasons we only partly understand, the *I* score seems to correlate well with predictivity. Having selected the relatively small number of candidate “influential” variables, an intensive use of a variety of known techniques in classification was applied. These were more sophisticated than simple linear regressions.

The issue of obtaining high predictivity from large data demands study. We encourage exploration away from significance-based methodologies and toward prediction-oriented ones. We propose the *I* score and the PR method of variable selection as candidate tools for the latter.

## Materials and Methods

The PR approach to variable selection depends heavily on the *I* score applied to small groups of explanatory variables. Suppose we have *n* observations on a disease phenotype *Y*. When dealing with a small group of *m* SNPs, each individual is represented by a value *Y* of the dependent variable and one of *m* variables fall. Then the value of *I* is given by

where *i*th individual, *n Y* values, *s* is the SD of all *n Y* values, *Y* values in cell *j*, *j*, and *n* is the total number of individuals. The measure *I* is a statistic which may be calculated from the observed data, and does not involve knowing the underlying distributions, as did truth in example 4.

The *I* score has several desirable properties. First it does not require specification of a model for the joint effect of the *m* SNPs on *Y*. It is designed to capture the discrepancy between the conditional means of *Y* given the values of the SNPs and the overall mean of *Y*. Unlike ORs as a measure of effect in assessing simple *I* captures and aggregates all discrepancy (signals) from all

Second, under the null hypothesis that the subset has no effect on *Y*, the expected value of *I* remains nonincreasing when dropping variables from the subset. In other words, the *I* score is robust to changes to the number of SNPs, *m*. And, *I* has the property that adjoining to the group another variable which is independent of *Y* will tend to decrease *I*; the PR method is based on selecting a group at random and sequentially eliminating those variables which diminish *I* the most, and retaining those for which *I* can no longer be diminished. Those variables, that are retained most often from many randomly chosen groups are candidates for variable selection. The fact that *I* does not automatically increase as more variables are added to the group being measured is a good property of the *I* score.

Finally, under the null hypothesis of no effect *I* acts like a weighted average of independent χ^{2}s with one degree of freedom. Therefore, *I* values substantially larger than 1 are worth noting.

## Online Supporting Materials I

### Simulation Details for Fig. 2 (Examples 2 and 3).

The prediction rate (proportion of correct predictions) of each variable set (of size 1 or 2) can be directly computed using the genotype frequencies specified.

Using sample sizes of 500 cases and 500 controls, we simulate *B* = 1,000 random case-control data sets by simulating genotype counts among cases and genotype counts among controls using the genotype frequencies specified. For each simulated data set for each variable set considered, we compute the statistic of the χ^{2} test of independence. For each variable set, we summarize these test statistics using histograms shown in Fig. 2, with a vertical bar indicating the corresponding 5% significance level.

### Simulation Details for Figs. 3 and 4 (Example 4).

We generate VSs of size 6

To simulate case-control data of a complex disease, we generated a fixed base-level vector of ORs (denoted by *γ* is between 1 and 2. For a given simulated VS, the actual OR for the partition cells is *γ* between 1 and 2 were considered (denoted as “OR” in Figs. 3 and 4). In total, there were 26

For each specified VS, we first computed the theoretical Bayes rate, based on the population frequencies and ORs (Fig. 3). Using 2,000 independent simulations under each VS, given a sample size specification, we evaluated (*i*) the average training prediction error, (*ii*) *P* value from χ^{2} test of independence, and (*iii*) our proposed estimated prediction rate using PR’s *I*. (*i*) and (*ii*) are commonly used approaches in the current literature.

## Online Supporting Materials II

Variable selection or feature selection refers to the approach of selecting a subset of an original group of variables to construct a model. Often feature selection is used on data of large dimensionality with modest sample sizes (18). In the context of high-dimensional data, such as GWAS, with perhaps redundant or irrelevant information, this dimensionality reduction can be a very important step. Unlike projection- or compression-based approaches (such as principal component analysis or use of information theory), feature selection methods do not change the variables themselves.

The types of approaches and tools developed for feature selection are both diverse and varying in degrees of complexity; however, there is general agreement that three broad categories of feature selection methods exist. These are filter, wrapper, and embedded methods. Filter approaches tend to select variables through ranking them by various measures (correlation coefficients, entropy, information gains, χ^{2}, etc.). Wrapper methods use black-box learning machines to ascertain the predictivity of groups of variables; because wrapper methods train similar predictive models for each subset of variables, they can be computationally intensive. Embedded techniques search for optimal sets of variables via a built-in classifier construction. A popular example of an embedded approach is the least absolute shrinkage and selection operator (LASSO) method for constructing a linear model, which penalizes the regression coefficients, shrinking many to zero. Often cross-validation is used to evaluate the prediction rates. For a more comprehensive survey of the feature selection literature see, among others, refs. 14, 18⇓–20.

Although a spectrum of feature selection approaches exists, many scientists have taken the approach of tackling prediction through the use of important and hard-to-discover influential variables found to be statistically significant in previous studies. In the context of high-dimensional data and in the spirit of further investigating variables known to be influential, it is reasonable to hope that these same variables can prove useful for predictive purposes as well. This approach is in some ways most similar to a univariate filter method, as it is independent of the classifier and has no cross-validation or prediction step for variable selection. Our purpose here is to address this popular approach of variable selection via statistical significance. We illustrate how and why the popular filter approach of variable selection through statistical significance might be different from variable selection through predictivity.

## Acknowledgments

This research is supported by National Science Foundation Grant DMS-1513408.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: slo{at}stat.columbia.edu or chernoff{at}stat.harvard.edu.

Author contributions: A.L., H.C., T.Z., and S.-H.L. designed research; A.L., H.C., T.Z., and S.-H.L. performed research; A.L., T.Z., and S.-H.L. analyzed data; and A.L., H.C., T.Z., and S.-H.L. wrote the paper.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1518285112/-/DCSupplemental.

## References

- ↵
- ↵
- ↵
- ↵
- ↵.
- Janssens AC,
- van Duijn CM

- ↵.
- Ward MD,
- Greenhill BD,
- Bakke KM

- ↵.
- Welch I,
- Goyal A

- ↵
- ↵
- ↵.
- Chernoff H,
- Lo SH,
- Zheng T

- ↵.
- Lo SH,
- Zheng T

- ↵.
- Lu HHS,
- Scholkopf B,
- Zhao H

- Zheng T,
- Chernoff H,
- Hu I,
- Ionita-Laza I,
- Lo SH

- ↵
- ↵
- ↵.
- Wang H,
- Lo SH,
- Zheng T,
- Hu I

- ↵.
- Lo SH,
- Chernoff H,
- Cong L,
- Ding Y,
- Zheng T

- ↵
- ↵.
- Saeys Y,
- Inza I,
- Larrañaga P

- ↵
- ↵

## Citation Manager Formats

## Article Classifications

- Biological Sciences
- Biophysics and Computational Biology

- Physical Sciences
- Statistics

## Sign up for Article Alerts

## Jump to section

- Article
- Abstract
- Highly Significant vs. Highly Predictive Variables
- Three Examples
- Comparing Significance Tests with the
*I*Score - Applying the
*I*Score to Real Breast Cancer Data - Comments and Conclusion
- Materials and Methods
- Online Supporting Materials I
- Online Supporting Materials II
- Acknowledgments
- Footnotes
- References

- Figures & SI
- Info & Metrics