Skip to main content

Main menu

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
  • Submit
  • About
    • Editorial Board
    • PNAS Staff
    • FAQ
    • Accessibility Statement
    • Rights and Permissions
    • Site Map
  • Contact
  • Journal Club
  • Subscribe
    • Subscription Rates
    • Subscriptions FAQ
    • Open Access
    • Recommend PNAS to Your Librarian

User menu

  • Log in
  • My Cart

Search

  • Advanced search
Home
Home
  • Log in
  • My Cart

Advanced Search

  • Home
  • Articles
    • Current
    • Special Feature Articles - Most Recent
    • Special Features
    • Colloquia
    • Collected Articles
    • PNAS Classics
    • List of Issues
  • Front Matter
    • Front Matter Portal
    • Journal Club
  • News
    • For the Press
    • This Week In PNAS
    • PNAS in the News
  • Podcasts
  • Authors
    • Information for Authors
    • Editorial and Journal Policies
    • Submission Procedures
    • Fees and Licenses
  • Submit
Research Article

Framework for making better predictions by directly estimating variables’ predictivity

View ORCID ProfileAdeline Lo, Herman Chernoff, View ORCID ProfileTian Zheng, and Shaw-Hwa Lo
  1. aDepartment of Politics, Princeton University, Princeton, NJ 08540;
  2. bDepartment of Statistics, Harvard University, Cambridge, MA 02138;
  3. cDepartment of Statistics, Columbia University, New York, NY 10027

See allHide authors and affiliations

PNAS first published November 29, 2016; https://doi.org/10.1073/pnas.1616647113
Adeline Lo
aDepartment of Politics, Princeton University, Princeton, NJ 08540;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Adeline Lo
Herman Chernoff
bDepartment of Statistics, Harvard University, Cambridge, MA 02138;
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: slo@stat.columbia.edu chernoff@stat.harvard.edu tz33@columbia.edu
Tian Zheng
cDepartment of Statistics, Columbia University, New York, NY 10027
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tian Zheng
  • For correspondence: slo@stat.columbia.edu chernoff@stat.harvard.edu tz33@columbia.edu
Shaw-Hwa Lo
cDepartment of Statistics, Columbia University, New York, NY 10027
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: slo@stat.columbia.edu chernoff@stat.harvard.edu tz33@columbia.edu
  1. Contributed by Herman Chernoff, October 13, 2016 (sent for review June 4, 2016; reviewed by David L. Banks and Ming Yuan)

  • Article
  • Figures & SI
  • Info & Metrics
  • PDF
Loading

Significance

Good prediction, especially in the context of big data, is important. Common approaches to prediction include using a significance-based criterion for evaluating variables to use in models and evaluating variables and models simultaneously for prediction using cross-validation or independent test data. The first approach can lead to choosing less-predictive variables, because significance does not imply predictivity. The second approach can be improved through considering a variable’s predictivity as a parameter to be estimated. The literature currently lacks measures that do this. We suggest a measure that evaluates variables’ abilities to predict, the I-score. The I-score is effective in differentiating between noisy and predictive variables in big data and can be related to a lower bound for the correct prediction rate.

Abstract

We propose approaching prediction from a framework grounded in the theoretical correct prediction rate of a variable set as a parameter of interest. This framework allows us to define a measure of predictivity that enables assessing variable sets for, preferably high, predictivity. We first define the prediction rate for a variable set and consider, and ultimately reject, the naive estimator, a statistic based on the observed sample data, due to its inflated bias for moderate sample size and its sensitivity to noisy useless variables. We demonstrate that the I-score of the PR method of VS yields a relatively unbiased estimate of a parameter that is not sensitive to noisy variables and is a lower bound to the parameter of interest. Thus, the PR method using the I-score provides an effective approach to selecting highly predictive variables. We offer simulations and an application of the I-score on real data to demonstrate the statistic’s predictive performance on sample data. We conjecture that using the partition retention and I-score can aid in finding variable sets with promising prediction rates; however, further research in the avenue of sample-based measures of predictivity is much desired.

  • prediction
  • variable selection
  • high-dimensional data
  • predictivity

Prediction is a highly important goal for many scientists and has become increasingly difficult as the quantity and complexity of available data have grown. Complex and high-dimensional data particularly demand attention. However, the literature on prediction does not yet have a clear theoretical framework that allows for characterizing a variable’s predictivity directly [see A Brief Literature Review on VS for a brief review on the literature of variable selection (VS)]. Rather, VS for variable sets in the context of prediction is currently conducted in two common ways. The first is VS through identification of variables correlated with the outcome, measured through tests of statistical significance—such as the chi-square test. The second is through VS of variables that seem to do well in an independent set of test data, as measured through testing sample error rates. The first approach is still very much in use for predicting health outcomes (see ref. 1, among others) but its prediction performance has been disappointing (e.g., refs. 1 and 2). We show in our related work (3) how and why the popular filter approach of VS through statistical significance does not serve the purpose of prediction well. For an intuitive illustration of the relationship between predictive and significant sets of variables, see Fig. 1. Under a significance-test-based search setting, the set of variables found to be significant expands as the sample size grows (Fig. 1, widening orange dotted ovals). However, the set of predictive variables (Fig. 1, blue circle) is not susceptible to sample-size changes in the same way—because predictivity is a population parameter—and overlaps, but is not perfectly aligned with, significant sets. It is easy to see that in this scenario targeting significant sets may miss the goal of prediction entirely. Instead, we suggest that emphasis must be placed on designing measures that directly evaluate variable sets’ predictivity.

Fig. 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 1.

Illustration of the relationship between predictive and significant sets of variable sets. Rectangular space denotes all candidate variable sets. Significant sets are identified through traditional significance-tests.

We show in ref. 3 that the first approach suffers from the problem that significant variables are not necessarily predictive, and vice versa, so targeting significant variables might miss the goal of VS for higher predictivity. This problem is prevalent in simple as well as complex data. The second way for VS sets aside testing (or validation) data to see how well selected predictors might do on “new data.” However, as is in the case of genome-wide association study (GWAS) data, researchers frequently lack large enough sample sizes for this approach to be efficient. Reuse of training data in the form of cross-validation is often adopted in practice.

We suggest that an alternative, and perhaps logical, approach to prediction should start with defining the theoretical prediction rates of a set of variables as a parameter of interest. It would be productive then to create measures designed to directly measure such a parameter, rather than relying on the estimated prediction rate by cross-validation. We call such an approach “variable set assessment,” or VSA. We hope that designing measures that directly estimate a variable set’s true ability to predict may prove to be both fruitful and efficient in the use of sample data for good prediction. Here, we propose such a prediction-based framework. Grounded in statistical theory, we highlight an avenue of research toward creating sensible measures that target highly predictive variable sets through assessing their predictivity directly. We emphasize genetic data, although we will show that the methods proposed are easily tailored to other high-dimensional data in the natural and social sciences.

A Brief Literature Review on VS

A related and extremely important literature is that of VS or feature selection, which refers to the practice of selecting a subset of an original group of variables that is later used to construct a model. Often VS is used on data of large dimensionality with modest sample sizes (7). In the context of high-dimensional data, such as GWAS, this dimensionality reduction can be a crucial step. VS approaches are commonly proposed to efficiently search for the best variable sets according to a specified criterion. Most performance measures are developed to maximize the probability of selecting the truly important variables but are not direct measures of predictivity. Therefore, popular VS approaches do not return reliable assessment of the predictivity of variable sets. In contrast, we will propose considering VSA through a reliable, model-free measure used to assess the potential predictivity of a variable set. Unlike projection- or compression-based approaches (such as principal component analysis or use of information theory), VSA methods do not change the variables themselves.

The types of approaches and tools developed for feature selection are both diverse and varying in degrees of complexity. However, there is general agreement that three broad categories of feature selection methods exist: filter, wrapper, and embedded methods. Filter approaches tend to select variables through ranking them by various measures (correlation coefficients, entropy, information gains, chi-square, etc.). Wrapper methods use “black box” learning machines to ascertain the predictivity of groups of variables; because wrapper methods often involve retraining prediction models for different variable sets considered, they can be computationally intensive. Embedded techniques search for optimal sets of variables via a built-in classifier construction. A popular example of an embedded approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients, shrinking many to zero. Often cross-validation is used to evaluate the prediction rates.

Often, though not always, the goal of these approaches is statistical inference. When this is the case, the researcher might be interested in understanding the mechanism relating the explanatory variables with a response. Although inference is clearly important, prediction is an important objective as well. In this case, the goal of these VS approaches is in inferring the membership of variables in the “important set.” Various numerical criteria have been proposed to identify such variables [e.g., Akaike information criterion (AIC) and Bayesian information criterion (BIC), among others; see chapter 7 in ref. 8 for a review], which are associated with predictive performance under model assumptions made for the derivation of these criteria. However, these criteria were not designed to specifically correlate with predictivity. Indeed, we are unaware of a measure that directly attempts to evaluate a variable set’s theoretical level of predictivity. This paper proposes a model-free parameter for predictivity and its sample estimate. For a more comprehensive survey of the feature/VS literature see, among others, refs. 7, 9, 10, and 11.

Although a spectrum of VS approaches exists, many scientists have taken the approach of tackling prediction through the use of important and hard-to-discover influential variables found to be statistically significant in previous studies. When these efforts are in the context of high-dimensional data and alongside work investigating variables known to be influential, it might seem reasonable to hope that variables found to be significant can prove useful for predictive purposes as well. This approach is in some ways most similar to a univariate filter method, because it is independent of the classifier and has no cross-validation or prediction step for VS. We show in our related work (3) how and why the popular filter approach of VS through statistical significance does not serve the purpose of prediction well. For an intuitive illustration of the relationship between predictive and significant sets of variables, see Fig. 1. Under the context of a significance-test based search for variable sets, the set of variables found to be significant expands as the sample size grows (Fig. 1, widening orange dotted ovals). However, the set of predictive variables (Fig. 1, blue circle) are not susceptible to sample-size changes in the same way—because predictivity is a population parameter—and overlaps, but is not perfectly aligned with, significant sets. It is easy to see that in this scenario targeting significant sets may miss the goal of prediction entirely. Instead, we suggest that emphasis must be placed on designing measures that directly evaluate variable sets’ predictivity.

Many methods also use out-of-sample testing error rates or cross-validation to ascertain whether prediction is done well. This approach was not designed to specifically find a theoretically correct prediction rate for a given variable set; rather, it is simply a performance evaluation of future predictions from a pattern recognition technique on selected variable sets (trained on training data). Sometimes the variable sets in the training data are selected through statistics such as the adjusted R squared, AIC, or BIC. When p≫n (or even in instances where p>n), a standard in big data, however, these statistics can fail to be useful.∗ Again, these criteria were not designed to be directly correlated with a given variable set’s predicitivity. Using out-of-sampling testing and/or cross-validation techniques additionally requires either setting aside valuable sample data to make sure the variable sets selected under the training set are indeed highly predictive and not just overfitting the data or is often computationally burdensome. It becomes important then that we have a good screening mechanism when conducting VSA for removing noisy variables (and thus finding predictive ones), even with constrained amounts of sample data. We show in our simulations how poorly we can do in VSA for prediction through training set compared with out-of-sample testing prediction rates (with “infinite” future testing data—a mostly unattainable, but ideal, scenario). An ideal measure for predictivity (or a good VSA measure) reflects a variable set’s predictivity. In doing so, it would also guide VSA through screening out noisy variables and should correlate well with the out-of-sample correct prediction rate. We present a potential candidate measure, the I-score, for evaluating the predictivity of a given variable set in this paper.

Toy Example

To highlight some of our key issues, consider a small artificial example. Suppose an observed variable Y is defined asY={X1+X2(modulo 2)with prob.1/2,X2+X3+X4(modulo 2)with prob.1/2,[1]where X1,X2,X3 and X4 are 4 of 50 observed and potentially influential variables {Xi;1≤i≤50}. Each Xi can take values 0 and 1. A collection of discrete variables S may be regarded as a discrete variable that takes on a finite number of values. Each value defined by S constitutes a cell. The collection of all cells forms a partition, ΠS, based on the discrete variables in S. We also assume that the Xi were selected independently to be 1 with probability 0.5, again the simplest case without affecting the general results. Clearly, none of the individual Xi has a marginal effect on Y.

Scenario I. A statistician knows the model and wishes to compute which variable sets are predictive of Y, and how predictive, when 𝐗=(X1,X2,…,X50) is given. Because Y depends only on the first four X variables, it is obvious there are two clusters of variable sets S1={X1,X2} and S2={X2,X3,X4} that are potentially useful in his prediction. We treat the highest correct prediction rate possible for a given variable set as an important parameter and call this predictivity (θc). Using the knowledge of the model, we can compute the predictivity for S1 as θc(S1)=0.75. The predictivity for S2 is θc(S2)=0.75 also. Incidentally, the predictivity of the union of S1 and S2, θc(S1∪S2), is also 0.75.

The statistician realizes that using variable sets S1 and S2 he can predict Y correctly 75% of the time. This is indeed the case because, for instance, upon observing 𝐗=(X1,…,X50) the statistician predictsY^=X1+X2(modulo 2).

It is easy to verify that the strategy of predicting with S1 returns a 75% prediction accuracy in expectation. This is also the highest percent accuracy S1 can theoretically achieve. We discuss this in depth shortly. This result extends to S2 as well.

Scenario II. In practice, the statistician rarely has knowledge of the model and instead observes only the data. We suggest that the statistician use the partition retention (PR) approach and its corresponding I-score (which we present formally in Alternative Measure: I-Score; see ref. 4 for the original presentation of the approach or see Eq. S2 for IΠ𝐗) to identify the influential variable sets. Suppose with 400 observations the researcher wishes to identify variable sets with high predictivity and to infer their abilities to predict. Using the PR approach he can use the I-score to screen for variable sets with high potential predictivity. In this example, S1 and S2 are consistently returned with the highest I-scores (23.71 and 12.79) in simulations. Using the inequality in Eq. 7, which we derive in the following section, the lower bounds for the predictivity of θc(S1) and θc(S2) are calculated to be 67 and 62%, respectively. Eq. 7 does not require knowledge of the true model as defined in Eq. 1.

Theoretical Prediction Rates

We contribute to the prediction literature by introducing the prediction rate as a parameter to be directly estimated. We show that the PR method’s I-score, a sample-based statistic, can be used to construct an asymptotically consistent lower bound for the prediction rate.

We deal here with the special case of case control studies where the explanatory variables are discrete, and the outcome variable takes only two values, case or control. These results are easily generalized for classification problems, where the dependent variable can take on a finite number of possible values. Consider GWAS data of the usual type, with cases and controls. Assume that there are nd cases and nu controls. Using the traditional Bayesian binary classification setting, we ideally have a prior probability, π(w=d), that the state of the next individual, w, is a disease case, d, and π(w=u)=1−π(w=d) that the next individual is a control, u. In the following we shall assume that both d and u are equally likely and that the cost of an incorrect classification is the same for both possibilities. We generalize to different cost functions and priors for d and u in Generalization to Arbitrary Priors and Generalization to Different Loss and Cost Functions. Let the joint distribution of the feature value 𝐗 and w be P(𝐱,w). The joint distribution can be expressed as P(w,𝐱)=π(w|𝐱)⋅P(𝐱)=P(𝐱|w)⋅π(w), where π(w|𝐱) is the posterior distribution and π(w) is the prior. It is easy to see that the best classification rule can be derived by Bayes’ decision rule for minimizing the posterior probability of error: d if π(d|𝐱)>π(u|𝐱), otherwise u. Here the variable set 𝐗=(X1,X2,…,Xm), with each Xi taking one of the values in {0,1,2}, corresponding to the three possible genotypes for each SNP. In this way, 𝐗 forms a partition, denoted by ΠX, with 3m=m1 elements: Π𝐗={𝐗=𝐱j,j=1,…,m1:𝐱j=(xj1,xj2,…,xjm),xjk∈{0,1,2},1≤k≤m}.

Assuming equal priors, that is, π(d)=π(u)=12, the correct prediction rate θc on 𝐗 using the full Bayes’ decision rule can be calculated asθc(𝐗)=θc[p𝐗d,p𝐗u]=12∑𝐱∈Π𝐱max{p𝐗d(𝐱),p𝐗u(𝐱)},where p𝐗d(𝐱) and p𝐗u(𝐱) stand for P(𝐱|w=d) and P(𝐱|w=u), respectively. We can easily derive (see Technical Notes, Technical Note 1)θc[p𝐗d,p𝐗u]=12+14∑j∈Π𝐗|P(j|d)−P(j|u)|.[2]

This suggests that we can achieve better prediction rates by choosing variable sets corresponding to the probability pairs that lead to large values of ∑j∈Π𝐗|P(j|d)−P(j|u)|. In this theoretical setting, it is easy to show that θc increases or stays the same when another variable is added to the current variable set. This means adding many noisy variables leads to maintaining the same θc. Therefore, when sample size is no constraint, we are never hurt in our search for highly predictive variables by simply adding explanatory variables to our current set. However, in the realistic world of sample size constraints, a direct search for a variable set with a larger sample estimate of θc will fail; we offer a heuristic explanation as to why in the following section. We refer to this direct search of θc with sample data as the sample analog throughout.

Problems with the Sample Analog.

The value of θc is unknown and must be estimated. We may naturally turn to the naive sample estimate of its true theoretical values, which is sometimes referred to as the training rate. However, this estimated value of θc (where the cell probabilities are replaced by the observed proportions) is nondecreasing with the addition of more variables to a given variable set under evaluation. As the partition becomes increasingly finer, we reach a point where there is at maximum a single observation within each partition cell and 100% correct sample prediction rate is attained. This is true regardless of the true prediction rate. Then, the final estimated prediction rate is equivalent to 100%, rendering it useless as a method for finding predictive variable sets and screening out noisy ones. This is a direct result of a sparsity problem that does not occur in our theoretical world but certainly plagues the sample-size-constrained real world. (See Technical Notes, Technical Note 2 for a more detailed explanation.)We need instead a sample-based measure that can discern adding noisy versus influential variables and identify variable sets with large prediction rates for a given moderate sample size.

Alternative Measure: I-Score.

We consider this obstacle and suggest an alternative measure, a lower bound to θc, which we estimate using the I-score of the PR method (4) in sample data. The I-score converges asymptotically to a constant multiple ofθI(Π𝐗)=∑j∈Π𝐗[P(j|d)−P(j|u)]2.[3]

To relate θI to θc defined in Eq. 2, we first examine the following Lemma 1, which is derived in Technical Notes, Technical Note 3.

Lemma 1. For K real values {zj;1≤i≤K}, ∑j=1Kzj=a and ∑j=1K|zj|=b, we have∑j=1Kzj2≤a2+b22.[4]

In the case of zj=(P(j|d)−P(j|u)) for j∈Π𝐗, we have a=0. It then follows that2∑i=1k[P(j|d)−P(j|u)]2≤∑i=1k|P(j|d)−P(j|u)|.

This suggests that a strategy seeking variable sets with larger values of θI can have the parallel effect of encouraging selection of variable sets with larger values of θc, yielding better predictors. In the following, we present Theorem 1 and Corollary 2 (see Technical Notes, Technical Note 5 and Technical Note 6 for proofs).

Theorem 1. Under the assumptions that ndn→λ, a value strictly between 0 and 1, and π(d)=π(u)=1/2, thenlimn→∞sn2IΠ𝐗n=𝒫λ2(1−λ)2∑j∈Π𝐗[P(j|d)−P(j|u)]2[5]where =𝒫 indicates that the left-hand side converges in probability to the right-hand side and sn2=ndnu/n2 (see Technical Notes, Technical Note 5 for more detail).

We now show that θI defined in Eq. 3 is a parameter relevant to θc(𝐗). Together with Lemma 1, we can use the I-score to derive a useful asymptotic lower bound to the prediction rate of a variable set 𝐗, θc(𝐗), as presented in Corollary 2.

Corollary 2. Under the assumptions in Theorem 1, the following is an asymptotic lower bound for the correct prediction rate:θc(𝐗)≥𝒫12+142limn→∞IΠ𝐗nλ(1−λ).[6]

Using sample data, the estimated lower bound for θc is then12+142IΠ𝐗nλ(1−λ).[7]

The lower bounds presented in the toy example were obtained using the above Eq. 7.

We extend to an arbitrary prior in Corollary 3 (see Generalization to Arbitrary Priors for discussion and proof).

Corollary 3. Under the assumptions of an arbitrary prior π(d) and ndn→λ as n→∞, the correct prediction rate isθc∗[p𝐗d,p𝐗u]=12+12∑j∈Π𝐗|P(j|d)π(d)−P(j|u)π(u)|.[8]

The last generalization of the proposed framework accounts for incurring different costs (or losses) when making incorrect predictions (see Generalization to Different Loss and Cost Functions for discussion). Note that searching for 𝐗 with larger I-scores is asymptotically equivalent to searching for larger values of the lower bound in Eq. 6 which is closely related to the correct predictivity of a given variable set 𝐗, θc(𝐗). For example, if a variable set 𝐗 has a large I-score (substantially larger than 1; see ref. 4), it is a strong indication that 𝐗 itself could be a variable set with high predictivity. This stands in contrast to many current approaches to prediction [e.g., random forest and least absolute shrinkage and selection operator (LASSO)] that are evaluated for predictivity via cross-validation, which is computer-intensive.

Desirable Properties of the I-Score.

We note that the I-score is one possible approach to approximating the prediction rate in the sample analog form, and that the search for other potential scores is desirable and needed. Nevertheless, several properties of I are particularly appealing.

First, I requires no specification of a model for the joint effect of {X1,X2,…,Xm} on Y because it is designed to capture the discrepancy between the conditional means of Y on {X1,X2,…,Xm} and the mean of Y. Second, as mentioned earlier, the I-score does not monotonically increase with the addition of any and all variables as would the sample analog form of θc. Rather, given a variable set of size m with m−1 truly influential variables, the I-score is typically higher under the influential m−1 variables than under all m variables. If m−1 variables are influential in the sense that any smaller subset of variables is less influential, then removal of a variable to size m−2 will decrease the I-score in expectation. This natural tendency of the I-score to “peak” at variable set(s) that lead to high predictivity in the face of noisy variables under the current sample size is crucial.

Most important to note, we showed that the I-score can help find variables with high θc by identifying variables that have high values of θI (recall θI=∑j∈Π𝐗[P(j|d)−P(j|u)]2), which is related to the lower bound of θc. An important step to finding these highly predictive variable sets and discarding noisy ones through finding high I-scores is using the backward dropping algorithm (BDA) developed in ref. 4. The algorithm requires drawing many starting sets of variables and recursively dropping random variables and calculating I-scores. For more information, see ref. 4 or BDA.

Generalization to Arbitrary Priors

A problem that emerges when dealing with case-control data such as GWAS is that prior information on observing the next person as a disease case is unknown and not easily estimated from empirical data. Priors are defined by circumstances and contexts within which the case-control data are sampled—each dataset requires its own unique and unknown prior at that point in time.

Corollary 3. Under the assumptions of an arbitrary prior π(d) and ndn→λ as n→∞, the correct prediction rate can be easily seen as

θc∗[pXd,pXu]=12+12∑j∈ΠX|P(j|d)π(d)−P(j|u)π(u)|.

Let the modified score IΠn∗ be defined asnsn2IΠn∗=14∑j∈ΠXnj2[y¯j(π(d)λ)−(1−y¯j)(π(u)1−λ)]2.

Then we havelimn→∞sn2IΠn∗n=𝒫14∑j∈ΠX[P(j|d)π(d)−P(j|u)π(u)]2.[S5]

Similar lower bounds to Corollary 2 can then be derived asθc∗[pXd,pXu]=12+12∑j∈ΠX|P(j|d)π(d)−P(j|u)π(u)|≥12+12limn→∞λ(1−λ)IΠn∗2n−a2[S6]where a=∑j∈Π𝐗(P(j|d)π(d)−P(j|u)π(u))=π(d)−π(u).

Similar to Corollary 1, Eq. S5 is a direct consequence of Eq. S6 and Lemma 1 (but with zj replaced by |P(j|d)π(d)−P(j|u)π(u)|).

Generalization to Different Loss and Cost Functions

Thus far we have used a 0–1 loss on the binary classification problem. The 0–1 loss treats false negatives and false positives equally. In real applications, the scientist may wish to weigh the costs of different incorrect predictions differently. For instance, failing to detect a cancer patient may be deemed a more costly mistake to make than that of misclassifying a healthy patient because ameliorating the former mistake later on can be more difficult. The different cost amounts in making a loan decision is another example. The cost of lending to a defaulter may be seen as greater than that of the loss-of-business cost of declining a loan to a nondefaulter due to some positive level of risk aversion. Let loss function L be defined asL(d,u)=ld,L(u,d)=lu[S7]andL(d,d)=L(u,u)=0[S8]where ld and lu are the prices paid (or losses incurred) for misclassifying a diseased individual to the healthy class or a healthy person to a diseased class, respectively. We can derive the optimum Bayes’ solution by minimizing the expected predicted loss, that is, to assign future observations to the class with less loss, given its j value. We simply assign a test sample with partition (predictor) j to d ifP(j|d)π(d)L(d,u)<P(j|u)π(u)L(u,d)otherwise, assign to u. Equivalently, choose d ifP(j|d)π(d)ld<P(j|u)π(u)luotherwise u. In this way, the expected loss of adopting this rule is thus:el=12∑j∈Π𝐗min{aj,bj},where aj=P(j|d)π(d)ld and bj=P(j|u)π(u)lu. The random rule of classifying an individual to the healthy class or disease class has an expected loss ofγ=12∑(aj+bj)=12(π(d)ld+π(u)lu),a constant independent of the partition Πx. The “gain” in θcl (interpreted as less the expected loss of Bayes’ rule) can be defined asθcl=12∑j∈Π𝐗max{aj,bj}=12∑j∈Π𝐗(aj+bj)−el=γ−el.

Because γ is independent of X and ΠX, it is desirable to search for X with larger θcl to achieve better “gains.” Again we haveθcl=γ2+θcl−el2=γ2+14∑j∈Π𝐗|aj−bj|

After standardizing by γ, we obtain the improved prediction rate asθc=θclγ=12+14γ∑j∈Π𝐗|aj−bj|

Collecting the above discussion together, let the cost-based I-score IΠ𝐗c be defined asnsn2IΠ𝐗c=14γ∑j∈Π𝐗nj2[y¯j(π(d)λ)ld−(1−y¯j)(π(u)1−λ)lu]2≈n24γ∑j∈Π𝐗[P(j|d)π(d)ld−P(j|u)π(u)lu]2.[S9]

We present the following lower bound in Corollary 4. Let∑j∈Π𝐗(P(j|d)π(d)l2−P(j|u)π(u)l1)=π(d)l2−π(u)l1=a.

Corollary 4. Under the assumptions of Corollary 2 and using the loss function L described in Eqs. S7 and S8, thenlimn→∞sn2IΠ𝐗cn=𝒫14γ∑j∈Π𝐗[P(j|d)π(d)ld−P(j|u)π(u)lu]2.[S10]

Furthermore, one can derive a similar lower bound for the correct prediction rate θc asθc=12+14γ∑j∈Π𝐗|aj−bj|≥𝒫limn→∞(12+14γλ(1−λ)IΠ𝐗cn−a2)=12+14γlimn→∞λ(1−λ)IΠ𝐗cn−a2[S11]

The proofs for Eqs. S10 and S11 are quite similar to that for Corollary 3 given above; we shall omit them.

Technical Notes

Technical Note 1: Alternative Formulation of the Theoretical Prediction Rate.

Recall that the expected error of adopting the above Bayes’ decision rule (under a 0/1 loss) isθe[p𝐗d,p𝐗u]=12∑𝐱∈Π𝐱min{p𝐗d(𝐱),p𝐗u(𝐱)}.

The correct prediction rate θc on 𝐗 is defined asθc(𝐗)=θc[p𝐗d,p𝐗u]=1−θe[p𝐗d,p𝐗u]=12∑𝐱∈Π𝐱max{p𝐗d(𝐱),p𝐗u(𝐱)}where θe is the error rate. For simplicity of presentation, we can represent the above asθc=12∑j∈Π𝐱max{P(j|d),P(j|u)}where j is short for 𝐱j, a cell in the partition Π𝐗 formed by the variables 𝐗.

It is easy to show that12{θc[p𝐗d,p𝐗u]−θe[p𝐗d,p𝐗u]}=θc[p𝐗d,p𝐗u]−12=14∑j∈Π𝐗|P(j|d)−P(j|u)|.

Therefore,θc[p𝐗d,p𝐗u]=12+14∑j∈Π𝐗|P(j|d)−P(j|u)|.

Technical Note 2: Issue with Sample Analog of θc.

Suppose 𝐗m={X1,…,Xm} and 𝐗m+1={X1,…,Xm,Xm+1}. The partition formed by 𝐗m isΠ𝐗m={A1,…,Am1},whereas the partition formed by 𝐗m+1 isΠ𝐗m+1={A1∩B,…,Am1∩B,A1∩Bc,…,Am1∩Bc}={Π𝐗m∩B,Π𝐗m∩Bc}where B={𝐗m+1=1}. LetΠ𝐗m1=Π𝐗m∩{Xm+1=1} and Π𝐗m0=Π𝐗m∩{𝐗m+1=0},where Π𝐗m1 and Π𝐗m0 form two subpartitions of Π𝐗m+1, i.e., Π𝐗m+1=Π𝐗m0∪Π𝐗m1. Then|p^Π𝐗m(d)−p^Π𝐗m(u)|≤|p^Π𝐗m0(d)−p^Π𝐗m0(u)|+|p^Π𝐗m1(d)−p^Π𝐗m1(u)|,where p^(⋅) is the sample estimator. We see that the sample analog inherently favors an increase in number of partition cells (i.e., adding more variables).

Technical Note 3: Proof of Lemma 1.

It is obvious that |a|≤b. Let S1 be the sum of the positive values of zj and S2 the sum of the negative values. Let T1 be the sum of the squares of the positive values and T2 the sum of the squares of the negative values. It follows that S1+S2=a and S1−S2=b and thus S1=(a+b)/2 and S2=(a−b)/2. Then clearly T1≤S12 and T2≤S22. Consequently,∑j=1Kzj2=T1+T2≤S12+S22=a2+b22[S1]which is equivalent to the inequality in Eq. 4 and equality is attained when there are at most one positive and one negative component if |a|<b.

Technical Note 4: Technical Details on I-Score.

The influential score (I-score) is a statistic derived from the PR method. Several forms and variations were associated with the PR method before it was finally coined with this name in 2009 (4). We introduce the PR method and the I-score briefly here.†

Consider a set of n observations of a disease phenotype Y (dichotomous or continuous) and a large number S of SNPs, X1,X2,…,XS. Randomly select a small group, m, of the SNPs. Following the same notation as in previous sections, we call this small group 𝐗={Xk,k=1,…,m}. Recall that Xk takes values 0,1, and 2 (corresponding to three genotypes for a SNP locus: AA, A/B, and B/B). There are then m1=3m possible values for 𝐗’s. The n observations are partitioned into m1 cells according to the values of the m SNPs (Xk’s in 𝐗), with nj observations in the jth cell. We refer to this partition as Π𝐗. The proposed I-score (denoted by IΠ𝐗) is designed to place greater weight on cells that hold more observations:IΠ𝐗=∑j=1m1njn⋅(Y¯j−Y¯)2sn2/nj=∑j=1m1nj2(Y¯j−Y¯)2∑i=1n(Yi−Y¯)2[S2]where sn2=1n∑i=1n(Yi−Y¯)2. We note that the I-score is designed to capture the discrepancy between the conditional means of Y on {X1,X2,…,Xm} and the mean of Y.

In this paper, we consider the special problem of a case-control experiment where there are nd cases and nu controls and the variable Y is 1 for a case and 0 for a control. Then sn2=(ndnu)/n2 where n=nd+nu.

Technical Note 5: Proof of Theorem 1.

We prove that the I-score approaches a constant multiple of θI asymptotically.

Under the null hypothesis of no association between 𝐗={Xk,k=1,…,m} and Y, IΠ𝐗 can be asymptotically expressed as ∑j=1m1λjχj2 (a weighted average), where λj is between 0 and 1 and ∑j=1m1λj is equal to 1−∑j=1m1pj2, where pj is the cell j’s probability. {χj2} are m1 chi-squares, each with degree of freedom, df=1 (see ref. 4).

Furthermore, the above formulation and properties of IΠ𝐗 apply to the specified Y model with case-control study (where Y=1 designates case and Y=0 designates control) as demonstrated in ref. 4. More specifically, in a case-control study with nd cases and nu controls (letting n=nd+nu), nsn2IΠ𝐗 can be expressed as the following:nsn2IΠ𝐗=∑j∈Π𝐗nj2(Y¯j−Y¯)2=∑j∈Π𝐗(nd,jm+nu,jm)2(nd,jmnd,jm+nu,jm−ndnd+nu)2=(ndnund+nu)2∑j∈Π𝐗(nd,jmnd−nu,jmnu)2where nd,jm and nu,jm denote the numbers of cases and controls falling in jth cell, and Π𝐗 stands for the partition formed by m variables in 𝐗. Since the PR method‡ seeks the partition that yields larger I-scores, one can decompose the following:nsn2IΠ𝐗=∑j∈Π𝐗nj2(Y¯j−Y¯)2=An+Bn+Cnwhere An=∑j∈Π𝐗nj2(Y¯j−μj)2, Bn=∑j∈Π𝐗nj2(Y¯−μj)2, and Cn=∑j∈Π𝐗−2nj2(Y¯j−μj)(Y¯−μj). Here, μj and μ are the local and grand means of Y, that is, E(Y¯j)=μj;Y¯=μ=ndnd+nu for fixed n. It is easy to see that both terms An and Cn, when divided by n2 converge to 0 in probability as n→∞. We turn to the final term, Bn. Note thatlimnBnn2=𝒫limn∑j∈Π𝐗(nj2n2)(μj−μ)2

In a case-control study, we haveμj=ndP(j|d)ndP(j|d)+nuP(j|u)andμ=ndnd+nu

Because for every j, njn converges (in probability) to pj=λP(j|d)+(1−λ)P(j|u) as n→∞, if limnndn=λ, a fixed constant between 0 and 1, it follows thatBnn2=∑j∈Π𝐗(nj2n2)(μj−μ)2→𝒫∑j∈Π𝐗pj2(λP(j|d)λP(j|d)+(1−λ)P(j|u)−λ)2 asn→∞=∑j∈Π𝐗{λP(j|d)−λ[λP(j|d)+(1−λ)P(j|u)]}2=∑j∈Π𝐗{λ(1−λ)P(j|d)−[λ(1−λ)P(j|u)]}2=λ2(1−λ)2∑j∈Π𝐗[P(j|d)−P(j|u)]2

Thus, ignoring the constant term in the above equation, the I-score can guide a search for X partitions, which will lead to finding larger values of the summation term ∑j∈Π𝐗[P(j|d)−P(j|u)]2. We have proven Theorem 1.

Technical Note 6: Proof of Corollary 2.

Under the assumptions in Theorem 1, the following is an asymptotic lower bound for the correct predictive rate:θc(𝐗)≥𝒫12+142limn→∞IΠ𝐗nλ(1−λ).[S3]

Proof: From Eq. 2,θc(𝐗)=12+14∑j∈Π𝐗|P(j|d)−P(j|u)|(Lemma 1)≥12+142∑j∈Π𝐗(P(j|d)−P(j|u))2=12+142θI(Π𝐗)(Theorem 1)=𝒫12+142limn→∞sn2IΠ𝐗nλ2(1−λ)2=𝒫12+142limn→∞IΠ𝐗nλ(1−λ).[S4] ■

The asymptotic lower bound of Eq. 2 is a simple consequence of Lemma 1 and Theorem 1. In theory, the above corollary allows us to apply a useful lower bound for identifying good variable sets with large I-scores. In practice, however, once the variable sets are found (through their large I-scores), the true prediction rates can be greater than the identified lower bounds. Theorem 1 provides a simple asymptotic behavior of the I-score under some strict assumptions. We offer similar derivations below following two levels of relaxations of the constraints.

We remark that with additional work one can show that the convergence given in Eq. 2 can be extended to be uniformly over all partitions {Π} with bounded number of cells and for all λ that stay away from 0 to 1.

BDA

The BDA§ is a greedy algorithm to search for the variable subset that maximizes the I-score through stepwise elimination of variables from an initial subset sampled in some way from the variable space. The details are as follows.

  • i) Training set: Consider a training set {(y1,x1),…,(yn,xn)} of n observations, where xi=(x1i,…,xpi) is a p-dimensional vector of explanatory variables. Typically p is very large. All explanatory variables are discrete.

  • ii) Sampling from variable space: Select an initial subset of k explanatory variables 𝐗b={Xb1,…,Xbk},b=1,…,B.

  • iii) Compute I-score: I(𝐗b)=∑j∈Π𝐗bnj2(Y¯j−Y¯)2.

  • iv) Drop variables: Tentatively drop each variable in 𝐗b and recalculate the I-score with one variable less. Then drop the one that gives the highest I-score. Call this new subset 𝐗′b, which has one variable less than 𝐗b.

  • v) Return set: Continue the next round of dropping on 𝐗′b until only one variable is left. Keep the subset that yields the highest I-score in the whole dropping process. Refer to this subset as the return set 𝐑b. Keep it for future use.

If no variable in the initial subset has influence on Y, then the values of I will not change much in the dropping process. However, when influential variables are included in the subset then the I-score will increase (decrease) rapidly before (after) reaching the maximum.

Using the I-Score in Sample-Constrained Settings

We have shown that I/n asymptotically approaches a constant multiple of θI (which is related to a lower bound of θc) and has several desirable properties. We take this opportunity to explore and illustrate an application of the I-score measuring predictivity with sample data. To provide additional evidence of the I-score’s ability to measure true predictivity, we consider a set of simulations for which we know the “true” levels of predictivity for all variable sets. We also provide a real data application on breast cancer for which the I-score approach has done very well in predicting.

We take a moment to comment that evaluating a variable set for predictivity, what we have called here VSA, is different from evaluating a given classifier, which is the prediction stage, usually following or in conjunction with VS. The latter considers evaluating f(𝐱), a special function f(⋅) applied to a particular set of explanatory variables 𝐱, for a given outcome variable y, whereas the former considers the potential predictivity of the set of explanatory variables 𝐱 for that outcome y for all possible f(⋅). Our work here focuses simply on VSA. Variable sets assessed as highly predictive in our framework can then be flexibly used in various models for prediction purposes as pleases the researcher.

We are now in an odd situation where we have identified variable sets that could not have been found using conventional approaches and yet we wish to evaluate the predictivity of our identified variable sets against these conventional approaches. Nevertheless, we endeavor to do so. A couple options arise for approaches to compare against: training prediction rate and out-of-sample testing prediction rate. We will show that the I-score-based measure provides a useful and meaningful estimated lower bound to the correct prediction rate and correlates well with the out-of-sample test rate, whereas the training rate statistic, the sample analog of θc, does not. As such, our approach has an important benefit to prediction research: Compared with methods such as cross-validation of error rates, the I-score is efficient in the use of sample data, in the sense that it uses all observations instead of separating data into testing and training.

Simulations.

We offer simulations to illustrate how (i) the I-score can serve as a lower bound to the true predictivity of a given variable set even as noisy variables are adjoined, (ii) thereby serving as a screening mechanism, and (iii) finding the maximum I-score when conducting a BDA leads to finding the variable set with the highest corresponding level of predictivity. BDA reduces a variable set one variable at a time, by eliminating the weakest element until I reaches a peak.

We consider a module of three important variables {X1,X2,X3} (see Fig. 2 for the disease model used) among six unimportant variables {X7,…,X12} using sample sizes of 250 cases/250 controls, 500 cases/500 controls, and 1,000 cases/1,000 controls. (See Simulation Details for more detailed model setting and simulation details.) We demonstrated that the Inλ(1−λ) estimates* θI, which is related to an asymptotic lower bound (Eq. 6) for θc, as n→∞. It would be helpful to see how I performs at fixed, reasonable sample sizes. We compare the I-score derived predictivity lower bound against the Bayes’ theoretical prediction rate in our simulations to illustrate this. The out-of-sample correct prediction rate is presented in the simulations here as a further benchmark against which the I-score can be compared when data are limited, as is the case in real-world applications. The out-of-sample correct prediction rate is derived from the most optimistic context achievable in the real world, whereby future testing data are infinite. In all of the simulations, the I-score of a set of influential variables drops when a noisy variable is added. This drop is subsequently seen in the I-score derived bound for the correct prediction rate. The I-score can screen out noisy variables, which makes it useful in practical data applications.

Fig. 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 2.

A three-SNP disease model.

To illustrate how these statistics fare in accurately capturing the level of predictivity of each variable set under consideration, we consider their performance given already having found X2 and X3 as important. We then add X1, which should ideally correspond with an increase in the statistic. We continue adding the remaining noisy variables one at a time to this “good” set of variables and observe how the statistics evaluate the new, larger set of variables for predictivity. In Fig. 3, violin plots show distributions of training rate, the I-score lower bound, and the ideal out-of-sample prediction rate under each setting across the simulations. Theoretical Bayes’ rate is also plotted as a reference, which remains flat when noisy variables are added. This is because the Bayes’ rate is defined purely by the partition formed from the informative variables and does not change when adjoining noisy variables (X7,…,X12) and creating finer partitions.

Fig. 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. 3.

Variable set size 3: Comparison of the training rate and the lower bound based on the I-score against the out-of-sample prediction rate. We compare two statistics, I-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the I-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are three important variables in this example, X1, X2, and X3. The top row of graphs compares the (red) I-score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.

Several patterns emerge in these simulations. First, and most importantly, the I-score-derived prediction rate seems to be a reasonable lower bound to the Bayes’ rate. This holds even in moderate sample sizes.

The second pattern is that the estimated I-score lower bound peaks at the variable set that is inclusive of all influential variables (X1, X2, and X3) and no additional noisy variables. This is a characteristic of the out-of-sample correct prediction rate as well. For instance, if we consider the top row of Fig. 3 and start from the right of the x axes in each of the three plots with the largest set of variables inclusive of both influential and noisy variables (X1,X2,X3,X7,…,X12), continual removal of the noisy variables (sliding to the left of the x axis) until we reach the variable set (X1,X2, X3) results in higher predictivity as measured by the I-score lower bound. We can note that the I-score lower bounds drop upon further removing the influential X1 variable from the set (X1,X2,X3). Thus, the variable set that appears with the maximum I-score derived lower bound here both identifies the largest possible variable set of influential variables with no noisy variables and is also reflective of a conservative lower bound of the correct prediction rate for that variable set. We note that once we have found the variable sets with the highest I-scores and calculated the corresponding lower bound of the correct prediction rate, we can adjust this lower bound rate for its bias to derive an improved estimate of the correct prediction rate.

A third pattern is that the training rate suffers from overfitting when adjoining noisy variables even when the variable set includes a true influential subset of variables. If the variable set is irreducible, however, the training rate estimator reflects the Bayes’ correct prediction rate well; thus, the training rate estimator can perform reasonably well conditional on already identifying (X1,X2,X3). The training rate estimator cannot be used to screen to that variable set first, however.

Finally, and as we might expect, the training set rate explodes due to overfitting in high dimensions as noisy variables are adjoined to the partition formed by the informative variables (X1, X2, X3). Although the training set prediction rate seems to improve as the sample size increases, it cannot be used to screen out noisy variables, and is therefore difficult to use as a statistic to select highly predictive variable sets. The predictivity rates found through this statistic also dramatically depart from the out-of-sample testing rate. It tends to ever-optimistically evaluate variable sets for their future predictions even when noisy variables are added. This stands in stark contrast to the out-of-sample prediction rate because it lowers in prediction rate with the addition of useless variables. We notice that there is a trend that the I-score prediction rate does not remain flat. The score increases when removing a noisy variable and reducing to a variable set of only influential variables, indicating the additional advantage of the I-score as a lower bound; the I-score prefers a simpler model even when the Bayes’ rate remains the same, selecting for more parsimonious partitions that attain the Bayes’ rate, simultaneously a closer reflection of the out-of-sample prediction rate.

Recall the correct prediction rate is based on an absolute difference of probabilities summed over all Xs. Suppose we start with influential variables only, with θc∗ correct prediction rate, the highest we can attain out of all possible variable sets. Adding noisy variables to this set, variables that add no signal but simply create a finer partition, still returns θc∗. When estimating the correct prediction rate using sample data, though, the training estimate of θc value generally keeps increasing if noisy variables are added; the researcher does not know when to stop the search for influential variables, making selecting for highly predictive variables difficult. Ideally, we would like to “punish” adding such noisy variables to our variable set, so having a measure that balances between favoring coarser partitions but still recognizing actual new variables with strong enough signals (non noisy variables) is important. The I-score seems to support such an effect—preferring coarser partitions unless an additional variable (and therefore finer partition) provides enough signal in the data to justify keeping it.

Noisy variables in sample data may be indicative of actually noisy variables or influential variables with weak signals due to the sample size. Thus, we note there are cases where the I-score might not recognize these variables when their signals would require unrealistic sample sizes to be found through the measure. An example of this would be if a good predictor is highly complex (perhaps a combination of very many variables) and the observations are sparse in the partition. Because the I-score places greater weight on where the data tend to appear (note the ni2 term in the score), when most of the partition cells contain no observations or at most one observation, this can often look like noise.

The main draw of the I-score is its ability to screen for influential variable sets. The variable sets inclusive of the three influential variables (X1, X2, and X3) alone display the highest I-scores. Searching for variable sets with the highest I-scores thus tends to return highly influential variables only. Using the training prediction rate as a guiding measure for screening, however, would continually seek for ever-larger variable sets, regardless of whether they include noisy variables or not.

Real Data Application: van’t Veer Breast Cancer Data.

To reinforce the previous sections, we briefly analyze real disease data. As noted before, part of this research team has discovered that applying the PR approach to real disease data has not only been quite successful in finding variable sets (thus encompassing higher-order interactions, traditionally rather tricky in big data), but has also resulted in finding variable sets that are very predictive† that do not necessarily show up as significant through traditional significance testing. We present one discovered variable set (a total of 18 variable sets were found in ref. 5) found to be highly predictive for a breast cancer dataset that is not highly significant using a chi-square test.‡ In Table 1 we investigate the top, five-variable set (in this case five genes) found to be predictive through both top I-score and performance in prediction in cross-validation and an independent testing set in ref. 5. To find how significant these variables are, we calculate the individual, marginal association of each variable in the marginal P value. Given the familywise P value threshold of 6.98 × 10−5, none of these variables seems statistically significant. Measuring the joint influence of all five variables is not significant either. Using the variable sets (all 18 in ref. 5) that seemed to have the highest I-scores to predict on this dataset resulted in an out-of-sample testing error rate of 8%, in direct comparison with the literature’s best error rates of 30%. Using only the variable set displayed in Table 1 and the lower bound in Eq. 6 we can calculate the asymptotic lower bound of the correct prediction rate for this variable set as 59%. Thus, using only this variable set alone, we can achieve at least a 59% correct classification rate at minimum. For details on the final predictors, see ref. 5.

View this table:
  • View inline
  • View popup
Table 1.

Real data example: van’t Veer breast cancer data (6)

Simulation Details

The simulation is based on a six-SNP disease model. The six SNPs are organized into two three-SNP modules (X1, X2, X3), and (X4, X5, X6). Six additional variables (X7, …, X12) are simulated to be noisy and unrelated to the disease. The frequencies for the minor allele of each SNP are all 0.5. The risk of the disease for an individual depends on the two three-SNP genotypes of these two modules. Each module defines two sets of genotypes, high risk genotypes and low risk genotypes, identically depicted in Fig. 2. If an individual has two low risk genotypes, he has odds of 1/60 for having the disease. Here, odds is the ratio of the probability of an event occurring (disease) over the probability of the event not occurring (no disease). For an individual with one of the low-risk genotypes and one of the high-risk genotypes, the odds are increased to 1/10. If an individual has high-risk genotypes for both modules, the odds become 1. In this section, we present results for the first module (X1, X2 and X3). In Fig. S1 we present results for both modules, or all six SNPs, together.

Fig. S1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Fig. S1.

Variable set size 6: Comparison of the training rate and I-score against the out-of-sample prediction rate. Again we compare two statistics, I-score lower bound and the training set prediction rate against the out-of-sample prediction rate. Lower bound from the I-score is provided in red, training set prediction rate in blue, and the out-of-sample prediction rate is in light blue. The thick black line in all six graphs is the true Bayes’ rate. All x axes correspond to variable sets (described in red for important variables and black for noisy ones) and all y axes correspond to (correct) prediction rate. There are six important variables in this example, X1, X2, X3, X4, X5, and X6. The top row of graphs compares the (red) I score statistics against the (light blue) out-of-sample prediction rate. The lower row of graphs compares the (dark blue) training set prediction rate against the (light blue) out-of-sample prediction rate. From left to right the graphs increase in sample size from 250 cases and 250 controls, to 500 cases and 500 controls, to 1,000 cases and 1,000 controls.

The data can take on three sample size levels: 250 cases/250 controls, 500 cases/500 controls, and 1,000 cases/1,000 controls. For each possible variable set we create a partition Π and calculate the p^id and p^iu (the estimated probability that an individual in cell j is a case or a control), respectively: nidnd and niunu where i=1...m and m=|Π| where |Π| is the size of the partition Π. We conducted 300 simulations and evaluated a set of statistics on each of the variable sets for each simulation: the training prediction rate, Bayes’ prediction rate, out-of-sample prediction rate, and the I-score-derived lower bound estimate of the predictivity rate; see Fig. 3. Throughout, we assume prior probability of (0.5, 0.5) for case and control. The statistics are detailed below:

  • i) Training prediction rate is defined as the following:12∑j=1m1max(p^jd,p^ju)

  • ii) Bayes’ rate: Recall this rate is constant across all variable sets that are inclusive of the truly influential variables, regardless of how many noisy variables are also included. This is the best predictivity one can achieve if knowledge of the influential variables is available. It is defined as12∑j=1m1max(pjd,pju)

  • iii) Out-of-sample prediction rate: This is conducted on the “infinite” future data to find pjd and pju for the rate. The “infinite” future data are often unrealistic with real data but we present it for the purposes of this simulation and to clearly provide a gold standard against which to compare. It is defined as∑j=1mpjd⋅Y^j+pju⋅(1−Y^j)

  • iv) I-score lower bound predictivity rate as defined from Eq. 7.

Concluding Remarks

Prediction has become more important in recent decades and, with it, the need for tools appropriate for good prediction. A first step can be to assess variable sets for predictivity, which we call VSA. We show in other work that assessment of variables from a statistical significance criterion to predict is not ideal (3). A currently popular alternative solution is to select variables via sample-based, out-of-sample testing error rates. This approach is ad hoc in nature, sample-based, and is not measuring some theoretical underlying level of predictivity for a given variable set. Often validation of selected candidate variable sets requiressetting aside valuable sample data in out-of-sample testing or cross-validation. Sometimes the sample size may not suffice for validating variable set sizes larger than one or two variables, as is often the case in big data like GWAS. Cross-validation avoids setting aside sample data as independent test sets but is computationally difficult in big data without using independent samples. As such, prediction research would benefit from a theoretical framework that directly defines a variable set’s predictivity as a parameter of interest to estimate. We believe our work here is a preliminary and important effort in that direction, by considering what theoretically highly predictive variable sets are, and how we might try to find them. In fact, using measures such as the I-score could be an important new direction in the prediction literature because it neither uses the training sample prediction rate nor does it require an artificial or ad hoc regularization choice.

We identify the equation for the theoretical correct predictivity of variable sets (θc) in Eq. 2 and then demonstrate that, unfortunately, the training estimate for it is quite useless. As such, we offer an alternative measure. We show that the In asymptotically approaches a lower bound to θI of Eq. 2 and is thus correlated with the correct predictivity rate of a given variable set. Importantly, we show that the I-score has a natural tendency to discard noisy variables, keep influential ones, and asymptotically approach this lower bound to θc. The I-score does well in identifying predictive variable sets in both our complex simulations as well as real data application.

We note that other measures with such desirable properties may also exist, and we encourage rigorous research in this direction. As a new field of inquiry, the search for measures that maximize predictivity may do much in the way of living up to the hopes of advancing predicting outcomes of interest, such as disease status. In some ways, this work is motivated by a practical consideration of finite samples. As noted in the setup of our framework, in a theoretical world of limitless data we can in fact find the variable sets with highest values of θc. However, our real world of finite sample sizes requires other sample-appropriate measures that may approximate but not achieve the θc. In other words, based on available sample size, the I-score, and any other such measure, detects not necessarily the maximum θc but some θc,nH, the largest θc correct prediction rate for which the corresponding X variables can be selected given n. Consider a situation where the true set of variables 𝐗∗ that provide the theoretical maximum θc∗ is very large. Suppose we have a sample of data that is quite modest. Selecting all variables 𝐗∗ is not possible given the sample size n (too many of the cell frequencies are small or zero) and so a measure such as the I-score retrieves a set 𝐗′ that provides potentially the largest θc achievable given the sample constraint. This in some ways mirrors the common issue of not detecting true effects when the sample size is too small in statistical significance testing.

We leave the important discussion of how to combine identified predictive variable sets in different final prediction models outside the scope of this paper.

Simulation Results for Important Variable Set of Size 6

Here we present simulation results in Fig. S1 for the six SNPs according to the model described in the main text. All other simulation parameters were the same as the three-SNP example.

Acknowledgments

This research is supported by National Science Foundation Grant DMS-1513408.

Footnotes

  • ↵1To whom correspondence may be addressed. Email: slo{at}stat.columbia.edu, chernoff{at}stat.harvard.edu, or tz33{at}columbia.edu.
  • Author contributions: S.-H.L initiated and oversaw the project; A.L., H.C., T.Z., and S.-H.L. designed research; A.L., H.C., T.Z., and S.-H.L. performed research; A.L., T.Z., and S.-H.L. analyzed data; and A.L., H.C., T.Z., and S.-H.L. wrote the paper.

  • Reviewers: D.L.B., Duke University; and M.Y., University of Wisconsin–Madison.

  • The authors declare no conflict of interest.

  • ↵*This assumes that sn2⟶λ(1−λ)asn⟶∞.

  • ↵†Here “predictive” refers to both high in I-score as well as having high correct prediction rates in k-fold cross-validation testing rates.

  • ↵‡We note an inherent difficulty to presenting the reverse situation, that of finding the most significant variable sets in the breast cancer data and determining their predictivity rates. This is precisely because the PR approach allows for higher-order interaction searches, which is more difficult using current common approaches. Although it is possible to use common approaches to discover marginally significant variables, or possibly two-way interactions, and then determine their predictivity rates, capturing up to five-way (as shown in our presentation here using the PR approach) interactions is not yet feasible as of the date of this writing with current common approaches.

  • ↵*“Unfortunately, the Cp, AIC, and BIC approaches are not appropriate in the high-dimensional setting, because estimating σ^2 (variance) is problematic. Similarly, problems arise in the application of the adjusted R2 in the high-dimensional setting, because one can easily obtain a model with an adjusted R2 value of 1” (12).

  • ↵†We use GWAS data to motivate our presentation of the I-score and PR method, but the approach applies to any data with discrete explanatory variables.

  • ↵‡The PR method encompasses a BDA that is introduced in ref. 5; we directly cite and present the BDA in Supporting Information.

  • ↵§The presentation of the BDA is taken directly from section 2.2 of ref. 5. For further details, see ref. 5.

  • This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1616647113/-/DCSupplemental.

References

  1. ↵
    1. Gransbo K, et al.
    (2013) Chromosome 9p21 genetic variation explains 13% of cardiovascular disease incidence but does not improve risk prediction. J Intern Med 274:233–240.
    .
    OpenUrlCrossRefPubMed
  2. ↵
    1. Zheng SL, et al.
    (2008) Cumulative association of five genetic variants with prostate cancer. N Engl J Med 358:910–919.
    .
    OpenUrlCrossRefPubMed
  3. ↵
    1. Lo A,
    2. Chernoff H,
    3. Zheng T,
    4. Lo SH
    (2015) Why significant variables aren’t automatically good predictors. Proc Natl Acad Sci USA 112(45):13892–13897.
    .
    OpenUrlAbstract/FREE Full Text
  4. ↵
    1. Chernoff H,
    2. Lo SH,
    3. Zheng T
    (2009) Discovering influential variables: A method of partitions. Ann Appl Stat 3(4):1335–1369.
    .
    OpenUrl
  5. ↵
    1. Wang H,
    2. Lo SH,
    3. Zheng T,
    4. Hu I
    (2012) Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 28(21):2834–2842.
    .
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. van’t Veer LJ, et al.
    (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–536.
    .
    OpenUrlCrossRefPubMed
  7. ↵
    1. Saeys Y,
    2. Inza I,
    3. Larrañaga P
    (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517.
    .
    OpenUrlAbstract/FREE Full Text
  8. ↵
    1. Hastie T,
    2. Tibshirani R,
    3. Friedman J
    (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, New York), 2nd Ed.
    .
    1. Guyon I,
    2. Elisseeff A
    (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182.
    .
    OpenUrlCrossRef
    1. Hua J,
    2. Tembe WD,
    3. Dougherty ER
    (2009) Performance of feature-selection meth-ods in the classification of high-dimension data. Pattern Recogn 42(3):409–424.
    .
    OpenUrlCrossRef
    1. Bolón-Canedo V,
    2. Sánchez-Maroño N,
    3. Alonso-Betanzos A
    (2013) A review of feature selection methods on synthetic data. Knowl Inform Syst 34(3):483–519.
    .
    OpenUrl
  9. ↵
    1. James G,
    2. Witten D,
    3. Hastie T,
    4. Tibshirani R
    (2014) An Introduction to Statistical Learning: With Applications in R (Springer, New York).
    .
Next
Back to top
Article Alerts
Email Article

Thank you for your interest in spreading the word on PNAS.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Framework for making better predictions by directly estimating variables’ predictivity
(Your Name) has sent you a message from PNAS
(Your Name) thought you would like to see the PNAS web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Citation Tools
Framework for making better predictions
Adeline Lo, Herman Chernoff, Tian Zheng, Shaw-Hwa Lo
Proceedings of the National Academy of Sciences Nov 2016, 201616647; DOI: 10.1073/pnas.1616647113

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Request Permissions
Share
Framework for making better predictions
Adeline Lo, Herman Chernoff, Tian Zheng, Shaw-Hwa Lo
Proceedings of the National Academy of Sciences Nov 2016, 201616647; DOI: 10.1073/pnas.1616647113
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Mendeley logo Mendeley
Proceedings of the National Academy of Sciences: 118 (15)
Current Issue

Submit

Sign up for Article Alerts

Jump to section

  • Article
    • Abstract
    • A Brief Literature Review on VS
    • Toy Example
    • Theoretical Prediction Rates
    • Generalization to Arbitrary Priors
    • Generalization to Different Loss and Cost Functions
    • Technical Notes
    • BDA
    • Using the I-Score in Sample-Constrained Settings
    • Simulation Details
    • Concluding Remarks
    • Simulation Results for Important Variable Set of Size 6
    • Acknowledgments
    • Footnotes
    • References
  • Figures & SI
  • Info & Metrics
  • PDF

You May Also be Interested in

Water from a faucet fills a glass.
News Feature: How “forever chemicals” might impair the immune system
Researchers are exploring whether these ubiquitous fluorinated molecules might worsen infections or hamper vaccine effectiveness.
Image credit: Shutterstock/Dmitry Naumov.
Reflection of clouds in the still waters of Mono Lake in California.
Inner Workings: Making headway with the mysteries of life’s origins
Recent experiments and simulations are starting to answer some fundamental questions about how life came to be.
Image credit: Shutterstock/Radoslaw Lecyk.
Cave in coastal Kenya with tree growing in the middle.
Journal Club: Small, sharp blades mark shift from Middle to Later Stone Age in coastal Kenya
Archaeologists have long tried to define the transition between the two time periods.
Image credit: Ceri Shipton.
Illustration of groups of people chatting
Exploring the length of human conversations
Adam Mastroianni and Daniel Gilbert explore why conversations almost never end when people want them to.
Listen
Past PodcastsSubscribe
Panda bear hanging in a tree
How horse manure helps giant pandas tolerate cold
A study finds that giant pandas roll in horse manure to increase their cold tolerance.
Image credit: Fuwen Wei.

Similar Articles

Site Logo
Powered by HighWire
  • Submit Manuscript
  • Twitter
  • Facebook
  • RSS Feeds
  • Email Alerts

Articles

  • Current Issue
  • Special Feature Articles – Most Recent
  • List of Issues

PNAS Portals

  • Anthropology
  • Chemistry
  • Classics
  • Front Matter
  • Physics
  • Sustainability Science
  • Teaching Resources

Information

  • Authors
  • Editorial Board
  • Reviewers
  • Subscribers
  • Librarians
  • Press
  • Cozzarelli Prize
  • Site Map
  • PNAS Updates
  • FAQs
  • Accessibility Statement
  • Rights & Permissions
  • About
  • Contact

Feedback    Privacy/Legal

Copyright © 2021 National Academy of Sciences. Online ISSN 1091-6490