Distributional conformal prediction

Edited by Emmanuel J. Candès, Stanford University, Stanford, CA, and approved October 5, 2021 (received for review April 24, 2021)
November 23, 2021
118 (48) e2107794118

Significance

Prediction problems are important in many contexts. Examples include cross-sectional prediction, time series forecasting, counterfactual prediction and synthetic controls, and individual treatment effect prediction. We develop a prediction method that works in conjunction with many powerful classical methods (e.g., conventional quantile regression) as well as modern high-dimensional methods for estimating conditional distributions (e.g., quantile neural networks). Unlike many existing prediction approaches, our method is valid conditional on the observed predictors and efficient under some conditions. Importantly, our method is also robust; it exhibits unconditional coverage guarantees under model misspecification, under overfitting, and with time series data.

Abstract

We propose a robust method for constructing conditionally valid prediction intervals based on models for conditional distributions such as quantile and distribution regression. Our approach can be applied to important prediction problems, including cross-sectional prediction, k–step-ahead forecasts, synthetic controls and counterfactual prediction, and individual treatment effects prediction. Our method exploits the probability integral transform and relies on permuting estimated ranks. Unlike regression residuals, ranks are independent of the predictors, allowing us to construct conditionally valid prediction intervals under heteroskedasticity. We establish approximate conditional validity under consistent estimation and provide approximate unconditional validity under model misspecification, under overfitting, and with time series data. We also propose a simple “shape” adjustment of our baseline method that yields optimal prediction intervals.
We develop a robust approach for constructing prediction intervals based on models for conditional distributions. The proposed method is generic and can be implemented using a great variety of flexible and powerful methods, including conventional quantile regression (QR) (1) and distribution regression (DR) (e.g., refs. 2 and 3), as well as nonparametric and high-dimensional machine learning methods such as quantile neural networks (e.g., ref. 4) and quantile trees and random forests (e.g., refs. 5 and 6).
We observe data {(Yt,Xt)}t=1T, where Yt is a continuous outcome of interest and Xt is a p×1 vector of predictors. Our task is to predict YT+1 given knowledge of XT+1. This setting encompasses many classical cross-sectional and time series prediction problems. Moreover, our approach can be applied to synthetic control settings where the goal is to predict counterfactuals in the absence of a policy intervention (e.g., refs. 7 and 8) and to the problem of predicting individual treatment effects (e.g., refs. 9 and 10).
With independent and identically distributed (iid) (or exchangeable data), standard conformal prediction methods, which are based on modeling the conditional mean, yield prediction intervals C^(1α) that satisfy
P(YT+1C^(1α)(XT+1))1α
[1]
for a given miscoverage level α(0,1). A prediction interval satisfying this property is said to be unconditionally valid. Unconditionally valid prediction intervals guarantee accurate coverage on average, treating (YT+1,XT+1) and {(Yt,Xt)}t=1T as random.
However, in many applications, unconditional validity may be unsatisfactory. Let us consider three examples; refs. 11 and 12 have further examples and discussions. First, from a fairness perspective, data-driven recommendation systems should guarantee equalized coverage across protected groups, in which case the goal is to construct prediction intervals that are valid conditional on a protected attribute such as race or gender (11). Second, as in Predicting Stock Market Returns, consider the problem of predicting stock returns given the realized volatility. Since the distribution of returns is more dispersed when the variance is higher, a natural prediction algorithm should yield wider prediction intervals for higher values of volatility. That is, the prediction interval should be valid conditional on the known value of realized volatility rather than on average. Third, as in Predicting Wages Using CPS Data, suppose our goal is to predict wages based on an individual’s education and experience. An unconditionally valid prediction interval exhibits coverage 90% on average across all individuals but may contain the true wage of high school dropouts with no work experience with probability zero. A more useful prediction interval should exhibit correct coverage conditional on an individual’s observed education and experience and contain the true wage with 90% probability for every single individual.
Motivated by this discussion, we develop a distributional conformal prediction (DCP) method for constructing prediction intervals that are approximately valid conditional on the full vector of predictors XT+1 while treating YT+1 and {(Yt,Xt)}t=1T as random:
P(YT+1C^(1α)(XT+1)|XT+1)1α+oP(1).
[2]
A prediction interval satisfying property Eq. 2 as T is said to be approximately conditionally valid.*
While the requirement in Eq. 2 is natural in many applications, there are also other notions of conditional validity. Instead of conditioning on XT+1 (object conditional), one can also study the conditional coverage probability given the training sample {(Yt,Xt)}t=1T (training conditional) or given YT+1 (label conditional) or combinations of them; ref. 15 has a detailed discussion. By proposition 2 of ref. 15, inductive conformal predictions (also known as split-sample conformal predictions) automatically achieve training conditional validity as long as the training sample is large enough. In classification problems (the support of YT+1 is a finite set), label conditional validity is often of great interest as it is important to know the error rates for different categories and provides useful information on false-positive and false-negative rates (15). In ref. 15, label conditional validity is achieved by forming the conformity score within each category. Both training and label conditional validity can be achieved in a distribution-free way (i.e., for a given procedure, the conditional validity holds for any distribution of the data).
However, object conditional validity in the sense of Eq. 2 cannot be achieved in a distribution-free way for nontrivial predictions. By refs. 12, 13, and 15, any prediction set satisfying Eq. 2 for every probability distribution of (XtY t) has infinite Lebesgue measure with nontrivial probability. Therefore, we only aim to achieve Eq. 2 for a limited class of probability distributions. The construction of the proposed prediction set C^(1α) relies on learning the conditional distribution Yt|Xt, and we only hope for conditional validity in Eq. 2 in the class of distributions that can be learned well. In particular, this class of distributions is those satisfying our regularity conditions.
Our empirical results demonstrate the importance of using DCP instead of standard conformal prediction methods based on modeling the conditional mean. When predicting daily stock returns in Predicting Stock Market Returns, the coverage probability of the 90% mean-based conformal prediction interval can drop to around 50% when the realized volatility is high. By contrast, DCP provides a coverage probability close to 90% for all values of realized volatility. This finding is important since volatility tends to be high during periods of crisis when accurate risk assessments are most needed. When predicting wages in Predicting Wages Using CPS Data, we find that the DCP prediction intervals contain the true wage with probability close to 90% for most individuals, whereas standard mean-based conformal prediction intervals either substantially under- or overcover.
To motivate DCP, note that a conditionally valid prediction interval is given by
[Q(α2,x),Q(1α2,x)],
[3]
where Q(τ,x) is the τ quantile of Yt given Xt = x. To implement the prediction interval Eq. 3, a plug-in approach would replace Q with a consistent estimator Q^:
[Q^(α2,x),Q^(1α2,x)].
[4]
This approach exhibits two well-known drawbacks. First, it will often exhibit undercoverage in finite samples (e.g., ref. 16). Second, it is neither conditionally nor unconditionally valid under misspecification.
We build upon conformal prediction (17, 18) and use the conditional ranking as a conformity score. This choice is particularly useful when working with regression models for conditional distributions such as QR and DR. Our method is conditionally valid under correct specification, while the construction of the procedure as a conformal prediction method guarantees the unconditional validity under misspecification. Let F(y,x)=P(Yty|Xt=x) denote the conditional cumulative distribution function (CDF) of Yt given Xt = x. Throughout the paper, we assume that F(·,Xt) is a continuous function almost surely. Our method is based on the probability integral transform, which states that the conditional rank, UtF(Yt,Xt), has the uniform distribution on (0, 1) and is independent of Xt.
To construct the prediction interval, we test the plausibility of each yR. By the probability integral transform, conditional on XT+1,F(YT+1,XT+1) belongs to [α/2,1α/2] with probability 1α. Thus, collecting all values yR satisfying F(y,XT+1)[α/2,1α/2] yields a conditionally valid prediction interval in the sense of Eq. 2. We operationalize this idea by proposing a conformal prediction procedure based on the estimated ranks, U^t(y)F^(y)(Yt,Xt). For each yR,F^(y) is an estimator of F obtained based on the augmented data, {(Yt,Xt)}t=1T+1, where YT+1=y. Data augmentation is a key feature of conformal prediction. It implies the model-free unconditional exact finite-sample validity with iid (or exchangeable) data and thus, guards against model misspecification and overfitting. Without data augmentation, the resulting prediction intervals are not exactly valid, not even with correct specification and iid data.
Our baseline method asymptotically coincides with the oracle interval in Eq. 3. This oracle interval may not be the shortest possible prediction interval in general. Therefore, we also develop a simple and easy to implement adjustment of our baseline method for improving efficiency, which we refer to as optimal DCP. In Predicting Wages Using CPS Data, we show empirically that optimal DCP yields shorter prediction intervals than baseline DCP when the conditional distribution is skewed.
We establish the following theoretical performance guarantees for the baseline and optimal DCP.
1.
Asymptotic conditional validity under consistent estimation of the conditional CDF
2.
Unconditional validity under model misspecification:
Finite-sample validity with iid (or exchangeable) data
Asymptotic validity with time series data
3.
For optimal DCP:
Under weak conditions: asymptotic conditional validity and optimality (shortest length)
Under strong conditions: asymptotic convergence to the optimal prediction interval

Motivating Example

We illustrate the advantages of DCP relative to mean-based conformal prediction (e.g., ref. 20) based on the following simple analytical example:
Yt=Xt+Xtεt,XtiidUniform(0,1),εtiidN(0,1).
[5]
Our motivating example draws on refs. 16 and 2022 that illustrate the importance of accounting for heteroscedasticity. We focus on the population conformal prediction (or oracle) problem under correct specification and abstract from finite-sample issues.
Mean-based conformal prediction is based on the residuals Rt=YtE(Yt|Xt)=YtXt=Xtεt. The mean-based prediction interval is
C(1α)reg(x)=[xQ|R|(1α),x+Q|R|(1α)],
[6]
where Q|R|(1α) is the (1α) quantile of the distribution of |Rt|. An important property and drawback of C(1α)reg is that its length, 2·Q|R|(1α), is fixed and does not depend on XT+1=x (16, 20). This feature implies that C(1α)reg is not adaptive to the heteroskedasticity in the location-scale model Eq. 5 and not conditionally valid.
DCP is based on the ranks Ut=Φ(εt), where Φ(·) is the CDF of N(0, 1). The DCP prediction interval is
C(1α)dcp(x)=[xx·Q|ε|(1α),x+x·Q|ε|(1α)],
[7]
where Q|ε|(1α)=Φ1(1α/2) is the (1α) quantile of |εt|. Unlike C(1α)reg, the length of C(1α)dcp,2x·Q|ε|(1α), depends on XT+1=x. Our construction automatically adapts to the heteroskedasticity in model Eq. 5 and is conditionally valid.
Fig. 1 provides an illustration. Fig. 1A shows that the conditional length of C(0.9)reg is constant, whereas the length of C(0.9)dcp varies as a function of x. C(0.9)dcp is shorter than C(0.9)reg for low values and wider for high values of x. Fig. 1B shows that C(0.9)dcp is valid for all x, whereas C(0.9)reg overcovers for low values and undercovers for high values of x. Fig. 1 illustrates the advantage of our method. For predictor values where the conditional variance is low, it yields shorter prediction intervals while ensuring conditional coverage for values where the conditional dispersion is large by suitably enlarging the prediction interval.
Fig. 1
Motivating example. (A) Conditional length 90% prediction interval. (B) Conditional coverage 90% prediction interval.

Related Literature

We build on and contribute to the literature on conformal prediction (e.g., refs. 13, 1518, 20, 23, and 24) and the literature on model-free prediction (19, 25), as well as the literature on quantile prediction methods (e.g., ref. 26 has a review).
Within the conformal prediction literature, our paper is most closely related to refs. 13, 16, and 20. Ref. 13 proposes conditionally valid and asymptotically efficient conformal prediction intervals based on estimators of the conditional density. We take a different and complementary approach, allowing researchers to leverage powerful regression methods for modeling conditional distributions, including QR and DR approaches. Ref. 20 develops conformal prediction methods based on regression models for conditional expectations. However, as discussed in Motivating Example, this approach is not conditionally valid under heteroskedasticity. They also propose a locally weighted conformal prediction approach, where the regression residuals are weighted by the inverse of a measure of their variability. This approach can alleviate some of the limitations of mean-based conformal prediction but is motivated by and based on restrictive locations-scale models. By contrast, our approach is generic and exploits flexible and substantially more general models for the whole conditional distribution.
Ref. 16 proposes a split conformal approach based on QR models, which they call conformalized quantile regression (CQR). Refs. 14 and 27 have related approaches, and ref. 28 has a general approach to adaptive conformal prediction. CQR is based on splitting the data into two subsets, T1 and T2. Based on T1, they estimate two separate quantile functions Q^(α/2,x) and Q^(1α/2,x) and construct the prediction intervals as
[Q^(α/2,x)QE(1α),Q^(1α/2,x)+QE(1α)],
where QE(1α) is the (1α)(1+1/|T2|) th empirical quantile of
Et=max{Q^(α/2,Xt)Yt,YtQ^(1α/2,Xt)}
in T2. Constructing prediction intervals based on deviations from quantile estimates is similar to working with deviations from mean estimates, as the deviations are measured in absolute levels. By contrast, exploiting the probability integral transform, our approach is generic and relies on permuting ranks, which naturally have the same scaling on (0, 1). Note, however, that our paper was inspired by ref. 16, and we view our proposal as a (fully quantile rank–based) refinement of ref. 16. The value of this refinement is especially apparent in the second empirical example. In addition, we also give quantile-based optimal prediction intervals.
Our adjustment for constructing efficient prediction intervals is related to and inspired by conformal prediction literature on minimum-volume prediction sets based on density estimators (e.g., refs. 13, 23, and 2931) and nearest-neighbor estimators (32). It is most closely related and can be viewed as an alternative to conformal histogram regression (33). The main differences between our approach and conformal histogram regression are the following. First, our method is based on an optimization problem formulated in terms of estimated quantile functions and does not require estimating a conditional density or histogram. Second, we do not work with nested sets but instead, use a simple adjustment of our baseline conformity score. Finally, our approach works for general outcome distributions and does not rely on assuming unimodal distributions.
Conceptually, our paper is further related to the transformation-based model-free prediction approach developed in refs. 19 and 25 in that we rely on transformations of the original setup into one that is easier to work with (i.e., ranks that are uniformly distributed) and study the properties of our approach in a model-free setting. An important difference is the implementation of the resulting procedure. The transformation-based approach is based on the bootstrap, whereas our approach is based on permuting ranks. Permuting ranks estimated based on the augmented data guarantees the model-free finite-sample validity of our method with exchangeable data. To our knowledge, no exact finite-sample validity results have been developed for the bootstrap-based approach.

DCP

Here, we introduce DCP. We present a full and a split-sample version of our method.

Full DCP

Let y denote a test value for YT+1. We test the plausibility of each value yR, collect all plausible values, and report them as the prediction set. In practice, we consider a grid of test values Ytrial. Define the augmented data Z(y)={Zt(y)}t=1T+1, where
Zt(y)={(Yt,Xt)if1tT(y,Xt)ift=T+1.
[8]
Based on the augmented dataset Z(y), we estimate the conditional CDF using a suitable method such as QR and DR, which are discussed in more detail in SI Appendix. Let F^(y) denote the estimator for F based on the augmented sample. If the original estimate is not monotonic, we rearrange it (e.g., refs. 35 and 36) so that F^(y)(·,x) is always monotonic. To simplify the exposition, we keep these rearrangements implicit.
We compute the ranks {U^t(y)}t=1T+1, where
U^t(y)={F^(y)(Yt,Xt)if1tTF^(y)(y,Xt)ift=T+1,
[9]
and obtain P values as
p^(y)=1T+1t=1T+11{V^t(y)V^T+1(y)},
[10]
where V^t(y)ψ(U^t(y)) and ψ(·) is a deterministic function. For our baseline method, we use ψ(x)=|x1/2|. In Extension: Optimal DCP, we show how to choose ψ optimally to ensure efficiency. Prediction intervals are computed as C^(1α)full(XT+1)={yYtrial:p^(y)>α}.§ We summarize our approach in Algorithm 1.
Algorithm 1: (Full DCP).
Input: Data {(Yt,Xt)}t=1T, miscoverage level α(0,1), a point XT+1, test values Ytrial
Process: For yYtrial,
1)
define the augmented data Z(y) as in Eq. 9
2)
compute p^(y) as in Eq. 10
Output: Return (1α) prediction set C^(1α)full(XT+1)={yYtrial:p^(y)>α}

Split DCP

An important drawback of full DCP (Algorithm 1) is its computational burden due to the grid search. Since F^(y) is obtained based on the augmented data, one has to choose Ytrial and reestimate the entire conditional distribution for all yYtrial. Therefore, we propose a split conformal procedure that exploits sample splitting, avoids grid search, and only requires estimating F once. Sample splitting is a popular approach for improving the computational performance of conformal prediction methods (e.g., refs. 16 and 20).
Algorithm 2: (Split DCP).
Input: Data {(Yt,Xt)}t=1T, miscoverage level α(0,1), point XT+1
Process:
1)
Split {1,,T} into T1{1,,T0} and T2{T0+1,,T}
2)
Obtain F^ based on {Zt}tT1
3)
Compute {V^t}tT2={ψ(U^t)}tT2, where U^t=F^(Yt,Xt)
4)
Compute Q^T2, the (1α)(1+1/|T2|) empirical quantile of {V^t}tT2
Output: Return (1α) prediction set C^(1α)split(XT+1)={y:ψ(F^(y,XT+1))Q^T2}
(Since F^(·,XT+1) is monotonic, C^(1α)split(XT+1) is an interval)
In Algorithm 2, we split {1,,T} into {1,,T0} and {T0+1,,T}. With iid data, one can also consider random splits.
Split DCP lends itself naturally to simple in-sample validity checks with both cross-sectional and time series data as illustrated in Empirical Applications.

Theoretical Performance Guarantees

In this section, we establish the theoretical properties of our procedure. We focus on full-sample DCP (Algorithm 1). For the split-sample approach (Algorithm 2), we provide a modified version (SI Appendix, Algorithm S1) in SI Appendix and present its theoretical properties in Extension: Optimal DCP.
When the data are iid (or exchangeable), our method achieves finite-sample unconditional validity in a model-free manner, as a consequence of general results on conformal inference and permutation inference more generally (e.g., refs. 17 and 37).
Theorem 1 (Finite-sample unconditional validity).
Suppose that the data are iid or exchangeable and that the estimator of the conditional distribution is invariant to permutations of the data. Then,
P(YT+1C^(1α)full(XT+1))1α.
The proof of Theorem 1 is standard and omitted. Theorem 1 highlights the strengths and drawbacks of conformal prediction methods. Most commonly used estimators of the conditional CDF such as QR and DR are invariant to permutations of the data. As a result, Theorem 1 provides a model-free unconditional performance guarantee in finite samples, allowing for arbitrary misspecification of the model of the conditional CDF. On the other hand, it has a major theoretical drawback. Even with iid data, it provides no guarantee at all on conditional validity.
Our next theoretical results provide a remedy. We impose the following weak regularity conditions.
Assumption 1.
Suppose that there exists a nonrandom function F*(·,·) such that the following conditions hold as T. Define Vtψ(F*(Yt,Xt)) for 1tT+1.
1)
There exists a strictly increasing continuous function ϕ:[0,)[0,) such that ϕ(0)=0 and (T+1)1t=1T+1ϕ(|V^tVt|)=oP(1) and V^T+1=VT+1+oP(1), where V^tV^t(YT+1)=ψ(F^(YT+1)(Yt,Xt)) for 1tT+1.
2)
supvR|G˜(v)G(v)|=oP(1), where G˜(v)=(T+1)1t=1T+11{Vt<v} and G(·) is the distribution function of VT+1.
3)
supx1x2|G(x1)G(x2)|/|x1x2| is bounded.
Assumption 1 allows for some flexibility with respect to the model estimator. Here, we only require F* to be a nonrandom function, which may or may not be F. The interpretation is straightforward when F*=F since this simply means that the estimator F^ is consistent for F. We discuss the case of F*F after Theorem 2 below. Note that we can replace the consistency requirement in Assumption 1 with a stronger uniform consistency requirement, supx,y|F^(y,x)F(y,x)|=oP(1).
We also notice that the quantities V^t and Vt are defined under the true YT+1. This means that F^(y) uses y=YT+1. In other words, the estimator F^ based on the sample {(Xt,Yt)}t=1T+1 would be consistent for some F* if YT+1 were observed. Since the goal of Assumption 1 is to guarantee the coverage probability for YT+1, the conditions in Assumption 1 only need to hold for y=YT+1.
Notice that F^ is consistent for F* under a very weak norm, and no rate condition is required. When ψ(x)=|x1/2|, a simple example of ϕ(·) in Assumption 1 is ϕ(x)=xq for some q > 0; in other words, a sufficient condition is (T+1)1t=1T+1|F^(Yt,Xt)F(Yt,Xt)|q=oP(1), which can be verified for many existing estimators with q = 2.
The following lemma gives the basic consistency result.
Lemma 1.
Let Assumption 1 hold. Then, G^(V^T+1)=G(VT+1)+oP(1), where G^(v)=(T+1)1t=1T+11{V^t<v}.
By Assumption 1, G(·) is uniformly continuous and thus, continuous. Since G(·) is the distribution function of VT+1, we have that G(VT+1) has the uniform distribution on (0, 1) [i.e., P(G(VT+1)α)=α]. This implies the unconditional asymptotic validity.
Theorem 2 (Asymptotic unconditional validity).
Let Assumption 1 hold. Then,
P(YT+1C^(1α)full(XT+1))=1α+o(1).
Theorem 2 establishes the asymptotic unconditional validity of the procedure. Since Theorem 1 already establishes the unconditional validity in finite samples for iid or exchangeable data without assuming any consistency of F^, the main purpose of Theorem 2 is to address the case of nonexchangeable data (e.g., time series data with ergodicity), especially when the model is misspecified (i.e., if F*F).
To illustrate model misspecification, consider the popular linear QR model, which assumes Q(τ,x)=xβ(τ), and thus, F(y,x)=F(y,x;β)=011{xβ(τ)y}dτ. This model is typically estimated by β^(τ)=argminβt=1T+1ρτ(YtXtβ) with ρτ(a)=a(τ1{a<0}). Under misspecification [Q(τ,x)xβ(τ)], β^(τ) is still estimating β*(τ)=argminβt=1T+1Eρτ(YtXtβ), and F* is defined using β*(·) [e.g., F*(y,x)=011{xβ*(τ)y}dτ]. For parametric models, F* is usually the probability limit of F^. In general, we can consider a model F and minimize the empirical risk F^=argmingFt=1T+1L(Yt,Xt,g) for some loss function L. Even if the model is misspecified (FF), it is still possible to show that F^ is close (in some norm) to F*=argmingFt=1T+1E[L(Yt,Xt,g)]. In SI Appendix, we provide a more detailed discussion of this and some theoretical results verifying the consistency requirement in Assumption 1 for the time series case; ref. 24 has a general discussion of conformal prediction in time series settings.
The cost of allowing for misspecification is that one cannot guarantee conditional validity when F*F. On the other hand, Lemma 1 implies that the prediction intervals are conditionally valid when F*=F.
Theorem 3 (Asymptotic conditional validity).
Let Assumption 1 hold with F*=F. Then,
P(YT+1C^(1α)full(XT+1)|XT+1)=1α+oP(1).
Theorems 2 and 3 establish the asymptotic validity of our procedure under weak and easy to verify conditions. They formalize the key intuition that conditional validity hinges on the quality of the estimator F^ of the conditional CDF.#

Extension: Optimal DCP

In Theoretical Performance Guarantees, we have seen that a generic conformity score ψ(y,x)=|F(y,x)1/2| leads to conditional validity if the conditional distribution F can be estimated consistently. We now characterize an optimal choice of conformity score that results in the shortest prediction interval. Detailed implementation algorithms, technical assumptions, and proofs are provided in SI Appendix.
Let Z and X denote the support of Zt=(Yt,Xt) and Xt, respectively. The optimal prediction interval is
C(1α)opt(x)=[r1(x,α),r2(x,α)],
[11]
where the functions r1(·,·),r2(·,·) satisfy that for any xX,
r2(x,α)r1(x,α)=minF(z2,x)F(z1,x)1αz2z1.
[12]
The question is whether it is possible to design a conformity score that achieves the above optimal prediction interval. To answer this question formally, we consider a generic conformity score ψ(y,x), which might contain components that need to be estimated.
Permuting a large number of values of {ψ(Yt,Xt)} in conformal predictions leads to taking the sample (1α) quantile of ψ(Yt,Xt) as the output. For example, following Algorithm 2, one would output the (1α)(1+1/|T2|) empirical quantile of {ψ(Yt,Xt)}. Assuming a law of large numbers, this empirical quantile would be close to the population (1α) quantile of ψ(Yt,Xt), leading to the asymptotic conformal prediction interval for YT+1:
C(1α)conf(XT+1)={y:ψ(y,XT+1)Qψ(1α)},
[13]
where Qψ(1α) is the (1α) quantile of ψ(Yt,Xt). The following result shows how to construct the optimal conformity score ψ.
Lemma 2.
Let ψ*(y,x)=|F(y,x)b(x,α)(1α)/2|, where b(·,·) is a function satisfying that for any xX,
b(x,α)argminz[0,α]Q(z+1α,x)Q(z,x).
[14]
Let C(1α)conf(XT+1) be defined as in Eq.13 with the above conformity score ψ*. Assume that F(·,x) is a continuous function for any xX. Then, Qψ(1α)=(1α)/2 and
μ(C(1α)opt(XT+1))=μ(C(1α)conf(XT+1))almostsurely,
where μ(·) denotes the Lebesgue measure. If the optimization problem in Eq.11 has a unique solution for any xX, then
C(1α)opt(XT+1)=C(1α)conf(XT+1)almostsurely.
Lemma 2 motivates conformity scores of the form ψ*(y,x)=F(y,x)[b(x,α)+(1α)/2], where b(·,·) solves Eq. 14. Compared with the choice of ψ(y,x)=|F(y,x)1/2| mentioned in Theoretical Performance Guarantees, we can view ψ* as having a “shape” adjustment b(x,α)α/2. Since F(Yt,Xt) is independent of Xt, the optimal conformity score measures the distance between two independent components: F(Yt,Xt) and 1/2+(b(Xt,α)α/2). Hence, by Lemma 2, in order to take into account the shape of the conditional distribution F(·,x), it suffices to consider the scalar quantity 1/2+(b(x,α)α/2).
In some special cases, the shape adjustment can be shown to be zero [i.e., b(x,α)=α/2]. One typical example is when F(·,x) is a symmetric unimodal distribution with a well-defined conditional density. Therefore, the choice of ψ(y,x)=|F(y,x)1/2| mentioned in Theoretical Performance Guarantees is optimal in these cases. However, Lemma 2 provides a construction that achieves optimality more generally. By the definition of ψ* and Qψ(1α)=(1α)/2, the prediction interval is
C(1α)conf(x)=[Q(b(x,α),x),Q(b(x,α)+1α,x)].
[15]
We illustrate this in Fig. 2 with α=0.1. Eq. 15 implies that b(x,α) is the quantile index of the lower bound of the interval. For the symmetric distribution in Fig. 2, Top, we see b(x,α)=0.05, which is α/2. For the asymmetric distribution in Fig. 2, Bottom, we see that b(x,α)=0.007, which is far away from α/2=0.05.
Fig. 2
Optimal prediction intervals. (Top) Symmetric distribution. (Bottom) Asymmetric distribution.
The first result in Lemma 2 is general and allows for the lack of uniqueness of the optimal prediction interval. For example, if F is the uniform distribution on a certain interval, then all conditionally valid prediction intervals have the same length. Clearly, in this case, achieving the optimal length is the only goal one can hope for. When we can uniquely define the optimal prediction interval, Lemma 2 implies that the conformal procedure can recover the uniquely defined optimal interval, not just achieving the optimal length.
Lemma 2 also confirms the insight of ref. 13; the optimal confidence set for XT+1=x should take the form {y:f(y,x)c(x)} for some c(x) > 0, where f(y,x)=F(y,x)/y. Assume that F(·,x) is a unimodal distribution and f(·,x) is a continuous function for any xX. Then, this confidence set is an interval. This means that {y:f(y,x)c(x)}=[c1(x),c2(x)] and f(c1(x),x)=f(c2(x),x)=c(x). We notice that c1(x),c2(x) are related to our results in that c1(x)=Q(b(x,α),x) and c2(x)=Q(b(x,α)+1α,x). To see this, simply observe that the first-order condition of the optimization problem in Eq. 14 is 1/f(Q(z+1α,x),x)1/f(Q(z,x))=0, which implies that
f(Q(b(x,α)+1α,x))=f(Q(b(x,α),x)).
To make the procedure operational, we provide the conformal prediction interval C^(1α)conf(XT+1) defined in SI Appendix, Algorithm S1. We can provide the following guarantee.
Theorem 4.
Let SI Appendix, Assumption S1 hold. Then,
P(YT+1C^(1α)conf(XT+1)|XT+1)=1α+oP(1)
and
μ(C^(1α)conf(XT+1))μ(C(1α)opt(XT+1))+oP(1).
The main requirements in SI Appendix, Assumption S1 are consistency of F^ and that the density f bounded below on its support. This is quite mild in the sense that it does not imply that the optimal prediction interval in Eq. 11 is uniquely defined. For example, it allows f to be a uniform distribution. Therefore, as discussed above, the conformal prediction interval would have approximately the shortest length but might not converge to C(1α)opt(XT+1) in Eq. 11.
The following theorem provides a stronger result about C^(1α)conf(XT+1) based on stronger assumptions.
Theorem 5.
Let SI Appendix, Assumption S2 hold. Consider the conformal prediction interval C^(1α)conf(XT+1) defined in SI Appendix, Algorithm S1. Then,
μ(C^(1α)conf(XT+1)C(1α)opt(XT+1))=oP(1),
where denotes the symmetric difference of sets [i.e., AB=(A\B)(B\A)], C(1α)opt(XT+1) is defined in Eq.11.
The key component of SI Appendix, Assumption S2 is consistent estimation of b. Theorem 5 shows that C^(1α)conf(XT+1) is close to C(1α)opt(XT+1) in the sense that the symmetric difference between these two sets has vanishing Lebesgue measure.

Empirical Applications

We illustrate the performance of DCP in two empirical applications and provide a comparison with alternative approaches. These examples, especially the second, illustrate the value of our proposal. We consider eight different conformal prediction methods.
1)
DCP-QR: DCP with QR (Algorithm 2)
2)
DCP-QR*: Optimal DCP with QR (SI Appendix, Algorithm S1)
3)
DCP-DR: DCP with DR (Algorithm 2)
4)
CQR: CQR with QR (16)
5)
CQR-m: CQR variant (14, 27) with QR
6)
CQR-r: CQR variant (14) with QR
7)
CP-OLS: Mean-based split conformal prediction (CP) with Ordinary Least Squares (OLS)
8)
CP-loc: Locally weighted conformal prediction (20) with OLS
All computations were carried out in R (39). Code and data for replicating the empirical results are deposited in GitHub (https://github.com/kwuthrich/Replication_DCP).

Predicting Stock Market Returns

Here, we consider the problem of predicting stock market returns, which are known to exhibit substantial heteroskedasticity (a recent review is in chapter 13 in ref. 40 and references therein). We use data on daily returns of the market portfolio (Center for Research in Security Prices value-weighted portfolio) from 1 July 1926 to 30 June 2021.** We use lagged realized volatility Xt to predict the present return Yt.†† Daily returns are not iid and exhibit time series dependence. In SI Appendix, we show that the key conditions underlying our theoretical results hold when the data are β-mixing. Several stochastic volatility models for asset returns, including the popular generalized autoregressive conditional heteroskedasticity models, can be shown to be β-mixing (e.g., refs. 4244).
We evaluate the performance of the different methods by splitting the data into a training and a test sample. To account for the dependence in the data, we present results averaged over five consecutive prediction exercises. In the first exercise, we apply split conformal prediction with an equal split (|T1|=|T2|) to the first 50% of observations and use the next 10% for testing. In the second exercise, we drop the first 10% of the observations, apply split conformal prediction to the next 50% of observations, and use the next 10% for testing and so on.
Fig. 3 plots the empirical coverage probabilities for 20 bins obtained by dividing up the support of Xt based on equally spaced quantiles. DCP-QR and DCP-QR* yield prediction intervals with coverage levels that are almost constant across all bins and close to the nominal level. They outperform DCP-DR, which undercovers in high-volatility regimes. The conditional coverage properties of DCP-QR and DCP-QR* are very similar to CQR, CQR-m, CQR-r, and CP-loc. This suggest that location-scale models, which are nested by QR, provide a good approximation of the conditional distribution. CP-OLS exhibits overcoverage under low-volatility regimes and substantial undercoverage under high-volatility regimes. This finding has important practical implications since the volatility tends to be high during periods of crisis, which is precisely when accurate risk assessments are most needed.
Fig. 3
Conditional coverage 90% prediction intervals by realized volatility.
Fig. 4 shows the conditional length of the prediction intervals. DCP-QR, DCP-QR*, CQR, CQR-m, CQR-r, and CP-loc yield prediction intervals of similar length. The DCP-DR prediction intervals are somewhat shorter than those of the QR-based methods at the upper tail. Finally, CP-OLS yields prediction intervals that are almost constant across all values of realized volatility; they are longer than the DCP intervals at the lower tail and shorter at the upper tail.‡‡
Fig. 4
Conditional length 90% prediction intervals by realized volatility.

Predicting Wages Using CPS Data

We consider the problem of predicting wages using individual characteristics. We use the 2012 Current Population Survey (CPS) data provided in the R package hdm (45), which contain information on N=29,217 observations. Here, we use the index i instead of t. To illustrate the impact of skewness on the performance of the different prediction methods, we use the hourly wage as our dependent variable Yi.### Predictors Xi include indicators for gender, marital status, educational attainment, region, experience, experience squared, and all two-way interactions such that dim(Xi)=100 after removing constant variables.
Following refs. 14 and 16, we evaluate the performance of the different methods by randomly holding out 20% of the data for testing, Itest, and applying split conformal prediction with an equal split to the remaining 80% of the data. We repeat the whole experiment 20 times.
Table 1 shows that all conformal prediction methods exhibit excellent unconditional coverage properties, confirming the theoretical finite-sample guarantees. To assess and compare the conditional coverage properties, for each method, we compute conditional coverage probabilities as the predictions from logistic regressions of {YiC^(1α)split(Xi)}iItest on {Xi}iItest, where C^(1α)split is the split conformal prediction interval obtained by the corresponding method. The less dispersed the predicted coverage probabilities are around the nominal level 1α=0.9, the better the overall conditional coverage properties of a method. Table 1 plots the SD of the predicted coverage probabilities.¶¶ DCP-QR* yields the lowest dispersion of all methods. The predicted coverage probabilities based on DCP-QR are less dispersed than those obtained from CQR, CQR-m, and CQR-r. CP-loc yields a higher dispersion than the methods based on QR and DR, which demonstrates the value added of using flexible models of the conditional distribution. Overall, DCP performs much better than CP-OLS, for which the predicted coverage probabilities exhibit a very high dispersion. SI Appendix, Fig. S1 plots histograms of the predicted coverage probabilities.
Table 1
 Coverage 90% prediction intervals
 DCP-QRDCP-QR*DCP-DRCQRCQR-mCQR-rCP-OLSCP-loc
Unconditional coverage0.900.900.900.900.900.900.900.90
SD of predicted conditional coverage (×100)1.801.713.082.212.362.3011.134.11
Table 2 shows the average length of the prediction intervals. DCP-QR* produces the shortest prediction intervals among of all methods. This demonstrates the practical advantage of the shape adjustment when the conditional distribution is skewed. The results also suggest a trade-off between conditional coverage accuracy and average length. For example, CP-OLS and CP-loc, which both exhibit poor conditional coverage properties, yield shorter prediction intervals than DCP-QR.
Table 2
 Average length 90% prediction intervals
DCP-QRDCP-QR*DCP-DRCQRCQR-mCQR-rCP-OLSCP-loc
34.2229.6133.6934.5234.8434.6333.8432.66

Notes

*See, for example, refs. 1214 for a further discussion of the difference between conditional and unconditional validity.
This transformation is also very useful in other prediction problems (e.g., ref. 19).
For example, we can choose Ytrial to be a fine grid between max1tT|Yt| and max1tT|Yt|. This choice has a theoretical justification since under exchangeability, P(|YT+1|>max1tT|Yt|)1/(1+T) (34) (a discussion is in the conformal Inference R-package; https://github.com/ryantibs/conformal).
§
Instead of C^(1α)full(XT+1), we typically report the closed interval C˜(1α)full(XT+1)=[min(C^(1α)full(XT+1)),max(C^(1α)full(XT+1))].
This is not really much different from assuming that F^ based on the sample {(Xt,Yt)}t=1T is consistent for some F*.
#
In Theorem 3, we assume F*=F. Since the first version of this paper was posted, ref. 38 has provided more general results where F*F.
In this case, Q(1/2+δ,x)Q(1/2,x)=Q(1/2,x)Q(1/2δ,x), and the conditional density is increasing on (,Q(1/2,x)) and decreasing on (Q(1/2,x),). One can show b(x,α)=α/2 by taking the first-order derivative for the optimization problem in Eq. 14 and setting it to zero.
**The data are constructed from the Fama/French Three Factors data (41) available from Kenneth R. French’s data library (accessed 17 August 2021).
††
We compute realized volatility as the square root of the sum of squared returns over the last 22 d.
‡‡
The CP-OLS prediction intervals are not exactly constant because we are reporting results averaged over five experiments.
###
We obtain the hourly wage by exponentiating the log hourly wage provided in the dataset.
¶¶
Using 1/|Itest|iItest(Coverage^i0.9)2, where Coverage^i is the predicted coverage probability, instead of the SD yields very similar results.

Data Availability

Data and computer codes to replicate all the results in this paper have been deposited in GitHub (https://github.com/kwuthrich/Replication_DCP). All data are referenced in the main text.

Acknowledgments

We thank the editor, two anonymous referees, Dimitris Politis, and Allan Timmermann for valuable comments. V.C. acknowledges funding from the NSF. All remaining errors are our own.

Supporting Information

Appendix 01 (PDF)

References

1
R. Koenker, G. Bassett, Regression quantiles. Econometrica 46, 33–50 (1978).
2
S. Foresi, F. Peracchi, The conditional distribution of excess returns: An empirical analysis. J. Am. Stat. Assoc. 90, 451–466 (1995).
3
V. Chernozhukov, I. Fernandez-Val, B. Melly, Inference on counterfactual distributions. Econometrica 81, 2205–2268 (2013).
4
J. W. Taylor, A quantile regression neural network approach to estimating the conditional density of multiperiod returns. J. Forecast. 19, 299–311 (2000).
5
P. Chaudhuri, W. Y. Loh, Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli 8, 561–576 (2002).
6
N. Meinshausen, Quantile regression forests. J. Mach. Learn. Res. 7, 983–999 (2006).
7
M. D. Cattaneo, Y. Feng, R. Titiunik, Prediction intervals for synthetic control methods. arXiv [Preprint] (2019). https://arxiv.org/abs/1912.07120 (Accessed 20 August 2021).
8
V. Chernozhukov, K. Wüthrich, Y. Zhu, An exact and robust conformal inference method for counterfactual and synthetic controls. J. Am. Stat. Assoc., (2021).
9
D. Kivaranovic, R. Ristl, M. Posch, H. L. Leeb, Conformal prediction intervals for the individual treatment effect. arXiv [Preprint] (2020). https://arxiv.org/abs/2006.01474 (Accessed 20 August 2020).
10
L. Lei, E. J. Candès, Conformal inference of counterfactuals and individual treatment effects. arXiv [Preprint] (2020). https://arxiv.org/abs/2006.06138 (Accessed 20 August 2021).
11
Y. Romano, R. F. Barber, C. Sabatti, E. J. Candes, With malice towards none: Assessing uncertainty via equalized coverage. arXiv [Preprint] (2019). https://arxiv.org/abs/1908.05428 (Accessed 20 August 2021).
12
R. Foygel Barber, E. J. Candès, A. Ramdas, R. J. Tibshirani, The limits of distribution-free conditional predictive inference. J. IMA 10, 455–482 (2021).
13
J. Lei, L. Wasserman, Distribution-free prediction bands for non-parametric regression. J. Royal Stat. Soc. Ser. B. Stat. Methodol. 76, 71–96 (2014).
14
M. Sesia, E. J. Candes, A comparison of some conformal quantile regression methods. Stat 9, e261 (2020).
15
V. Vovk, “Conditional validity of inductive conformal predictors” in Proceedings of the Asian Conference on Machine Learning, S. C. H. Hoi, W. Buntine, Eds. (PMLR, Singapore Management University, Singapore), vol. 25, pp. 475–490 (2012).
16
Y. Romano, E. Patterson, E. Candes, Conformalized quantile regression. Adv. Neural Inf. Process. Syst., 32, 3543–3553 (2019).
17
V. Vovk, A. Gammerman, G. Shafer, Algorithmic Learning in a Random World (Springer Science & Business Media, 2005).
18
V. Vovk, I. Nouretdinov, A. Gammerman, On-line predictive linear regression. Ann. Stat. 37, 1566–1590 (2009).
19
D. N. Politis, Model-Free Prediction and Regression: A Transformation-Based Approach to Inference (Springer, New York, NY, 2015).
20
J. Lei, M. G. Sell, A. Rinaldo, R. J. Tibshirani, L. Wasserman, Distribution-free predictive inference for regression. J. Am. Stat. Assoc. 113, 1094–1111 (2018).
21
R. Koenker, G. Bassett, Robust tests for heteroscedasticity based on regression quantiles. Econometrica 50, 43–61 (1982).
22
R. Koenker, Quantile Regression, Econometric Society Monographs (Cambridge University Press, 2005).
23
J. Lei, J. Robins, L. Wasserman, Distribution free prediction sets. J. Am. Stat. Assoc. 108, 278–287 (2013).
24
V. Chernozhukov, K. Wüthrich, Z. Yinchu, “Exact and robust conformal inference methods for predictive machine learning with dependent data” in Proceedings of the 31st Conference on Learning Theory, S. Bubeck, V. Perchet, P. Rigollet, Eds. (PMLR, Cambridge, MA, 2018), vol. 75, pp. 732–749.
25
D. N. Politis, Model-free model-fitting and predictive distributions. Test 22, 183–221 (2013).
26
I. Komunjer, “Chapter 17 - Quantile prediction” in Handbook of Economic Forecasting, G. Elliott, A. Timmermann, Eds. (Elsevier, 2013), pp. 961–994.
27
D. Kivaranovic, K. D. Johnson, H. Leeb, “Adaptive, distribution-free prediction intervals for deep networks” in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, S. Chiappa, R. Calandra, Eds. (PMLR, Cambridge, MA, 2020), vol. 108, pp. 4346–4356.
28
V. Vovk et al., “Conformal calibrators” in Proceedings of the Ninth Symposium on Conformal and Probabilistic Prediction and Applications, A. Gammerman, V. Vovk, Z. Luo, E. Smirnov, G. Cherubin, Eds. (PMLR, Cambridge, MA, 2020), vol. 128, pp. 84–99.
29
D. J. Eck, F. W. Crawford, Efficient and minimal length parametric conformal prediction regions. arXiv [Preprint] (2019). https://arxiv.org/abs/1905.03657 (Accessed 20 August 2021).
30
R. Izbicki, G. T. Shimizu, R. B. Stern, Flexible distribution-free conditional predictive bands using density estimators. arXiv [Preprint] (2019). https://arxiv.org/abs/1910.05575 (Accessed 20 August 2021).
31
R. Izbicki, G. Shimizu, R. B. Stern, CD-split and HPD-split: Efficient conformal regions in high dimensions. arXiv [Preprint] (2020). https://arxiv.org/abs/2007.12778 (Accessed 20 August 2021).
32
L. Gyorfi, H. Walk, “Nearest neighbor based conformal prediction” (Rep. Stuttgarter Mathematische Berichte 2020-002, Universität Stuttgart, Stuttgart, Germany, 2020).
33
M. Sesia, Y. Romano, Conformal histogram regression. arXiv [Preprint] (2021). https://arxiv.org/abs/2105.08747 (Accessed 20 August 2021).
34
W. Chen, Z. Wang, W. Ha, R. F. Barber, Trimmed conformal prediction for high-dimensional models. arXiv [Preprint] (2016). https://arxiv.org/abs/1611.09933 (Accessed 20 August 2021).
35
V. Chernozhukov, I. Fernandez-Val, A. Galichon, Improving point and interval estimators of monotone functions by rearrangement. Biometrika 96, 559–575 (2009).
36
V. Chernozhukov, I. Fernández-Val, A. Galichon, Quantile and probability curves without crossing. Econometrica 78, 1093–1125 (2010).
37
W. Hoeffding, The large-sample power of tests based on permutations of observations. Ann. Math. Stat. 23, 169–192 (1952).
38
E. J. Candès, L. Lei, Z. Ren, Conformalized survival analysis. arXiv [Preprint] (2021). https://arxiv.org/abs/2103.09763 (Accessed 20 August 2021).
39
R Core Team, R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, Austria, 2021).
40
G. Elliott, A. Timmermann, Economic Forecasting (Princeton University Press, 2016).
41
R. K. French, Kenneth French Data Library. Fama/French 3 Factors [Daily] Data (2021). http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html. Accessed 17 August 2021.
42
F. Boussama, “Ergodicité, mélange et estimation dans les modeles GARCH,” PhD thesis, Paris 7, Paris, France (1998).
43
M. Carrasco, X. Chen, Mixing and moment properties of various GARCH and stochastic volatility models. Econom. Theory 18, 17–39 (2002).
44
C. Francq, J. M. Zakoïan, Mixing properties of a general class of GARCH (1,1) models without moment assumptions on the observed process. Econom. Theory 22, 815–834 (2006).
45
V. Chernozhukov, C. Hansen, M. Spindler, hdm: High-dimensional metrics. R J. 8, 185–199 (2016).

Information & Authors

Information

Published in

The cover image for PNAS Vol.118; No.48
Proceedings of the National Academy of Sciences
Vol. 118 | No. 48
November 30, 2021
PubMed: 34819368

Classifications

Data Availability

Data and computer codes to replicate all the results in this paper have been deposited in GitHub (https://github.com/kwuthrich/Replication_DCP). All data are referenced in the main text.

Submission history

Accepted: September 24, 2021
Published online: November 23, 2021
Published in issue: November 30, 2021

Keywords

  1. prediction intervals
  2. conditional validity
  3. model-free validity
  4. quantile regression
  5. distribution regression

Acknowledgments

We thank the editor, two anonymous referees, Dimitris Politis, and Allan Timmermann for valuable comments. V.C. acknowledges funding from the NSF. All remaining errors are our own.

Notes

This article is a PNAS Direct Submission.
Published under the PNAS license.

Authors

Affiliations

Department of Economics, Massachusetts Institute of Technology, Cambridge, MA 02142;
Center for Statistics and Data Science, Massachusetts Institute of Technology, Cambridge, MA 02142;
Department of Economics, University of California San Diego, La Jolla, CA 92093;
CESifo, 81679 Munich, Germany;
ifo Center for Public Finance and Political Economy, ifo Institute, 81679 Munich, Germany;
Department of Economics, Brandeis University, Waltham, MA 02453;
International Business School, Brandeis University, Waltham, MA 02453

Notes

2
To whom correspondence may be addressed. Email: [email protected].
Author contributions: V.C., K.W., and Y.Z. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.
1
V.C., K.W., and Y.Z. contributed equally to this work.

Competing Interests

The authors declare no competing interest.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Altmetrics




Citations

Export the article citation data by selecting a format from the list below and clicking Export.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to access the full text.

    Single Article Purchase

    Distributional conformal prediction
    Proceedings of the National Academy of Sciences
    • Vol. 118
    • No. 48

    Figures

    Tables

    Media

    Share

    Share

    Share article link

    Share on social media