Cross-prediction-powered inference

Significance Machine learning is increasingly used as an efficient substitute for traditional data collection when the latter is challenging. For example, predictions of conditions such as poverty, deforestation, and population density based on satellite imagery are used to supplement accurate survey data, which requires significant time and resources to collect. However, predictions are imperfect and potentially biased, calling into question the validity of conclusions drawn from such data. This manuscript introduces a method for valid inference powered by machine learning. The method enables researchers to draw more reliable and accurate conclusions from machine learning predictions.


Introduction
As data-driven decisions fuel progress across science and technology, ensuring that such decisions are reliable is of critical importance.The reliability of data-driven decision-making rests on having access to high-quality data on one hand, and properly accounting for uncertainty on the other.
One frequently discussed issue is that acquiring high-quality data often involves laborious human labeling, or slow and expensive scientific measurements, or overcoming privacy concerns when human subjects are involved.Machine learning offers a promising alternative: sophisticated techniques such as generative modeling and deep neural networks are being used to cheaply produce large amounts of data that would otherwise be too expensive or time-consuming to collect.For example, tools to predict protein structure are supporting wide-ranging research in biology [10,25,30,48]; large language models are being used to generate difficult-to-aggregate information about materials that can be used to fight climate change [58]; predictions of socioeconomic and environmental conditions based on satellite imagery are being used for downstream policy decisions [5,23,38,46].This increasingly common practice, marked by supplementing high-quality data with machine learning outputs, calls for new principles of uncertainty quantification.
In this work we study this problem in the semi-supervised context, where labels are scarce but features are abundant.For example, precise measurements of environmental conditions are difficult to come by but satellite imagery is abundant.Due to its volume, satellite imagery is routinely used in combination with computer vision algorithms to predict a range of factors on a global scale, including deforestation [21], poverty rates [23], and population densities [36].These predictions provide a compelling substitute for resource-intensive ground-based measurements and surveys.However, it is crucial to acknowledge that, Figure 1: Examples of GEE satellite imagery used in the deforestation analysis of Bullock et al. [13].
while promising, the predictions are not infallible.Consequently, downstream inferences that uncritically treat them as ground truth will be invalid.
We introduce cross-prediction: a broadly applicable method for semi-supervised inference that leverages the power of machine learning while retaining validity.Assume a researcher holds both a small labeled dataset and a large unlabeled dataset, and they seek inference-i.e., a p-value or a confidence intervalabout a population-level quantity such as the mean outcome or a regression coefficient.Cross-prediction carefully leverages black-box machine learning to impute the missing labels, resulting in both valid and powerful inferences.The validity is a result of a particular debiasing step; the power is a result of using sophisticated predictive techniques such as deep learning or random forests.We show that the use of blackbox predictions on the unlabeled data can lead to a massive improvement in statistical power compared to using the labeled data alone.
Cross-prediction builds upon the recent proposal of prediction-powered inference [1].Unlike predictionpowered inference, we do not assume that our researcher already has access to a predictive model for imputing the labels.Rather, to apply prediction-powered inference, the researcher would need to use a portion of the labeled data to either train a model from scratch or fine-tune an off-the-shelf model.We show that this leads to a suboptimal solution.Consider the following example studied by Angelopoulos et al. [1].The goal is to form a confidence interval for the fraction of the Amazon rainforest that was lost between 2000 and 2015.A small number of "gold-standard" deforestation labels for certain parcels of land are available, having been collected through field visits [13].In addition, satellite imagery is available for the entire Amazon; see Figure 1 for Google Earth Engine (GEE) examples used in the deforestation study of Bullock et al. [13].Angelopoulos et al. apply prediction-powered inference after using a portion of the labeled data and a gradient-boosted tree to fine-tune a regression-tree-based predictor of forest cover [44].Our work offers an alternative: we can avoid data splitting and instead apply cross-prediction, still with a gradient-boosted tree, to perform the fine-tuning.By doing so, we significantly reduce the size of the confidence interval, as seen in Figure 2.This trend will be consistent throughout our experiments: cross-prediction is more efficient than prediction-powered inference with data splitting.Figure 2 also shows that cross-prediction outperforms "classical" inference, which forms a confidence interval based on gold-standard labels only and simply ignores the unlabeled data.Additional details about these experiments can be found in Section 7.2.
Another important takeaway from Figure 2 is that cross-prediction gives more stable inferences: the confidence intervals have lower variability than the intervals computed via baseline approaches.Intuitively, classical inference has higher variability due to the smaller sample size, while prediction-powered inference has higher variability due to the arbitrariness in the data split.We will quantify the stability of cross-prediction in our experiments in Section 7, showcasing its superiority across a range of examples, see Table 4.
Our work is also related to the literature known as semi-supervised inference [56].The main difference between existing approaches and our work is that our proposal leverages black-box machine learning methods, allowing for more complicated data modalities (such as high-dimensional imagery) and more sophisticated ways of leveraging the unlabeled data.We elaborate on the relationship to prior work in Section 2.1.

Problem setup
We study statistical inference in a semi-supervised setting, where collecting high-quality labels is challenging but feature observations are abundant.Formally, we have a dataset consisting of n i.i.d.feature-label pairs, {(X 1 , Y 1 ), . . ., (X n , Y n )} ∼ P n .In addition, we have a dataset consisting of N unlabeled data points, { X1 , . . ., XN } ∼ P N X , where P X denotes the marginal distribution over features according to P. We are most interested in the regime where N ≫ n, as in the case where feature collection is far cheaper than label collection.
Our goal is to perform inference on a property θ * (P) of the data-generating distribution P, such as the mean outcome, a quantile of the outcome distribution, or a regression coefficient.Our proposal handles all estimands defined as a solution to an M-estimation problem: for a convex loss function ℓ θ .Here and throughout, (X, Y ) denotes a generic sample from P independent of everything else.All of the aforementioned estimands can be cast in the form (1).For example, if the target of inference is the mean outcome, θ * (P) = E[Y ], then θ * (P) minimizes the squared loss: Note that the estimand (and thus the loss) will sometimes only depend on a subset of the features X or only on the outcome Y , as in Eq. ( 2).Also note that this manuscript focuses on θ * (P) ∈ R d for a fixed d.Studying high-dimensional settings-for example, understanding what scaling of d is permitted by the theory-is a valuable direction for future work.Below, we write θ * = θ * (P) for short.
The main question we address is this: how should we leverage the unlabeled data to achieve both valid and powerful inference?Validity alone is an easy target: we can simply dispense with the unlabeled data and find the classical estimator θclass , defined as For all standard estimands defined via M-estimation-including means, quantiles, linear regression coefficientsthere are off-the-shelf confidence intervals around θclass that cover θ * with a desired probability in the largesample limit [see, e.g., 28,49].The classical estimator and the corresponding confidence intervals shall be the main comparison points used to evaluate the performance of cross-prediction.

Related work
We discuss the relationship between our work and the most closely related technical scholarship.
Semi-supervised inference.Our work falls within the literature known as semi-supervised inference [56].Most existing work develops methods specialized to particular estimation problems, such as mean estimation [56,57], quantile estimation [14], or linear regression [4,47].One exception is the recent work of Song et al. [45], who also study general M-estimation.Their approach uses a projection-based correction to the classical loss (3) based on simple statistics from the unlabeled data, such as averages of low-degree polynomials of the features.Unlike existing proposals, our approach is based on imputing the missing labels using black-box machine learning methods, allowing for more complicated data modalities and more intricate ways of leveraging the unlabeled data.For example, it is unclear how to apply existing methods when the features X i are high-dimensional images.We also note that the semi-supervised observation model has been long studied in semi-supervised learning [50,59,60].However, in this literature the goal is prediction, rather than inference.
Prediction-powered inference.The core idea in this paper is to correct imputed predictions, and this derives from the proposal of prediction-powered inference [1].However, a key assumption in predictionpowered inference is that, in addition to a labeled and an unlabeled dataset, the analyst is given a good pre-trained machine learning model.We make no such assumption.To apply the theory of predictionpowered inference, our setting would require using a portion of the labeled data for model training and leaving the rest for inference.In contrast, cross-prediction leverages each labeled data point for both model training and inference, leading to a boost in statistical power.The distinction between having and not having a pre-trained model makes a difference even when comparing prediction-powered inference and the classical approach.Angelopoulos et al. [1] do not take into account the data used for model training when comparing the two baselines, because the model is assumed to have been trained before the analysis takes place.This makes sense when considering off-the-shelf models such as AlphaFold.In our comparisons we do take the training data into account.Angelopoulos et al. [2] show a central limit theorem for the prediction-powered estimator, allowing for computational and statistical improvements over the original methods for prediction-powered inference.Our inferences will be based on a similar central limit theorem for cross-prediction.
Wang et al. [51] similarly study inferences based on machine learning predictions.They propose using the labeled data to train a predictor of true outcomes from predicted ones, and then applying the predictor to debias the predictions on the unlabeled data.This algorithm does not come with a formal validity guarantee.Motwani and Witten [31] conduct a detailed empirical comparison of the method of Wang et al. and prediction-powered inference.
Theory of cross-validation.Cross-prediction is based on a form of cross-fitting.Consequently, our analysis is related to the theoretical studies of cross-validation [3,7,8,20,26].In particular, our theory borrows from the analysis of Bayle et al. [8], who prove a central limit theorem and study inference on the cross-validation test error.Our goal, however, is entirely different; we aim to provide inferential guarantees for an estimand θ * , as defined in Eq. ( 1), in a semi-supervised setting.
Semiparametric inference.Our work is also related to the rich literature on semiparametric inference [9,16,22,27,29,32,34,37], where the goal is to do estimation in the presence of a high-dimensional nuisance parameter.Our debiasing strategy closely resembles doubly-robust estimators [6], such as the AIPW estimator [35,39], and one-step estimators [33].In this literature the estimand is typically an expected value, such as the average treatment effect.One exception is the work of Jin and Rothenhäusler [24], who study general M-estimators through a semiparametric lens.The use of cross-fitting is common in that literature as well [16,17,18].While the technical arguments used in our work bear resemblance to those classically used in semiparametric inference, our motivation is different.Our focus is on showcasing how a theoretically-principled use of black-box predictors-neural networks, random forests, and so on-on massive amounts of unlabeled data can boost inference.Since the practice of leveraging unlabeled data through predictions is already prevalent in domains such as remote sensing, our goal is to ground it in statistical theory.
Inference with missing data.Semi-supervised inference can be seen as a special case of the problem of inference with missing data [41], where missing information about the labels occurs.Our proposed method bears similarities to multiple imputation [40,42,43] as, at least at a high level, it is based on "averaging out" multiple imputed predictions for the labels.However, our method is substantially different from multiple imputation, most notably due to the fact that it incorporates a particular form of debiasing to mitigate prediction inaccuracies.
Inference under model misspecification.Finally, our work relates to a large body of work on inference under model misspecification [e.g., 11,12,52,53].In particular, we do not assume that our predictions follow any "true" statistical model, and for parameters θ * defined as a regression solution, we do not assume that the regression model is correct.For example, if θ * is the solution to a linear regression, we do not assume that the data truly follows a linear model.Like in classical M-estimation, we will show asymptotic normality of our estimator despite the misspecification.

Cross-prediction
We propose cross-prediction-an estimation technique based on a combination of cross-fitting and prediction.The basic idea is to impute labels for the unlabeled data points, and then remove the bias arising from the inaccuracies in the predictions using the labeled data.We give a step-by-step outline of the construction of the cross-prediction estimator.In the following sections we will show how to perform inference with this estimator; that is, how to perform hypothesis tests or construct confidence intervals for θ * .

Cross-prediction for mean estimation
Before discussing the general case, we consider the problem of mean estimation to gain intuition; the object of inference is simply We begin by partitioning the labeled dataset into K chunks, I 1 = {1, . . ., n/K}, I 2 = {n/K+1, . . ., 2n/K}, and so on (we assume for simplicity that n is divisible by K). 1 Here, K is a user-specified number of folds, e.g.K = 10.Then, as in cross-validation, we train a machine learning model K times, each time training on all data except one fold.Let A train denote a possibly randomized training algorithm, which takes as input a dataset of arbitrary size and outputs a predictor of labels from features.Then, for each j ∈ [K], let f (j) be the model obtained by training on all folds but I j ; that is, ).We note that A train can be quite general; it may or may not treat the training data points symmetrically, and f (j) need not come from a well-defined family of predictors.Rather, f (j) can be any black-box model; e.g. a random forest, a gradient-boosted tree, a neural network, and so on.Moreover, f (j) can be trained from scratch or obtained by fine-tuning an off-the-shelf model.Finally, we use the trained models to impute predictions and compute the cross-prediction estimator, defined as: Intuitively, the first term in Eq. ( 4) is an empirical approximation of the population mean if we treated the predictions as true labels.The second term in Eq. ( 4) serves to debias this heutistic: it subtracts an estimate of the bias between the predicted labels and the true labels.We note that the estimator (4) coincides with the mean estimator of Zhang and Bradic [57] in the special case where f (j) are linear models, that is, f (j) (x) = x ⊤ β j for some β j .Our analysis applies more broadly, allowing for complex high-dimensional models (e.g., image classifiers).
Observe that the cross-prediction estimator is unbiased, i.e., E[ θ+ ] = θ * .Indeed, since i ∈ I j is not used to train model f (j) , we have The classical estimator is of course the sample mean: which is also unbiased.Given that both the cross-prediction estimator and the classical estimator are unbiased, it makes sense to ask which one has a lower variance.The main benefit of cross-prediction is that, if the trained models f (j) are reasonably accurate, we expect the variance of the cross-prediction estimator to be lower.To see this, first recall that, typically, N ≫ n.This means that the first term in θ+ should have a vanishing variance due to the magnitude of N .Therefore, As the sample mean, the remaining term is an average of n terms.However, when the models are accurate, i.e. f The closest alternative to the cross-prediction estimator is the prediction-powered estimator [1], that is, its straightforward adaptation to the setup without a pre-trained model.As discussed earlier, predictionpowered inference relies on having a pre-trained model f .We can reduce our setting to this case by introducing data splitting: we use the first n tr ≤ n data points from the labeled dataset to train a model f and the rest of the labeled data to compute the prediction-powered estimator: The prediction-powered estimator is also unbiased: E[ θPP ] = θ * .However, this strategy is potentially wasteful because, after f is trained, the training data is thrown away and not subsequently used for estimation.Cross-prediction uses the data more efficiently, by leveraging each data point for both training and estimation.

General cross-prediction
To introduce the cross-prediction estimator in full generality, recall that we are considering all estimands of the form (1).As in the case of mean estimation, we split the labeled data into K folds and train a predictive model f (j) on all folds but fold j ∈ [K].The proposed cross-prediction estimator is defined as: Here, we use the short-hand notation lf The intuition remains the same as before: the first term is an empirical approximation of the population loss if we treated the predictions as true labels, and the second term aims to debias this heuristic.One can verify that the mean estimator in Eq. ( 4) is indeed a special case of the general estimator in Eq. (7), by taking ℓ θ to be the squared loss, as per Eq. ( 2).
The cross-prediction estimator optimizes an unbiased objective, since , seeing that i ∈ I j is not used to train model f (j) .Furthermore, by the same argument as before, we expect L + (θ) to have a lower variance than the classical objective in Eq. ( 3) if N is large and the trained predictors are reasonably accurate.We note that L + (θ) may not be a convex function in general, but solving for θ+ is tractable in many cases of interest.For example, in the case of means and generalized linear models, L + (θ) is convex.
The prediction-powered estimator is similar to the cross-prediction estimator, but it requires data splitting and does not average over multiple model fits.It is equal to where, as before, f is trained on the first n tr labeled data points.The fact that cross-prediction averages the results of multiple model fits allows it to achieve more stable inference.Indeed, in our experiments we will observe that cross-prediction is more stable than prediction-powered inference throughout.

Inference for the mean
We now discuss inference with the cross-prediction estimator.For simplicity we first look at mean estimation, where θ * = E[Y ].We will see that much of the discussion will carry over to general M-estimation problems.
Inference with the cross-prediction estimator in Eq. ( 4) is difficult because the terms being averaged are all dependent through the labeled data.In contrast, the classical estimator in Eq. ( 5) averages independent terms, allowing for confidence intervals based on the central limit theorem.The prediction-powered estimator in Eq. ( 6) is similarly amenable to inference based on the central limit theorem, seeing that all the terms are independent conditional on f .In this section we show that, under a relatively mild regularity condition, the cross-prediction estimator likewise satisfies a central limit theorem.This will in turn immediately allow constructing confidence intervals and hypothesis tests for θ * .
The central limit theorem will require that, as the sample size grows, the predictions concentrate sufficiently rapidly around their expectation.Intuitively, one can think of the condition as requiring that the predictions are sufficiently stable.While the stability property is difficult to verify for complex black-box models, we empirically observe that inference based on the resulting central limit theorem nevertheless provides the correct coverage.We observe this across different estimation problems, data modalities, sample sizes, and so on.
Our analysis based on stability is inspired by the work of Bayle et al. [8], who study inference on the cross-validation test error, since the inferential challenges in cross-prediction are similar to those in crossvalidation.The ultimate goals of the two analyses are, however, entirely different.
Below we state the stability condition.For all x, we define f (x) := E[f (1) (x)]; the "average" model f is the predictor we would obtain if we could train many models on independent datasets of size n − n/K and average out their predictions.Assumption 1.We say that the predictions are stable if, as n → ∞, Assumption 1 requires that the models f (j) converge to their "average" model f , but there is no assumption that f is by any means well-specified.If the number of folds is fixed (e.g., K = 10), as we will typically assume, then Assumption 1 is satisfied if the variance of the difference between the learned predictions f (1) (X) and the average predictions f (X) vanishes at any rate, Var f (1) We expect that any reasonably stable learning algorithm A train should satisfy Assumption 1 (intuitively, any algorithm not too sensitive to perturbing a single data point).Violations of the assumption might arise if the number of folds is allowed to grow, e.g. as in the case of leave-one-out cross-fitting, since then the variance has to tend to zero sufficiently rapidly.
Equipped with Assumption 1, we can now state the central limit theorem for cross-prediction.
Theorem 1 (Cross-prediction CLT for the mean).Let θ * be the mean outcome, θ * = E[Y ].Suppose that the predictions are stable (Ass.1).Further, assume that n N has a limit, and that σ2 = Var( f (X)) and With this, inference on θ * is now straightforward as long as we can estimate the asymptotic variance consistently.We will discuss strategies for doing so in Section 6.
Corollary 1 (Inference for the mean via cross-prediction).Let θ * be the mean outcome, θ * = E[Y ].Assume the conditions of Theorem 1, and suppose that we have estimators σ2 p → σ2 and σ2 Then, lim inf Per standard notation, z 1−α/2 denotes the (1−α/2)-quantile of the standard normal distribution.Corollary 1 follows by a direct application of Theorem 1, together with Slutsky's theorem.

Inference for general M-estimation
We generalize the principle introduced in Section 4 to handle arbitrary M-estimation problems.Indeed, the results presented in this section will strictly subsume the results of Section 4.
As in the case of the mean, we will require that the predictions are "stable" in an appropriate sense.Naturally, the notion of stability will depend on the loss function used to define the M-estimator.Assumption 2. With f (•) as before, we say that the predictions are stable if for all θ, as n → ∞, Here, Var(• | f (1) ) denotes the covariance matrix conditional on f (1) .Also, for vectors and matrices, by " → 0" we mean convergence in mean to zero element-wise.Notice that by setting ℓ θ (y) = (θ − y) 2 to be the squared loss, Assumption 2 reduces to Assumption 1 in the case of mean estimation.As in the case of Assumption 1, Assumption 2 should be interpreted as a stability requirement on A train .Moreover, there is again no requirement of correct specification of f .We will provide two approaches to inference in this section; which one is more appropriate will depend on the inference problem at hand.One approach will be based on the characterization of θ * as a zero of the gradient of the expected loss, ∇L(θ * ) = E[∇ℓ θ * (X, Y )] = 0, which follows by the convexity of the loss.In particular, we will construct a confidence set for θ * by finding all θ accepted by a valid test for the null hypothesis that ∇L(θ) = 0. Since the test is valid and θ * satisfies the null condition, the true solution θ * will be excluded with small probability.The hypothesis test for the population gradient ∇L(θ) will follow from a central limit theorem for the gradient of the cross-prediction loss, The other approach will be based on showing asymptotic normality of the cross-prediction estimator.For this, we build on the proof of asymptotic normality of the prediction-powered estimator (with a pre-trained model) [2], which in turn builds on classical asymptotic normality of M-estimators [49].The asymptotic normality will allow forming standard CLT intervals around θ+ .
We implicitly assume mild regularity on the losses ℓ θ (x, y) and ℓ θ (x, f (j) (x)), in particular that they are differentiable and locally Lipschitz around θ * for all possible f (j) (see Def. A.1 in [2]).Our second inference approach will require the usual condition that θ+ is consistent, θ+ p → θ * ; this is satisfied quite broadly, e.g. when the parameter space is compact or when L + (θ) is convex.The latter holds for all generalized linear models, for example.See [2,49] for further discussion.
Theorem 2 states the main technical result of this section, which extends Theorem 1 to general Mestimation problems.
Corollary 2 (Inference via cross-prediction).Suppose that we have estimators Σ p → Σ and Vθ p → Vθ , for all θ.Then, assuming the conditions of Theorem 2, for either we have lim inf Above, χ 2 d,1−α is the (1−α)-quantile of the chi-squared distribution with d degrees of freedom; when d = 1 (as in the case of mean estimation), χ d,1−α is equal to z 1−α/2 .Note also that in the case of mean estimation, the two confidence sets are identical and reduce to the set from Corollary 1.In the second confidence set we apply a Bonferroni correction over the d coordinates of the estimand for simplicity and clarity of exposition; we can obtain an asymptotically exact (1 − α)-confidence set as Next, we apply Theorem 2 and Corollary 2 to concrete problems-quantile estimation, linear regression, and generalized linear models-to get explicit confidence interval constructions.

Example: quantile estimation
Assume we are interested in a quantile of Y , The quantile can equivalently be written as any minimizer of the pinball loss, The subgradient of the pinball loss is equal to ∇ℓ θ (y) = −q1{y > θ} + (1 − q)1{y ≤ θ} = −q + 1{y ≤ θ}.Plugging this expression into the first confidence set from Corollary 2 yields ≤ θ} is the average empirical CDF of the predictions on the unlabeled data, and ∆ is the difference between the empirical CDFs of the predictions and true outcomes on the labeled data.The standard errors are equal to σ2 θ = Var(1{ f (X) ≤ θ}) and σ2 ∆,θ = Var(1{ f (X) ≤ θ} − 1{Y ≤ θ}).The confidence set C + α thus consists of all values θ such that the average predicted CDF F + (θ), corrected by the bias ∆ + (θ), is close to the target level q.

Example: linear regression
In linear regression, the target of inference is defined by In this case, the cross-prediction estimator, equal to the solution to ∇L + ( θ+ ) = 0, has a closed-form expression.Letting X ∈ R N ×d (resp.X ∈ R n×d ) be the unlabeled (resp.labeled) data matrix, Y ∈ R n be the vector of labeled outcomes, the solution is given by where f avg(K) ( X) = 1 K K j=1 f (j) ( X) is the vector of average predictions on the unlabeled data, and f 1:K (X) is the vector of predictions on the labeled data: f 1:K (X) = (f (1) (X 1 ), . . ., f (1) . We see that θ+ resembles the usual least-squares estimator with f avg(K) ( X) as the response, except for the extra debiasing factor, N n • X ⊤ (f 1:K (X) − Y), that takes into account the prediction inaccuracies.Instantiating the relevant terms, Theorem 2 implies that θ+ is asymptotically normal with covariance equal to Σ OLS = H −1 Vθ * H −1 , where H = E[XX ⊤ ] and Vθ * = n N Σθ * + Σ∆ , for Σθ * = Var(( f (X) − X ⊤ θ * )X) and Σ∆ = Var(( f (X) − Y )X).
For a given coordinate of interest i, a confidence interval for θ * i can therefore be obtained as given an estimate ΣOLS of Σ OLS .

Example: generalized linear models
We can generalize the previous example by considering all generalized linear models (GLMs).In particular, we consider targets of inference given by where p θ (y|x) = exp(yx ⊤ θ − ψ(x ⊤ θ)) is the probability density of the outcome given the features and the log-partition function ψ is convex.The objective (9) recovers the linear-regression objective (8) by setting ψ(s) = 1 2 s 2 .It captures logistic regression by choosing ψ(s) = log(1 + e s ).
The asymptotic covariance given by Theorem 2 evaluates to . Therefore, analogously to the OLS case, given an estimate ΣGLM of Σ GLM , we can construct a confidence interval for θ+ as

Variance estimation via bootstrapping
The previous inference results rely on being able to estimate the asymptotic covariance of θ+ .We herewith provide an explicit estimation strategy that we will use in our experiments.
Recall that that the asymptotic covariance is equal to Σ = H −1 θ * Vθ * H −1 θ * , where Vθ = n N Σθ + Σ∆,θ , for Σθ = Var(∇ℓ f θ,i ) and Σ∆,θ = Var(∇ℓ f θ,i −∇ℓ θ,i ).Estimating the Hessian H θ is easy via plug-in estimation; Vθ , on the other hand, depends on the average model f .If the average model f was known, one could compute estimates of Σθ and Σ∆,θ by replacing the true covariances with their empirical counterparts.Thus, the challenge is to approximate f .To achieve this, we apply the bootstrap to simulate multiple model training runs, and at the end we average the predictions of all the learned models.
Finally, we approximate the covariance by n N Σθ + Σ∆,θ .In computing Σ∆,θ , we technically do not average out the bootstrapped models because we want to make sure that every point (X i , Y i ) used to compute the gradient bias is independent of its corresponding model f (b) boot .Intuitively, as per the discussion following Assumption 1, if A train is stable we expect f (b) boot to be a good approximation of the average model f , which in turn means that the bootstrap covariance estimates should be consistent per conventional wisdom.We show empirically that the covariance estimates lead to valid coverage across a range of applications.To give one concrete example, consider mean estimation: and form the final interval as

Experiments
We evaluate cross-prediction and compare it to baseline approaches on several datasets; the baselines are the classical inference method, which only uses the labeled data, and prediction-powered inference with a data-splitting step in order to train a predictive model.Code for reproducing the experiments is available at: https://github.com/tijana-zrnic/cross-ppi.
For each experimental setting, we plot the coverage and confidence interval width estimated over 100 trials for all baselines.We also show the constructed confidence intervals for five randomly chosen trials.
Finally, to quantify the stability of inferences, we report the standard deviation of the lower and upper endpoints of the confidence intervals for each method.
We begin with proof-of-concept experiments on synthetic data.Then, we move on to more complex real datasets.

Proof-of-concept experiments on synthetic data
To build intuition, we begin with simple experiments on synthetic data.The purpose is to confirm what we expect in theory: (a) as it gets easier to predict labels from features, cross-prediction and prediction-powered inference become more powerful and increasingly outperform the classical approach; (b) cross-prediction uses the data more efficiently than prediction-powered inference, yielding smaller intervals; (c) cross-prediction gives more stable inferences than the baselines when the predictions are useful; (d) all three approaches lead to satisfactory coverage.
In all of the following experiments, we have N = 10, 000 unlabeled data points, and we vary the size of the labeled data n between 100 and 1000, in 100-point increments.We apply cross-prediction with K = 10 folds.We estimate the variance using the bootstrap approach described in Section 6, with B = 30 bootstrap samples.For prediction-powered inference, we use half of the labeled data for model training.To illustrate the point that cross-prediction can be applied with any black-box model, we train gradient-boosted trees via XGBoost [15] to obtain the models f (j) .We use the same model-fitting strategy for prediction-powered inference.We fix the target error level to be α = 0.1 and average the results over 100 trials.
Mean estimation.For given parameters R 2 and σ 2 Y , the data-generating distribution is defined as ∈ {0, 0.5, 1}.The idea is to vary the degree to which the outcomes can be explained through the features: when R 2 = 0, the outcome is independent of the features and we do not expect predictions to help, while when R 2 = 1, the outcome can be perfectly explained through the features and we expect predictions to be helpful.Given that the variance of Y is kept constant regardless of R 2 , classical inference has the same distribution of widths across R 2 .The target of inference is In Figure 3 we plot the coverage and interval width of cross-prediction, classical inference, and predictionpowered inference, as well as five example intervals.All three methods approximately achieve the target coverage, and cross-prediction dominates prediction-powered inference throughout.Further, we see that the classical approach dominates the alternatives when the features are independent of the outcomes, while the alternatives become more powerful as R 2 increases.To evaluate stability, in Table 1 we report the standard deviation of the lower and upper endpoints of the confidence intervals from Figure 3, for n = 100.We observe that the classical approach is the most stable method when R 2 = 0, which makes sense because the predictions can only introduce noise.When R 2 = 0.5, cross-prediction and classical inference have a similar degree of stability, while when R 2 = 1 cross-prediction is significantly more stable.Moreover, cross-prediction is significantly more stable than prediction-powered inference throughout.These trends hold across different values of n, however we only include the results for n = 100 for simplicity of exposition.
Quantile estimation.We adopt the same data-generating process as for mean estimation.We only change the target of inference θ * to be the 75th percentile of the outcome distribution.
In Figure 4 we plot the coverage and interval width of cross-prediction, classical inference, and predictionpowered inference, as well as five example intervals.We observe a qualitatively similar comparison as in the case of mean estimation: all three methods approximately achieve the target coverage, and crossprediction dominates prediction-powered inference throughout.As before, the classical approach dominates the alternatives when the features are independent of the outcomes, and the alternatives become increasingly powerful as R 2 increases.In Table 2 we evaluate the stability of the methods by reporting the standard deviation of the lower and upper endpoints of the confidence intervals from Figure 4, for n = 100.As before, Table 2 shows that cross-prediction is more stable than prediction-powered inference for all values of R 2 , and when R 2 = 0 classical inference is the most stable option.When R 2 = 0.5, cross-prediction has a slightly more stable upper endpoint than classical inference, while classical inference has a more stable lower endpoint.When R 2 = 1, cross-prediction is by far the most stable method.For R 2 ∈ {0, 0.5}.Again, these trends are largely consistent across different values of n, however we only include the results for n = 100 for simplicity.
Linear regression.Finally, we look at linear regression.For robustness and interpretability, it is common to include only a subset of the available features in the regression.The process of deciding which variables to include is known as model selection.The variables that are not included in the model may still be predictive ) and upper (σ U ) endpoints of the confidence intervals in the mean estimation problem from Figure 3, for n = 100.The minimum value in each column is in bold.  of the outcome of interest; we demonstrate that, as such, they can be useful for inference.
The data-generating distribution is defined as follows: we generate X ∼ N (0, Again, the idea is to vary how much of the outcome can be explained through prediction versus how much of it is exogenous randomness.We fix σ 2 Y = 4 and vary R 0 ∈ {0, 0.5, 1}.The target of inference is defined as the least-squares solution when regressing Y on (X 1 , X 2 ), that is, the first coordinate of this solution.In this case, this is simply equal to θ * = β 1 = 1.
In Figure 5 we plot the coverage and interval width of cross-prediction, classical inference, and predictionpowered inference.When R 2 0 = 0, the classical approach outperforms the prediction-based approaches; as R 2 0 grows, meaning more of the randomness of the outcome can be attributed to X 3 , the prediction-based  approaches dominate the classical one.As before, cross-prediction yields smaller intervals than predictionpowered inference.We remark that, even though the inference problem posits a linear model, the predictionbased approaches still use XGBoost for model training.Like in the previous two examples, we report on the stability of the three methods in Table 3.We again fix n = 100 for simplicity.Cross-prediction is far more stable than prediction-powered inference throughout, and it is more stable than classical inference for nonzero values of R 2 0 .3: Standard deviation of the lower (σ L ) and upper (σ U ) endpoints of the confidence intervals in the linear regression problem from Figure 5, for n = 100.The minimum value in each column is in bold.

Estimating deforestation from satellite imagery
We briefly revisit the problem of deforestation analysis from Section 1.As we saw in Figure 2, crossprediction gave tighter confidence intervals for the deforestation rate than using gold-standard measurements of deforestation alone.In other words, cross-prediction can enable a reduction in the number of necessary field visits to measure deforestation.Moreover, we saw that cross-prediction outperformed prediction-powered inference.
Here we argue another benefit of cross-prediction in this problem: it is a more stable solution than the baselines.Table 4 shows the standard deviation of the endpoints of the confidence intervals constructed by cross-prediction and its competitors.Cross-prediction has a significantly lower variability of the endpoints than both classical inference and prediction-powered inference, while the latter two exhibit similar variability.
Finally, we provide the experimental details that were omitted in Section 1 for brevity.We have n all = 3192 data points with gold-standard labels total.To simulate having unlabeled images, in each trial we randomly split the data into n points to serve as the labeled data, for varying n ∈ {0.1n all , 0.2n all , 0.3n all }, and treat the remaining points as unlabeled.The target of inference is the fraction of deforested areas across the locations contained in the sample.We apply cross-prediction with K = 10 folds.For prediction-powered inference, we use n tr = 0.1n points for model tuning.We average the results over 100 trials.

Estimating relationships between socioeconomic covariates in survey data
We evaluate cross-prediction on the American Community Survey (ACS) Public Use Microdata Sample (PUMS).We investigate the relationship between age, sex, and income in survey data collected in California in 2019 (n all = 377, 575 people total).High-quality survey data is generally difficult and time-consuming to collect.With this experiment we hope to demonstrate how, by imputing missing information based on the available covariates, cross-prediction can achieve both powerful and valid inferences while reducing the requisite amount of survey data.See Figure 6 for a subset of the available covariates in the ACS PUMS data.
We use the Folktables [19] interface to download the PUMS data, including income, age, sex, and 15 other demographic covariates.In each trial, we randomly sample n data points to serve as the labeled data, for varying n, and treat the remaining data points as the unlabeled data.We vary n ∈ {0.1n all , 0.2n all , 0.3n all }.The target of inference is the linear regression coefficient when regressing income on age and sex: , where Y is income and X ols encodes age and sex, X ols = (X age , X sex ).For the purpose of evaluating coverage, the corresponding coefficient computed on the whole dataset is taken as the ground-truth value of the estimand.To obtain the models f (j) , we train gradient-boosted trees via XGBoost [15].Note that the predictors use all 17 covariates to predict the missing labels, even though the target of inference is only defined with respect to two covariates.We apply cross-prediction with K = 5 folds.For prediction-powered inference, we use n tr = 0.2n points for model training, and we also train gradient-boosted trees.The target error level is α = 0.1 and we average the results over 100 trials.
In Figure 7 we plot the coverage and interval width for the three baselines, together with five example intervals.All three methods cover the true target with the desired probability.Moreover, as before, crossprediction outperforms prediction-powered inference.In this example, the predictive power of the trained models is not high enough for prediction-powered inference to outperform the classical approach; crossprediction, however, outperforms both.In Table 4 we report on the stability of the three methods for n = 0.1n all .We observe that cross-prediction is more stable than both alternatives.We also observe that prediction-powered inference has more stable intervals than the classical approach, despite the fact that they are wider on average.

Estimating the prevalence of spiral galaxies from galaxy images
We next look at galaxy data from the Galaxy Zoo 2 dataset [54], consisting of human-annotated images of galaxies from the Sloan Digital Sky Survey [55].Of particular interest are galaxies with spiral arms, which are correlated with star formation in the discs of low-redshift galaxies, and thus contribute to the understanding of star formation.See Figure 8 for example images of a spiral and a nonspiral galaxy.We show that, by leveraging the massive amounts of unlabeled galaxy imagery together with machine learning, cross-prediction can decrease the requisite number of human annotations for inference on galaxy demographics.
We have 167, 434 annotated galaxy images.In each trial, we randomly split them up into n points to serve as the labeled data, for n ∈ {10000, 20000, 30000}, and treat the remaining data points as unlabeled.) and upper (σ U ) endpoints of the confidence intervals in the problems studied in Figure 2, Figure 7, and Figure 9.For each problem we take n to be the smallest labeled dataset size in the considered range.The minimum value in each column is in bold.
The target of inference is the fraction of spiral galaxies in the dataset, equal to about 26.22%.To compute predictions, we fine-tune all layers of a pre-trained ResNet50.We apply cross-prediction with K = 3 folds.For prediction-powered inference, we use n tr = 0.1n points for model training.The target error rate is α = 0.1 and we average the results over 100 trials.
In Figure 9 we plot the coverage and interval width of the three methods, as well as the intervals for five randomly chosen trials.Both cross-prediction and prediction-powered inference yield smaller intervals than the classical approach.Moreover, cross-prediction dominates prediction-powered inference.We observe satisfactory coverage for all three procedures.In Table 4 we evaluate the stability of the procedures for n = 10, 000.Cross-prediction is significantly more stable than classical inference and prediction-powered inference.The latter two achieve a similar degree of stability.

Evaluating heuristics
In Figure 2, we saw that cross-prediction gave tighter confidence intervals than the baseline approaches for the problem of deforestation analysis.In this section, we test two heuristic ways of reducing the variance of the classical approach and prediction-powered inference and compare the heuristics to cross-prediction.
The first heuristic removes the debiasing from the cross-prediction estimator and simply averages the predictions on the large unlabeled dataset: This is akin to computing the classical estimator while pretending that the predicted labels are the ground Coverage and average interval width of cross-prediction, classical inference, and prediction-powered inference (PPI), as well as two heuristics related to cross-fitting: one that removes the debiasing and one that trains on all labeled data instead of forming folds.The experimental setup is the same as in Figure 2.
truth.The second heuristic trains a model on all the labeled data, f all = A train ({(X i , Y i )} n i=1 ), and computes This estimator is akin to the prediction-powered estimator if we treated f all as fixed and independent of the labeled dataset.
For both heuristics, we form confidence intervals based on the usual central limit theorem that assumes i.i.d.sampling.For the first heuristic this is done conditional on the trained models f (j) , since the terms are indeed conditionally independent given f (1) , . . ., f (K) .Since the second heuristic proceeds under the assumption that f all can be seen as being independent of the labeled data, we apply the central limit theorem to the two sums separately, as if f all were fixed.We see in Figure 10 that removing the debiasing is detrimental to coverage; removing the folds has a more moderate effect that vanishes with n, but it is nevertheless significant.Cross-prediction yields wider intervals than both heuristics, and by doing so it maintains correct coverage.

A Proofs
A.1 Proof of Theorem 1 (CLT for mean estimation) The proof builds on two key technical lemmas, which leverage the notion of stability in Assumption 1 to show that we can "replace" the models f (j) in the definition of the cross-prediction estimator (4) with the "average" model f .Since f is a nonrandom model, we can proceed with a standard CLT analysis of the two terms comprising the estimator.
We begin by stating and proving the technical lemmas, which are inspired by the analysis of crossvalidation due to Bayle et al. [8].To simplify notation, we use E X and Var X to denote the expectation and variance conditional on everything but X.
Lemma A.1.Suppose that the predictions are stable (Ass.1).Denote Then, Proof.Let ψ(x) = min(1, x).We will use the fact that 1  Notice that ψ(x) is also concave.Therefore, by Jensen's inequality, we have Invoking the stability condition shows that the right-hand side converges to zero.Note that, technically, the stability condition is stronger than what is needed for the expression above to converge to zero.In particular, stability ensures that E K • Var X f (1) (X) − f (X) → 0, while for the expression above to vanish it would suffice to ensure E Var X f (1) (X) − f (X) → 0. The stronger condition will be used in the next technical lemma, which handles the second term in the cross-prediction estimator.
Putting everything together, we get that 1 K K j=1 Fj p → 0, as desired.
Lemma A.2. Suppose that the predictions are stable (Ass.1).Denote Then, Next, by Jensen's inequality, we have Invoking the stability condition shows that the right-hand side converges to zero.Hence, With Lemma A.1 and Lemma A.2 in hand, we can now prove Theorem 1.As alluded to earlier, the idea is to use Lemma A.1 and Lemma A.2 to replace the models f (j) in the definition of the cross-prediction estimator (4).Asymptotic normality of θ+ .We follow an argument similar to the classical proof of asymptotic normality of M-estimators (see Theorem 5.23 in [49]).Given a function g, we use the shorthand notation The differentiability and local Lipschitzness of the loss at θ * imply that for every (possibly random) sequence h n = O P (1), we have .
By essentially the same argument as in the proof of the asymptotic normality of ∇L + (θ), we can substitute the average over models f (j) for f via Lemma A.4, thus getting

Figure 2 :
Figure 2: Estimating the deforestation rate in the Amazon from satellite imagery.Left: Example intervals constructed by cross-prediction, classical inference, and prediction-powered inference (PPI), for five random splits into labeled and unlabeled data and a fixed number of gold-standard deforestation labels, n = 319.Middle and right: Coverage and interval width averaged over 100 random splits into labeled and unlabeled data, for n ∈ {319, 638, 957}.The target of inference is the fraction of the Amazon rainforest lost between 2000 and 2015 (gray line in left panel).The target coverage is 90% (gray line in middle panel).
In more detail, for each b ∈ {1, 2, . . ., B} = [B], we sample n − n K data points uniformly at random from the labeled dataset, and denote the indices of the samples by I b boot .Then, we use the sampled data points to train a model f (b)boot using the same model-fitting strategy as for the cross-prediction models f (j) .To estimate Σθ , we compute Σθ = Var ∇ℓ θ ( Xi , fboot ( Xi )), i ∈ [N ] , denotes the empirical covariance.To estimate Σ∆,θ , we compute Σ∆,θ = Var ∇ℓ θ (X i , f

Figure 3 :
Figure 3: Mean estimation.Intervals from five randomly chosen trials (left), coverage (middle), and average interval width (right) of cross-prediction, classical inference, and prediction-powered inference (PPI) in a mean estimation problem.

Figure 4 :
Figure 4: Quantile estimation.Intervals from five randomly chosen trials (left), coverage (middle), and average interval width (right) of cross-prediction, classical inference, and prediction-powered inference (PPI) in a quantile estimation problem.The target is the 75th percentile.

1 Figure 5 :
Figure 5: Linear regression.Intervals from five randomly chosen trials (left), coverage (middle), and average interval width (right) of cross-prediction, classical inference, and prediction-powered inference (PPI) in a linear regression problem.

Figure 6 :Figure 7 :
Figure 6: Subset of the variables available in the ACS PUMS data.

Figure 8 :
Figure 8: Example images of a spiral galaxy (left) and a nonspiral galaxy (right).

Figure 10 :
Figure10: Estimating the deforestation rate in the Amazon from satellite imagery (revisited).Coverage and average interval width of cross-prediction, classical inference, and prediction-powered inference (PPI), as well as two heuristics related to cross-fitting: one that removes the debiasing and one that trains on all labeled data instead of forming folds.The experimental setup is the same as in Figure2.

Fact 1 .
Let X n be a sequence of random variables.Then,X n p → 0 if and only if E[min(1, |X n |)] → 0.Note that ψ(x) is nondecreasing and satisfies ψ K j=1 x j ≤ K j=1 ψ(x j ) for non-negative x j ; this yields

E
The proof follows a similar principle as the proof of Lemma A.1.As before, we let ψ(x) = min(1, x) and use the factK j=1 F j p → 0 if and only if E[ψ(| K j=1 F j |)] → 0 (seeFact 1).We use the fact that ψ K j=1 x j ≤ K j=1 ψ(x j ) for non-negative x j ; this yieldsE [ψ (|F j |)] .

Table 1 :
Standard deviation of the lower (σ L

Table 2 :
Standard deviation of the lower (σ L

Table 4 :
Estimating the prevalence of spiral galaxies from galaxy images.Intervals from five randomly chosen trials (left), coverage (middle), and average interval width (right) of cross-prediction, classical inference, and prediction-powered inference (PPI) in a mean estimation problem on galaxy image data.The target θ * is the fraction of spiral galaxies.Standard deviation of the lower (σ L