## New Research In

### Physical Sciences

### Social Sciences

#### Featured Portals

#### Articles by Topic

### Biological Sciences

#### Featured Portals

#### Articles by Topic

- Agricultural Sciences
- Anthropology
- Applied Biological Sciences
- Biochemistry
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Ecology
- Environmental Sciences
- Evolution
- Genetics
- Immunology and Inflammation
- Medical Sciences
- Microbiology
- Neuroscience
- Pharmacology
- Physiology
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology

# Lasso adjustments of treatment effect estimates in randomized experiments

Edited by Richard M. Shiffrin, Indiana University, Bloomington, IN, and approved December 1, 2015 (received for review June 3, 2015)

## Abstract

We provide a principled way for investigators to analyze randomized experiments when the number of covariates is large. Investigators often use linear multivariate regression to analyze randomized experiments instead of simply reporting the difference of means between treatment and control groups. Their aim is to reduce the variance of the estimated treatment effect by adjusting for covariates. If there are a large number of covariates relative to the number of observations, regression may perform poorly because of overfitting. In such cases, the least absolute shrinkage and selection operator (Lasso) may be helpful. We study the resulting Lasso-based treatment effect estimator under the Neyman–Rubin model of randomized experiments. We present theoretical conditions that guarantee that the estimator is more efficient than the simple difference-of-means estimator, and we provide a conservative estimator of the asymptotic variance, which can yield tighter confidence intervals than the difference-of-means estimator. Simulation and data examples show that Lasso-based adjustment can be advantageous even when the number of covariates is less than the number of observations. Specifically, a variant using Lasso for selection and ordinary least squares (OLS) for estimation performs particularly well, and it chooses a smoothing parameter based on combined performance of Lasso and OLS.

Randomized experiments are widely used to measure the efficacy of treatments. Randomization ensures that treatment assignment is not influenced by any potential confounding factors, both observed and unobserved. Experiments are particularly useful when there is no rigorous theory of a system’s dynamics, and full identification of confounders would be impossible. This advantage was cast elegantly in mathematical terms in the early 20th century by Jerzy Neyman, who introduced a simple model for randomized experiments, which showed that the difference of average outcomes in the treatment and control groups is statistically unbiased for the average treatment effect (ATE) over the experimental sample (1).

However, no experiment occurs in a vacuum of scientific knowledge. Often, baseline covariate information is collected about individuals in an experiment. Even when treatment assignment is not related to these covariates, analyses of experimental outcomes often take them into account with the goal of improving the accuracy of treatment effect estimates. In modern randomized experiments, the number of covariates can be very large—sometimes even larger than the number of individuals in the study. In clinical trials overseen by regulatory bodies like the Food and Drug Administration and the Medicines and Healthcare products Regulatory Agency, demographic and genetic information may be recorded about each patient. In applications in the tech industry, where randomization is often called A/B testing, there is often a huge amount of behavioral data collected on each user. However, in this “big data” setting, much of these data may be irrelevant to the outcome being studied or there may be more potential covariates than observations, especially once interactions are taken into account. In these cases, selection of important covariates or some form of regularization is necessary for effective regression adjustment.

To ground our discussion, we examine a randomized trial of the pulmonary artery catheter (PAC) that was carried out in 65 intensive care units in the United Kingdom between 2001 and 2004, called PAC-man (2). The PAC is a monitoring device commonly inserted into critically ill patients after admission to intensive care, and it provides a continuous measurement of several indicators of cardiac activity. However, insertion of PAC is an invasive procedure that carries some risk of complications (including death), and it involves significant expenditure both in equipment costs and personnel (3). Controversy over its use came to a head when an observational study found that PAC had an adverse effect on patient survival and led to increased cost of care (4). This led to several large-scale randomized trials, including PAC-man.

In the PAC-man trial, randomization of treatment was largely successful, and a number of covariates were measured about each patient in the study. If covariate interactions are included, the number of covariates exceeds the number of individuals in the study; however, few of them are predictive of the patient’s outcome. As it turned out, the (pretreatment) estimated probability of death was imbalanced between the treatment and control groups (*P* = 0.005, Wilcoxon rank sum test). Because the control group had, on average, a slightly higher risk of death, the unadjusted difference-in-means estimator may overestimate the benefits of receiving a PAC. Adjustment for this imbalance seems advantageous in this case, because the pretreatment probability of death is clearly predictive of health outcomes posttreatment.

In this paper, we study regression-based adjustment, using the least absolute shrinkage and selection operator (Lasso) to select relevant covariates. Standard linear regression based on ordinary least squares (OLS) suffers from overfitting if a large number of covariates and interaction terms are included in the model. In such cases, researchers sometimes perform model selection based on observing which covariates are unbalanced given the realized randomization. This generally leads to misleading inferences because of incorrect test levels (5). The Lasso (6) provides researchers with an alternative that can mitigate these problems and still perform model selection. We define an estimator,

In the theoretical analysis in this paper, instead of assuming that the standard linear model is the true data-generating mechanism, we work under the aforementioned nonparametric model of randomization introduced by Neyman (1) and popularized by Donald Rubin (9). In this model, the outcomes and covariates are fixed quantities, and the treatment group is assumed to be sampled without replacement from a finite population. The treatment indicator, rather than an error term, is the source of randomness, and it determines which of two potential outcomes is revealed to the experimenter. Unlike the standard linear model, the Neyman–Rubin model makes few assumptions not guaranteed by the randomization itself. The setup of the model does rely on the stable unit treatment value assumption, which states that there is only one version of treatment, and that the potential outcome of one unit should be unaffected by the particular assignment of treatments to the other units; however, it makes no assumptions of linearity or exogeneity of error terms. OLS (7, 10, 11), logistic regression (12), and poststratification (13) are among the adjustment methods that have been studied under this model.

To be useful to practitioners, the Lasso-based treatment effect estimator must be consistent and yield a method to construct valid confidence intervals. We outline conditions on the covariates and potential outcomes that will guarantee these properties. We show that an upper bound for the asymptotic variance can be estimated from the model residuals, yielding asymptotically conservative confidence intervals for the ATE, which can be substantially narrower than the unadjusted confidence intervals. Simulation studies are provided to show the advantage of the Lasso-adjusted estimator and to show situations where it breaks down. We apply the estimator to the PAC-man data, and compare the estimates and confidence intervals derived from the unadjusted, OLS-adjusted, and Lasso-adjusted methods. We also compare different methods of selecting the Lasso tuning parameter on these data.

## Framework and Definitions

We give a brief outline of the Neyman–Rubin model for a randomized experiment; the reader is urged to consult refs. 1, 9, and 14 for more details. We follow the notation introduced in refs. 7 and 10. For concreteness, we illustrate the model in the context of the PAC-man trial.

For each individual in the study, the model assumes that there exists a pair of quantities representing his/her health outcomes under the possibilities of receiving and not receiving the catheter. These are called the potential outcomes under treatment and control, and are denoted as *i* is defined, in theory, to be

In the mathematical specification of this model, we consider the potential outcomes to be fixed, nonrandom quantities, even though they are not all observable. The only randomness in the model comes from the assignment of treatment, which is controlled by the experimenter. We define random treatment indicators *i*, defined as

Note that the model does not incorporate any covariate information about the individuals in the study, such as physiological characteristics or health history. However, we will assume we have measured a vector of baseline, preexperimental covariates for each individual *i*. These might include, for example, age, gender, and genetic makeup. We denote the covariates for individual *i* as the column vector *Theoretical Results*, we will assume that there is a correlational relationship between an individual’s potential outcomes and covariates, but we will not assume a generative statistical model.

Define the set of treated individuals as *B*. Define the number of treated and control individuals as *A* or *B* to label the treatment or control group. Thus, for example, the average values of the potential outcomes and the covariates in the treatment group are as follows:*A* is determined by the random treatment assignment. Averages over the whole population are denoted as

## Treatment Effect Estimation

Our main inferential goal will be average effect of the treatment over the whole population in the study. In a trial such as PAC-man, this represents the difference between the average outcome if everyone had received the catheter, and the average outcome if no one had received it. This is defined as follows:

Although

In practice, the “ideal” linear adjustment vectors, leading to a minimum-variance estimator of the form of *p*, under regularity conditions, the inclusion of interaction terms guarantees that it never has higher asymptotic variance than the unadjusted estimator, and asymptotically conservative confidence intervals for the true parameter can be constructed.

In modern randomized trials, where a large number of covariates are recorded for each individual, *p* may be comparable to or even larger than *n*. In this case, OLS regression can overfit the data badly, or may even be ill posed, leading to estimators with large finite-sample variance. To remedy this, we propose estimating the adjustment vectors using the Lasso (6). The adjustment vectors would take the following form:*n*.] Here,

It is worth noting that, when two different adjustments are made for the treatment and control groups as in ref. 7 and here, the covariates do not have to be the same for the two groups. However, when they are not the same, the Lasso- or OLS-adjusted estimators are no longer guaranteed to have smaller or equal asymptotic variance than the unadjusted one, even in the case of fixed *p*. In practice, one may still choose between the adjusted and unadjusted estimators based on the widths of the corresponding confidence intervals.

## Theoretical Results

### Notation.

For a vector *j*th component of *S*, and *S*. For any column vector *D*, let *D*, respectively, and *D*. Let

### Decomposition of the Potential Outcomes.

The Neyman–Rubin model does not assume a linear relationship between the potential outcomes and the covariates. To study the properties of adjustment under this model, we decompose the potential outcomes into a term linear in the covariates and an error term. Given vectors of coefficients *n*.]

Note that we have not added any assumptions to the model; we have simply defined unit-level residuals, **3**] and [**4**] are fixed, deterministic numbers. It is easy to verify that

### Conditions.

We will need the following to hold for both the treatment and control potential outcomes. The first set of assumptions (1–3) are similar to those found in ref. 7.

##### Condition 1:

Stability of treatment assignment probability.

##### Condition 2:

The centered moment conditions. There exists a fixed constant

##### Condition 3:

The means

Because we consider the high-dimensional setting where *p* is allowed to be much larger than *n*, we need additional assumptions to ensure that the Lasso is consistent for estimating

##### Definition 1:

Given *n*, although the notation does not explicitly show this.

##### Definition 2:

Define

##### Condition 4:

Decay and scaling. Let

##### Condition 5:

Cone invertibility factor. Define the Gram matrix as *n*, such that

##### Condition 6:

Let

### Theorem 1.

*Assume* *Conditions 1*–*6* *hold for some* *and* *Then*,*Theorem 1* is given in *SI Appendix*. It is easy to show, as in the following corollary of *Theorem 1*, that the asymptotic variance of *a* and *b* on the covariates in the subset *J* with intercept, respectively.

### Corollary 1.

*For* *and* *defined in* [**18**] *and some* *and* *assume* *Conditions 1*–*6* *hold*. *Then the asymptotic variance of* *is no greater than that of the* *The difference is* *where*

##### Remark 1:

If, instead of *Condition 6*, we assume that the covariates are uniformly bounded, i.e., **7**], can be weakened to a second moment condition. Although we do not prove the necessity of any of our conditions, our simulation studies show that the distributions of the unadjusted and the Lasso-adjusted estimator may be nonnormal when (*i*) the covariates are generated from Gaussian distributions and the error terms do not satisfy second moment condition, e.g., being generated from a *t* distribution with one degree of freedom; or (*ii*) the covariates do not have bounded fourth moments, e.g., being generated from a *t* distribution with three degrees of freedom. See the histograms in Fig. 1, where the corresponding *p* values of Kolmogorov–Smirnov testing for normality are less than

##### Remark 2:

Statement [**11**], typically required in debiasing the Lasso (15), is stronger by a factor of

##### Remark 3:

*Condition 5* is slightly weaker than the typical restricted eigenvalue condition for analyzing the Lasso.

##### Remark 4:

If we assume **10**], then *Condition 6* requires that the tuning parameters are proportional to

##### Remark 5:

For fixed *p*, **9**], *Condition 4* holds automatically, and *Condition 5* holds when the smallest eigenvalue of *Corollary 1* reverts to corollary 1.1. in ref. 7. When these conditions are not satisfied, we should set

### Neyman-Type Conservative Variance Estimate.

We note that the asymptotic variance in *Theorem 1* involves the cross-product term

We will show in *SI Appendix*, *Theorem S1*, that the limit of

### Related Work.

The Lasso has already made several appearances in the literature on treatment effect estimation. In the context of observational studies, ref. 15 constructs confidence intervals for preconceived effects or their contrasts by debiasing the Lasso-adjusted regression, ref. 16 employs the Lasso as a formal method for selecting adjustment variables via a two-stage procedure that concatenates features from models for treatment and outcome, and similarly, ref. 17 gives very general results for estimating a wide range of treatment effect parameters, including the case of instrumental variables estimation. In addition to the Lasso, ref. 18 considers nonparametric adjustments in the estimation of ATE. In works such as these, which deal with observational studies, confounding is the major issue. With confounding, the naive difference-in-means estimator is biased for the true treatment effect, and adjustment is used to form an unbiased estimator. However, in our work, which focuses on a randomized trial, the difference-in-means estimator is already unbiased; adjustment reduces the variance while, in fact, introducing a small amount of finite-sample bias. Another major difference between this prior work and ours is the sampling framework: we operate within the Neyman–Rubin model with fixed potential outcomes for a finite population, where the treatment group is sampled without replacement, whereas these papers assume independent sampling from a probability distribution with random error terms.

Our work is related to the estimation of heterogeneous or subgroup-specific treatment effects, including interaction terms to allow the imputed individual-level treatment effects to vary according to some linear combination of covariates. This is pursued in the high-dimensional setting in ref. 19; this work advocates solving the Lasso on a reduced set of modified covariates, rather than the full set of covariate by treatment interactions, and includes extensions to binary outcomes and survival data. The recent work in ref. 20 considers the problem of designing multiple-testing procedures for detecting subgroup-specific treatment effects; they pose this as an optimization over testing procedures where constraints are added to enforce guarantees on type I error rate and power to detect effects. Again, the sampling framework in these works is distinct from ours; they do not use the Neyman–Rubin model as a basis for designing the methods or investigating their properties.

### PAC Data Illustration and Simulations.

We now return to the PAC-man study introduced earlier. We examine the data in more detail and explore the results of several adjustment procedures. There were 1,013 patients in the PAC-man study: 506 treated (managed with PAC) and 507 control (managed without PAC, but retaining the option of using alternative devices). The outcome variable is quality-adjusted life years (QALYs). One QALY represents 1 year of life in full health; in-hospital death corresponds to a QALY of zero. We have 59 covariates about each individual in the study; we include all main effects as well as 1,113 two-way interactions, and form a design matrix *SI Appendix* for more details on the design matrix.

The assumptions that underpin the theoretical guarantees of the *SI Appendix*, Fig. S9. The covariates with the largest two fourth moments (37.3 and 34.9, respectively) are quadratic term *S*, and moreover, even if it were known, calculating the cone invertibility factor would involve an infeasible optimization. This is a general issue in the theory of sparse linear high-dimensional estimation. To approximate these conditions, we use the bootstrap to estimate the active set of covariates *S* and the error terms *SI Appendix* for more details. Our estimated *S* contains 16 covariates and the estimated second moments of *Condition 5*, we examine the largest and smallest eigenvalues of the sub-Gram matrix *Condition 5* seems reasonably bounded away from zero.

We now estimate the ATE using the unadjusted estimator, the Lasso-adjusted estimator, and the OLS-adjusted estimator, which is computed based on a subdesign matrix containing only the 59 main effects. We also present results for the two-step estimator *SI Appendix*, *Algorithm 1*, we show how we adapt the CV procedure to select the tuning parameter for

We use the R package “glmnet” to compute the Lasso solution path and select the tuning parameters *SI Appendix*, *Algorithm 1*). Therefore, the cv(Lasso+OLS) and the cv(Lasso) may select different covariates to do the adjustment. This type of CV requires more computation than the CV based on just the Lasso estimator because it needs to compute the OLS estimator for each fold and each given

Fig. 2 presents the ATE estimates along with

However, it is interesting to note that, compared with the unadjusted estimator, the OLS-adjusted estimator causes the ATE estimate to decrease (from −0.13 to −0.31), and shortens the confidence interval by about *SI Appendix*, Fig. S8). We also note that these adjustments agree with the one performed in ref. 13, where the treatment effect was adjusted downward to

The covariates selected by Lasso for adjustment are shown in Table 1, where “A

Because not all of the potential outcomes are observed, we cannot know the true gains of adjustment methods. However, we can estimate the gains via building a simulated set of potential outcomes by matching treated units to control units on observed covariates. We use the matching method described in ref. 21, which gives 1,013 observations with all potential outcomes imputed. We match on the 59 main effects only. The ATE is

*SI Appendix*, Table S5, shows the results. For all of the methods, the bias is substantially smaller (by a factor of 100) than the SD. The SD and *SI Appendix*, Fig. S10, that the sampling distribution of the estimates is very close to Normal.

We conduct additional simulation studies to evaluate the finite sample performance of *SI Appendix*.

## Discussion

We study the Lasso-adjusted ATE estimate under the Neyman–Rubin model for randomization. Our purpose in using the Neyman–Rubin model is to investigate the performance of the Lasso under a realistic sampling framework that does not impose strong assumptions on the data. We provide conditions that ensure asymptotic normality, and provide a Neyman-type estimate of the asymptotic variance that can be used to construct a conservative confidence interval for the ATE. Although we do not require an explicit generative linear model to hold, our theoretical analysis requires the existence of latent “adjustment vectors” such that moment conditions of the error terms are satisfied, and that the cone invertibility condition of the sample covariance matrix is satisfied in addition to moment conditions for OLS adjustment as in ref. 7. Both assumptions are difficult to check in practice. In our theory, we do not address whether these assumptions are necessary for our results to hold, although simulations indicate that the moment conditions cannot be substantially weakened. As a by-product of our analysis, we extend Massart’s concentration inequality for sampling without replacement, which is useful for theoretical analysis under the Neyman–Rubin model. Simulation studies and the real-data illustration show the advantage of the Lasso-adjusted estimator in terms of estimation accuracy and model interpretation. In practice, we recommend a variant of Lasso, cv(Lasso+OLS), to select covariates and perform the adjustment, because it gives similar coverage probability and confidence interval length compared with cv(Lasso), but with far fewer covariates selected. In future work, we plan to extend our analysis to other popular methods in high-dimensional statistics such as Elastic-Net and ridge regression, which may be more appropriate for estimating adjusted ATE under different assumptions.

The main goal of using Lasso in this paper is to reduce the variance (and overall mean squared error) of ATE estimation. Another important task is to estimate heterogenous treatment effects and provide conditional treatment effect estimates for subpopulations. When the Lasso models of treatment and control outcomes are different, both in variables selected and coefficient values, this could be interpreted as modeling treatment effect heterogeneity in terms of covariates. However, reducing variance of the ATE estimate and estimating heterogenous treatment effects have completely different targets. Targeting heterogenous treatment effects may result in more variable ATE estimates. Moreover, our simulations show that the set of covariates selected by the Lasso is unstable, and this may cause problems when interpreting them as evidence of heterogenous treatment effects. How best to estimate such effects is an open question that we would like to study in future research.

## Materials and Methods

We did not conduct the PAC-man experiment, and we are analyzing secondary data without any personal identifying information. As such, this study is exempt from human subjects review. The original experiments underwent human subjects review in the United Kingdom (2).

## Acknowledgments

We thank David Goldberg for helpful discussions, Rebecca Barter for copyediting and suggestions for clarifying the text, and Winston Lin for comments. We thank Richard Grieve [London School of Hygiene and Tropical Medicine (LSHTM)], Sheila Harvey (LSHTM), David Harrison [Intensive Care National Audit and Research Centre (ICNARC)], and Kathy Rowan (ICNARC) for access to data from the PAC-Man Cost Effectiveness Analysis and the ICNARC Case Mix Programme database. This research was partially supported by NSF Grants DMS-11-06753, DMS-12-09014, DMS-1107000, DMS-1129626, DMS-1209014, Computational and Data-Enabled Science and Engineering in Mathematical and Statistical Sciences 1228246, DMS-1160319 (Focused Research Group); AFOSR Grant FA9550-14-1-0016; NSA Grant H98230-15-1-0040; the Center for Science of Information, a US NSF Science and Technology Center, under Grant Agreement CCF-0939370; Department of Defense for Office of Naval Research Grant N00014-15-1-2367; and the National Defense Science and Engineering Graduate Fellowship Program.

## Footnotes

↵

^{1}A.B. and H.L. contributed equally to this work.- ↵
^{2}To whom correspondence should be addressed. Email: binyu{at}stat.berkeley.edu.

Author contributions: A.B., H.L., C.-H.Z., J.S.S., and B.Y. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Drawing Causal Inference from Big Data,” held March 26–27, 2015, at the National Academies of Sciences in Washington, DC. The complete program and video recordings of most presentations are available on the NAS website at www.nasonline.org/Big-data.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1510506113/-/DCSupplemental.

## References

- ↵.
- Splawa-Neyman J,
- Dabrowska DM,
- Speed TP

- ↵
- ↵
- ↵
- ↵
- ↵.
- Tibshirani R

- ↵
- ↵.
- Bühlmann P,
- Van De Geer S

- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵.
- Belloni A,
- Chernozhukov V,
- Hansen C

- ↵.
- Belloni A,
- Chernozhukov V,
- Fernández-Val I,
- Hansen C

- ↵
- ↵.
- Tian L,
- Alizadeh A,
- Gentles A,
- Tibshirani R

*J Am Stat Assoc*109(508):1517–1532. - ↵
- ↵

## Citation Manager Formats

### More Articles of This Classification

### Related Content

- No related articles found.