Sources of selection bias in evaluating social programs: An interpretation of conventional measures and evidence on the effectiveness of matching as a program evaluation method
See allHide authors and affiliations
This paper decomposes the conventional measure of selection bias in observational studies into three components. The first two components are due to differences in the distributions of characteristics between participant and nonparticipant (comparison) group members: the first arises from differences in the supports, and the second from differences in densities over the region of common support. The third component arises from selection bias precisely defined. Using data from a recent social experiment, we find that the component due to selection bias, precisely defined, is smaller than the first two components. However, selection bias still represents a substantial fraction of the experimental impact estimate. The empirical performance of matching methods of program evaluation is also examined. We find that matching based on the propensity score eliminates some but not all of the measured selection bias, with the remaining bias still a substantial fraction of the estimated impact. We find that the support of the distribution of propensity scores for the comparison group is typically only a small portion of the support for the participant group. For values outside the common support, it is impossible to reliably estimate the effect of program participation using matching methods. If the impact of participation depends on the propensity score, as we find in our data, the failure of the common support condition severely limits matching compared with random assignment as an evaluation estimator.
This paper uses data from a largescale social experiment conducted on a prototypical job training program to decompose conventional measures of selection bias into a component corresponding to selection bias, precisely defined, and into components arising from failure of a common support condition and failure to weight the data appropriately. We demonstrate that a substantial fraction of the conventional measure of selection bias is not due to selection, precisely defined, and we conjecture that this is a general finding. We find that the conventional measure of selection bias is misleading. We also provide mixed evidence on the effectiveness of the matching methods widely used for evaluating programs. The selection bias remaining after matching is a substantial percentage—often over 100%—of the experimentally estimated impact of program participation.
Our analysis is based on the Roy (1) model of potential outcomes, which is identical to the Fisher (2) model for experiments and to the switching regression model of Quandt (3). This class of models has been popularized (and renamed) in statistics as the “Rubin” (4) model. In this model, there are two potential outcomes (Y_{0}, Y_{1}), where Y_{0} corresponds to the notreatment state and Y_{1} corresponds to the treatment state. The indicator D equals 1 if a person participates in a program, and equals 0 otherwise. The probability that D = 1 given X, Pr(D = 1  X), is sometimes called the propensity score in statistics [see Rosenbaum and Rubin (5)].
The parameter of interest considered in this paper is the mean effect of treatment on the treated. It is not always the parameter of interest in evaluating social programs [see Heckman and Robb (6), Heckman (7), Heckman and Smith (8) and Heckman et al. (9)], but it is commonly used. It gives the expected gain from treatment for those who receive it. For covariate vector X, it is defined as Sometimes interest focuses on the average impact for X in some region K, e.g., where F(X  D = 1) is the distribution of X conditional on D = 1. The term E(Y_{1}  X, D = 1) in the definition of Δ(X) can be identified and consistently estimated from data on program participants. Missing from ordinary observational studies is the data required to estimate the counterfactual term E(Y_{0}  X, D = 1).
Many methods exist for constructing this counterfactual or an averaged version of it [see Heckman and Robb (6)]. One common method uses the outcomes of nonparticipants, E(Y_{0}  X, D = 0), to proxy for the outcomes that participants would have experienced had they not participated. The selection bias B(X) that results from using this proxy is defined as We have data from a social experiment in which some persons are randomly denied treatment. Let R = 1 for persons randomized into the experimental treatment group and R = 0 for persons randomized into the experimental control group. Randomization is conditional on D = 1, where D = 1 now indicates that the person would have participated in the absence of random assignment. Assuming no randomization bias, as defined in Heckman (7) or Heckman and Smith (8), one can use the experimental control group to consistently estimate E(Y_{0}  X, D = 1, R = 0) = E(Y_{0}  X, D = 1) under standard conditions. In this paper, we use data on experimental controls and on a companion sample of eligible nonparticipants (persons for whom D = 0) to estimate B(X) in order to understand the sources of bias that arise in nonexperimental evaluation studies.
The selection bias measure B(X) is rigorously defined only over the set of X values common to the D = 1 and D = 0 populations. Heckman and colleagues (10) report that for the data analyzed in this paper Unequal supports are also found for a particular scalar measure of X, P(X) = Pr(D = 1  X), which plays an important role in many evaluation methods. We find that Using the X distribution of participants, we define the mean selection bias _{SX} as where S_{X} = S_{1X} ∩ S_{0X}, the set of X in the common support.
Decomposing the Conventional Measure of Bias
The conventional measure of selection bias B used, e.g., in LaLonde (11), does not condition on X and is defined as B = E(Y_{0}  D = 1) − E(Y_{0}  D = 0). It can be decomposed into a portion corresponding to a properly weighted average of B(X) and two other components. First note that Further decomposition yields where where P_{X} = ∫_{SX} dF(X  D = 1) is the proportion of the density of X given D = 1 in the overlap set S_{X}, S_{1X}∖S_{X} is the support of X given D = 1 that is not in the overlap set S_{X}, and S_{0X}∖S_{X} is the support of X given D = 0 that is not in the overlap set S_{X}.
Term B_{1} in Eq. 3 does not arise from selection bias precisely defined but rather from the failure to find counterparts to E(Y_{0}  D = 1, X) in the set S_{0X}∖S_{X} and the failure to find counterparts to E(Y_{0}  D = 0, X) in the set S_{1X}∖S_{X}. Term B_{2} arises from the differential weighting of E(Y_{0}  D = 0, X) by the densities for X given D = 1 and D = 0 within the overlap set. Only the B_{3} term arises from selection bias as precisely defined. The “true” bias _{SX} may be of a different magnitude and even a different sign than the conventional bias B.
Reducing the Dimension of the Conditioning Set and a Nonparametric Test of the Validity of Matching
For samples with only a few thousand observations, such as the one we use here, nonparametric estimation of E(Y_{0}  X, D = 1) and E(Y_{0}  X, D = 0) for highdimensional X is impractical. Instead, we estimate conditional means as functions of P(X) using the orthogonal decomposition where E(V  P(X), D = 1) = 0. Heckman et al. (12) show that forming the mean conditional on P(X) permits consistent, but possibly inefficient, estimation of terms analogous to those in Eq. 3 but conditioned on P(X) rather than X and with the conditional means integrated against the empirical distributions for P(X), F(P(X)  D = 1) and F(P(X)  D = 0).
Another advantage of conditioning on P(X) in constructing the conditional means is that we can test the validity of matching as a method of evaluating programs. If meaning that Y_{0} is independent of D given X, then for P(X) ∈ H ⊆ (0, 1), where H is some set in the unit interval [see Rosenbaum and Rubin (5)]. Two implications of Eq. 4 are that and so that B(P(X)) = E(Y_{0}  D = 1, P(X)) − E(Y_{0}  D = 0, P(X)) = 0 for all P(X) ∈ H and hence _{SP} = 0. A test that B(P(X)) = 0 for all P(X) ∈ H is a test of the validity of the matching method as an estimator of treatment effects in the region H.
Provided that condition 5a is met, matching is a very attractive method for estimating Δ conditional on P(X). Under the condition given by Eq. 4, or the weaker condition 5a, the difficulty of finding matches for highdimensional X is avoided by conditioning only on P(X). Furthermore, matching methods using observations with common support eliminate two of the three sources of bias in Eq. 3. The bias arising from regions of nonoverlapping support, term B_{1} in Eq. 3, is eliminated by matching only over regions of common support. The bias due to different density weighting is eliminated because matching on participant propensity scores effectively reweights the nonparticipant data. Thus P_{X}_{SP} is the only component in Eq. 3 that is not necessarily eliminated by matching.
Nonparametric estimates of each of the components in Eq. 3 are obtained from Eq. 6, below, where n_{1} denotes the size of the D = 1 sample and n_{0} denotes the size of the D = 0 sample. Let ^ indicate an estimate and let {D = 1} be the set of indices i for persons with D = 1, {D = 0} be the set of indices i for D = 0, and P_{i} = P(X) for person i. Then we may decompose B̂ into the sample analogs of the three terms in Eq. 3, where and where the imputed outcome in the notreatment state for an observation with propensity score P_{i}, Ê(Y_{0}  D = 0, P_{i}), is estimated by a local linear regression of Y_{0} on P_{i} using data on persons for whom D = 0. We use the local linear regression methods of Fan (13) with optimal datadependent bandwidths. Each term under the summations on the righthand side of Eq. 6 is selfweighted by averaging over the empirical distribution of propensity scores in either the D = 1 or D = 0 sample. Heckman et al. (12) show that under random sampling each term is consistently estimated and times each term centered around its probability limit is asymptotically normal. That work extends the analysis in Rosenbaum and Rubin (5) by presenting a rigorous asymptotic distribution theory for the matching estimator.
Failure of a Common Support Condition: A Major Component of Measured Selection Bias
A major finding reported in our research [see Heckman et al. (10, 12)] is that using a variety of conditioning variables, the support condition is not satisfied over large intervals of 0 ≤ P(X) ≤1 in our sample. Fig. 1 a and b present histograms showing on the same graph the distributions of the estimates of P(X) for the control and comparison groups for adult men and women, respectively. The propensity scores were estimated using the covariates X reported in Heckman et al. (10). These covariates are chosen to minimize classification error when P̂(X) > P_{c} is used to predict D = 1 and P̂(X) ≤ P_{c} is used to predict D = 0, where P_{c} is some cutoff value of P(X). Recent (last 6 month) unemployment and earnings histories turn out to be the key predictors of participation for both groups. We find that the set of X that is chosen is robust to wide variations in P_{c} around the (known) population mean of P_{i}, E(P(X)). Our estimation method corrects for the overrepresentation of the experimental control group (D = 1) relative to the eligible nonparticipants (D = 0) in the available data using ideas developed in the analysis of weighted distributions by Rao (14, 15). A universal finding in our research using a variety of covariates is the failure of the common support condition. For both male and female comparison groups, there are substantial stretches of the control group values of P for which there are no comparison group members. This is an essential and hitherto unnoticed source of selection bias as conventionally measured.
Estimating the Components of the Conventional Measure of Selection Bias
Table 1 presents consistent and asymptotically normal estimates of the three components of the decomposition in Eq. 3 estimated using the formula in Eq. 6. The data are from the National Job Training Partnership Act (JTPA) Study (NJS), a recent experimental evaluation of the training programs funded under the JTPA [see Orr et al. (16)]. The JTPA program is the largest federal training program in the United States and is similar both to earlier federal training programs in the United States and to many other programs throughout the world. Lessons from our study are likely to apply to other training programs.
In the JTPA evaluation, accepted applicants were randomly assigned into treatment and control groups, with the control group prohibited from receiving JTPA services for 18 months. A sample of persons eligible for JTPA in the same localities as the experiment who chose not to participate in the program was collected as a nonexperimental comparison group. The same survey instrument was administered to the control and comparison groups.
In the notation defined earlier, the control group sample gives information on Y_{0} for those with D = 1 and the sample of eligible nonparticipants gives Y_{0} for those with D = 0. Following the experimental analysis, we use quarterly earnings and total earnings in the 18 months after random assignment as our outcome measures.
Table 1 reports estimates of the components of the decomposition in Eq. 3 with earnings as the outcome variable for the adult men and women in our data. The first column in each table indicates the quarter (3month period) over which the estimates are constructed. These quarters are defined relative to the month of random assignment. Each row corresponds to one quarter, with the bottom row reporting totals over the first six quarters (18 months) after random assignment. The second column reports the estimated mean selection bias B̂. The next three columns report estimates of the components of the decomposition in Eq. 3. The top number in each cell is the estimate, the number in parentheses is the bootstrap standard error, and the number in square brackets is the percentage of B̂ for the row that is attributable to the given component. The first component, B̂_{1}, is presented in the third column of each table. The component arising from misweighting of the data, B̂_{2}, is given in the fourth column and the component due to true selection bias, B̂_{3}, appears in the fifth column. The sixth column presents _{SP}, the estimated selection bias for those in the overlap set S_{P}. The final column expresses _{SP} as a fraction of the experimental impact estimate. All of the values in Table 1 are reported as monthly dollars. Thus, the value of −418 in the first row and first column of Table 1 indicates a mean earnings difference of −$418 per month over the 3 months of the first quarter after random assignment. The percentages of controls and ENPs in the common support region for P_{i} are reported in the notes to each table.
A remarkable feature of the tables is that for the overall 18 month earnings measure, terms B̂_{1} and B̂_{2} are generally substantially larger than the selection bias term B̂_{3} for both groups. For adult males, the selection bias is a tiny fraction (only two percent) of the conventional measure of selection bias and is not statistically significantly different from zero. This is surprising since a majority of both the control and comparison group samples are in the overlap set, S_{P}, for both groups. For adult women, selection bias is proportionately higher although the conventional measure B̂ is lower than for adult males. For them the bias measures B̂ and B̂_{3} are of the same order of magnitude. Results for male and female youth reported in Heckman et al. (12) are similar to those for adult women. These overall results appear to provide a strong endorsement for matching on the propensity score as a method of program evaluation, especially for males. However, the bias _{SP} that is not eliminated by matching on a common support is still large relative to the treatment effects, as is shown in the seventh column of Table 1.
The decompositions for quarterly earnings tell a somewhat different story. There is considerable evidence of selection bias for adult males in quarter t = 5, although even in this quarter the selection bias is still dwarfed by the other components of Eq. 3. However, expressed as a fraction of the experimental impact estimate, the bias is substantial in most quarters.
The evidence for the empirical importance of selection bias that is not removed by the matching estimator used in this paper is even stronger when we examine the bias at particular deciles of the P_{i} distribution. This is done in Table 2. For adult males, the bias tends to be large, negative and statistically significant at the lowest decile, with a large positive bias in the upper deciles. For adult women, the pattern is Ushaped with the smallest bias at the lowest deciles. The apparent success of the matching method in eliminating selection bias in the overall estimates is a fortuitous circumstance that masks substantial bias within quarters and over particular subintervals of P_{i}. These patterns are found for many different specifications of P (see ref. 10).
The Failure of Matching to Estimate the Full Treatment Effect
Fig. 1 demonstrates that the support of P_{i} in the overlap set, S_{P}, is substantially different from the support of P_{i} for participants in the program, S_{1P}. This evidence implies that even if matching eliminates selection bias for P_{i} in the common support, the matching estimator cannot estimate the impact of participation over the entire set S_{1P}. In Heckman et al. (10), we report that the treatment effect varies with P_{i}; thus, failure of the common support condition S_{0P} = S_{1P}means that the matching estimator cannot identify the full treatment effect. At best, the matching estimator provides a partial description of the impact of participation on outcomes.
Acknowledgments
We thank Derek Neal and José Scheinkman for critical readings of this manuscript. We thank the Bradley Foundation, the Russell Sage Foundation, and the National Science Foundation (SBR9321048) for research support.
Footnotes

↵ To whom reprint requests should be addressed.

James J. Heckman

Abbreviation: JTPA, Job Training Partnership Act.
 Accepted July 25, 1996.
 Copyright © 1996, The National Academy of Sciences of the USA
References
 ↵
 Roy A D
 ↵
 Fisher R A
 ↵
 ↵
 Rubin D
 ↵
 Rosenbaum P,
 Rubin D B
 ↵
 Heckman J,
 Singer B
 Heckman J,
 Robb R
 ↵
 Manski C,
 Garfinkel I
 Heckman J
 ↵
 Heckman J,
 Smith J
 ↵
Heckman, J., Smith, J. & Taber, C. (1996) Rev. Econ. Stat., in press.
 ↵
Heckman, J., Ichimura, H., Smith, J. & Todd, P. (1996) Econometrica, in press.
 ↵
 LaLonde R
 ↵
Heckman, J., Ichimura, H. & Todd, P. (1996) Rev. Econ. Studies, in press.
 ↵
 ↵
 Patil G P
 Rao C R
 ↵
 Feinberg S
 Rao C R
 ↵
 Orr L,
 Bloom H,
 Bell S,
 Lin W,
 Cave G,
 Doolittle F
Citation Manager Formats
Sign up for Article Alerts
Jump to section
 Article
 Decomposing the Conventional Measure of Bias
 Reducing the Dimension of the Conditioning Set and a Nonparametric Test of the Validity of Matching
 Failure of a Common Support Condition: A Major Component of Measured Selection Bias
 Estimating the Components of the Conventional Measure of Selection Bias
 The Failure of Matching to Estimate the Full Treatment Effect
 Acknowledgments
 Footnotes
 References
 Figures & SI
 Info & Metrics