New Research In
Physical Sciences
Social Sciences
Featured Portals
Articles by Topic
Biological Sciences
Featured Portals
Articles by Topic
 Agricultural Sciences
 Anthropology
 Applied Biological Sciences
 Biochemistry
 Biophysics and Computational Biology
 Cell Biology
 Developmental Biology
 Ecology
 Environmental Sciences
 Evolution
 Genetics
 Immunology and Inflammation
 Medical Sciences
 Microbiology
 Neuroscience
 Pharmacology
 Physiology
 Plant Biology
 Population Biology
 Psychological and Cognitive Sciences
 Sustainability Science
 Systems Biology
Local instrumental variables and latent variable models for identifying and bounding treatment effects

Contributed by James Joseph Heckman
Related Article
 Models of treatment effects when responses are heterogeneous Jun 08, 1999
Abstract
This paper examines the relationship between various treatment parameters within a latent variable model when the effects of treatment depend on the recipient’s observed and unobserved characteristics. We show how this relationship can be used to identify the treatment parameters when they are identified and to bound the parameters when they are not identified.
This paper uses the latent variable or index model of econometrics and psychometrics to impose structure on the Neyman (1)–Fisher (2)–Cox (3)–Rubin (4) model of potential outcomes used to define treatment effects. We demonstrate how the local instrumental variable (LIV) parameter (5) can be used within the latent variable framework to generate the average treatment effect (ATE), the effect of treatment on the treated (TT) and the local ATE (LATE) of Imbens and Angrist (6), thereby establishing a relationship among these parameters. LIV can be used to estimate all of the conventional treatment effect parameters when the index condition holds and the parameters are identified. When they are not, LIV can be used to produce bounds on the parameters with the width of the bounds depending on the width of the support for the index generating the choice of the observed potential outcome.
Models of Potential Outcomes in a Latent Variable Framework
For each person i, assume two potential outcomes (Y_{0i}, Y_{1i}) corresponding, respectively, to the potential outcomes in the untreated and treated states. Let D_{i} = 1 denote the receipt of treatment; D_{i} = 0 denotes nonreceipt. Let Y_{i} be the measured outcome variable so that This is the NeymanFisherCoxRubin model of potential outcomes. It is also the switching regression model of Quandt (7) or the Roy model of income distribution (8, 9).
This paper assumes that a latent variable model generates the indicator variable D. Specifically, we assume that the assignment or decision role for the indicator is generated by a latent variable D^{*}_{i}: 1 where Z_{i} is a vector of observed random variables and U_{Di} is an unobserved random variable. D^{*}_{i} is the net utility or gain to the decisionmaker from choosing state 1. The index structure underlies many models in econometrics (10) and in psychometrics (11).
The potential outcome equation for the participation state is Y_{1i} = μ_{1}(X_{i}, U_{1i}), and the potential outcome for the nonparticipation state is Y_{0i} = μ_{0}(X_{i}, U_{0i}), where X_{i} is a vector of observed random variables and (U_{1i}, U_{0i}) are unobserved random variables. It is assumed that Y_{0} and Y_{1} are defined for everyone and that these outcomes are independent across persons so that there are no interactions among agents. Important special cases include models with (Y_{0}, Y_{1}) generated by latent variables and include μ_{j}(X_{i}, U_{ji}) = μ_{j}(X_{i}) + U_{ji} if Y is continuous and μ_{j}(X_{i}, U_{ji}) = 1(Xβ_{j} + U_{ji} ≥ 0) if Y is binary, where 1(A) is the indicator function that takes the value 1 if the event A is true and takes the value 0 otherwise. We do not restrict the (μ_{1}, μ_{0}) function except through integrability condition iv given below.
We assume: (i) μ_{D}(Z) is a nondegenerate random variable conditional on X = x; (ii) (U_{D}, U_{1}) and (U_{D}, U_{0}) are absolutely continuous with respect to Lebesgue measure on ℜ^{2}; (iii) (U_{D}, U_{1}) and (U_{D}, U_{0}) are independent of (Z, X); (iv) Y_{1} and Y_{0} have finite first moments; and (v) Pr(D = 1) > 0.
Assumption i requires an exclusion restriction: There exists a variable that determines the treatment decision but does not directly affect the outcome. Let F_{UD} be the distribution of U_{D} with the analogous notation for the distribution of the other random variables. Let P(z) denote Pr(D = 1Z = z) = F_{UD}(μ_{D}(z)). P(z) is sometimes called the “propensity score”, following ref. 12. Let Ũ_{D} denote the probability transform of U_{D}: Ũ_{D} = F_{UD}(U_{D}). Note that, because U_{D} is absolutely continuous with respect to Lebesgue measure, Ũ_{D} ≈ Unif(0,1). Let Δ_{i} denote the treatment effect for person i: Δ_{i} = Y_{1i} − Y_{0i}.
It is the index structure on D that plays the crucial role in this paper. An index structure on the potential outcomes (Y_{0}, Y_{1}) is not required, although it is both conventional and convenient in many applications.
Definition of Parameters
We examine four different mean parameters within this framework: the ATE, effect of treatment on the treated (TT), the local ATE (LATE), and the LIV parameter. The average treatment effect is given by: 2 From assumption iv, it follows that E(ΔX = x) exists and is finite a.e. F_{X}. The expected effect of treatment on the treated is the most commonly estimated parameter for both observational data and social experiments (13, 14). It is defined as: 3 From iv, Δ^{TT}(x, D = 1) exists and is finite a.e. F_{XD = 1}, where F_{XD=1}denotes the distribution of X conditional on D = 1. It will be useful to define a version of Δ^{TT}(X, D = 1) conditional on P(Z): so that 4 From our assumptions, Δ^{TT} (x, P(z), D = 1) exists and is finite a.e. F_{X,P(Z)D=1}. In the context of a latent variable model, the LATE parameter of Imbens and Angrist (6) using P(Z) as the instrument is 5 Without loss of generality, assume that P(z) > P(z′). From assumption iv, it follows that Δ^{LATE}(x, P(z), P(z′)) is well defined and is finite a.e. F_{X,P(Z)} × F_{X,P(Z)}. For interpretative reasons, Imbens and Angrist (6) also assume that P(z) is monotonic in z, a condition that we do not require. However, we do require that P(z) ≠ P(z′) for any (z, z′) where the parameter is defined.
The fourth parameter that we analyze is the LIV parameter introduced in ref. 5 and defined in the context of a latent variable model as 6 LIV is the limit form of the LATE parameter. In the next section, we demonstrate that Δ^{LIV}(x, P(z)) exists and is finite a.e. F_{X,P(Z)} under our maintained assumptions.
A more general framework defines the parameters in terms of Z. The latent variable or index structure implies that defining the parameters in terms of Z or P(Z) results in equivalent expressions. In the index model, Z enters the model only through the μ_{D}(Z) index, so that for any measurable set A, Because any cumulative distribution function is leftcontinuous and nondecreasing, we have
Relationship Between Parameters Using the Index Structure
Given the index structure, a simple relationship exists among the four parameters. From the definition it is obvious that 7 Next, consider Δ^{LATE}(x, P(z), P(z′)). Note that 8 so that and thus 9 LIV is the limit of this expression as P(z) → P(z′). In Eq. 8, E(Y_{1}X = x, Ũ) and E(Y_{0}X = x, Ũ) are integrable with respect to dF_{Ũ} a.e. F_{X}. Thus, E(Y_{1}X = x, P(Z) = P(z)) and E(Y_{0}X = x, P(Z) = P(z)) are differentiable a.e. with respect to P(z), and thus E(YX = z, P(Z) = P(z)) is differentiable a.e. with respect to P(z) with derivative given by† 10 From assumption iv, the derivative in Eq. 10 is finite a.e. F_{X,Ũ}. The same argument could be used to show that Δ^{LATE}(x, P(z), P(z′)) is continuous and differentiable in P(z) and P(z′).
We rewrite these relationships in succinct form in the following way: and 11 Each parameter is an average value of LIV, E(ΔX = x, Ũ_{d} = u), but for values of U_{D} lying in different intervals. LIV defines the treatment effect more finely than do LATE, ATE, or TT.
Δ^{LIV}(x, p) is the average effect for people who are just indifferent between participation or not at the given value of the instrument (i.e., for people who are indifferent at P(Z) = p). Δ^{LIV}(x, p) for values of p close to zero is the average effect for individuals with unobservable characteristics that make them the most inclined to participate, and Δ^{LIV}(x, p) for values of p close to one is the average treatment effect for individuals with unobservable characteristics that make them the least inclined to participate. ATE integrates Δ^{LIV}(x, p) over the entire support of Ũ_{D} (from p = 0 to p = 1). It is the average effect for an individual chosen at random. Δ^{TT}(x, P(z), D = 1) is the average treatment effect for persons who chose to participate at the given value of P(Z) = P(z); Δ^{TT}(x, P(z), D = 1) integrates Δ^{LIV}(x, p) up to p = P(z). As a result, it is primarily determined by the average effect for individuals whose unobserved characteristics make them the most inclined to participate in the program. LATE is the average treatment effect for someone who would not participate if P(Z) ≤ P(z′) and would participate if P(Z) ≥ P(z). Δ^{LATE}(x, P(z), P(z′)) integrates Δ^{LIV}(x, p) from p = P(z′) to p = P(z).
To derive TT, use Eq. 4 to obtain 12 Using Bayes rule, one can show that 13 Because Pr(D = 1X = x, P(Z)) = P(z), 14 Note further that, because Pr(D = 1X) = E(P(Z)X) = ∫_{0}^{1} (1 − F_{P(Z)X=x}(t))dt, we can reinterpret Eq. 14 as a weighted average of LIV parameters in which the weighting is the same as that from a “lengthbiased,” “sizebiased,” or “Pbiased” sample: 15 where g_{x}(u) = 1 − F_{P(Z)X=x}(u)/∫ (1 − F_{P(Z)X=x}(t))dt. Replacing P(Z) with lengthofspell, g_{x}(u) is the density of a lengthbiased sample of the sort that would be obtained from stock biased sampling in duration analysis (16). Here we sample from the P(Z) conditional on D = 1 and obtain an analogous density used to weight up LIV. g_{x}(u) is nonincreasing function of U. Δ^{LIV}(x, p) is given zero weight for p ≥ p^{max}(x).
Identification of Treatment Parameters
Assume access to an infinite independently and identically distributed sample of (D, Y, X, Z) observations, so that the joint distribution of (D, Y, X, Z) is identified. Let 𝒫(x) denote the closure of the support P(Z) conditional on X = x, and let 𝒫^{c}(x) = (0, 1)∖𝒫(x). Let p^{max}(x) and p^{min}(x) be the maximum and minimum values in 𝒫(x).
LATE and LIV are defined as functions (Y, X, Z) and are thus straightforward to identify. Δ^{LATE}(x, P(z), P(z′)) is identified for any (P(z), P(z′)) ∈ 𝒫(x) × 𝒫(x). Δ^{LIV}(x, P(z)) is identified for any P(z) that is a limit point of 𝒫(x). The larger the support of P(Z) conditional on X = x, the bigger the set of LIV and LATE parameters that can be identified.
ATE and TT are not defined directly as functions of (Y, X, Z), so a more involved discussion of their identification is required. We can use LIV or LATE to identify ATE and TT under the appropriate support conditions: (i) If 𝒫(x) = [0, 1], then Δ^{ATE}(x) is identified from Δ^{LIV}. If {0, 1} ∈ 𝒫(x), then Δ^{ATE}(x) is identified from Δ^{LATE}. (ii) If (0, P(z)) ⊂ 𝒫(x), then Δ^{TT}(x, P(z), D = 1) is identified from Δ^{LIV}. If {0, P(z)} ∈ 𝒫(x) then Δ^{TT}(x, P(z), D = 1) is identified from Δ^{LATE}.
Note that TT is identified under weaker conditions than is ATE. To identify TT, one needs to observe P(Z) arbitrarily close to 0 (p^{min}(x) = 0) and to observe some positive P(Z) values whereas to identify ATE, one needs to observe P(Z) arbitrarily close to 0 and arbitrarily close to 1 (p^{max}(x) = 1 and p^{min}(x) = 0). Note that the conditions involve the closure of the support of P(Z) conditional on X = x and not the support itself. For example, to identify Δ^{TT}(x, D = 1) from Δ^{LATE}, we do not require that 0 be in the support of P(Z) conditional on X = x but that points arbitrarily close to 0 be in the support. This weaker requirement follows from Δ^{LIV}(x, P(z)) being a continuous function of P(z) and Δ^{LATE}(x, P(z), P(z′)) being a continuous function of P(z) and P(z′).
Without these support conditions, we can still construct bounds if Y_{1} and Y_{0} are known to be bounded with probability one. For ease of exposition and to simplify the notation, assume that Y_{1} and Y_{0} have the same bounds, so that and For example, if Y is an indicator variable, then the bounds are y_{x}^{l} = 0 and y_{x}^{u} = 1 for all x. For any P(z) ∈ 𝒫(x), we can identify 16 and 17 In particular, we can evaluate Eq. 16 at P(z) = p^{max}(x) and can evaluate Eq. 17 at P(z) = p^{min}(x). The distribution of (D, Y, X, Z) contains no information on ∫_{pmax(x)}^{1} E(Y_{1}X = x, Ũ = u)du and ∫_{0}^{pmin(x)} E(Y_{0}X = x, Ũ = u)du, but we can bound these quantities: 18 We thus can bound Δ^{ATE}(x) by§ The width of the bounds is thus The width is linearly related to the distance between p^{max}(x) and 1 and the distance between p^{min}(x) and 0. These bounds are directly related to the “identification at infinity” results of refs. 9 and 18. Such identification at infinity results require the condition that μ_{D}(Z) takes arbitrarily large and arbitrarily small values if the support of U_{D} is unbounded. The condition is sometimes criticized as being not credible. However, as is made clear by the width of the above bounds, the proper metric for measuring how close one is to identification at infinity is the distance between p^{max}(x) and 1 and the distance between p^{min}(x) and 0. It is credible that these distances may be small. In practice, semiparametric methods that use identification at infinity arguments to identify ATE are implicitly extrapolating E(Y_{1}X = x, Ũ = u) for u > p^{max}(x) and E(Y_{0}X = x, Ũ = u) for u < p^{min}(x).
We can construct analogous bounds for Δ^{TT}(x, P(z), D = 1) for P(z) ∈ 𝒫(x): The width of the bounds on Δ^{TT}(x, P(z), D = 1) is thus: The width of the bounds is linearly decreasing in the distance between p^{min}(x) and 0. Note that the bounds are tighter for larger P(z) evaluation points because the higher the P(z) evaluation point, the less weight is placed on the unidentified quantity ∫_{0}^{pmin(x)} E(Y_{0}X = x, Ũ = u)du. In the extreme case, where P(z) = p^{min}(x), the width of the bounds simplifies to y_{x}^{u} − y_{x}^{l}.
We can integrate the bounds on Δ^{TT}(x, P(z), D=1) to bound Δ^{TT}(x, D = 1): The width of the bounds on Δ^{TT}(x, D = 1) is thus: Using Eq. 13, we have 19 Unlike the bounds on ATE, the bounds on TT depend on the distribution of P(Z), in particular, on Pr(D = 1X = x) = E(P(Z)X = x). The width of the bounds is linearly related to the distance between p^{min}(x) and 0, holding Pr(D = 1X = x) constant. The larger Pr(D = 1X = x) is, the tighter the bounds because the larger P(Z) is on average, the less probability weight is being placed on the unidentified quantity ∫_{0}^{pmin(x)} E(Y_{0}X = x, Ũ = u)du.
Conclusion
This paper uses an index model or latent variable model for the selection variable D to impose some structure on a model of potential outcomes that originates with Neyman (1), Fisher (2), and Cox (3). We introduce the LIV parameter as a device for unifying different treatment parameters. Different treatment effect parameters can be seen as averaged versions of the LIV parameter that differ according to how they weight the LIV parameter. ATE weights all LIV parameters equally. LATE gives equal weight to the LIV parameters within a given interval. TT gives a large weight to those LIV parameters corresponding to the treatment effect for individuals who are the most inclined to participate in the program. The weighting of P for LIV that produces TT is like that obtained in length biased or sized biased samples.
Identification of LATE and LIV parameters depends on the support of the propensity score, P(Z). The larger the support of P(Z), the larger the set of LATE and LIV parameters that are identified. Identification of ATE depends on observing P(Z) values arbitrarily close to 1 and P(Z) values arbitrarily close to 0. When such P(Z) values are not observed, ATE can be bounded, and the width of the bounds is linearly related to the distance between 1 and the largest P(Z) and the distance between 0 and the smallest P(Z) value. For TT, identification requires that one observe P(Z) values arbitrarily close to 0. If this condition does not hold, then the TT parameter can be bounded and the width of the bounds will be linearly related to the distance between 0 and the smallest P(Z) value, holding Pr(D = 1X) constant.
Implementation of these methods through either parametric or nonparametric methods is straightforward. In joint work with Arild Aakvik of the University of Bergen (Bergen, Norway), we have developed the sampling theory for the LIV estimator and empirically estimated and bounded various treatment parameters for a Norwegian vocational rehabilitation program.
We conclude this paper with the observation that the index structure for D is not strictly required, nor is any monotonicity assumption necessary to produce results analogous to those presented in this paper. The index structure on D simplifies the derivations and yields the elegant relationships presented here. However, LIV can be defined without using an index structure (5); so can LATE. We can define LIV for different sets of regressors and produce relationships like those given in Eq. 11 defining the integrals over multidimensional sets instead of intervals. The bounds we present also can be generalized to cover this case as well. The index structure for D arises in many psychometric and economic models in which the index represents net utilities or net preferences over states, and these are usually assumed to be continuous. In these cases, its application leads to the simple and concise relationships given in this paper.
Acknowledgments
We thank Aarild Aakvik, Victor Aguirregabiria, Xiaohong Chen, Lars Hansen and Justin Tobias for close reading of this manuscript. We also thank participants in the Canadian Econometric Studies Group (September, 1998), the Midwest Econometrics Group (September, 1998), the University of Upsalla (November, 1998), the University of Chicago (December, 1998), the University of Chicago (December, 1998), and University College London (December, 1998). James J. Heckman is Henry Schultz Distinguished Service Professor of Economics at the University of Chicago and a Senior Fellow at the American Bar Foundation. Edward Vytlacil is a Sloan Fellow at the University of Chicago. This research was supported by National Institutes of Health Grants R01HD3495801 and R01HD3205803, National Science Foundation Grant 9709873, and the Donner Foundation.
Footnotes

↵* To whom reprint requests should be addressed at: 1126 East 59th Street, Chicago, IL 60637. email: jjh{at}uchicago.edu.

↵† See, e.g., Kolmogorov and Fomin (15), Theorem 9.8 for one proof.

‡ The modifications required to handle the more general case are straightforward.

↵§ The following bounds on ATE also can be derived easily by applying Manski’s (17) bounds for “LevelSet Restrictions on the Outcome Regression.” The bounds for the other parameters discussed in this paper cannot be derived by applying his results.
ABBREVIATIONS
 LIV,
 local instrumental variable;
 ATE,
 average treatment effect;
 TT,
 effect of treatment on the treated;
 LATE,
 local ATE
 Accepted February 2, 1999.
 Copyright © 1999, The National Academy of Sciences
References
 ↵
 Neyman J
 ↵
 Fisher R A
 ↵
 Cox D R
 ↵
 ↵
 ↵
 ↵
 ↵
 Roy A
 ↵
 ↵
 Maddala G S
 ↵
 ↵
 Rosenbaum P,
 Rubin D
 ↵
 Heckman J,
 Singer B
 Heckman J,
 Robb R
 ↵
 Ashenfelter O,
 Card D
 Heckman J,
 Lalonde R,
 Smith J
 ↵
 Kolmogorov A N,
 Fomin S V
 ↵
 Feinberg S
 Rao C R
 ↵
 Manski C
 ↵
 Heckman J