Does matching overcome LaLonde's critique of nonexperimental estimators?
Introduction
There is a long-standing debate in the literature over whether social programs can be reliably evaluated without a randomized experiment. Randomization has a key advantage over nonexperimental methods in generating a control group that has the same distributions of both observed and unobserved characteristics as the treatment group. At the same time, social experimentation also has some drawbacks, such as high cost, the potential to distort the operation of an ongoing program, the common problem of program sites refusing to participate in the experiment and the problem of randomized-out controls seeking alternative forms of treatment.2 In contrast, evaluation methods that use nonexperimental data tend to be less costly and less intrusive. Also, for some questions of interest, they are the only alternative.3
The major obstacle in implementing a nonexperimental evaluation strategy is choosing among the wide variety of estimation methods available in the literature. This choice is important given the accumulated evidence that impact estimates are often highly sensitive to the estimator chosen. A literature has arisen, starting with LaLonde (1986), that evaluates the performance of nonexperimental estimators using experimental data as a benchmark. Much of this literature implicitly frames the question as one of searching for “the” nonexperimental estimator that will always solve the selection bias problem inherent in nonexperimental evaluations. Two recent contributions to this literature by Dehejia and Wahba (DW) (1999, 2002) have drawn attention to a class of estimators called propensity score matching estimators. They apply these matching estimators to the same experimental data from the National Supported Work (NSW) Demonstration, and the same nonexperimental data from the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID), analyzed by LaLonde (1986) and find very low biases. Their findings have made propensity score matching the estimator du jour in the evaluation literature.
Dehejia and Wahba 1999, Dehejia and Wahba 2002 finding of low bias from applying propensity score matching to LaLonde's (1986) data is surprising in light of the lessons learned from the analyses of Heckman, Ichimura and Todd and Heckman, Ichimura, Smith and Todd (Heckman et al. (1997a), Heckman 1996, Heckman 1998a) (henceforth HIT and HIST) using the experimental data from the U.S. National Job Training Partnership Act (JTPA) Study. They conclude that in order for matching estimators to have low bias, it is important that the data include a rich set of variables related to program participation and labor market outcomes, that the nonexperimental comparison group be drawn from the same local labor markets as the participants, and that the dependent variable (typically earnings) be measured in the same way for participants and nonparticipants. All three of these conditions fail to hold in the NSW data analyzed by LaLonde (1986) and Dehejia and Wahba 1999, Dehejia and Wahba 2002.
In this paper, we reanalyze these data, applying both cross-sectional and longitudinal variants of propensity score matching. We find that the low bias estimates obtained by Dehejia and Wahba 1999, Dehejia and Wahba 2002 using various cross-sectional matching estimators are highly sensitive to their choice of a particular subsample of LaLonde's (1986) data for their analysis. We also find the changing the set of variables used to estimate the propensity scores strongly affects the estimated bias in LaLonde's original sample. At the same time, we find that difference-in-differences (DID) matching estimators exhibit better performance than the cross-sectional estimators. This is consistent with the evidence from the JTPA data in HIT (1997a) and HIST (1998a) on the importance of avoiding geographic mismatch and of measuring the dependent variable in the same way in the treatment and comparison groups. Both these sources of bias are likely to be relatively stable over time, and so should difference out. More generally, our findings make it clear that propensity score matching does not represent a “magic bullet” that solves the selection problem in every context. The implicit search for such an estimator in the literature cannot succeed. Instead, the optimal nonexperimental evaluation strategy in a given context depends critically on the available data and on the institutions governing selection into the program.
The plan of the paper is as follows. Section 2 reviews some key papers in the previous literature on the choice among alternative nonexperimental estimators. Section 3.1 lays out the evaluation problem and Section 3.2 briefly describes commonly used nonexperimental estimators. Section 3.3 describes the cross-sectional and difference-in-differences matching estimators that we focus on in our study. 3.4 Choice-based sampled data, 3.5 When does bias arise in matching? briefly address the issues of choice-based sampling and the bias that arises from incomplete matching, respectively. Section 3.6 explains how we use the experimental data to benchmark the performance of nonexperimental estimators. Section 4 describes the NSW program. Section 5 describes our analysis samples from the NSW data and the two comparison groups. Section 6 presents our estimated propensity scores and Section 7 discusses the “balancing tests” used in some recent studies to aid in selecting a propensity score specification. 8 Matching estimates, 9 Regression-based estimates give the bias estimates obtained using matching and regression-based estimators, respectively. Section 10 displays evidence on the use of specification tests applied to our cross-sectional matching estimators and Section 11 concludes.
Section snippets
Previous research
Several previous papers use data from the National Supported Work Demonstration experiment to study the performance of econometric estimators. LaLonde (1986) was the first and the data we use come from his study. He arranged the NSW data into two samples: one of AFDC women and one of disadvantaged men. The comparison group subsamples were constructed from two national survey datasets: the CPS and the PSID. LaLonde (1986) applies a number of standard evaluation estimators, including simple
The evaluation problem
Assessing the impact of any intervention requires making an inference about the outcomes that would have been observed for program participants had they not participated. Denote by Y1 the outcome conditional on participation and by Y0 the outcome conditional on non-participation, so that the impact of participating in the program isFor each person, only Y1 or Y0 is observed, so Δ is not observed for anyone. This missing data problem lies at the heart of the evaluation problem.
Let D=1
The national supported work demonstration
The National Supported Work (NSW) Demonstration25 was a transitional, subsidized work experience program that operated for 4 years at 15 locations throughout the United States. It served four target groups: female long-term AFDC recipients, ex-drug addicts, ex-offenders, and young school dropouts. The program first provided trainees with work in a
Samples
In this study, we consider three experimental samples and two nonexperimental comparison groups. All of the samples are based on the male samples from LaLonde (1986).29 The experimental sample includes male respondents in the NSWs ex-addict, ex-offender and high school dropout target groups who had valid pre- and post-program earnings data.
The first experimental sample is the same
Propensity scores
We present matching estimates based on two alternative specifications of the propensity score, Pr(D=1|Z). The first specification is that employed in Dehejia and Wahba 1999, Dehejia and Wahba 2002; the second specification is based on LaLonde (1986). Although LaLonde does not consider matching estimators, he estimates a probability of participation in the course of implementing the classical selection estimator of Heckman (1979). In both cases, we use the logit model to estimate the scores.
The
Variable selection and the balancing test
Under the conditional mean independence assumption required for application of propensity score matching, the outcome variable must be conditionally mean independent of treatment conditional on the propensity score, P(Z). Implementing matching requires choosing a set of variables Z that plausibly satisfy this condition. This set should include all of the key factors affecting both program participation and outcomes—that is, all the variables that affect both D and Y0. No mechanical algorithm
Matching estimates
We now present our estimates of the bias obtained when we apply matching to the experimental NSW data and the two different nonexperimental comparison groups. Our estimation strategy differs somewhat from that of LaLonde (1986) and Dehejia and Wahba 1999, Dehejia and Wahba 2002 in that we obtain direct estimates of the bias by applying matching to the randomized-out control group and the nonexperimental comparison groups, whereas the other papers obtain the bias indirectly by applying matching
Regression-based estimates
We next present bias estimates obtained using a number of standard, regression-based impact estimators for each of the three experimental samples and both comparison groups. We seek answers to two questions. First, how well do these estimators perform in the different samples? We have argued that the DW sample may implicitly present a less difficult selection problem than the original LaLonde sample due to its inclusion of persons randomly assigned late in the experiment only if they had zero
Specification tests
As discussed in Section 2, Heckman and Hotz (1989) found that when they applied two types of specification tests to the NSW data that they were able to rule out those estimators that implied a different qualitative conclusion than the experimental impact estimates. In this section, we apply one of the specification tests that they use to the cross-sectional matching estimators presented in Table 5. The test we apply is the pre-program alignment test, in which each candidate estimator is applied
Summary and conclusions
Our analysis of the data from the National Supported Work Demonstration yields three main conclusions. First, our evidence leads us to question recent claims in the literature by Dehejia and Wahba 1999, Dehejia and Wahba 2002 and others regarding the general effectiveness of matching estimators relative to more traditional econometric methods. While we are able to replicate the low bias estimates reported in the Dehejia and Wahba 1999, Dehejia and Wahba 2002 studies, we conclude that their
Acknowledgements
We thank Robert LaLonde for providing us with the data from his 1986 study. We thank Rajeev Dehejia for providing us with information helpful in reconstructing the samples used in the Dehejia and Wahba (1999, 2002) studies. We thank seminar participants at Boston College, the CEA meetings, CILN, the Department of Family and Community Services of Australia, Econometric Society (European meetings), the GAAC conference on the Evaluation of Active Labor Market Policies, IFS, IRP, IZA, Kentucky,
References (57)
- et al.
An evaluation of public employment programmes in the East German state of Sachsen-Anhalt
Labour Economics
(2002) - et al.
Labour market training in Norway—effect on earnings
Labour Economics
(2002) A nonexperimental evaluation of training programs for the unemployed in Sweden
Labour Economics
(2002)Lifetime earnings and the Vietnam draft lotteryevidence from Social Security Administrative records
American Economic Review
(1990)Estimating the labor market impact of voluntary military service using Social Security data on military applicants
Econometrica
(1998)- Angrist, J., Hahn, J., 1999. When to control for covariates? Panel asymptotics for estimates of treatment effects. NBER...
Estimating the effect of training programs on earnings
Review of Economics and Statistics
(1978)- et al.
Using the longitudinal structure of earnings to estimate the effect of training programs
Review of Economics and Statistics
(1985) The impact of CETA programs on earnings: a review of the literature
Journal of Human Resources
(1987)Estimating the effects of training programs with nonrandom selection
Review of Economics and Statistics
(1984)
Evaluation methods for non-experimental data
Fiscal Studies
The case for randomized field trials in economic and policy research
Journal of Economic Perspectives
Are classical experiments needed for manpower policy?
Journal of Human Resources
Measuring the effect of subsidized training programs on movements in and out of employment
Econometrica
Controlling bias in observational studies
Sankyha
New evidence on the long-term effects of employment and training programs
Journal of Labor Economics
Causal effects in nonexperimental studiesreevaluating the evaluation of training programs
Journal of the American Statistical Association
Propensity score matching methods for nonexperimental causal studies
Review of Economics and Statistics
The impact of being offered and receiving classroom training on the employment histories of disadvantaged womenevidence from experimental data
Review of Economic Studies
Design adaptive nonparametric regression
Journal of the American Statistical Association
Local linear regression smoothers and their minimax efficiencies
The Annals of Statistics
Local Polynomial Modelling and its Applications
The adequacy of comparison group designs for evaluations of employment related programs
Journal of Human Resources
Evaluating program evaluationsnew evidence on commonly used nonexperimental methods
American Economic Review
Finite-sample properties of propensity score matching and weighting estimators
Review of Economics and Statistics
On the role of the propensity score in efficient estimation of average treatment effects
Econometrica
Cited by (0)
- 1
Affiliated with the National Bureau of Economic Research (NBER) and the IZA.