Elsevier

Journal of Econometrics

Volume 125, Issues 1–2, March–April 2005, Pages 305-353
Journal of Econometrics

Does matching overcome LaLonde's critique of nonexperimental estimators?

https://doi.org/10.1016/j.jeconom.2004.04.011Get rights and content

Abstract

This paper applies cross-sectional and longitudinal propensity score matching estimators to data from the National Supported Work (NSW) Demonstration that have been previously analyzed by LaLonde (1986) and Dehejia and Wahba (1999, 2002). We find that estimates of the impact of NSW based on propensity score matching are highly sensitive to both the set of variables included in the scores and the particular analysis sample used in the estimation. Among the estimators we study, the difference-in-differences matching estimator performs the best. We attribute its performance to the fact that it eliminates potential sources of temporally invariant bias present in the NSW data, such as geographic mismatch between participants and nonparticipants and the use of a dependent variable measured in different ways for the two groups. Our analysis demonstrates that while propensity score matching is a potentially useful econometric tool, it does not represent a general solution to the evaluation problem.

Introduction

There is a long-standing debate in the literature over whether social programs can be reliably evaluated without a randomized experiment. Randomization has a key advantage over nonexperimental methods in generating a control group that has the same distributions of both observed and unobserved characteristics as the treatment group. At the same time, social experimentation also has some drawbacks, such as high cost, the potential to distort the operation of an ongoing program, the common problem of program sites refusing to participate in the experiment and the problem of randomized-out controls seeking alternative forms of treatment.2 In contrast, evaluation methods that use nonexperimental data tend to be less costly and less intrusive. Also, for some questions of interest, they are the only alternative.3

The major obstacle in implementing a nonexperimental evaluation strategy is choosing among the wide variety of estimation methods available in the literature. This choice is important given the accumulated evidence that impact estimates are often highly sensitive to the estimator chosen. A literature has arisen, starting with LaLonde (1986), that evaluates the performance of nonexperimental estimators using experimental data as a benchmark. Much of this literature implicitly frames the question as one of searching for “the” nonexperimental estimator that will always solve the selection bias problem inherent in nonexperimental evaluations. Two recent contributions to this literature by Dehejia and Wahba (DW) (1999, 2002) have drawn attention to a class of estimators called propensity score matching estimators. They apply these matching estimators to the same experimental data from the National Supported Work (NSW) Demonstration, and the same nonexperimental data from the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID), analyzed by LaLonde (1986) and find very low biases. Their findings have made propensity score matching the estimator du jour in the evaluation literature.

Dehejia and Wahba 1999, Dehejia and Wahba 2002 finding of low bias from applying propensity score matching to LaLonde's (1986) data is surprising in light of the lessons learned from the analyses of Heckman, Ichimura and Todd and Heckman, Ichimura, Smith and Todd (Heckman et al. (1997a), Heckman 1996, Heckman 1998a) (henceforth HIT and HIST) using the experimental data from the U.S. National Job Training Partnership Act (JTPA) Study. They conclude that in order for matching estimators to have low bias, it is important that the data include a rich set of variables related to program participation and labor market outcomes, that the nonexperimental comparison group be drawn from the same local labor markets as the participants, and that the dependent variable (typically earnings) be measured in the same way for participants and nonparticipants. All three of these conditions fail to hold in the NSW data analyzed by LaLonde (1986) and Dehejia and Wahba 1999, Dehejia and Wahba 2002.

In this paper, we reanalyze these data, applying both cross-sectional and longitudinal variants of propensity score matching. We find that the low bias estimates obtained by Dehejia and Wahba 1999, Dehejia and Wahba 2002 using various cross-sectional matching estimators are highly sensitive to their choice of a particular subsample of LaLonde's (1986) data for their analysis. We also find the changing the set of variables used to estimate the propensity scores strongly affects the estimated bias in LaLonde's original sample. At the same time, we find that difference-in-differences (DID) matching estimators exhibit better performance than the cross-sectional estimators. This is consistent with the evidence from the JTPA data in HIT (1997a) and HIST (1998a) on the importance of avoiding geographic mismatch and of measuring the dependent variable in the same way in the treatment and comparison groups. Both these sources of bias are likely to be relatively stable over time, and so should difference out. More generally, our findings make it clear that propensity score matching does not represent a “magic bullet” that solves the selection problem in every context. The implicit search for such an estimator in the literature cannot succeed. Instead, the optimal nonexperimental evaluation strategy in a given context depends critically on the available data and on the institutions governing selection into the program.

The plan of the paper is as follows. Section 2 reviews some key papers in the previous literature on the choice among alternative nonexperimental estimators. Section 3.1 lays out the evaluation problem and Section 3.2 briefly describes commonly used nonexperimental estimators. Section 3.3 describes the cross-sectional and difference-in-differences matching estimators that we focus on in our study. 3.4 Choice-based sampled data, 3.5 When does bias arise in matching? briefly address the issues of choice-based sampling and the bias that arises from incomplete matching, respectively. Section 3.6 explains how we use the experimental data to benchmark the performance of nonexperimental estimators. Section 4 describes the NSW program. Section 5 describes our analysis samples from the NSW data and the two comparison groups. Section 6 presents our estimated propensity scores and Section 7 discusses the “balancing tests” used in some recent studies to aid in selecting a propensity score specification. 8 Matching estimates, 9 Regression-based estimates give the bias estimates obtained using matching and regression-based estimators, respectively. Section 10 displays evidence on the use of specification tests applied to our cross-sectional matching estimators and Section 11 concludes.

Section snippets

Previous research

Several previous papers use data from the National Supported Work Demonstration experiment to study the performance of econometric estimators. LaLonde (1986) was the first and the data we use come from his study. He arranged the NSW data into two samples: one of AFDC women and one of disadvantaged men. The comparison group subsamples were constructed from two national survey datasets: the CPS and the PSID. LaLonde (1986) applies a number of standard evaluation estimators, including simple

The evaluation problem

Assessing the impact of any intervention requires making an inference about the outcomes that would have been observed for program participants had they not participated. Denote by Y1 the outcome conditional on participation and by Y0 the outcome conditional on non-participation, so that the impact of participating in the program isΔ=Y1−Y0.For each person, only Y1 or Y0 is observed, so Δ is not observed for anyone. This missing data problem lies at the heart of the evaluation problem.

Let D=1

The national supported work demonstration

The National Supported Work (NSW) Demonstration25 was a transitional, subsidized work experience program that operated for 4 years at 15 locations throughout the United States. It served four target groups: female long-term AFDC recipients, ex-drug addicts, ex-offenders, and young school dropouts. The program first provided trainees with work in a

Samples

In this study, we consider three experimental samples and two nonexperimental comparison groups. All of the samples are based on the male samples from LaLonde (1986).29 The experimental sample includes male respondents in the NSWs ex-addict, ex-offender and high school dropout target groups who had valid pre- and post-program earnings data.

The first experimental sample is the same

Propensity scores

We present matching estimates based on two alternative specifications of the propensity score, Pr(D=1|Z). The first specification is that employed in Dehejia and Wahba 1999, Dehejia and Wahba 2002; the second specification is based on LaLonde (1986). Although LaLonde does not consider matching estimators, he estimates a probability of participation in the course of implementing the classical selection estimator of Heckman (1979). In both cases, we use the logit model to estimate the scores.

The

Variable selection and the balancing test

Under the conditional mean independence assumption required for application of propensity score matching, the outcome variable must be conditionally mean independent of treatment conditional on the propensity score, P(Z). Implementing matching requires choosing a set of variables Z that plausibly satisfy this condition. This set should include all of the key factors affecting both program participation and outcomes—that is, all the variables that affect both D and Y0. No mechanical algorithm

Matching estimates

We now present our estimates of the bias obtained when we apply matching to the experimental NSW data and the two different nonexperimental comparison groups. Our estimation strategy differs somewhat from that of LaLonde (1986) and Dehejia and Wahba 1999, Dehejia and Wahba 2002 in that we obtain direct estimates of the bias by applying matching to the randomized-out control group and the nonexperimental comparison groups, whereas the other papers obtain the bias indirectly by applying matching

Regression-based estimates

We next present bias estimates obtained using a number of standard, regression-based impact estimators for each of the three experimental samples and both comparison groups. We seek answers to two questions. First, how well do these estimators perform in the different samples? We have argued that the DW sample may implicitly present a less difficult selection problem than the original LaLonde sample due to its inclusion of persons randomly assigned late in the experiment only if they had zero

Specification tests

As discussed in Section 2, Heckman and Hotz (1989) found that when they applied two types of specification tests to the NSW data that they were able to rule out those estimators that implied a different qualitative conclusion than the experimental impact estimates. In this section, we apply one of the specification tests that they use to the cross-sectional matching estimators presented in Table 5. The test we apply is the pre-program alignment test, in which each candidate estimator is applied

Summary and conclusions

Our analysis of the data from the National Supported Work Demonstration yields three main conclusions. First, our evidence leads us to question recent claims in the literature by Dehejia and Wahba 1999, Dehejia and Wahba 2002 and others regarding the general effectiveness of matching estimators relative to more traditional econometric methods. While we are able to replicate the low bias estimates reported in the Dehejia and Wahba 1999, Dehejia and Wahba 2002 studies, we conclude that their

Acknowledgements

We thank Robert LaLonde for providing us with the data from his 1986 study. We thank Rajeev Dehejia for providing us with information helpful in reconstructing the samples used in the Dehejia and Wahba (1999, 2002) studies. We thank seminar participants at Boston College, the CEA meetings, CILN, the Department of Family and Community Services of Australia, Econometric Society (European meetings), the GAAC conference on the Evaluation of Active Labor Market Policies, IFS, IRP, IZA, Kentucky,

References (57)

  • M. Eichler et al.

    An evaluation of public employment programmes in the East German state of Sachsen-Anhalt

    Labour Economics

    (2002)
  • O. Raaum et al.

    Labour market training in Norway—effect on earnings

    Labour Economics

    (2002)
  • H. Regnér

    A nonexperimental evaluation of training programs for the unemployed in Sweden

    Labour Economics

    (2002)
  • J. Angrist

    Lifetime earnings and the Vietnam draft lotteryevidence from Social Security Administrative records

    American Economic Review

    (1990)
  • J. Angrist

    Estimating the labor market impact of voluntary military service using Social Security data on military applicants

    Econometrica

    (1998)
  • Angrist, J., Hahn, J., 1999. When to control for covariates? Panel asymptotics for estimates of treatment effects. NBER...
  • O. Ashenfelter

    Estimating the effect of training programs on earnings

    Review of Economics and Statistics

    (1978)
  • O. Ashenfelter et al.

    Using the longitudinal structure of earnings to estimate the effect of training programs

    Review of Economics and Statistics

    (1985)
  • B. Barnow

    The impact of CETA programs on earnings: a review of the literature

    Journal of Human Resources

    (1987)
  • L. Bassi

    Estimating the effects of training programs with nonrandom selection

    Review of Economics and Statistics

    (1984)
  • R. Blundell et al.

    Evaluation methods for non-experimental data

    Fiscal Studies

    (2000)
  • G. Burtless

    The case for randomized field trials in economic and policy research

    Journal of Economic Perspectives

    (1995)
  • G. Burtless et al.

    Are classical experiments needed for manpower policy?

    Journal of Human Resources

    (1986)
  • D. Card et al.

    Measuring the effect of subsidized training programs on movements in and out of employment

    Econometrica

    (1988)
  • W. Cochran et al.

    Controlling bias in observational studies

    Sankyha

    (1973)
  • K. Couch

    New evidence on the long-term effects of employment and training programs

    Journal of Labor Economics

    (1992)
  • R. Dehejia et al.

    Causal effects in nonexperimental studiesreevaluating the evaluation of training programs

    Journal of the American Statistical Association

    (1999)
  • R. Dehejia et al.

    Propensity score matching methods for nonexperimental causal studies

    Review of Economics and Statistics

    (2002)
  • Devine, T., Heckman, J., 1996. The structure and consequences of eligibility rules for a social program: a study of the...
  • C. Eberwein et al.

    The impact of being offered and receiving classroom training on the employment histories of disadvantaged womenevidence from experimental data

    Review of Economic Studies

    (1997)
  • J. Fan

    Design adaptive nonparametric regression

    Journal of the American Statistical Association

    (1992)
  • J. Fan

    Local linear regression smoothers and their minimax efficiencies

    The Annals of Statistics

    (1992)
  • J. Fan et al.

    Local Polynomial Modelling and its Applications

    (1996)
  • T. Fraker et al.

    The adequacy of comparison group designs for evaluations of employment related programs

    Journal of Human Resources

    (1987)
  • D. Friedlander et al.

    Evaluating program evaluationsnew evidence on commonly used nonexperimental methods

    American Economic Review

    (1995)
  • M. Frölich

    Finite-sample properties of propensity score matching and weighting estimators

    Review of Economics and Statistics

    (2004)
  • J. Hahn

    On the role of the propensity score in efficient estimation of average treatment effects

    Econometrica

    (1998)
  • Ham, J., Li, X., Reagan, P., 2003. Propensity score matching, a distance-based measure of migration, and the wage...
  • Cited by (0)

    1

    Affiliated with the National Bureau of Economic Research (NBER) and the IZA.

    View full text