RDDEconomics PDF

Journal of Economic Literature 48 (June 2010): 281–355
http:www.aeaweb.org/articles.php?doi=10.1257/jel.48.2.281
Regression Discontinuity Designs

in Economics
David S. Lee and Thomas Lemieux*
This paper provides an introduction and “user guide” to Regression Discontinuity

(RD) designs for empirical researchers. It presents the basic theory behind the research
design, details when RD is likely to be valid or invalid given economic incentives,
explains why it is considered a “quasi-experimental” design, and summarizes differ-
ent ways (with their advantages and disadvantages) of estimating RD designs and
the limitations of interpreting these estimates. Concepts are discussed using examples
drawn from the growing body of empirical research using RD. ( JEL C21, C31)
1. Introduction (1960) analyzed the impact of merit awards

on future academic outcomes, using the fact
R egression Discontinuity (RD) designs

were first introduced by Donald L.
Thistlethwaite and Donald T. Campbell
that the allocation of these awards was based
on an observed test score. The main idea
behind the research design was that individ-
(1960) as a way of estimating treatment uals with scores just below the cutoff (who
effects in a nonexperimental setting where did not receive the award) were good com-
treatment is determined by whether an parisons to those just above the cutoff (who
observed “assignment” variable (also referred did receive the award). Although this evalua-
to in the literature as the “forcing” variable tion strategy has been around for almost fifty
or the “running” variable) exceeds a known years, it did not attract much attention in
cutoff point. In their initial application of economics until relatively recently.
RD designs, Thistlethwaite and Campbell Since the late 1990s, a growing number of
studies have relied on RD designs to estimate
program effects in a wide variety of economic
* Lee: Princeton University and NBER. Lemieux:
contexts. Like Thistlethwaite and Campbell
University of British Columbia and NBER. We thank
David Autor, David Card, John DiNardo, Guido Imbens, (1960), early studies by Wilbert van der Klaauw
and Justin McCrary for suggestions for this article, as well (2002) and Joshua D. Angrist and Victor Lavy
as for numerous illuminating discussions on the various (1999) exploited threshold rules often used by
topics we cover in this review. We also thank two anony-
mous referees for their helpful suggestions and comments, educational institutions to estimate the effect
and Damon Clark, Mike Geruso, Andrew Marder, and of financial aid and class size, respectively,
Zhuan Pei for their careful reading of earlier drafts. Diane on educational outcomes. Sandra E. Black
Alexander, Emily Buchsbaum, Elizabeth Debraggio,
Enkeleda Gjeci, Ashley Hodgson, Yan Lau, Pauline Leung, (1999) exploited the presence of discontinui-
and Xiaotong Niu provided excellent research assistance. ties at the geographical level (school district
281
282 Journal of Economic Literature, Vol. XLVIII (June 2010)
boundaries) to estimate the willingness to pay a highly credible and transparent way of
for good schools. Following these early papers estimating program effects, RD designs can
in the area of education, the past five years be used in a wide variety of contexts cover-
have seen a rapidly growing literature using ing a large number of important economic
RD designs to examine a range of questions. questions. These two facts likely explain
Examples include the labor supply effect of why the RD approach is rapidly becoming
welfare, unemployment insurance, and dis- a major element in the toolkit of empirical
ability programs; the effects of Medicaid on economists.
health outcomes; the effect of remedial edu- Despite the growing importance of RD
cation programs on educational achievement; designs in economics, there is no single com-
the empirical relevance of median voter mod- prehensive summary of what is understood
els; and the effects of unionization on wages about RD designs—when they succeed,
and employment. when they fail, and their strengths and weak-
One important impetus behind this recent nesses.2 Furthermore, the “nuts and bolts” of
flurry of research is a recognition, formal- implementing RD designs in practice are not
ized by Jinyong Hahn, Petra Todd, and van (yet) covered in standard econometrics texts,
der Klaauw (2001), that RD designs require making it difficult for researchers interested
seemingly mild assumptions compared to in applying the approach to do so. Broadly
those needed for other nonexperimental speaking, the main goal of this paper is to fill
approaches. Another reason for the recent these gaps by providing an up-to-date over-
wave of research is the belief that the RD view of RD designs in economics and cre-
design is not “just another” evaluation strat- ating a guide for researchers interested in
egy, and that causal inferences from RD applying the method.
designs are potentially more credible than A reading of the most recent research
those from typical “natural experiment” reveals a certain body of “folk wisdom”
strategies (e.g., difference-in-differences or regarding the applicability, interpretation,
instrumental variables), which have been and recommendations of practically imple-
heavily employed in applied research in menting RD designs. This article represents
recent decades. This notion has a theoreti- our attempt at summarizing what we believe
cal justification: David S. Lee (2008) for- to be the most important pieces of this wis-
mally shows that one need not assume the dom, while also dispelling misconceptions
RD design isolates treatment variation that is that could potentially (and understandably)
“as good as randomized”; instead, such ran- arise for those new to the RD approach.
domized variation is a consequence of agents’ We will now briefly summarize the most
inability to precisely control the assignment important points about RD designs to set
variable near the known cutoff. the stage for the rest of the paper where
So while the RD approach was initially we systematically discuss identification,
thought to be “just another” program evalu- interpretation, and estimation issues. Here,
ation method with relatively little general and throughout the paper, we refer to the
applicability outside of a few specific prob- assignment variable as X. Treatment is, thus,
lems, recent work in economics has shown
quite the opposite.1 In addition to providing
the RD design in economics is unique as it is still rarely
used in other disciplines.
1 See Thomas D. Cook (2008) for an interesting his- 2 See, however, two recent overview papers by van
tory of the RD design in education research, psychology, der Klaauw (2008b) and Guido W. Imbens and Thomas
statistics, and economics. Cook argues the resurgence of Lemieux (2008) that have begun bridging this gap.
Lee and Lemieux: Regression Discontinuity Designs in Economics 283
assigned to individuals (or “units”) with a instrumental variables (IV) approaches.

value of X greater than or equal to a cutoff When using IV for causal inference, one
value c. must assume the instrument is exog-
enously generated as if by a coin-flip.
• RD designs can be invalid if indi- Such an assumption is often difficult to
viduals can precisely manipulate the justify (except when an actual lottery
“assignment variable.” was run, as in Angrist (1990), or if there
When there is a payoff or benefit to were some biological process, e.g., gen-
receiving a treatment, it is natural for an der determination of a baby, mimicking
economist to consider how an individual a coin-flip). By contrast, the variation
may behave to obtain such benefits. For that RD designs isolate is randomized
example, if students could effectively as a consequence of the assumption that
“choose” their test score X through individuals have imprecise control over
effort, those who chose a score c (and the assignment variable.
hence received the merit award) could
be somewhat different from those who • RD designs can be analyzed—and
chose scores just below c. The impor- tested—like randomized experiments.
tant lesson here is that the existence of This is the key implication of the local
a treatment being a discontinuous func- randomization result. If variation in the
tion of an assignment variable is not suf- treatment near the threshold is approxi-
ficient to justify the validity of an RD mately randomized, then it follows that
design. Indeed, if anything, discontinu- all “baseline characteristics”—all those
ous rules may generate incentives, caus- variables determined prior to the realiza-
ing behavior that would invalidate the tion of the assignment variable—should
RD approach. have the same distribution just above and
just below the cutoff. If there is a discon-
• If individuals—even while having tinuity in these baseline covariates, then
some influence—are unable to pre- at a minimum, the underlying identify-
cisely manipulate the assignment ing assumption of individuals’ inability
variable, a consequence of this is that to precisely manipulate the assignment
the variation in treatment near the variable is unwarranted. Thus, the
threshold is randomized as though baseline covariates are used to test the
from a randomized experiment. validity of the RD design. By contrast,
This is a crucial feature of the RD when employing an IV or a matching/
design, since it is the reason RD designs regression-control strategy, assumptions
are often so compelling. Intuitively, typically need to be made about the rela-
when individuals have imprecise con- tionship of these other covariates to the
trol over the assignment variable, even if treatment and outcome variables.3
some are especially likely to have values
of X near the cutoff, every individual will • Graphical presentation of an RD
have approximately the same probability design is helpful and informative, but
of having an X that is just above (receiv- the visual presentation should not be
ing the treatment) or just below (being
denied the treatment) the cutoff— 3 Typically, one assumes that, conditional on the covari-
similar to a coin-flip experiment. This ates, the treatment (or instrument) is essentially “as good
result clearly differentiates the RD and as” randomly assigned.
tilted toward either finding an effect which case has a smaller bias with-
or finding no effect. out knowing something about the true
It has become standard to summarize function. There will be some functions
RD analyses with a simple graph show- where a low-order polynomial is a very
ing the relationship between the out- good approximation and produces little
come and assignment variables. This has or no bias, and therefore it is efficient to
several advantages. The presentation of use all data points—both “close to” and
the “raw data” enhances the transpar- “far away” from the threshold. In other
ency of the research design. A graph can situations, a polynomial may be a bad
also give the reader a sense of whether approximation, and smaller biases will
the “jump” in the outcome variable at occur with a local linear regression. In
the cutoff is unusually large compared to practice, parametric and nonparametric
the bumps in the regression curve away approaches lead to the computation of
from the cutoff. Also, a graphical analy- the exact same statistic.5 For example,
sis can help identify why different func- the procedure of regressing the outcome
tional forms give different answers, and Y on X and a treatment dummy D can
can help identify outliers, which can be be viewed as a parametric regression
a problem in any empirical analysis. The (as discussed above), or as a local linear
problem with graphical presentations, regression with a very large bandwidth.
however, is that there is some room for Similarly, if one wanted to exclude the
the researcher to construct graphs mak- influence of data points in the tails of the
ing it seem as though there are effects X distribution, one could call the exact
when there are none, or hiding effects same procedure “parametric” after trim-
that truly exist. We suggest later in the ming the tails, or “nonparametric” by
paper a number of methods to minimize viewing the restriction in the range of X
such biases in presentation. as a result of using a smaller bandwidth.6
Our main suggestion in estimation is to
• Nonparametric estimation does not not rely on one particular method or
represent a “solution” to functional specification. In any empirical analysis,
form issues raised by RD designs. It is results that are stable across alternative
therefore helpful to view it as a com-
plement to—rather than a substitute
5 See section 1.2 of James L. Powell (1994), where it
for—parametric estimation. is argued that is more helpful to view models rather than
When the analyst chooses a parametric particular statistics as “parametric” or “nonparametric.” It
functional form (say, a low-order poly- is shown there how the same least squares estimator can
simultaneously be viewed as a solution to parametric, semi-
nomial) that is incorrect, the resulting parametric, and nonparametric problems.
estimator will, in general, be biased. 6 The main difference, then, between a parametric and
When the analyst uses a nonparametric nonparametric approach is not in the actual estimation but
rather in the discussion of the asymptotic behavior of the
procedure such as local linear regres- estimator as sample sizes tend to infinity. For example,
sion—essentially running a regression standard nonparametric asymptotics considers what would
using only data points “close” to the happen if the bandwidth h—the width of the “window”
of observations used for the regression—were allowed to
cutoff—there will also be bias.4 With a shrink as the number of observations N tended to infinity.
finite sample, it is impossible to know It turns out that if h → 0 and Nh → ∞ as N → ∞, the bias
will tend to zero. By contrast, with a parametric approach,
when one is not allowed to make the model more flexible
4 Unless the underlying function is exactly linear in the with more data points, the bias would generally remain—
area being examined. even with infinite samples.
and equally plausible specifications are said, as we show below, there has been an
generally viewed as more reliable than explosion of discoveries of RD designs that
those that are sensitive to minor changes cover a wide range of interesting economic
in specification. RD is no exception in topics and questions.
this regard. The rest of the paper is organized as fol-
lows. In section 2, we discuss the origins of the
• Goodness-of-fit and other statistical RD design and show how it has recently been
tests can help rule out overly restric- formalized in economics using the potential
tive specifications. outcome framework. We also introduce an
Often the consequence of trying many important theme that we stress throughout
different specifications is that it may the paper, namely that RD designs are partic-
result in a wide range of estimates. ularly compelling because they are close cous-
Although there is no simple formula ins of randomized experiments. This theme is
that works in all situations and con- more formally explored in section 3 where
texts for weeding out inappropriate we discuss the conditions under which RD
specifications, it seems reasonable, at designs are “as good as a randomized experi-
a minimum, not to rely on an estimate ment,” how RD estimates should be inter-
resulting from a specification that can be preted, and how they compare with other
rejected by the data when tested against commonly used approaches in the program
a strictly more flexible specification. evaluation literature. Section 4 goes through
For example, it seems wise to place less the main “nuts and bolts” involved in imple-
confidence in results from a low-order menting RD designs and provides a “guide to
polynomial model when it is rejected practice” for researchers interested in using
in favor of a less restrictive model (e.g., the design. A summary “checklist” highlight-
separate means for each discrete value ing our key recommendations is provided at
of X). Similarly, there seems little reason the end of this section. Implementation issues
to prefer a specification that uses all the in several specific situations (discrete assign-
data if using the same specification, but ment variable, panel data, etc.) are covered in
restricting to observations closer to the section 5. Based on a survey of the recent lit-
threshold, gives a substantially (and sta- erature, section 6 shows that RD designs have
tistically) different answer. turned out to be much more broadly applica-
ble in economics than was originally thought.
Although we (and the applied literature) We conclude in section 7 by discussing recent
sometimes refer to the RD “method” or progress and future prospects in using and
“approach,” the RD design should perhaps interpreting RD designs in economics.
be viewed as more of a description of a par-
ticular data generating process. All other
2. Origins and Background
things (topic, question, and population of
interest) equal, we as researchers might pre- In this section, we set the stage for the rest
fer data from a randomized experiment or of the paper by discussing the origins and the
from an RD design. But in reality, like the basic structure of the RD design, beginning
randomized experiment—which is also more with the classic work of Thistlethwaite and
appropriately viewed as a particular data Campbell (1960) and moving to the recent
generating process rather than a “method” of interpretation of the design using modern
analysis—an RD design will simply not exist tools of program evaluation in economics
to answer a great number of questions. That (potential outcomes framework). One of
the main virtues of the RD approach is that Thistlethwaite and Campbell (1960) pro-
it can be naturally presented using simple vide some graphical intuition for why the
graphs, which greatly enhances its credibility coefficient τ could be viewed as an estimate
and transparency. In light of this, the major- of the causal effect of the award. We illustrate
ity of concepts introduced in this section are their basic argument in figure 1. Consider an
represented in graphical terms to help cap- individual whose score X is exactly c. To get
ture the intuition behind the RD design. the causal effect for a person scoring c, we
need guesses for what her Y would be with
2.1 Origins
and without receiving the treatment.
The RD design was first introduced by If it is “reasonable” to assume that all
Thistlethwaite and Campbell (1960) in their factors (other than the award) are evolving
study of the impact of merit awards on the “smoothly” with respect to X, then B′ would
future academic outcomes (career aspira- be a reasonable guess for the value of Y of
tions, enrollment in postgraduate programs, an individual scoring c (and hence receiving
etc.) of students. Their study exploited the the treatment). Similarly, A′′ would be a rea-
fact that these awards were allocated on the sonable guess for that same individual in the
basis of an observed test score. Students with counterfactual state of not having received
test scores X, greater than or equal to a cut- the treatment. It follows that B′ − A′′ would
off value c, received the award, while those be the causal estimate. This illustrates the
with scores below the cutoff were denied the intuition that the RD estimates should use
award. This generated a sharp discontinuity observations “close” to the cutoff (e.g., in this
in the “treatment” (receiving the award) as case at points c′ and c′′ ).
a function of the test score. Let the receipt There is, however, a limitation to the intu-
of treatment be denoted by the dummy vari- ition that “the closer to c you examine, the
able D ∈ {0, 1}, so that we have D = 1 if better.” In practice, one cannot “only” use
X ≥ c and D = 0 if X < c. data close to the cutoff. The narrower the
At the same time, there appears to be no area that is examined, the less data there are.
reason, other than the merit award, for future In this example, examining data any closer
academic outcomes, Y, to be a discontinuous than c′ and c′′ will yield no observations at all!
function of the test score. This simple rea- Thus, in order to produce a reasonable guess
soning suggests attributing the discontinu- for the treated and untreated states at X = c
ous jump in Y at c to the causal effect of the with finite data, one has no choice but to use
merit award. Assuming that the relationship data away from the discontinuity.7 Indeed,
between Y and X is otherwise linear, a sim- if the underlying function is truly linear, we
ple way of estimating the treatment effect τ know that the best linear unbiased estima-
is by fitting the linear regression tor of τ is the coefficient on D from OLS
estimation (using all of the observations) of
(1) Y = α + Dτ + Xβ + ε, equation (1).
This simple heuristic presentation illus-
where ε is the usual error term that can be trates two important features of the RD
viewed as a purely random error generat-
ing variation in the value of Y around the
regression line α + Dτ + Xβ. This case is 7 Interestingly, the very first application of the RD
depicted in figure 1, which shows both the design by Thistlethwaite and Campbell (1960) was based
on discrete data (interval data for test scores). As a result,
true underlying function and numerous real- their paper clearly points out that the RD design is funda-
izations of ε. mentally based on an extrapolation approach.
3
Outcome variable (Y)
B′
2 τ
A″
0
c″ c c′
Assignment variable (X)
Figure 1. Simple Linear RD Setup
design. First, in order for this approach to in the theoretical work of Hahn, Todd, and
work, “all other factors” determining Y must van der Klaauw (2001), who described the
be evolving “smoothly” with respect to X. If RD evaluation strategy using the language
the other variables also jump at c, then the of the treatment effects literature. Hahn,
gap τ will potentially be biased for the treat- Todd, and van der Klaauw (2001) noted the
ment effect of interest. Second, since an RD key assumption of a valid RD design was that
estimate requires data away from the cut- “all other factors” were “continuous” with
off, the estimate will be dependent on the respect to X, and suggested a nonparamet-
chosen functional form. In this example, if ric procedure for estimating τ that did not
the slope β were (erroneously) restricted to assume underlying linearity, as we have in
equal zero, it is clear the resulting OLS coef- the simple example above.
ficient on D would be a biased estimate of The necessity of the continuity assump-
the true discontinuity gap. tion is seen more formally using the “poten-
tial outcomes framework” of the treatment
2.2 RD Designs and the Potential Outcomes
effects literature with the aid of a graph. It is
Framework
typically imagined that, for each individual i,
While the RD design was being imported there exists a pair of “potential” outcomes:
into applied economic research by studies Yi(1) for what would occur if the unit were
such as van der Klaauw (2002), Black (1999), exposed to the treatment and Yi(0) if not
and Angrist and Lavy (1999), the identifica- exposed. The causal effect of the treatment is
tion issues discussed above were formalized represented by the difference Yi(1) − Yi(0).
4.00
3.50 Observed
3.00 D
E F
2.50
B′
B
2.00
E[Y(1)|X]
A
1.50 Observed
A′
1.00
0.50
E[Y(0)|X]
0.00
Xd
0 0.5 1 1.5 2 2.5 3 3.5 4
Assignment variable (X)
Figure 2. Nonlinear RD
The fundamental problem of causal infer- B − A = l   

  E[Yi | Xi = c + ε]
im 
ε↓0
ence is that we cannot observe the pair Yi(0)
and Yi(1) simultaneously. We therefore typi- −     E[Yi | Xi = c + ε],
lim 
ε↑0
cally focus on average effects of the treat-
ment, that is, averages of Yi(1) − Yi(0) over which would equal
(sub-)populations, rather than on unit-level
effects. E[Yi(1) − Yi(0) | X = c].
In the RD setting, we can imagine there
are two underlying relationships between This is the “average treatment effect” at the
average outcomes and X, represented by cutoff c.
E[Yi(1) | X ] and E[Yi(0) | X ], as in figure 2. This inference is possible because of
But by definition of the RD design, all indi- the continuity of the underlying functions
viduals to the right of the cutoff (c = 2 in E[Yi(1) | X ] and E[Yi(0) | X ].8 In essence,
this example) are exposed to treatment and
all those to the left are denied treatment.
8 The continuity of both functions is not the minimum
Therefore, we only observe E[Yi(1) | X ] to
that is required, as pointed out in Hahn, Todd, and van der
the right of the cutoff and E[Yi(0) | X] to Klaauw (2001). For example, identification is still possible
the left of the cutoff as indicated in the even if only E[Yi(0) | X ] is continuous, and only continuous
figure. at c. Nevertheless, it may seem more natural to assume that
the conditional expectations are continuous for all values
It is easy to see that with what is observ- of X, since cases where continuity holds at the cutoff point
able, we could try to estimate the quantity but not at other values of X seem peculiar.
this continuity condition enables us to use c annot, therefore, be correlated with any
the average outcome of those right below other factor.9
the cutoff (who are denied the treat- At the same time, the other standard
ment) as a valid counterfactual for those assumption of overlap is violated since,
right above the cutoff (who received the strictly speaking, it is not possible to
treatment). observe units with either D = 0 or D = 1
Although the potential outcome frame- for a given value of the assignment variable
work is very useful for understanding how X. This is the reason the continuity assump-
RD designs work in a framework applied tion is required—to compensate for the
economists are used to dealing with, it also failure of the overlap condition. So while
introduces some difficulties in terms of we cannot observe treatment and non-
interpretation. First, while the continuity treatment for the same value of X, we can
assumption sounds generally plausible, it is observe the two outcomes for values of X
not completely clear what it means from an around the cutoff point that are arbitrarily
economic point of view. The problem is that close to each other.
since continuity is not required in the more
traditional applications used in econom- 2.3 RD Design as a Local Randomized
ics (e.g., matching on observables), it is not Experiment
obvious what assumptions about the behav-
ior of economic agents are required to get When looking at RD designs in this way,
continuity. one could get the impression that they
Second, RD designs are a fairly pecu- require some assumptions to be satisfied,
liar application of a “selection on observ- while other methods such as matching on
ables” model. Indeed, the view in James J. observables and IV methods simply require
Heckman, Robert J. Lalonde, and Jeffrey A. other assumptions.10 From this point of
Smith (1999) was that “[r]egression discon- view, it would seem that the assumptions
tinuity estimators constitute a special case for the RD design are just as arbitrary as
of selection on observables,” and that the those used for other methods. As we discuss
RD estimator is “a limit form of matching throughout the paper, however, we do not
at one point.” In general, we need two cru- believe this way of looking at RD designs
cial conditions for a matching/selection on does justice to their important advantages
observables approach to work. First, treat- over most other existing methods. This
ment must be randomly assigned conditional point becomes much clearer once we com-
on observables (the ignorability or uncon- pare the RD design to the “gold standard”
foundedness assumption). In practice, this is of program evaluation methods, random-
typically viewed as a strong, and not particu- ized experiments. We will show that the
larly credible, assumption. For instance, in a RD design is a much closer cousin of ran-
standard regression framework this amounts domized experiments than other competing
to assuming that all relevant factors are con- methods.
trolled for, and that no omitted variables are
correlated with the treatment dummy. In an 9 In technical terms, the treatment dummy D follows a
RD design, however, this crucial assumption degenerate (concentrated at D = 0 or D = 1), but nonethe-
is trivially satisfied. When X ≥ c, the treat- less random distribution conditional on X. Ignorability is
ment dummy D is always equal to 1. When thus trivially satisfied.
10 For instance, in the survey of Angrist and Alan B.
X < c, D is always equal to 0. Conditional Krueger (1999), RD is viewed as an IV estimator, thus hav-
on X, there is no variation left in D, so it ing essentially the same potential drawbacks and pitfalls.
4.0
Observed (treatment)
3.5 E[Y(1)|X]
3.0
2.5
2.0
Observed (control)
1.5 E[Y(0)|X]
1.0
0.5
0.0
0 0.5 1 1.5 2 2.5 3 3.5 4
Assignment variable (random number, X)
Figure 3. Randomized Experiment as a RD Design
In a randomized experiment, units are words, continuity is a direct consequence of

typically divided into treatment and control randomization.
groups on the basis of a randomly gener- The fact that the curves E[Yi(1) | X ] and
ated number, ν. For example, if ν follows a E[Yi(0) | X ] are flat in a randomized experi-
uniform distribution over the range [0, 4], ment implies that, as is well known, the aver-
units with ν ≥ 2 are given the treatment age treatment effect can be computed as
while units with ν < 2 are denied treat- the difference in the mean value of Y on the
ment. So the randomized experiment can right and left hand side of the cutoff. One
be thought of as an RD design where the could also use an RD approach by running
assignment variable is X = v and the cutoff regressions of Y on X, but this would be less
is c = 2. Figure 3 shows this special case in efficient since we know that if randomization
the potential outcomes framework, just as in were successful, then X is an irrelevant vari-
the more general RD design case of figure able in this regression.
2. The difference is that because the assign- But now imagine that, for ethical reasons,
ment variable X is now completely random, people are compensated for having received
it is independent of the potential outcomes a “bad draw” by getting a monetary compen-
Yi(0) and Yi(1), and the curves E[Yi(1) | X ] sation inversely proportional to the random
and E[Yi(0) | X ] are flat. Since the curves are number X. For example, the treatment could
flat, it trivially follows that they are also con- be job search assistance for the unemployed,
tinuous at the cutoff point X = c. In other and the outcome whether one found a job
within a month of receiving the treatment. that the RD design is more closely related
If people with a larger monetary compen- to randomized experiments than to other
sation can afford to take more time looking popular program evaluation methods such
for a job, the potential outcome curves will as matching on observables, difference-in-
no longer be flat and will slope upward. The differences, and IV.
reason is that having a higher random num-
ber, i.e., a lower monetary compensation,
3. Identification and Interpretation
increases the probability of finding a job. So
in this “smoothly contaminated” randomized This section discusses a number of issues
experiment, the potential outcome curves of identification and interpretation that arise
will instead look like the classical RD design when considering an RD design. Specifically,
case depicted in figure 2. the applied researcher may be interested
Unlike a classical randomized experi- in knowing the answers to the following
ment, in this contaminated experiment questions:
a simple comparison of means no longer
yields a consistent estimate of the treatment 1. How do I know whether an RD design
effect. By focusing right around the thresh- is appropriate for my context? When
old, however, an RD approach would still are the identification assumptions plau-
yield a consistent estimate of the treatment sible or implausible?
effect associated with job search assistance.
The reason is that since people just above 2. Is there any way I can test those
or below the cutoff receive (essentially) the assumptions?
same monetary compensation, we still have
locally a randomized experiment around the 3. To what extent are results from RD
cutoff point. Furthermore, as in a random- designs generalizable?
ized experiment, it is possible to test whether
randomization “worked” by comparing the On the surface, the answers to these
local values of baseline covariates on the two questions seem straightforward: (1) “An
sides of the cutoff value. RD design will be appropriate if it is plau-
Of course, this particular example is sible that all other unobservable factors are
highly artificial. Since we know the monetary “continuously” related to the assignment
compensation is a continuous function of variable,” (2) “No, the continuity assump-
X, we also know the continuity assumption tion is necessary, so there are no tests for
required for the RD estimates of the treat- the validity of the design,” and (3) “The RD
ment effect to be consistent is also satisfied. estimate of the treatment effect is only appli-
The important result, due to Lee (2008), cable to the subpopulation of individuals at
that we will show in the next section is that the discontinuity threshold, and uninforma-
the conditions under which we locally have tive about the effect anywhere else.” These
a randomized experiment (and continuity) answers suggest that the RD design is no
right around the cutoff point are remark- more compelling than, say, an instrumen-
ably weak. Furthermore, in addition to tal variables approach, for which the analo-
being weak, the conditions for local random- gous answers would be (1) “The instrument
ization are testable in the same way global must be uncorrelated with the error in the
randomization is testable in a randomized outcome equation,” (2) “The identification
experiment by looking at whether baseline assumption is ultimately untestable,” and (3)
covariates are balanced. It is in this sense “The estimated treatment effect is applicable
to the subpopulation whose treatment was 3.1 Valid or Invalid RD?

affected by the instrument.” After all, who’s
to say whether one untestable design is more Are individuals able to influence the
“compelling” or “credible” than another assignment variable, and if so, what is the
untestable design? And it would seem that nature of this control? This is probably the
having a treatment effect for a vanishingly most important question to ask when assess-
small subpopulation (those at the threshold, ing whether a particular application should
in the limit) is hardly more (and probably be analyzed as an RD design. If individuals
much less) useful than that for a population have a great deal of control over the assign-
“affected by the instrument.” ment variable and if there is a perceived
As we describe below, however, a closer benefit to a treatment, one would certainly
examination of the RD design reveals quite expect individuals on one side of the thresh-
different answers to the above three questions: old to be systematically different from those
on the other side.
1. “When there is a continuously distrib- Consider the test-taking RD example.
uted stochastic error component to the Suppose there are two types of students: A
assignment variable—which can occur and B. Suppose type A students are more
when optimizing agents do not have able than B types, and that A types are also
precise control over the assignment keenly aware that passing the relevant thresh-
variable—then the variation in the old (50 percent) will give them a scholarship
treatment will be as good as random- benefit, while B types are completely igno-
ized in a neighborhood around the dis- rant of the scholarship and the rule. Now
continuity threshold.” suppose that 50 percent of the questions are
trivial to answer correctly but, due to ran-
2. “Yes. As in a randomized experiment, dom chance, students will sometimes make
the distribution of observed baseline careless errors when they initially answer the
covariates should not change discon- test questions, but would certainly correct
tinuously at the threshold.” the errors if they checked their work. In this
scenario, only type A students will make sure
3. “The RD estimand can be interpreted to check their answers before turning in the
as a weighted average treatment effect, exam, thereby assuring themselves of a pass-
where the weights are the relative ex ing score. Thus, while we would expect those
ante probability that the value of an who barely passed the exam to be a mixture
individual’s assignment variable will be of type A and type B students, those who
in the neighborhood of the threshold.” barely failed would exclusively be type B
students. In this example, it is clear that the
Thus, in many contexts, the RD design marginal failing students do not represent a
may have more in common with random- valid counterfactual for the marginal passing
ized experiments (or circumstances when an students. Analyzing this scenario within an
instrument is truly randomized)—in terms RD framework would be inappropriate.
of their “internal validity” and how to imple- On the other hand, consider the same sce-
ment them in practice—than with regression nario, except assume that questions on the
control or matching methods, instrumental exam are not trivial; there are no guaran-
variables, or panel data approaches. We will teed passes, no matter how many times the
return to this point after first discussing the students check their answers before turn-
above three issues in greater detail. ing in the exam. In this case, it seems more
plausible that, among those scoring near the 3.1.1 Randomized Experiments from
threshold, it is a matter of “luck” as to which Nonrandom Selection
side of the threshold they land. Type A stu-
dents can exert more effort—because they To see how the inability to precisely con-
know a scholarship is at stake—but they do trol the assignment variable leads to a source
not know the exact score they will obtain. In of randomized variation in the treatment,
this scenario, it would be reasonable to argue consider a simplified formulation of the RD
that those who marginally failed and passed design:11
would be otherwise comparable, and that an
RD analysis would be appropriate and would (2) Y = Dτ + Wδ1 + U
yield credible estimates of the impact of the
scholarship. D = 1[X ≥ c]
These two examples make it clear that one
must have some knowledge about the mech- X = Wδ2 + V,
anism generating the assignment variable
beyond knowing that, if it crosses the thresh- where Y is the outcome of interest, D is the
old, the treatment is “turned on.” It is “folk binary treatment indicator, and W is the
wisdom” in the literature to judge whether vector of all predetermined and observable
the RD is appropriate based on whether characteristics of the individual that might
individuals could manipulate the assignment impact the outcome and/or the assignment
variable and precisely “sort” around the dis- variable X.
continuity threshold. The key word here is This model looks like a standard endog-
“precise” rather than “manipulate.” After enous dummy variable set-up, except that
all, in both examples above, individuals do we observe the assignment variable, X. This
exert some control over the test score. And allows us to relax most of the other assump-
indeed, in virtually every known application tions usually made in this type of model.
of the RD design, it is easy to tell a plausi- First, we allow W to be endogenously deter-
ble story that the assignment variable is to mined as long as it is determined prior to
some degree influenced by someone. But V. Second, we take no stance as to whether
individuals will not always be able to have some elements of δ1 or δ2 are zero (exclusion
precise control over the assignment variable. restrictions). Third, we make no assump-
It should perhaps seem obvious that it is nec- tions about the correlations between W, U,
essary to rule out precise sorting to justify and V.12
the use of an RD design. After all, individ- In this model, individual heterogeneity in
ual self-selection into treatment or control the outcome is completely described by the
regimes is exactly why simple comparison of pair of random variables (W, U); anyone with
means is unlikely to yield valid causal infer- the same values of (W, U) will have one of
ences. Precise sorting around the threshold two values for the outcome, depending on
is self-selection. whether they receive treatment. Note that,
What is not obvious, however, is that,
when one formalizes the notion of having 11 We use a simple linear endogenous dummy variable
imprecise control over the assignment vari- setup to describe the results in this section, but all of the
able, there is a striking consequence: the results could be stated within the standard potential out-
variation in the treatment in a neighborhood comes framework, as in Lee (2008).
12 This is much less restrictive than textbook descrip-
of the threshold is “as good as randomized.” tions of endogenous dummy variable systems. It is typically
We explain this below. assumed that (U, V ) is independent of W.
Imprecise control
Precise control
“Complete control”
Density
0
x
Figure 4. Density of Assignment Variable Conditional on W = w, U = u
since RD designs are implemented by run- Now consider the distribution of X, condi-
ning regressions of Y on X, equation (2) looks tional on a particular pair of values W = w,
peculiar since X is not included with W and U = u. It is equivalent (up to a translational
U on the right hand side of the equation. We shift) to the distribution of V conditional on
could add a function of X to the outcome W = w, U = u. If an individual has complete
equation, but this would not make a differ- and exact control over X, we would model it
ence since we have not made any assump- as having a degenerate distribution, condi-
tions about the joint distribution of W, U, and tional on W = w, U = u. That is, in repeated
V. For example, our setup allows for the case trials, this individual would choose the same
where U = Xδ3 + U′, which yields the out- score. This is depicted in figure 4 as the thick
come equation Y = Dτ + Wδ1 + Xδ3 + U′. line.
For the sake of simplicity, we work with the If there is some room for error but indi-
simple case where X is not included on the viduals can nevertheless have precise control
right hand side of the equation.13 about whether they will fail to receive the
unobservable term U. Since it is not possible to distinguish

13 When
RD designs are implemented in practice, the between these two effects in practice, we simplify the
estimated effect of X on Y can either reflect a true causal setup by implicitly assuming that X only comes into equa-
effect of X on Y or a spurious correlation between X and the tion (2) indirectly through its (spurious) correlation with U.
treatment, then we would expect the density r andomized in a neighborhood of the thresh-
of X to be zero just below the threshold, but old. To see this, note that by Bayes’ Rule, we
positive just above the threshold, as depicted have
in figure 4 as the truncated distribution. This
density would be one way to model the first (3) Pr[W = w, U = u | X = x]
example described above for the type A stu-
= f (x | W = w, U = u) __
dents. Since type A students know about the Pr[W = w, U = u]
     , 
 
scholarship, they will double-check their f(x)
answers and make sure they answer the easy
questions, which comprise 50 percent of the where f (∙) and f (∙ | ∙) are marginal and
test. How high they score above the pass- conditional densities for X. So when
ing threshold will be determined by some f (x | W = w, U = u) is continuous in x, the
randomness. right hand side will be continuous in x, which
Finally, if there is stochastic error in the therefore means that the distribution of W, U
assignment variable and individuals do not conditional on X will be continuous in x.15
have precise control over the assignment That is, all observed and unobserved prede-
variable, we would expect the density of X termined characteristics will have identical
(and hence V ), conditional on W = w, U = u distributions on either side of x = c, in the
to be continuous at the discontinuity thresh- limit, as we examine smaller and smaller
old, as shown in figure 4 as the untruncated neighborhoods of the threshold.
distribution.14 It is important to emphasize In sum,
that, in this final scenario, the individual still
has control over X: through her efforts, she Local Randomization: If individuals have
can choose to shift the distribution to the imprecise control over X as defined above,
right. This is the density for someone with then Pr[W = w, U = u | X = x] is continu-
W = w, U = u, but may well be different— ous in x: the treatment is “as good as” ran-
with a different mean, variance, or shape of domly assigned around the cutoff.
the density—for other individuals, with dif-
ferent levels of ability, who make different In other words, the behavioral assumption
choices. We are assuming, however, that all that individuals do not precisely manipulate
individuals are unable to precisely control X around the threshold has the prediction
the score just around the threshold. that treatment is locally randomized.
This is perhaps why RD designs can be
Definition: We say individuals have so compelling. A deeper investigation into
imprecise control over X when conditional the real-world details of how X (and hence
on W = w and U = u, the density of V (and D) is determined can help assess whether it
hence X) is continuous. is plausible that individuals have precise or
imprecise control over X. By contrast, with
When individuals have imprecise con-
trol over X this leads to the striking implica-
15 Since the potential outcomes Y(0) and Y(1) are func-
tion that variation in treatment status will be
tions of W and U, it follows that the distribution of Y(0)
and Y(1) conditional on X is also continuous in x when indi-
viduals have imprecise control over X. This implies that
14 For example, this would be plausible when X is a the conditions usually invoked for consistently estimating
test score modeled as a sum of Bernoulli random vari- the treatment effect (the conditional means E[Y(0) | X = x]
ables, which is approximately normal by the central limit and E[Y(1) | X = x] being continuous in x) are also satisfied.
theorem. See Lee (2008) for more detail.
most nonexperimental evaluation contexts, 3.2.2 Testing the Validity of the RD Design
learning about how the treatment variable is
determined will rarely lead one to conclude An almost equally important implication of
that it is “as good as” randomly assigned. the above local random assignment result is
that it makes it possible to empirically assess
3.2 Consequences of Local Random
the prediction that Pr[W = w, U = u | X = x]
Assignment
is continuous in x. Although it is impossible
There are three practical implications of to test this directly—since U is unobserved—
the above local random assignment result. it is nevertheless possible to assess whether
Pr[W = w | X = x] is continuous in x at the
3.2.1 Identification of the Treatment Effect
threshold. A discontinuity would indicate a
First and foremost, it means that the dis- failure of the identifying assumption.
continuity gap at the cutoff identifies the This is akin to the tests performed to
treatment effect of interest. Specifically, we empirically assess whether the randomiza-
have tion was carried out properly in randomized
experiments. It is standard in these analyses
  E[Y | X = c + ε]
l   
im  to demonstrate that treatment and control
ε↓0
groups are similar in their observed base-
− lim 
     E[Y | X = c + ε] line covariates. It is similarly impossible to
ε↑0
test whether unobserved characteristics are

 ∑   (  wδ1 + u)
balanced in the experimental context, so the
= τ + l   
im 
ε↓0 w,u most favorable statement that can be made
about the experiment is that the data “failed
× Pr[W = w, U = u | X = c + ε] to reject” the assumption of randomization.

Performing this kind of test is arguably
    ∑
− lim    (wδ1 + u)
    more important in the RD design than in
ε↑0 w,u the experimental context. After all, the true
nature of individuals’ control over the assign-
× Pr[W = w, U = u | X = c + ε]
ment variable—and whether it is precise or
imprecise—may well be somewhat debat-
= τ,
able even after a great deal of investigation
into the exact treatment-assignment mecha-
where the last line follows from the continu- nism (which itself is always advisable to do).
ity of Pr[W = w, U = u | X = x]. Imprecision of control will often be nothing
As we mentioned earlier, nothing changes more than a conjecture, but thankfully it has
if we augment the model by adding a direct testable predictions.
impact of X itself in the outcome equation, There is a complementary, and arguably
as long as the effect of X on Y does not jump more direct and intuitive test of the impre-
at the cutoff. For example, in the example of cision of control over the assignment vari-
Thistlethwaite and Campbell (1960), we can able: examination of the density of X itself,
allow higher test scores to improve future as suggested in Justin McCrary (2008). If the
academic outcomes (perhaps by raising the density of X for each individual is continu-
probability of admission to higher quality ous, then the marginal density of X over the
schools) as long as that probability does not population should be continuous as well. A
jump at precisely the same cutoff used to jump in the density at the threshold is proba-
award scholarships. bly the most direct evidence of some degree
of sorting around the threshold, and should researchers will include them in regressions,
provoke serious skepticism about the appro- because doing so can reduce the sampling
priateness of the RD design.16 Furthermore, variability in the estimator. Arguably the
one advantage of the test is that it can always greatest potential for this occurs when one
be performed in a RD setting, while testing of the baseline covariates is a pre-random-
whether the covariates W are balanced at the assignment observation on the dependent
threshold depends on the availability of data variable, which may likely be highly corre-
on these covariates. lated with the post-assignment outcome vari-
This test is also a partial one. Whether each able of interest.
individual’s ex ante density of X is continuous The local random assignment result allows
is fundamentally untestable since, for each us to apply these ideas to the RD context. For
individual, we only observe one realization of example, if the lagged value of the depen-
X. Thus, in principle, at the threshold some dent variable was determined prior to the
individuals’ densities may jump up while oth- realization of X, then the local randomization
ers may sharply fall, so that in the aggregate, result will imply that that lagged dependent
positives and negatives offset each other variable will have a continuous relationship
making the density appear continuous. In with X. Thus, performing an RD analysis on
recent applications of RD such occurrences Y minus its lagged value should also yield the
seem far-fetched. Even if this were the case, treatment effect of interest. The hope, how-
one would certainly expect to see, after strat- ever, is that the differenced outcome mea-
ifying by different values of the observable sure will have a sufficiently lower variance
characteristics, some discontinuities in the than the level of the outcome, so as to lower
density of X. These discontinuities could be the variance in the RD estimator.
detected by performing the local randomiza- More formally, we have
tion test described above.
     E[Y − Wπ | X = c + ε]
lim 
3.2.3 Irrelevance of Including Baseline ε↓0
Covariates
− lim 
     E[Y − Wπ | X = c + ε]
A consequence of a randomized experi- ε↑0
ment is that the assignment to treatment is,
by construction, independent of the base-  ∑
= τ + l   
im     (  w(δ1 − π) + u)
ε↓0 w,u
line covariates. As such, it is not necessary to
include them to obtain consistent estimates
× Pr[W = w, U = u | X = c + ε]
of the treatment effect. In practice, however,

16 Another possible source of discontinuity in the     ∑
− lim    (w(δ1 − π) + u)
   
ε↑0 w,u
density of the assignment variable X is selective attrition.
For example, John DiNardo and Lee (2004) look at the
effect of unionization on wages several years after a union × Pr[W = w, U = u | X = c + ε]
representation vote was taken. In principle, if firms that
were unionized because of a majority vote are more likely
to close down, then conditional on firm survival at a later = τ,
date, there will be a discontinuity in X (the vote share) that
could threaten the validity of the RD design for estimat-
ing the effect of unionization on wages (conditional on where Wπ is any linear function, and W can
survival). In that setting, testing for a discontinuity in the include a lagged dependent variable, for
density (conditional on survival) is similar to testing for
selective attrition (linked to treatment status) in a standard example. We return to how to implement
randomized experiment. this in practice in section 4.4.
3.3 Generalizability: The RD Gap as a The discontinuity gap then, is a par-

Weighted Average Treatment Effect ticular kind of average treatment effect
across all individuals. If not for the term
In the presence of heterogeneous treat- f (c | W = w, U = u)/f (c), it would be the
ment effects, the discontinuity gap in an average treatment effect for the entire
RD design can be interpreted as a weighted population. The presence of the ratio
average treatment effect across all individu- f (c | W = w, U = u)/f (c) implies the discon-
als. This is somewhat contrary to the temp- tinuity is instead a weighted average treat-
tation to conclude that the RD design only ment effect where the weights are directly
delivers a credible treatment effect for the proportional to the ex ante likelihood that an
subpopulation of individuals at the threshold individual’s realization of X will be close to
and says nothing about the treatment effect the threshold. All individuals could get some
“away from the threshold.” Depending on weight, and the similarity of the weights
the context, this may be an overly simplistic across individuals is ultimately untestable,
and pessimistic assessment. since again we only observe one realization
Consider the scholarship test example of X per person and do not know anything
again, and define the “treatment” as “receiv- about the ex ante probability distribution of
ing a scholarship by scoring 50 percent or X for any one individual. The weights may be
greater on the scholarship exam.” Recall relatively similar across individuals, in which
that the pair W, U characterizes individual case the RD gap would be closer to the
heterogeneity. We now let τ (w, u) denote overall average treatment effect; but, if the
the treatment effect for an individual with weights are highly varied and also related to
W = w and U = u, so that the outcome the magnitude of the treatment effect, then
equation in (2) is instead given by the RD gap would be very different from
the overall average treatment effect. While
Y = Dτ (W, U) + Wδ1 + U. it is not possible to know how close the RD
gap is from the overall average treatment
This is essentially a model of completely effect, it remains the case that the treat-
unrestricted heterogeneity in the treatment ment effect estimated using a RD design is
effect. Following the same line of argument averaged over a larger population than one
as above, we obtain would have anticipated from a purely “cut-
off ” interpretation.
(5)      E[Y | X = c + ε]
lim  Of course, we do not observe the density of
ε↓0
the assignment variable at the individual level
− lim 
     E[Y | X = c + ε] so we therefore do not know the weight for
ε↑0
each individual. Indeed, if the signal to noise

= ∑    τ
   (w, u) Pr[W = w, U = u | X = c] ratio of the test is extremely high, someone
w,u who scores a 90 percent may have almost a
zero chance of scoring near the threshold,
   (w, u)  __
= ∑     τ
f (c | W = w, U = u)
  
    implying that the RD gap is almost entirely
w,u f (c)
dominated by those who score near 50 per-
× Pr[W = w, U = u], cent. But if the reliability is lower, then the
RD gap applies to a relatively broader sub-
population. It remains to be seen whether
where the second line follows from equation or not and how information on the reliabil-
(3). ity, or a second test measurement, or other
c ovariates that can predict the assignment weight is given to those districts in which a
could be used in conjunction with the RD close election race was expected.
gap to learn about average treatment effects
3.4 Variations on the Regression
for the overall population. The understanding
Discontinuity Design
of the RD gap as a weighted average treat-
ment effect serves to highlight that RD causal To this point, we have focused exclu-
evidence is not somehow fundamentally dis- sively on the “classic” RD design introduced
connected from the average treatment effect by Thistlethwaite and Campbell (1960),
that is often of interest to researchers. whereby there is a single binary treatment
It is important to emphasize that the RD and the assignment variable perfectly pre-
gap is not informative about the treatment dicts treatment receipt. We now discuss two
if it were defined as “receipt of a scholar- variants of this base case: (1) when there is
ship that is awarded by scoring 90 percent so-called “imperfect compliance” of the rule
or higher on the scholarship exam.” This is and (2) when the treatment of interest is a
not so much a “drawback” of the RD design continuous variable.
as a limitation shared with even a carefully In both cases, the notion that the RD
controlled randomized experiment. For design generates local variation in treatment
example, if we randomly assigned financial that is “as good as randomly assigned” is
aid awards to low-achieving students, what- helpful because we can apply known results
ever treatment effect we estimate may not for randomized instruments to the RD
be informative about the effect of financial design, as we do below. The notion is also
aid for high-achieving students. helpful for addressing other data problems,
In some contexts, the treatment effect such as differential attrition or sample selec-
“away from the discontinuity threshold” may tion, whereby the treatment affects whether
not make much practical sense. Consider the or not you observe the outcome of interest.
RD analysis of incumbency in congressional The local random assignment result means
elections of Lee (2008). When the treatment that, in principle, one could extend the ideas
is “being the incumbent party,” it is implic- of Joel L. Horowitz and Charles F. Manski
itly understood that incumbency entails win- (2000) or Lee (2009), for example, to provide
ning the previous election by obtaining at bounds on the treatment effect, accounting
least 50 percent of the vote.17 In the election for possible sample selection bias.
context, the treatment “being the incum-
3.4.1. Imperfect Compliance: The
bent party by virtue of winning an election,
“Fuzzy” RD
whereby 90 percent of the vote is required
to win” simply does not apply to any real-life In many settings of economic interest,
situation. Thus, in this context, it is awkward treatment is determined partly by whether
to interpret the RD gap as “the effect of the assignment variable crosses a cutoff point.
incumbency that exists at 50 percent vote- This situation is very important in practice for
share threshold” (as if there is an effect at a variety of reasons, including cases of imper-
a 90 percent threshold). Instead it is more fect take-up by program participants or when
natural to interpret the RD gap as estimat- factors other than the threshold rule affect
ing a weighted average treatment effect of the probability of program participation.
incumbency across all districts, where more Starting with William M. K. Trochim (1984),
this setting has been referred to as a “fuzzy”
17 For this example, consider the simplified case of a RD design. In the case we have discussed
two-party system. so far—the “sharp” RD design—the
probability of treatment jumps from 0 to 1 and “excludability” (i.e., X crossing the cutoff
when X crosses the threshold c. The fuzzy cannot impact Y except through impacting
RD design allows for a smaller jump in the receipt of treatment). When these assump-
probability of assignment to the treatment at tions are made, it follows that18
the threshold and only requires
τF = E[Y(1) − Y(0) |unit is complier, X = c],
     Pr (D = 1 | X = c + ε)
lim 
ε↓0
where “compliers” are units that receive the
≠      Pr (D = 1 | X = c + ε).
lim  treatment when they satisfy the cutoff rule
ε↑0
(Xi ≥ c), but would not otherwise receive it.
Since the probability of treatment jumps In summary, if there is local random
by less than one at the threshold, the jump in assignment (e.g., due to the plausibility of
the relationship between Y and X can no lon- individuals’ imprecise control over X), then
ger be interpreted as an average treatment we can simply apply all of what is known
effect. As in an instrumental variable setting about the assumptions and interpretability
however, the treatment effect can be recov- of instrumental variables. The difference
ered by dividing the jump in the relationship between the “sharp” and “fuzzy” RD design
between Y and X at c by the fraction induced is exactly parallel to the difference between
to take-up the treatment at the threshold— the randomized experiment with perfect
in other words, the discontinuity jump in the compliance and the case of imperfect com-
relation between D and X. In this setting, the pliance, when only the “intent to treat” is
treatment effect can be written as randomized.
limε↓0 E[Y | X = c + ε] − limε↑0 E[Y | X = c + ε]
For example, in the case of imperfect
___________________________
τF =       
     , compliance, even if a proposed binary instru-
lim E[D | X = c + ε] − lim E[D | X = c + ε]
ε↓0 ε↑0
ment Z is randomized, it is necessary to rule
where the subscript “F” refers to the fuzzy out the possibility that Z affects the outcome,
RD design. outside of its influence through treatment
There is a close analogy between how the receipt, D. Only then will the instrumental
treatment effect is defined in the fuzzy RD variables estimand—the ratio of the reduced
design and in the well-known “Wald” formu- form effects of Z on Y and of Z on D—be
lation of the treatment effect in an instru- properly interpreted as a causal effect of D
mental variables setting. Hahn, Todd and on Y. Similarly, supposing that individuals do
van der Klaauw (2001) were the first to show not have precise control over X, it is neces-
this important connection and to suggest sary to assume that whether X crosses the
estimating the treatment effect using two- threshold c (the instrument) has no impact
stage least-squares (TSLS) in this setting. on y except by influencing D. Only then will
We discuss estimation of fuzzy RD designs in the ratio of the two RD gaps in Y and D be
greater detail in section 4.3.3. properly interpreted as a causal effect of D
Hahn, Todd and van der Klaauw (2001) on Y.
furthermore pointed out that the interpreta- In the same way that it is important to
tion of this ratio as a causal effect requires verify a strong first-stage relationship in an
the same assumptions as in Imbens and IV design, it is equally important to verify
Angrist (1994). That is, one must assume
“monotonicity” (i.e., X crossing the cutoff
cannot simultaneously cause some units to 18 See Imbens and Lemieux (2008) for a more formal
take up and others to reject the treatment) exposition.
that a discontinuity exists in the relationship impact the continuous regressor of interest
between D and X in a fuzzy RD design. T. If γ = 0 and U2 = 0, then the model col-
Furthermore, in this binary-treatment– lapses to a “sharp” RD design (with a con-
binary-instrument context with unrestricted tinuous regressor).
heterogeneity in treatment effects, the IV Note that we make no additional assump-
estimand is interpreted as the average treat- tions about U2 (in terms of its correlation
ment effect “for the subpopulation affected with W or V ). We do continue to assume
by the instrument,” (or LATE). Analogously, imprecise control over X (conditional on W
the ratio of the RD gaps in Y and D (the and U1, the density of X is continuous).19
“fuzzy design” estimand) can be interpreted Given the discussion so far, it is easy to
as a weighted LATE, where the weights show that
reflect the ex ante likelihood the individual’s
X is near the threshold. In both cases, the (7)      E[Y | X = c + ε]
lim 
ε↓0
exclusion restriction and monotonicity con-
dition must hold. − lim 
     E[Y | X = c + ε]
ε↑0
3.4.2 Continuous Endogenous Regressor
= U   
lim 
  E[T | X = c + ε]
ε↓0
In a context where the “treatment” is a
continuous variable—call it T—and there −     γ.
  E[T | X = c + ε]V
lim 
ε↑0
is a randomized binary instrument (that can
additionally be excluded from the outcome The left hand side is simply the “reduced
equation), an IV approach is an obvious way form” discontinuity in the relation between
of obtaining an estimate of the impact of T y and X. The term preceding γ on the right
on Y. The IV estimand is the reduced-form hand side is the “first-stage” discontinuity in
impact of Z on Y divided by the first-stage the relation between T and X, which is also
impact of Z on T. estimable from the data. Thus, analogous
The same is true for an RD design when to the exactly identified instrumental vari-
the regressor of interest is continuous. Again, able case, the ratio of the two discontinuities
the causal impact of interest will still be the yields the parameter γ : the effect of T on Y.
ratio of the two RD gaps (i.e., the disconti- Again, because of the added notion of imper-
nuities in Y and T). fect compliance, it is important to assume
To see this more formally, consider the that D (X crossing the threshold) does not
model directly enter the outcome equation.
In some situations, more might be known
(6) Y = Tγ + Wδ1 + U1 about the rule determining T. For exam-
ple, in Angrist and Lavy (1999) and Miguel
T = Dϕ + Wγ + U2 Urquiola and Eric A. Verhoogen (2009),
class size is an increasing function of total
D = 1[X ≥ c] school enrollment, except for discontinui-
ties at various enrollment thresholds. But
X = Wδ2 + V,
19 Although it would be unnecessary to do so for the
which is the same set-up as before, except identification of γ, it would probably be more accurate to
with the added second equation, allowing describe the situation of imprecise control with the conti-
nuity of the density of X conditional on the three variables
for imperfect compliance or other factors (W, U1, U2). This is because U2 is now another variable
(observables W or unobservables U2) to characterizing heterogeneity in individuals.
a dditional information about characteristics the relationship between the treatment vari-
such as the slope and intercept of the under- able D and X, a step function, going from
lying function (apart from the magnitude of 0 to 1 at the X = c threshold. The second
the discontinuity) generally adds nothing to column shows the relationship between the
the identification strategy. observables W and X. This is flat because X is
To see this, change the second equation in completely randomized. The same is true for
(6) to T = Dϕ + g(X) where g(∙) is any con- the unobservable variable U, depicted in the
tinuous function in the assignment variable. third column. These three graphs capture
Equation (7) will remain the same and, thus, the appeal of the randomized experiment:
knowledge of the function g(∙) is irrelevant treatment varies while all other factors are
for identification.20 kept constant (on average). And even though
There is also no need for additional theo- we cannot directly test whether there are no
retical results in the case when there is indi- treatment-control differences in U, we can
vidual-level heterogeneity in the causal effect test whether there are such differences in
of the continuous regressor T. The local ran- the observable W.
dom assignment result allows us to borrow Now consider an RD design (panel B of
from the existing IV literature and interpret figure 5) where individuals have imprecise
the ratio of the RD gaps as in Angrist and control over X. Both W and U may be sys-
Krueger (1999), except that we need to add tematically related to X, perhaps due to the
the note that all averages are weighted by the actions taken by units to increase their prob-
ex ante relative likelihood that the individu- ability of receiving treatment. Whatever the
al’s X will land near the threshold. shape of the relation, as long as individuals
have imprecise control over X, the relation-
3.5 Summary: A Comparison of RD and
ship will be continuous. And therefore, as we
Other Evaluation Strategies
examine Y near the X = c cutoff, we can be
We conclude this section by compar- assured that like an experiment, treatment
ing the RD design with other evaluation varies (the first column) while other factors
approaches. We believe it is helpful to view are kept constant (the second and third col-
the RD design as a distinct approach rather umns). And, like an experiment, we can test
than as a special case of either IV or match- this prediction by assessing whether observ-
ing/regression-control. Indeed, in important ables truly are continuous with respect to X
ways the RD design is more similar to a ran- (the second column).21
domized experiment, which we illustrate We now consider two other commonly
below. used nonexperimental approaches, referring
Consider a randomized experiment where to the model (2):
subjects are assigned a random number X and
are given the treatment if X ≥ c. By construc- Y = Dτ + Wδ1 + U
tion, X is independent and not systematically
related to any observable or unobservable D = 1[X ≥ c]
characteristic determined prior to the ran-
domization. This situation is illustrated in X = Wδ2 + V.
panel A of figure 5. The first column shows
20 As discussed in 3.2.1, the inclusion of a direct effect

of X in the outcome equation will not change identifica- 21 We thank an anonymous referee for suggesting these
tion of τ. illustrative graphs.
A. Randomized Experiment
1 1
E[W|X]
E[U|X]
E[D|X]
0 0 0
0 x 0x 0 x
B. Regression Discontinuity Design

1
1 1
E[U|X]
E[W|X]
E[D|X]
0 0
0 x 0x 0
0 x
C. Matching on Observables
1
1 E[U|W, D = 0]
E[U|W, D = 1]
E[U|W, D]
E[D|W]
0 0
W W
D. Instrumental Variables
1 1 1
E[W|Z]
E[D|Z]
E[U|Z]
0 0
0
z z z
Figure 5. Treatment, Observables, and Unobservables in Four Research Designs

3.5.1 Selection on Observables: Matching/ hope that more W ’s will lead to less biased
Regression Control estimates, this is obviously not necessarily
the case. For example, consider estimating
The basic idea of the “selection on observ- the economic returns to graduating high
ables” approach is to adjust for differences school (versus dropping out). It seems natu-
in the W ’s between treated and control indi- ral to include variables like parents’ socioeco-
viduals. It is usually motivated by the fact nomic status, family income, year, and place
that it seems “implausible” that the uncon- of birth in the regression. Including more
ditional mean Y for the control group repre- and more family-level W ’s will ultimately
sents a valid counterfactual for the treatment lead to a “within-family” sibling analysis;
group. So it is argued that, conditional on W, extending it even further by including date
treatment-control contrasts may identify the of birth leads to a “within-twin-pair” analysis.
(W-specific) treatment effect. And researchers have been critical—justifi-
The underlying assumption is that condi- ably so—of this source of variation in edu-
tional on W, U and V are independent. From cation. The same reasons causing discomfort
this it is clear that about the twin analyses should also cause
skepticism about “kitchen sink” multivariate
E[Y | D = 1, W = w] matching/propensity score/regression con-
trol analyses.23
− E[Y | D = 0, W = w] It is also tempting to believe that, if the W ’s
do a “good job” in predicting D, the selection
= τ + E[U | W = w, V ≥ c − wδ2] on observables approach will “work better.”
But the opposite is true: in the extreme case
− E[U | W = w, V < c − wδ2 ] when the W ’s perfectly predict X (and hence
D), it is impossible to construct a treatment-
= τ. control contrast for virtually all observations.
For each value of W, the individuals will
Two issues arise when implementing this either all be treated or all control. In other
approach. The first is one of functional form: words, there will be literally no overlap in
how exactly to control for the W ’s? When the support of the propensity score for the
the W ’s take on discrete values, one possibil- treated and control observations. The pro-
ity is to compute treatment effects for each pensity score would take the values of either
distinct value of W, and then average these 1 or 0.
effects across the constructed “cells.” This The “selection on observables” approach is
will not work, however, when W has continu- illustrated in panel C of figure 5. Observables
ous elements, in which case it is necessary to W can help predict the probability of treat-
implement multivariate matching, propen- ment (first column), but ultimately one must
sity score, reweighting procedures, or non- assume that unobservable factors U must be
parametric regressions.22 the same for treated and control units for
Regardless of the functional form issue,
there is arguably a more fundamental ques- 23 Researchers question the twin analyses on the
tion of which W ’s to use in the analysis. While grounds that it is not clear why one twin ends up having
it is tempting to answer “all of them” and more education than the other, and that the assumption
that education differences among twins is purely random
(as ignorability would imply) is viewed as far-fetched. We
22 See Hahn (1998) on including covariates directly thank David Card for pointing out this connection between
with nonparametric regression. twin analyses and matching approaches.
every value of W. That is, the crucial assump- When Z is continuous, there is an addi-
tion is that the two lines in the third column tional approach to identifying τ. The “Heckit”
be on top of each other. Importantly, there is approach uses the fact that
no comparable graph in the second column
because there is no way to test the design E[Y | W * = w*, Z = z, D = 1]
since all the W ‘s are used for estimation.
= τ + w*γ
3.5.2 Selection on Unobservables:
Instrumental Variables and “Heckit”
+ E CU | W * = w*, Z = z, V ≥ c − wδ2D
A less restrictive modeling assumption is
to allow U and V to be correlated, conditional E[Y | W * = w*, Z = z, D = 0]
on W. But because of the arguably “more
realistic”/flexible data generating process, = w*γ
another assumption is needed to identify τ.
One such assumption is that some elements
+ E CU | W * = w*, Z = z, V < c − wδ2D .
of W (call them Z) enter the selection equa-
tion, but not the outcome equation and are
also uncorrelated with U. An instrumental If we further assume a functional form for
variables approach utilizes the fact that the joint distribution of U, V, conditional
on W * and Z, then the “control function”
E[Y | W * = w*, Z = z] terms E CU | W = w, V ≥ c − wδ2D and
E CU | W = w, V < c − wδ2D are functions
= E[D | W * = w*, Z = z]τ + w* γ of observed variables, with the parameters
then estimable from the data. It is then pos-
+ E[U | W * = w*, Z = z] sible, for any value of W = w, to identify τ
as
= E[D | W * = w*, Z = z]τ + w* γ
+ E[U | W * = w*], (8) (E[Y | W * = w*, Z = z, D = 1]
where W has been split up into W * and Z − E[Y | W * = w*, Z = z, D = 0])

and γ is the corresponding coefficient for w*.
Conditional on W * = w*, Y only varies with − AE CU | W * = w*, Z = z, V ≥ c − wδ2D
Z because of how D varies with Z. Thus, one
identifies τ by “dividing” the reduced form − E CU | W * = w*, Z = z, V < c − wδ2D B .
quantity E[D | W * = w*, Z = z]τ (which can
be obtained by examining the expectation Even if the joint distribution of U, V is
of Y conditional on Z for a particular value unknown, in principle it is still possible
w* of W *) by E[D | W * = w*, Z = z], which to identify τ, if it were possible to choose
is also provided by the observed data. It is two different values of Z such that c − wδ2
common to model the latter quantity as a approaches −∞ and ∞. If so, the last two
linear function in Z, in which case the IV terms in (8) approach E[U | W * = w*] and,
estimator is (conditional on W *) the ratio of hence, cancel one another. This is known as
coefficients from regressions of Y on Z and “identification at infinity.”
D on Z. When Z is binary, this appears to be Perhaps the most important assumption
the only way to identify τ without imposing that any of these approaches require is the
further assumptions. existence of a variable Z that is (conditional
on W *) independent of U.24 There does not and that the assumption invoked is that Z is
seem to be any way of testing the validity uncorrelated with U, conditional on W. In this
of this assumption. Different, but equally latter case, the design seems fundamentally
“plausible” Z’s may lead to different answers, untestable, since all the remaining observable
in the same way that including different sets variables (the W ’s) are being “used up” for
of W ’s may lead to different answers in the identifying the treatment effect.
selection on observables approach.
3.5.3 RD as “Design” not “Method”
Even when there is a mechanism that
justifies an instrument Z as “plausible,” it is RD designs can be valid under the more
often unclear which covariates W * to include general “selection on unobservables”
in the analysis. Again, when different sets of environment, allowing an arbitrary correla-
W * lead to different answers, the question tion among U, V, and W, but at the same time
becomes which is more plausible: Z is inde- not requiring an instrument. As discussed
pendent of U conditional on W * or Z is inde- above, all that is needed is that conditional on
pendent of U conditional on a subset of the W, U, the density of V is continuous, and the
variables in W *? While there may be some local randomization result follows.
situations where knowledge of the mecha- How is an RD design able to achieve
nism dictates which variables to include, in this, given these weaker assumptions? The
other contexts, it may not be obvious. answer lies in what is absolutely necessary
The situation is illustrated in panel D of in an RD design: observability of the latent
figure 5. It is necessary that the instrument index X. Intuitively, given that both the
Z is related to the treatment (as in the first “selection on observables” and “selection on
column). The crucial assumption is regard- unobservables” approaches rely heavily on
ing the relation between Z and the unob- modeling X and its components (e.g., which
servables U (the third column). In order for W ’s to include, and the properties of the
an IV or a “Heckit” approach to work, the unobservable error V and its relation to other
function in the third column needs to be flat. variables, such as an instrument Z), actually
Of course, we cannot observe whether this is knowing the value of X ought to help.
true. Furthermore, in most cases, it is unclear In contrast to the “selection on observ-
how to interpret the relation between W and ables” and “selection on unobservables”
Z (second column). Some might argue the modeling approaches, with the RD design
observed relation between W and Z should the researcher can avoid taking any strong
be flat if Z is truly exogenous, and that if Z is stance about what W ’s to include in the anal-
highly correlated with W, then it casts doubt ysis, since the design predicts that the W ’s
on Z being uncorrelated with U. Others will are irrelevant and unnecessary for identifi-
argue that using the second graph as a test is cation. Having data on W ’s is, of course, of
only appropriate when Z is truly randomized, some use, as they allow testing of the under-
lying assumption (described in section 4.4).
24 For IV, violation of this assumption essentially means For this reason, it may be more helpful to
that Z varies with Y for reasons other than its influence consider RD designs as a description of a par-
on D. For the textbook “Heckit” approach, it is typically ticular data generating process, rather than a
assumed that U, V have the same distribution for any value
of Z. It is also clear that the “identification at infinity” “method” or even an “approach.” In virtually
approach will only work if Z is uncorrelated with U, oth- any context with an outcome variable Y, treat-
erwise the last two terms in equation (8) would not cancel. ment status D, and other observable variables
See also the framework of Heckman and Edward Vytlacil
(2005), which maintains the assumption of the indepen- W, in principle a researcher can construct a
dence of the error terms and Z, conditional on W *. regression-control or instrumental variables
(after designating one of the W variables a t ransparent way of graphically showing how
valid instrument) estimator, and state that the treatment effect is identified. We thus
the identification assumptions needed are begin the section by discussing how to graph
satisfied. the data in an informative way. We then
This is not so with an RD design. Either move to arguably the most important issue
the situation is such that X is observed, or it in implementing an RD design: the choice
is not. If not, then the RD design simply does of the regression model. We address this by
not apply.25 If X is observed, then one has presenting the various possible specifications,
little choice but to attempt to estimate the discussing how to choose among them, and
expectation of Y conditional on X on either showing how to compute the standard errors.
side of the cutoff. In this sense, the RD Next, we discuss a number of other prac-
design forces the researcher to analyze it in tical issues that often arise in RD designs.
a particular way, and there is little room for Examples of questions discussed include
researcher discretion—at least from an iden- whether we should control for other covari-
tification standpoint. The design also pre- ates and what to do when the assignment
dicts that the inclusion of W ’s in the analysis variable is discrete. We discuss a number of
should be irrelevant. Thus it naturally leads tests to assess the validity of the RD designs,
the researcher to examine the density of X or which examine whether covariates are “bal-
the distribution of W ’s, conditional on X, for anced” on the two sides of the threshold, and
discontinuities as a test for validity. whether the density of the assignment vari-
The analogy of the truly randomized able is continuous at the threshold. Finally,
experiment is again helpful. Once the we summarize our recommendations for
researcher is faced with what she thinks is a implementing the RD design.
properly carried out randomized controlled Throughout this section, we illustrate the
trial, the analysis is quite straightforward. various concepts using an empirical example
Even before running the experiment, most from Lee (2008) who uses an RD design to
researchers agree it would be helpful to dis- estimate the causal effect of incumbency in
play the treatment-control contrasts in the U.S. House elections. We use a sample of
W ’s to test whether the randomization was 6,558 elections over the 1946–98 period (see
carried out properly, then to show the simple Lee 2008 for more detail). The assignment
mean comparisons, and finally to verify the variable in this setting is the fraction of votes
inclusion of the W’s make little difference in awarded to Democrats in the previous elec-
the analysis, even if they might reduce sam- tion. When the fraction exceeds 50 percent,
pling variability in the estimates. a Democrat is elected and the party becomes
the incumbent party in the next election.
Both the share of votes and the probability
4. Presentation, Estimation, and Inference
of winning the next election are considered
In this section, we systematically discuss as outcome variables.
the nuts and bolts of implementing RD
4.1 Graphical Presentation
designs in practice. An important virtue
of RD designs is that they provide a very A major advantage of the RD design over
competing methods is its transparency, which
can be illustrated using graphical methods.
25 Of course, sometimes it may seem at first that an RD
A standard way of graphing the data is to
design does not apply, but a closer inspection may reveal that it
does. For example, see Per Pettersson (2000), which eventu- divide the assignment variable into a number
ally became the RD analysis in Pettersson-Lidbom (2008b). of bins, making sure there are two separate
bins on each side of the cutoff point (to avoid at this point, i.e., of the treatment effect. Since
having treated and untreated observations an RD design is “as good as a randomized
mixed together in the same bin). Then, the experiment” right around the cutoff point, the
average value of the outcome variable can be treatment effect could be computed by com-
computed for each bin and graphed against paring the average outcomes in “small” bins
the mid-points of the bins. just to the left and right of the cutoff point.
More formally, for some bandwidth h, If there is no visual evidence of a discontinu-
and for some number of bins K 0 and K 1 to ity in a simple graph, it is unlikely the formal
the left and right of the cutoff value, respec- regression methods discussed below will yield
tively, the idea is to construct bins (bk , bk+1], a significant treatment effect.
for k = 1, . . . , K = K 0 + K 1, where A third advantage is that the graph also
shows whether there are unexpected compa-
bk = c − (K 0 − k + 1) h. rable jumps at other points. If such evidence
is clearly visible in the graph and cannot be
The average value of the outcome variable explained on substantive grounds, this calls
in the bin is into question the interpretation of the jump
__ N at the cutoff point as the causal effect of the
 k =  _
Y   ∑    Yi 1 {bk < Xi ≤ bk+1}.
1    treatment. We discuss below several ways of
Nk i=1 testing explicitly for the existence of jumps at
It is also useful to calculate the number of points other than the cutoff.
observations in each bin Note that the visual impact of the graph
N is typically enhanced by also plotting a rela-
Nk = ∑    1{bk < Xi ≤ bk+1} tively flexible regression model, such as a
i=1 polynomial model, which is a simple way
to detect a possible discontinuity in the of smoothing the graph. The advantage of
assignment variable at the threshold, which showing both the flexible regression line
would suggest manipulation. and the unrestricted bin means is that the
There are several important advantages regression line better illustrates the shape of
in graphing the data this way before starting the regression function and the size of the
to run regressions to estimate the treatment jump at the cutoff point, and laying this over
effect. First, the graph provides a simple way the unrestricted means gives a sense of the
of visualizing what the functional form of the underlying noise in the data.
regression function looks like on either side Of course, if bins are too narrow the esti-
of the cutoff point. Since the mean of Y in mates will be highly imprecise. If they are
a bin is, for nonparametric kernel regres- too wide, the estimates may be biased as they
sion estimators, evaluated at the bin mid- fail to account for the slope in the regression
point using a rectangular kernel, the set of line (negligible for very narrow bins). More
bin means literally represent nonparametric importantly, wide bins make the compari-
estimates of the regression function. Seeing sons on both sides of the cutoff less credible,
what the nonparametric regression looks like as we are no longer comparing observations
can then provide useful guidance in choosing just to the left and right of the cutoff point.
the functional form of the regression models. This raises the question of how to choose
A second advantage is that comparing the the bandwidth (the width of the bin). In
mean outcomes just to the left and right of the practice, this is typically done informally by
cutoff point provides an indication of the mag- trying to pick a bandwidth that makes the
nitude of the jump in the regression function graphs look informative in the sense that bins
are wide enough to reduce the amount of first test is a standard F-test comparing the fit
noise, but narrow enough to compare obser- of a regression model with K′ bin dummies
vations “close enough” on both sides of the to one where we further divide each bin into
cutoff point. While it is certainly advisable to two equal sized smaller bins, i.e., increase the
experiment with different bandwidths and number of bins to 2K′ (reduce the bandwidth
see how the corresponding graphs look, it is from h′ to h′/2). Since the model with K′ bins
also useful to have some formal guidance in is nested in the one with 2K′ bins, a standard
the selection process. F-test with K′ degrees of freedom can be
One approach to bandwidth choice is used. If the null hypothesis is not rejected,
based on the fact that, as discussed above, the this provides some evidence that we are not
mean outcomes by bin correspond to kernel oversmoothing the data by using only K′ bins.
regression estimates with a rectangular ker- Another test is based on the idea that if the
nel. Since the standard kernel regression is a bins are “narrow enough,” then there should
special case of a local linear regression where not be a systematic relationship between Y
the slope term is equal to zero, the cross-val- and X, that we capture using a simple regres-
idation procedure described in more detail sion of Y on X, within each bin. Otherwise,
in section 4.3.1 can also be used here by this suggests the bin is too wide and that the
constraining the slope term to equal zero.26 mean value of Y over the whole bin is not
For reasons we discuss below, however, one representative of the mean value of Y at the
should not solely rely on this approach to boundaries of the bin. In particular, when
select the bandwidth since other reasonable this happens in the two bins next to the cut-
subjective goals should be considered when off point, a simple comparison of the two bin
choosing how to the plot the data. means yields a biased estimate of the treat-
Furthermore, a range a bandwidths often ment effect. A simple test for this consists of
yield similar values of the cross-valida- adding a set of interactions between the bin
tion function in practical applications (see dummies and X to a base regression of Y on
below). A researcher may, therefore, want the set of bin dummies, and testing whether
to use some discretion in choosing a band- the interactions are jointly significant. The
width that provides a particularly compelling test statistic once again follows a F distribu-
illustration of the RD design. An alternative tion with K′ degrees of freedom.
approach is to choose a bandwidth based on Figures 6–11 show the graphs for the
a more heuristic visual inspection of the data, share of Democrat vote in the next elec-
and then perform some tests to make sure tion and the probability of Democrats win-
this informal choice is not clearly rejected. ning the next election, respectively. Three
We suggest two such tests. Consider the sets of graphs with different bandwidths are
case where one has decided to use K′ bins reported using a bandwidth of 0.02 in figures
based on a visual inspection of the data. The 6 and 9, 0.01 in figures 7 and 10, and 0.005
 ˆ
26 In
section 4.3.1, we consider the cross-validation on a regression with a constant term, which means  Y   (Xi)
 ˆ  ˆ
function CVY (h) = (1/N) ∑ Ni=1  (Xi))2 where Y
     (Yi −  Y    (Xi) is the average value of Y among all observations in the bin
is the predicted value of Yi based on a regression using (excluding observation i). Note that this is slightly differ-
observations with a bin of width h on either the left (for ent from the standard cross-validation procedure in kernel
observations on left of the cutoff) or the right (for observa- regressions where the left-out observation is in the middle
tions on the right of the cutoff) of observation i, but not instead of the edge of the bin (see, for example, Richard
including observation i itself. In the context of the graph Blundell and Alan Duncan 1998). Our suggested procedure
discussed here, the only modification to  the cross-valida- is arguably better suited to the RD context since estimation
tion function is that the predicted value  Y ˆ  (Xi ) is based only of the treatment effect takes place at boundary points.
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
Figure 6. Share of Vote in Next Election, Bandwidth of 0.02 (50 bins)
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
Figure 9. Winning the Next Election, Bandwidth of 0.02 (50 bins)

1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
in figures 8 and 11. In all cases, we also show when the outcome variable is the vote share
the fitted values from a quartic regression (figure 13). For both outcome variables, the
model estimated separately on each side of value of the cross-validation function grows
the cutoff point. Note that the assignment quickly for bandwidths smaller than 0.02,
variable is normalized as the difference suggesting that the graphs with narrower bins
between the share of vote to Democrats and (figures 7, 8, 10, and 11) are too noisy.
Republicans in the previous election. This Panel B of table 1 shows the results of our
means that a Democrat is the incumbent two suggested specification tests. The tests
when the assignment variable exceeds zero. based on doubling the number of bins and
We also limit the range of the graphs to win- running regressions within each bin yield
ning margins of 50 percent or less (in abso- remarkably similar results. Generally speak-
lute terms) as data become relatively sparse ing, the results indicate that only fairly wide
for larger winning (or losing) margins. bins are rejected. Looking at both outcome
All graphs show clear evidence of a discon- variables, the tests systematically reject mod-
tinuity at the cutoff point. While the graphs els with bandwidths of 0.05 or more (twenty
are all quite informative, the ones with the bins over the –0.5 to 0.5 range). The models
smallest bandwidth (0.005, figures 8 and 11) are never rejected for either outcome vari-
are more noisy and likely provide too many able once we hit bandwidths of 0.02 (fifty
data points (200) for optimal visual impact. bins) or less. In practice, the testing proce-
The results of the bandwidth selection pro- dure rules out bins that are larger than those
cedures are presented in table 1. Panel A shows reported in figures 6–11.
the cross-validation procedure always suggests At first glance, the results in the two pan-
using a bandwidth of 0.02 or more, which cor- els of table 1 appear to be contradictory. The
responds to similar or wider bins than those cross-validation procedure suggests band-
used in figures 6 and 9 (those with the largest widths ranging from 0.02 to 0.05, while the
bins). This is true irrespective of whether we bin and regression tests suggest that almost
pick a separate bandwidth on each side of the all bandwidths of less than 0.05 are accept-
cutoff (first two rows of the panel), or pick the able. The reason for this discrepancy is that
bandwidth that minimizes the cross-validation while the cross-validation procedure tries
function for the entire date range on both the to balance precision and bias, the bin and
left and right sides of the cutoff. In the case regression tests only deal with the “bias” part
where the outcome variable is winning the of the equation by checking whether the
next election, the cross-validation procedure value of Y is more or less constant within a
for the data to the right of the cutoff point and given bin. Models with small bins easily pass
for the entire range suggests using a very wide this kind of test, although they may yield a
bin (0.049) that would only yield about ten very noisy graph. One alternative approach is
bins on each side of the cutoff. to choose the largest possible bandwidth that
As it turns out, the cross-validation function passes the bin and the regression test, which
for the entire data range has two local min- turns out to be 0.033 in table 1, a bandwidth
ima at 0.021 and 0.049 that correspond to the that is within the range of those suggested by
optimal bandwidths on the left and right hand the cross-validation procedure.
side of the cutoff. This is illustrated in figure From a practical point of view, it seems to
12, which plots the cross-validation function be the case that formal procedures, and in par-
as a function of the bandwidth. By contrast, ticular cross-validation, suggest bandwidths
the cross-validation function is better behaved that are wider than those one would likely
and shows a global minimum around 0.020 choose based on a simple visual examination
Table 1
Choice of Bandwidth in Graph for Voting Example
A. Optimal bandwidth selected by cross-validation

Side of cutoff Share of vote Win next election
Left 0.021 0.049

Right 0.026 0.021
Both 0.021 0.049
B. P-values of tests for the numbers of bins in RD graph

Share of vote Win next election
No. of bins Bandwidth Bin test Regr. test Bin test Regr. test
10 0.100 0.000 0.000 0.001 0.000

20 0.050 0.000 0.000 0.026 0.049
30 0.033 0.163 0.390 0.670 0.129
40 0.025 0.157 0.296 0.024 0.020
50 0.020 0.957 0.721 0.477 0.552
60 0.017 0.159 0.367 0.247 0.131
70 0.014 0.596 0.130 0.630 0.743
80 0.013 0.526 0.740 0.516 0.222
90 0.011 0.815 0.503 0.806 0.803
100 0.010 0.787 0.976 0.752 0.883
Notes: Estimated over the range of the forcing variable (Democrat to Republican difference in the share of
vote in the previous election) ranging between –0.5 and 0.5. The “bin test” is computed by comparing the fit
of a model with the number of bins indicated in the table to an alternative where each bin is split in 2. The
“regression test” is a joint test of significance of bin-specific regression estimates of the outcome variable on
the share of vote in the previous election.
of the data. In particular, both figures 7 and smoothing (relative to what formal bandwidth
10 (bandwidth of 0.01) look visually accept- selection procedures would suggest) to better
able but are clearly not recommended on the illustrate the variation in the raw data when
basis of the cross-validation procedure. This graphically illustrating an RD design.
likely reflects the fact that one important
goal of the graph is to show how the raw data 4.2 Regression Methods
look, and too much smoothing would defy the
purpose of such a data illustration exercise. 4.2.1 Parametric or Nonparametric
Furthermore, the regression estimates of the Regressions?
treatment effect accompanying the graphi-
cal results are a formal way of smoothing the When we introduced the RD design
data to get precise estimates. This suggests in section 2, we followed Thistlethwaite
that there is probably little harm in under- and Campbell (1960) in assuming that the
435
434
Cross-validation function
433
432
431
430
429
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Bandwidth
Figure 12. Cross-Validation Function for Choosing the Bandwidth in a RD Graph:

Winning the Next Election
78.8
78.6
78.4
78.2
78.0
77.8
77.6
77.4
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Bandwidth
Figure 13. Cross-Validation Function for Choosing Bandwidth in a RD Graph:

Share of Vote at Next Election
nderlying regression model was linear in

u function. As it turns out, however, the RD
the assignment variable X: setting poses a particular problem because
we need to estimate regressions at the cutoff
Y = α + Dτ + Xβ + ε. point. This results in a “boundary problem”
that causes some complications for nonpara-
In general, as in any other setting, there is metric methods.
no particular reason to believe that the true From an applied perspective, a simple
model is linear. The consequences of using an way of relaxing the linearity assumption is
incorrect functional form are more serious in to include polynomial functions of X in the
the case of RD designs however, since mis- regression model. This corresponds to the
specification of the functional form typically series estimation approach often used in non-
generates a bias in the treatment effect, τ.27 parametric analysis. A possible disadvantage
This explains why, starting with Hahn, Todd, of the approach, however, is that it provides
and van der Klaauw (2001), the estimation of global estimates of the regression function
RD designs have generally been viewed as a over all values of X, while the RD design
nonparametric estimation problem. depends instead on local estimates of the
This being said, applied papers using the regression function at the cutoff point. The
RD design often just report estimates from fact that polynomial regression models use
parametric models. Does this mean that data far away from the cutoff point to predict
these estimates are incorrect? Should all the value of Y at the cutoff point is not intui-
studies use nonparametric methods instead? tively appealing. That said, trying more flex-
As we pointed out in the introduction, we ible specification by adding polynomials in X
think that the distinction between parametric as regressors is an important and useful way of
and nonparametric methods has sometimes assessing the robustness of the RD estimates
been a source of confusion to practitioners. of the treatment effect.
Before covering in detail the practical issues The other leading nonparametric approach
involved in the estimation of RD designs, we is kernel regressions. Unlike series (poly-
thus provide some background to help clarify nomial) estimators, the kernel regression is
the insights provided by nonparametric anal- fundamentally a local method well suited for
ysis, while also explaining why, in practice, estimating the regression function at a partic-
RD designs can still be implemented using ular point. Unfortunately, this property does
“parametric” methods. not help very much in the RD setting because
Going beyond simple parametric linear the cutoff represents a boundary point where
regressions when the true functional form is kernel regressions perform poorly.
unknown is a well-studied problem in econo- These issues are illustrated in figure 2,
metrics and statistics. A number of nonpara- which shows a situation where the relation-
metric methods have been suggested to ship between Y and X (under treatment or
provide flexible estimates of the regression control) is nonlinear. First, consider the point
D located away from the cutoff point. The
kernel estimate of the regression of Y on X at
27 By contrast, when one runs a linear regression in a
X = Xd is simply a local mean of Y for values
model where the true functional form is nonlinear, the esti-
mated model can still be interpreted as a linear predictor of X close to Xd. The kernel function provides
that minimizes specification errors. But since specification a way of computing this local average by put-
errors are only minimized globally, we can still have large ting more weight on observations with values
specification errors at specific points including the cutoff
point and, therefore, a large bias in RD estimates of the of X close to Xd than on observations with val-
treatment effect. ues of X far away from Xd. Following Imbens
and Lemieux (2008), we focus on the conve- extremely noisy estimates of the treatment
nient case of the rectangular kernel. In this effect.28
setting, computing kernel regressions simply As a solution to this problem, Hahn, Todd,
amounts to computing the average value of Y and van der Klaauw (2001) suggests run-
in the bin illustrated in figure 2. The result- ning local linear regressions to reduce the
ing local average is depicted as the horizontal importance of the bias. In our setup with a
line EF, which is very close to true value of Y rectangular kernel, this suggestion simply
evaluated at X = Xd on the regression line. amounts to running standard linear regres-
Applying this local averaging approach is sions within the bins on both sides of the
problematic, however, for the RD design. cutoff point to better predict the value of the
Consider estimating the value of the regres- regression function right at the cutoff point.
sion function just on the right of the cutoff In this example, the regression lines within
point. Clearly, only observations on the right the bins around the cutoff point are close to
of the cutoff point that receive the treatment linear. It follows that the predicted values of
should be used to compute mean outcomes the local linear regressions at the cutoff point
on the right hand side. Similarly, only observa- are very close to the true values of A and B.
tions on the left of the cutoff point that do not Intuitively, this means that running local
receive the treatment should be used to com- linear regressions instead of just computing
pute mean outcomes on the left hand side. averages within the bins reduces the bias by
Otherwise, regression estimates would mix an order of magnitude. Indeed, Hahn, Todd,
observations with and without the treatment, and van der Klaauw (2001) show that the
which would invalidate the RD approach. remaining bias is of an order of magnitude
In this setting, the best thing is to com- lower, and is comparable to the usual bias
pute the average value of Y in the bin just in kernel estimation at interior points like D
to the right and just to the left of the cutoff (the small difference between the horizontal
point. These two bins are shown in figure 2. line EF and the true value of the regression
The RD estimate based on kernel regres- line evaluated at D).
sions is then equal to B′ − A′. In this exam- In the literature on nonparametric estima-
ple where the regression lines are upward tion at boundary points, local linear regres-
sloping, it is clear, however, that the esti- sions have been introduced as a means of
mate B′ − A′ overstates the true treatment reducing the bias in standard kernel regres-
effect represented as the difference B − A sion methods.29 One of the several contribu-
at the cutoff point. In other words, there tions of Hahn, Todd, and van der Klaauw
is a systematic bias in kernel regression (2001) is to show how the same bias-reducing
estimates of the treatment effect. Hahn,
Todd, and van der Klaauw (2001) provide a
more formal derivation of the bias (see also 28 The trade-off between bias and precision is a funda-
Imbens and Lemieux 2008 for a simpler mental feature of kernel regressions. A larger bandwidth
yields more precise, but potentially biased, estimates of the
exposition when the kernel is rectangu- regression. In an interior point like D, however, we see that
lar). In practical terms, the problem is that the bias is of an order of magnitude lower than at the cutoff
in finite samples the bandwidth has to be (boundary) point. In more technical terms, it can be shown
(see Hahn, Todd, and van der Klaauw 2001 or Imbens and
large enough to encompass enough obser- Lemieux 2008) that the usual bias is of order h2 at interior
vations to get a reasonable amount of pre- points, but of order h at boundary points, where h is the
cision in the estimated average values of bandwidth. In other words, the bias dies off much more
quickly when h goes to zero when we are at interior, as
Y. Otherwise, attempts to reduce the bias opposed to boundary, points.
by shrinking the bandwidth will result in 29 See Jianqing Fan and Irene Gijbels (1996).
procedure should also be applied to the RD Y = αr + fr (X − c) + ε,

design. We have shown here that, in practice,
this simply amounts to applying the original where fl (∙) and fr (∙) are functional forms that
insight of Thistlethwaite and Campbell (1960) we discuss later. The treatment effect can
to a narrower window of observations around then be computed as the difference between
the cutoff point. When one is concerned that the two regressions intercepts, αr and αl ,
the regression function is not linear over the on the two sides of the cutoff point. A more
whole range of X, a highly sensible procedure direct way of estimating the treatment effect
is, thus, to restrict the estimation range to is to run a pooled regression on both sides of
values closer to the cutoff point where the the cutoff point:
linear approximation of the regression line is
less likely to result in large biases in the RD Y = αl + τ D + f (X − c) + ε,
estimates. In practice, many applied papers
present RD estimates with varying window where τ = αr − αl and f (X − c) = fl (X − c)
widths to illustrate the robustness (or lack + D [ fr (X − c) − fl (X − c)]. One advan-
thereof) of the RD estimates to specifica- tage of the pooled approach is that it directly
tion issues. It is comforting to know that this yields estimates and standard errors of the
common empirical practice can be justified treatment effect τ. Note, however, that
on more formal econometric grounds like it is recommended to let the regression
those presented by Hahn, Todd, and van der function differ on both sides of the cut-
Klaauw (2001). The main conclusion we draw off point by including interaction terms
from this discussion of nonparametric meth- between D and X. For example, in the lin-
ods is that it is essential to explore how RD ear case where fl (X − c) = βl (X − c) and
estimates are robust to the inclusion of higher fr (X − c) = βr (X − c), the pooled regression
order polynomial terms (the series or poly would be
nomial estimation approach) and to changes
in the window width around the cutoff point Y = αl + τ D + βl (X − c)
(the local linear regression approach).
+ (βr − βl) D (X − c) + ε.
4.3 Estimating the Regression
The problem with constraining the slope of
A simple way of implementing RD designs the regression lines to be the same on both
in practice is to estimate two separate regres- sides of the cutoff (βr =βl) is best illustrated
sions on each side of the cutoff point. In by going back to the separate regressions
terms of computations, it is convenient to above. If we were to constrain the slope to
subtract the cutoff value from the covariate, be identical on both sides of the cutoff, this
i.e., transform X to X − c, so the intercepts would amount to using data on the right
of the two regressions yield the value of the hand side of the cutoff to estimate αl , and
regression functions at the cutoff point. vice versa. Remember from section 2 that
The regression model on the left hand side in an RD design, the treatment effect is
of the cutoff point (X < c) is obtained by comparing conditional expec-
tations of Y when approaching from the
Y = αl + fl (X − c) + ε, left (αl = limx↑c E[Yi | Xi = x]) and from the
right (αr = limx↓c E[Yi | Xi = x]) of the cut-
while the regression model on the right hand off. Constraining the slope to be the same
side of the cutoff point (X ≥ c) is would thus be inconsistent with the spirit of
the RD design, as data from the right of the The regression model on the left hand side
cutoff would be used to estimate αl , which of the cutoff point is
is defined as a limit when approaching from
the left of the cutoff, and vice versa. Y = αl + βl (X − c) + ε,
In practice, however, estimates where
the regression slope or, more generally, the where c − h ≤ X < c,
regression function f (X − c) are constrained
to be the same on both sides of the cutoff while the regression model on the right hand
point are often reported. One possible justi- side of the cutoff point is
fication for doing so is that if the functional
form is indeed the same on both sides of the Y = αr + βr (X − c) + ε,
cutoff, then more efficient estimates of the
treatment effect τ are obtained by imposing where c ≤ X ≤ c + h.
that constraint. Such a constrained specifica-
tion should only be viewed, however, as an As before, it is also convenient to estimate
additional estimate to be reported for the the pooled regression
sake of completeness. It should not form the
core basis of the empirical approach. Y = αl + τ D + βl (X − c)
4.3.1 Local Linear Regressions and
+ (βr − βl) D (X − c) + ε,
Bandwidth Choice
As discussed above, local linear regres- where c − h ≤ X ≤ c + h,
sions provide a nonparametric way of consis-
tently estimating the treatment effect in an since the standard error of the estimated
RD design (Hahn, Todd, and van der Klaauw treatment effect can be directly obtained
(2001), Jack Porter (2003)). Following from the regression.
Imbens and Lemieux (2008), we focus on While it is straightforward to estimate the
the case of a rectangular kernel, which linear regressions within a given window of
amounts to estimating a standard regression width h around the cutoff point, a more dif-
over a window of width h on both sides of the ficult question is how to choose this band-
cutoff point. While other kernels (triangular, width. In general, choosing a bandwidth
Epanechnikov, etc.) could also be used, the in nonparametric estimation involves find-
choice of kernel typically has little impact ing an optimal balance between precision
in practice. As a result, the convenience and bias. One the one hand, using a larger
of working with a rectangular kernel com- bandwidth yields more precise estimates as
pensates for efficiency gains that could be more observations are available to estimate
achieved using more sophisticated kernels.30 the regression. On the other hand, the lin-
ear specification is less likely to be accurate
30 It has been shown in the statistics literature (Fan and arguably more transparent way of putting more weight on
Gijbels 1996) that a triangular kernel is optimal for esti- observations close to the cutoff is simply to reestimate a
mating local linear regressions at the boundary. As it turns model with a rectangular kernel using a smaller bandwidth.
out, the only difference between regressions using a rect- In practice, it is therefore simpler and more transparent
angular or a triangular kernel is that the latter puts more to just estimate standard linear regressions (rectangular
weight (in a linear way) on observations closer to the cutoff kernel) with a variety of bandwidths, instead of trying out
point. It thus involves estimating a weighted, as opposed different kernels corresponding to particular weighted
to an unweighted, regression within a bin of width h. An regressions that are more difficult to interpret.
when a larger bandwidth is used, which can a vailable. The importance of undersmooth-
bias the estimate of the treatment effect. If ing only has to do with a thought experi-
the underlying conditional expectation is not ment of how much the bandwidth should
linear, the linear specification will provide shrink if the sample size were larger so that
a close approximation over a limited range one obtains asymptotically correct standard
of values of X (small bandwidth), but an errors, and does not help one choose a par-
increasingly bad approximation over a larger ticular bandwidth in a particular sample.32
range of values of X (larger bandwidth). In the econometrics and statistics litera-
As the number of observations avail- ture, two procedures are generally consid-
able increases, it becomes possible to use ered for choosing bandwidths. The first
an increasingly small bandwidth since linear procedure consists of characterizing the
regressions can be estimated relatively pre- optimal bandwidth in terms of the unknown
cisely over even a small range of values of joint distribution of all variables. The rel-
X. As it turns out, Hahn, Todd, and van der evant components of this distribution can
Klaauw (2001) show the optimal bandwidth then be estimated and plugged into the opti-
is proportional to N −1/5, which corresponds mal bandwidth function.33 In the context
to a fairly slow rate of convergence to zero. of local linear regressions, Fan and Gijbels
For example, this suggests that the bandwidth (1996) show this involves estimating a num-
should only be cut in half when the sample ber of parameters including the curvature of
size increases by a factor of 32 (25). For tech- the regression function. In practice, this can
nical reasons, however, it would be preferable be done in two steps. In step one, a rule-of-
to undersmooth by shrinking the bandwidth thumb (ROT) bandwidth is estimated over
at a faster rate requiring that h ∝ N −δ with the whole relevant data range. In step two,
1/5 < δ < 2/5, in order to eliminate an the ROT bandwidth is used to estimate the
asymptotic bias that would remain when optimal bandwidth right at the cutoff point.
δ = 1/5. In the presence of this bias, the For the rectangular kernel, the ROT band-
usual formula for the variance of a standard width is given by:
least square estimator would be invalid.31
In practice however, knowing at what rate 1/5
the bandwidth should shrink in the limit does σ   R 
hROT = 2.702 s  ___________
 ˜  2
  2 t  ,
∑ i=1     m′ ′ (xi)V 
N
not really help since only one actual sam-    U  ˜ 
ple with a given number of observations is
31 See Hahn, Todd, and van der Klaauw (2001) and approximate the variance of the RD estimator in the actual
Imbens and Lemieux (2008) for more details. sample. This does not say anything about what bandwidth
32 The main purpose of asymptotic theory is to use the should be chosen in the actual sample available for imple-
large sample properties of estimators to approximate the menting the RD design.
distribution of an estimator in the real sample being con- 33 A well known example of this procedure is the
sidered. The issue is a little more delicate in a nonparamet- “rule-of-thumb” bandwidth selection formula in ker-
ric setting where one also has to think about how fast the nel density estimation where an estimate of the dis-
bandwidth should shrink when the sample size approaches persion in the variable (standard deviation or the
infinity. The point about undersmoothing is simply that σ,  is plugged into the formula
 
interquartile range),  ˆ 
one unpleasant property of the optimal bandwidth is that σ ∙ N
 
0.9 ∙  ˆ 
  −1/5
. Bernard W. Silverman (1986) shows that
it does not yield the convenient least squares variance for- this formula is the closed form solution for the optimal
mula. But this can be fixed by shrinking the bandwidth bandwidth choice problem when both the actual density
a little faster as the sample size goes to infinity. Strictly and the kernel are Gaussian. See also Imbens and Karthik
speaking, this is only a technical issue with how to perform Kalyanaraman (2009), who derive an optimal bandwidth
the thought experiment (what happens when the sample for this RD setting, and propose a data-dependent method
size goes to infinity?) required for using asymptotics to for choosing the bandwidth.
where    ˜ ′ ′(∙) is the second derivative (curva-

m using only observations with values of X on
ture) of an estimated regression of Y on X,  ˜  σ 
 
the right of Xi (Xi < X ≤ Xi + h).
is the estimated standard error of the regres- Repeating the exercise for each and every
sion, R is the range of the assignment vari- observation, we get a whole set of predicted
able over which the regression is estimated, values of Y that can be compared to the
and the constant 2.702 is a number specific to actual values of Y. The optimal bandwidth
the rectangular kernel. A similar formula can can be picked by choosing the value of h that
be used for the optimal bandwidth, except minimizes the mean square of the difference
both the regression standard error and the between the predicted  
and actual value of Y.
average curvature of the regression func-  ˆ  (Xi) represent the pre-
More formally, let Y
tion are estimated locally around the cutoff dicted value of Y obtained using the regres-
point. For the sake of simplicity, we only sions described above. The cross-validation
compute the ROT bandwidth in our empiri- criterion is defined as
cal example. Following the common practice
in studies using these bandwidth selection N
(9) CVY(h) =  _  ∑   (Yi −  Y
1      ˆ  (Xi))2
methods, we also use a quartic specification
for the regression function.34 N i=1
The second approach is based on a cross-
validation procedure. In the case consid- with the corresponding cross-validation
ered here, Jens Ludwig and Douglas Miller choice for the bandwidth
(2007) and Imbens and Lemieux (2008) have
proposed a “leave one out” procedure aimed CV  =
h opt a   
rg min 
    Y (h).
 CV
specifically at estimating the regression func- h
tion at the boundary. The basic idea behind
this procedure is the following. Consider an Imbens and Lemieux (2008) discuss this pro-
observation i. To see how well a linear regres- cedure in more detail and point out that since
sion with a bandwidth h fits the data, we run we are primarily interested in what happens
a regression with observation i left out and around the cutoff, it may be advisable to only
use the estimates to predict the value of Y at compute CVY (h) for a subset of observations
X = Xi. In order to mimic the fact that RD with values of X close enough to the cutoff
estimates are based on regression estimates point. For instance, only observations with
at the boundary, the regression is estimated values of X between the median value of X to
using only observations with values of X on the left and right of the cutoff could be used
the left of Xi (Xi − h ≤ X < Xi) for observa- to perform the cross-validation.
tions on the left of the cutoff point (Xi < c). The second rows of tables 2 and 3 show the
For observations on the right of the cutoff local linear regression estimates of the treat-
point (Xi ≥ c), the regression is estimated ment effect for the two outcome variables
(share of vote and winning the next election).
34 See McCrary and Heather Royer (2003) for an We show the estimates for a wide range of
example where the bandwidth is selected using the ROT bandwidths going from the entire data range
procedure (with a triangular kernel), and Stephen L. (bandwidth of 1 on each side of the cutoff)
DesJardins and Brian P. McCall (2008) for an example
where the second step optimal bandwidth is computed to a very small bandwidth of 0.01 (winning
(for the Epanechnikov kernel). Both papers use a quartic margins of one percent or less). As expected,
regression function m(x) = β0 + β1 x + … + β4 x4, which the precision of the estimates declines
means that m′′(x) = 2β2 + 6β3 x + 12β4 x2. Note that the
quartic regressions are estimated separately on both sides quickly as we approach smaller and smaller
of the cutoff. bandwidths. Notice also that estimates based
Table 2
RD Estimates of the Effect of Winning the Previous Election on the
Share of Votes in the Next Election
Bandwidth: 1.00 0.50 0.25 0.15 0.10 0.05 0.04 0.03 0.02 0.01
Polynomial of order:
Zero 0.347 0.257 0.179 0.143 0.125 0.096 0.080 0.073 0.077 0.088
(0.003) (0.004) (0.004) (0.005) (0.006) (0.009) (0.011) (0.012) (0.014) (0.015)
[0.000] [0.000] [0.000] [0.000] [0.003] [0.047] [0.778] [0.821] [0.687]
One 0.118 0.090 0.082 0.077 0.061 0.049 0.067 0.079 0.098 0.096
(0.006) (0.007) (0.008) (0.011) (0.013) (0.019) (0.022) (0.026) (0.029) (0.028)
[0.000] [0.332] [0.423] [0.216] [0.543] [0.168] [0.436] [0.254] [0.935]
Two 0.052 0.082 0.069 0.050 0.057 0.100 0.101 0.119 0.088 0.098
(0.008) (0.010) (0.013) (0.016) (0.020) (0.029) (0.033) (0.038) (0.044) (0.045)
[0.000] [0.335] [0.371] [0.385] [0.458] [0.650] [0.682] [0.272] [0.943]
Three 0.111 0.068 0.057 0.061 0.072 0.112 0.119 0.092 0.108 0.082
(0.011) (0.013) (0.017) (0.022) (0.028) (0.037) (0.043) (0.052) (0.062) (0.063)
[0.001] [0.335] [0.524] [0.421] [0.354] [0.603] [0.453] [0.324] [0.915]
Four 0.077 0.066 0.048 0.074 0.103 0.106 0.088 0.049 0.055 0.077
(0.013) (0.017) (0.022) (0.027) (0.033) (0.048) (0.056) (0.067) (0.079) (0.063)
[0.014] [0.325] [0.385] [0.425] [0.327] [0.560] [0.497] [0.044] [0.947]
Optimal order of 6 3 1 2 1 2 0 0 0 0
the polynomial
Observations 6,558 4,900 2,763 1,765 1,209 610 483 355 231 106
Notes: Standard errors in parentheses. P-values from the goodness-of-fit test in square brackets. The goodness-of-fit
test is obtained by jointly testing the significance of a set of bin dummies included as additional regressors in the
model. The bin width used to construct the bin dummies is 0.01. The optimal order of the polynomial is chosen using
Akaike’s criterion (penalized cross-validation).
on very wide bandwidths (0.5 or 1) are sys- these estimates linked to the fact that the lin-
tematically larger than those for the smaller ear approximation does not hold over a wide
bandwidths (in the 0.05 to 0.25 range) that data range. This is particularly clear in the
are still large enough for the estimates to be case of winning the next election where fig-
reasonably precise. A closer examination of ures 9–11 show some clear curvature in the
figures 6–11 also suggests that the estimates regression function.
for very wide bandwidths are larger than Table 4 shows the optimal bandwidth
what the graphical evidence would suggest.35 obtained using the ROT and cross-valida-
This is consistent with a substantial bias for tion procedure. Consistent with the above
35 In the case of the vote share, the quartic regression Similarly, the quartic regression shown in figures 9–11 for
shown in figures 6–8 implies a treatment effect of 0.066, winning the next election implies a treatment effect of
which is substantially smaller than the local linear regres- 0.375, which is again smaller than the local linear regres-
sion estimates with a bandwidth of 0.5 (0.090) or 1 (0.118). sion estimates with a bandwidth of 0.5 (0.566) or 1 (0.689).
Table 3
Rd Estimates of the Effect of Winning the Previous Election on
Probability of Winning the Next Election
Bandwidth: 1.00 0.50 0.25 0.15 0.10 0.05 0.04 0.03 0.02 0.01
Polynomial of order:
Zero 0.814 0.777 0.687 0.604 0.550 0.479 0.428 0.423 0.459 0.533
(0.007) (0.009) (0.013) (0.018) (0.023) (0.035) (0.040) (0.047) (0.058) (0.082)
[0.000] [0.000] [0.000] [0.000] [0.011] [0.201] [0.852] [0.640] [0.479]
One 0.689 0.566 0.457 0.409 0.378 0.378 0.472 0.524 0.567 0.453
(0.011) (0.016) (0.026) (0.036) (0.047) (0.073) (0.083) (0.099) (0.116) (0.157)
[0.000] [0.000] [0.126] [0.269] [0.336] [0.155] [0.400] [0.243] [0.125]
Two 0.526 0.440 0.375 0.391 0.450 0.607 0.586 0.589 0.440 0.225
(0.016) (0.023) (0.039) (0.055) (0.072) (0.110) (0.124) (0.144) (0.177) (0.246)
[0.075] [0.145] [0.253] [0.192] [0.245] [0.485] [0.367] [0.191] [0.134]
Three 0.452 0.370 0.408 0.435 0.472 0.566 0.547 0.412 0.266 0.172
(0.021) (0.031) (0.052) (0.075) (0.096) (0.143) (0.166) (0.198) (0.247) (0.349)
[0.818] [0.277] [0.295] [0.115] [0.138] [0.536] [0.401] [0.234] [0.304]
Four 0.385 0.375 0.424 0.529 0.604 0.453 0.331 0.134 0.050 0.168
(0.026) (0.039) (0.066) (0.093) (0.119) (0.183) (0.214) (0.254) (0.316) (0.351)
[0.965] [0.200] [0.200] [0.173] [0.292] [0.593] [0.507] [0.150] [0.244]
Optimal order of 4 3 2 1 1 2 0 0 0 1
the polynomial
Observations 6,558 4,900 2,763 1,765 1,209 610 483 355 231 106
Notes: Standard errors in parentheses. P-values from the goodness-of-fit test in square brackets. The goodness-of-fit
test is obtained by jointly testing the significance of a set of bin dummies included as additional regressors in the
model. The bin width used to construct the bin dummies is 0.01. The optimal order of the polynomial is chosen using
Akaike’s criterion (penalized cross-validation).
discussion, the suggested bandwidth ranges showing more curvature for winning the next
from 0.14 to 0.28, which is large enough to election than the vote share, which calls for a
get precise estimates, but narrow enough to smaller bandwidth to reduce the estimation
minimize the bias. Two interesting patterns bias linked to the linear approximation.
can be observed in table 4. First, the band- Figures 14 and 15 plot the value of the
width chosen by cross-validation tends to be cross-validation function over a wide range
a bit larger than the one based on the rule- of bandwidths. In the case of the vote share
of-thumb. Second, the bandwidth is gener- where the linearity assumption appears more
ally smaller for winning the next election accurate (figures 6–8), the cross-validation
(second column) than for the vote share (first function is fairly flat over a sizable range of
column). This is particularly clear when the values for the bandwidth (from about 0.16
optimal bandwidth is constrained to be the to 0.29). This range includes the optimal
same on both sides of the cutoff point. This bandwidth suggested by cross-validation
is consistent with the graphical evidence (0.282) at the upper end, and the ROT
Table 4
Optimal Bandwidth for Local Linear Regressions,
Voting Example
A. Rule-of-thumb bandwidth Share of vote Win next election

Left 0.162 0.164
Right 0.208 0.130
Both 0.180 0.141
B. Optimal bandwidth selected by cross-validation
Left 0.192 0.247
Right 0.282 0.141
Both 0.282 0.172
Notes: Estimated over the range of the forcing variable (Democrat to Republican difference in the share of vote in
the previous election) ranging between –0.5 and 0.5. See the text for a description of the rule-of-thumb and cross-
validation procedures for choosing the optimal bandwidth.
bandwidth (0.180) at the lower end. In the optimal bandwidth is larger in the case of the
case of winning the next election (figure vote share.
15), the cross-validation procedure yields This example also illustrates the impor-
a sharper suggestion of optimal bandwidth tance of first graphing the data before run-
around 0.15, which is quite close to both the ning regressions and trying to choose the
optimal cross-validation bandwidth (0.172) optimal bandwidth. When the graph shows
and the ROT bandwidth (0.141). a more or less linear relationship, it is natu-
The main difference between the two ral to expect different bandwidths to yield
outcome variables is that larger bandwidths similar results and the bandwidth selection
start getting penalized more quickly in the procedure not to be terribly informative. But
case of winning the election (figure 15) than when the graph shows substantial curvature,
in the case of the vote share (figure 14). This it is natural to expect the results to be more
is consistent with the graphical evidence in sensitive to the choice of bandwidth and that
figures 6–11. Since the regression function bandwidth selection procedures will play a
looks fairly linear for the vote share, using more important role in selecting an appro-
larger bandwidths does not get penalized as priate empirical specification.
much since they improve efficiency without
4.3.2 Order of Polynomial in Local
generating much of a bias. But in the case
Polynomial Modeling
of winning the election where the regression
function exhibits quite a bit of curvature, In the case of polynomial regressions,
larger bandwidths are quickly penalized for the equivalent to bandwidth choice is
introducing an estimation bias. Since there the choice of the order of the polynomial
is a real tradeoff between precision and regressions. As in the case of local linear
bias, the cross-validation procedure is quite regressions, it is advisable to try and report
informative. By contrast, there is not much a number of specifications to see to what
of a tradeoff when the regression function is extent the results are sensitive to the order
more or less linear, which explains why the of the polynomial. For the same reason
78.2
78.1
78.0
77.9
77.8
77.7
77.6
77.5
77.4
77.3
77.2
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
Bandwidth
Figure 14. Cross-Validation Function for Local Linear Regression: Share of Vote at Next Election
435
434
433
432
431
430
429
428
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
Bandwidth
Figure 15. Cross-Validation Function for Local Linear Regression: Winning the Next Election
mentioned earlier, it is also preferable to a particular parametric model (say a cubic

estimate separate regressions on the two model) compares relative to a more general
sides of the cutoff point. nonparametric alternative. In the context
The simplest way of implementing poly- of the RD design, a natural nonparametric
nomial regressions and computing standard alternative is the set of unrestricted means of
errors is to run a pooled regression. For the outcome variable by bin used to graphi-
example, in the case of a third order polyno- cally depict the data in section 4.1. Since one
mial regression, we would have virtue of polynomial regressions is that they
provide a smoothed version of the graph,
Y = αl + τ D + βl1 (X − c) it is natural to ask how well the polynomial
model fits the unrestricted graph. A simple
+ βl2 (X − c)2 + βl3 (X − c)3 way of implementing the test is to add the
set of bin dummies to the polynomial regres-
+ (βr1 − βl1) D (X − c) sion and jointly test the significance of the
bin dummies. For example, in a first order
+ (βr2 − βl2) D (X − c)2 polynomial model (linear regression), the
test can be computed by including K − 2
+ (βr3 − βl3) D (X − c)3 + ε. bin dummies Bk, for k = 2 to K − 1, in the
model
While it is important to report a number of
specifications to illustrate the robustness of Y = αl + τ D + βl1 (X − c)
the results, it is often useful to have some
more formal guidance on the choice of the + (βr1 − βl1) D (X − c)
order of the polynomial. Starting with van
K−1
+ ∑   ϕk Bk + ε
der Klaauw (2002), one approach has been

to use a generalized cross-validation proce- k=2
dure suggested in the literature on nonpara-
metric series estimators.36 One special case
of generalized cross-validation (used by Dan and testing the null hypothesis that
A. Black, Jose Galdo, and Smith (2007a), for ϕ2 = ϕ3 = … = ϕK−1 = 0. Note that two of
example), which we also use in our empirical the dummies are excluded because of col-
example, is the well known Akaike informa- linearity with the constant and the treatment
tion criterion (AIC) of model selection. In a dummy, D.37 In terms of specification choice
regression context, the AIC is given by procedure, the idea is to add a higher order
term to the polynomial until the bin dum-
σ  ) + 2p,
 2
AIC = N ln ( ˆ  mies are no longer jointly significant.
Another major advantage of this proce-
σ   is the mean squared error of the
 2
where  ˆ  dure is that testing whether the bin dum-
regression, and p is the number of param- mies are significant turns out to be a test for
eters in the regression model (order of the
polynomial plus one for the intercept). 37 While excluding dummies for the two bins next to the
One drawback of this approach is that it cutoff point yields more interpretable results (τ remains
does not provide a very good sense of how the treatment effect), the test is invariant to the excluded
bin dummies, provided that one excluded dummy is on the
left of the cutoff point and the other one is on the right
36 See Blundell and Duncan (1998) for a more general (something standard regression packages will automati-
discussion of series estimators. cally do if all K dummies are included in the regression).
the presence of discontinuities in the regres- and 0.37–0.57 for the probability of winning
sion function at points other than the cutoff (table 3). One set of models the goodness-of-
point. In that sense, it provides a falsification fit test does not rule out, however, is higher
test of the RD design by examining whether order polynomial models with small band-
there are other unexpected discontinuities in widths that tend to be imprecisely estimated
the regression function at randomly chosen as they “overfit” the data.
points (the bin thresholds). To see this, Looking informally at both the fit of the
rewrite ∑  k=1
K
  k Bk  as
   ϕ model (goodness-of-fit test) and the preci-
sion of the estimates (standard errors) sug-
K K gests the following strategy: use higher
∑    ϕk Bk = ϕ1 + ∑    (ϕk − ϕk−1) B +
k 
, order polynomials for large bandwidths of
k=1 k=2 0.50 and more, lower order polynomials for
bandwidths between 0.05 and 0.50, and zero
where B +k  = 
∑ j=k
K
    Bj is a dummy variable order polynomials (comparisons of means)
indicating that the observation is in bin for bandwidths of less than 0.05, since the
k or above, i.e., that the assignment variable latter specification passes the goodness-
X is above the bin cutoff bk. Testing whether of-fit test for these very small bandwidths.
all the ϕk − ϕk−1 are equal to zero is equiva- Interestingly, this informal approach more or
lent to testing that all the ϕk are the same less corresponds to what is suggested by the
(the above test), which amounts to testing AIC. In this specific example, it seems that
that the regression line does not jump at the given a specific bandwidth, the AIC provides
bin thresholds bk. reasonable suggestions on which order of the
Tables 2 and 3 show the estimates of the polynomial to use.
treatment effect for the voting example. For
4.3.3 Estimation in the Fuzzy RD Design
the sake of completeness, a wide range of
bandwidths and specifications are presented, As discussed earlier, in both the “sharp”
along with the corresponding p-values for and the “fuzzy” RD designs, the probability
the goodness-of fit test discussed above (a of treatment jumps discontinuously at the
bandwidth of 0.01 is used for the bins used cutoff point. Unlike the case of the sharp RD
to construct the test). We also indicate at the where the probability of treatment jumps
bottom of the tables the order of the polyno- from 0 to 1 at the cutoff, in the fuzzy RD
mial selected for each bandwidth using the case, the probability jumps by less than one.
AIC. Note that the estimates of the treat- In other words, treatment is not solely deter-
ment effect for the “order zero” polynomi- mined by the strict cutoff rule in the fuzzy
als are just comparisons of means on the two RD design. For example, even if eligibility for
sides of the cutoff point, while the estimates a treatment solely depends on a cutoff rule,
for the “order one” polynomials are based on not all the eligibles may get the treatment
(local) linear regressions. because of imperfect compliance. Similarly,
Broadly speaking, the goodness-of-fit tests program eligibility may be extended in some
do a very good job ruling out clearly mis- cases even when the cutoff rule is not satis-
specified models, like the zero order poly- fied. For example, while Medicare eligibility
nomials with large bandwidths that yield is mostly determined by a cutoff rule (age 65
upward biased estimates of the treatment or older), some disabled individuals under
effect. Estimates from models that pass the age of 65 are also eligible.
the goodness-of-fit test mostly fall in the Since we have already discussed the inter-
0.05–0.10 range for the vote share (table 2) pretation of estimates of the treatment effect
in a fuzzy RD design in section 3.4.1, here we (12) Y = αr + τr T + fr (X − c) + εr,
just focus on estimation and implementation
issues. The key message to remember from where τr = τ δ. In this setting, τr can be
the earlier discussion is that, as in a standard interpreted as an “intent-to-treat” effect.
IV framework, the estimated treatment effect Estimation in the fuzzy RD design can
can be interpreted as a local average treat- be performed using either the local linear
ment effect, provided monotonicity holds. regression approach or polynomial regres-
In the fuzzy RD design, we can write the sions. Since the model is exactly identified,
probability of treatment as 2SLS estimates are numerically identical to
the ratio of reduced form coefficients τr/δ,
Pr (D = 1 | X = x) = γ + δT + g(x − c), provided that the same bandwidth is used
for equations (11) and (12) in the local lin-
where T = 1[X ≥ c] indicates whether the ear regression case, and that the same order
assignment variable exceeds the eligibil- of polynomial is used for g(∙) and f (∙) in the
ity threshold c.38 Note that the sharp RD is polynomial regression case.
a special case where γ = 0, g(∙) = 0, and In the case of the local linear regression,
δ = 1. It is advisable to draw a graph for Imbens and Lemieux (2008) recommend
the treatment dummy D as a function of the using the same bandwidth in the treatment
assignment variable X using the same proce- and outcome regression. When we are close
dure discussed in section 4.1. This provides to a sharp RD design, the function g(∙)
an informal way of seeing how large the is expected to be very flat and the optimal
jump in the treatment probability δ is at the bandwidth to be very wide. In contrast, there
cutoff point, and what the functional form is no particular reason to expect the func-
g(∙) looks like. tion f (∙) in the outcome equation to be flat
Since D = Pr(D = 1 | X = x) + ν, where or linear, which suggests the optimal band-
ν is an error term independent of X, the width would likely be less than the one for
fuzzy RD design can be described by the two the treatment equation. As a result, Imbens
equation system: and Lemieux (2008) suggest focusing on the
outcome equation for selecting bandwidth,
(10) Y = α + τD + f(X − c) + ε, and then using the same bandwidth for the
treatment equation.
(11) D = γ + δT + g(X − c) + ν. While using a wider bandwidth for the
treatment equation may be advisable on
Looking at these equations suggests esti- efficiency grounds, there are two practi-
mating the treatment effect τ by instru- cal reasons that suggest not doing so. First,
menting the treatment dummy D with using different bandwidths complicates the
T. Note also that substituting the treat- computation of standard errors since the
ment determining equation into the out- outcome and treatment samples used for the
come equation yields the reduced form estimation are no longer the same, meaning
the usual 2SLS standard errors are no longer
valid. Second, since it is advisable to explore
38 Although the probability of treatment is modeled as the sensitivity of results to changes in the
a linear probability model here, this does not impose any bandwidth, “trying out” separate bandwidths
restrictions on the probability model since g(x − c) is unre- for each of the two equations would lead to
stricted on both sides of the cutoff c, while T is a dummy
variable. So there is no need to write the model using a a large and difficult-to-interpret number of
probit or logit formulation. specifications.
The same broad arguments can be used in 4.4.1 Inspection of the Histogram of the
the case of local polynomial regressions. In Assignment Variable
principle, a lower order of polynomial could
be used for the treatment equation (11) than Recall that the underlying assumption
for the outcome equation (12). In practice, that generates the local random assignment
however, it is simpler to use the same order result is that each individual has impre-
of polynomial and just run 2SLS (and use cise control over the assignment variable,
as defined in section 3.1.1. We cannot test
2SLS standard errors).
this directly (since we will only observe one
4.3.4 How to Compute Standard Errors? observation on the assignment variable per
individual at a given point in time), but an
As discussed above, for inference in the intuitive test of this assumption is whether
sharp RD case we can use standard least the aggregate distribution of the assignment
squares methods. As usual, it is recom- variable is discontinuous, since a mixture of
mended to use heteroskedasticity-robust individual-level continuous densities is itself
standard errors (Halbert White 1980) instead a continuous density.
of standard least squares standard errors. One McCrary (2008) proposes a simple two-
additional reason for doing so in the RD case step procedure for testing whether there is a
is to ensure the standard error of the treat- discontinuity in the density of the assignment
ment effect is the same when either a pooled variable. In the first step, the assignment vari-
regression or two separate regressions on able is partitioned into equally spaced bins
each side of the cutoff are used to compute and frequencies are computed within those
the standard errors. As we just discussed, it bins. The second step treats the frequency
is also straightforward to compute standard counts as a dependent variable in a local linear
errors in the fuzzy RD case using 2SLS meth- regression. See McCrary (2008), who adopts
ods, although robust standard errors should the nonparametric framework for asymptot-
also be used in this case. Imbens and Lemieux ics, for details on this procedure for inference.
(2008) propose an alternative way of comput- As McCrary (2008) points out, this test can
ing standard errors in the fuzzy RD case, but fail to detect a violation of the RD identifica-
nonetheless suggest using 2SLS standard tion condition if for some individuals there is
errors readily available in econometric soft- a “jump” up in the density, offset by jumps
“down” for others, making the aggregate den-
ware packages.
sity continuous at the threshold. McCrary
One small complication that arises in the
(2008) also notes it is possible the RD esti-
nonparametric case of local linear regres- mate could remain unbiased, even when
sions is that the usual (robust) standard errors there is important manipulation of the assign-
from least squares are only valid provided that ment variable causing a jump in the density.
h ∝ N −δ with 1/5 < δ < 2/5. As we men- It should be noted, however, that in order to
tioned earlier, this is not a very important point rely upon the RD estimate as unbiased, one
in practice, and the usual standard errors can needs to invoke other identifying assumptions
be used with local linear regressions. and cannot rely upon the mild conditions we
4.4 Implementing Empirical Tests of RD focus on in this article.39
Validity and Using Covariates
In this part of the section, we describe 39 McCrary (2008) discusses an example where students
how to implement tests of the validity of the who barely fail a test are given extra points so that they
barely pass. The RD estimator can remain unbiased if one
RD design and how to incorporate covariates assumes that those who are given extra points were chosen
in the analysis. randomly from those who barely failed.
2.50
2.00
1.50
1.00
0.50
0.00
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
Figure 16. Density of the Forcing Variable (Vote Share in Previous Election)
One of the examples McCrary uses for his e stimation, replacing the dependent vari-
test is the voting model of Lee (2008) that able with each of the observed baseline
we used in the earlier empirical examples. covariates in W. A discontinuity would indi-
Figure 16 shows a graph of the raw densi- cate a violation in the underlying assump-
ties computed over bins with a bandwidth tion that predicts local random assignment.
of 0.005 (200 bins in the graph), along with Intuitively, if the RD design is valid, we
a smooth second order polynomial model. know that the treatment variable cannot
Consistent with McCrary (2008), the graph influence variables determined prior to the
shows no evidence of discontinuity at the realization of the assignment variable and
cutoff. McCrary also shows that a formal treatment assignment; if we observe it does,
test fails to reject the null hypothesis of no something is wrong in the design.
discontinuity in the density at the cutoff. If there are many covariates in W, even
abstracting from the possibility of misspecifi-
4.4.2 Inspecting Baseline Covariates
cation of the functional form, some discon-
An alternative approach for testing the tinuities will be statistically significant by
validity of the RD design is to examine random chance. It is thus useful to combine
whether the observed baseline covariates the multiple tests into a single test statistic to
are “locally” balanced on either side of the see if the data are consistent with no discon-
threshold, which should be the case if the tinuities for any of the observed covariates.
treatment indicator is locally randomized. A simple way to do this is with a Seemingly
A natural thing to do is conduct both Unrelated Regression (SUR) where each
a graphical RD analysis and a formal equation represents a different baseline
covariate, and then perform a χ2 test for within a narrower window around the cut-
the discontinuity gaps in all questions being off point, such as the one suggested by the
zero. For example, supposing the underlying bandwidth selection procedures discussed in
functional form is linear, one would estimate section 4.3.1.
the system Figure 17 shows the RD graph for a base-
line covariate, the Democratic vote share in
w1 = α1 + Dβ1 + Xγ1 + ε1 the election prior to the one used for the
…
… assignment variable (four years prior to
… the current election). Consistent with Lee
(2008), there is no indication of a disconti-
wK = αK + DβK + XγK + εK nuity at the cutoff. The actual RD estimate
using a quartic model is –0.004 with a stan-
and test the hypothesis that β1, …, βK are dard error of 0.014. Very similar results are
jointly equal to zero, where we allow the obtained using winning the election as the
ε’s to be correlated across the K equations. outcome variable instead (RD estimate of
Alternatively, one can simply use the OLS –0.003 with a standard error of 0.017).
estimates of β1, … , βK obtained from a
4.5 Incorporating Covariates in Estimation
“stacked” regression where all the equations
for each covariate are pooled together, while If the RD design is valid, the other use for
D and X are fully interacted with a set of K the baseline covariates is to reduce the sam-
dummy variables (one for each covariate wk). pling variability in the RD estimates. We dis-
Correlation in the error terms can then be cuss two simple ways to do this. First, one can
captured by clustering the standard errors “residualize” the dependent variable—sub-
on individual observations (which appear tract from Y a prediction of Y based on the
in the stacked dataset K times). Under the baseline covariates W—and then conduct an
null hypothesis of no discontinuities,  the
 ˆ  ˆ −1 ˆ
RD analysis on the residuals. Intuitively, this
Wald test statistic N β  ′ V
     β
     (where β
 ˆ   
is procedure nets out the portion of the varia-
the vector of estimates of β1 , …, βK, and V  ˆ   tion in Y we could have predicted using the
is the cluster-and-heteroskedasticity con- predetermined characteristics, making the
sistent
 ˆ
estimate of the asymptotic variance question whether the treatment variable
of  β
 ) converges in distribution to a χ2 with can explain the remaining residual varia-
K degrees of freedom. tion in Y. The important thing to keep in
Of course, the importance of functional mind is that if the RD design is valid, this
form for RD analysis means a rejection of procedure provides a consistent estimate of
the null hypothesis tells us either that the the same RD parameter of interest. Indeed,
underlying assumptions for the RD design any combination of covariates can be used,
are invalid, or that at least some of the equa- and abstracting from functional form issues,
tions are sufficiently misspecified and too the estimator will be consistent for the same
restrictive, so that nonzero discontinuities parameter, as discussed above in equation
are being estimated, even though they do (4). Importantly, this two-step approach also
not exist in the population. One could use allows one to perform a graphical analysis of
the parametric specification tests discussed the residual.
earlier for each of the individual equations to To see this more formally in the paramet-
see if misspecification of the functional form ric case, suppose one is willing to assume
is an important problem. Alternatively, the that the expectation of Y as a function of X
test could be performed only for observations is a polynomial, and the expectation of each
0.80
0.70
0.60
0.50
0.40
0.30
0.20
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
Figure 17. Discontinuity in Baseline Covariate (Share of Vote in Prior Election)
element of W is also a polynomial function of equation (13) is correct, in computing esti-

X. This implies mated standard errors in the second step,
 ˜  
one can ignore the fact that the first step was
(13) Y = Dτ +  X 
 γ
 ˜   + ε estimated.40
 ˜
The second approach—which uses the
  δ + u,
W =  X same assumptions implicit in equation (13)—
 ˜
is to simply add W to the regression. While
where X      is a vector of polynomial terms in X, this may seem to impose linearity in how W
δ and u are of conformable dimension, and
ε and
 ˜
u are by construction orthogonal to D 40 The two-step procedure solves the sample analogue
  . It follows that
and  X to the following set of moment equations:
 ˜    ˜
 γ − Wπ
(14) Y − Wπ = Dτ +  X 
 ˜  + ε  ˜  b(Y − Wπ0 − Dτ −  X  γ)d = 0
E ca D 
    
X
 ˜   E CW(Y − Wπ0)D = 0.
  (γ
= Dτ +  X  ˜  − δπ) − uπ + ε As noted above, the second-step estimator for τ is con-
 ˜ sistent for any value of π. Letting θ ≡ A  γτ    B, and using the
  γ − uπ + ε.
= Dτ +  X notation of Whitney K. Newey and Daniel L. McFadden
(1994), this means that the first row of ∇πθ(π0) = −G −1 θ  Gπ
is a row of zeros. It follows from their theorem 6.1, with
This makes clear that a regression of Y − Wπ the 1,1 element of V being the asymptotic variance of the
˜   will give consistent estimates of
on D and  X estimator for τ, that the 1,1 element of V is equal to the 1,1
θ  E[ g(z)g(z)′ ]G θ  ′ , which is the asymptotic
element of G −1 −1
τ and γ. This is true no matter the value of π. covariance matrix of the second stage estimator ignoring
Furthermore, as long as the specification in estimation in the first step.
affects Y, it can be shown that the inclusion involves limiting the estimation to a window
of these regressors will not affect the consis- of data around the threshold and using a lin-
tency of the estimator for τ.41 The advantage ear specification within that window.45 We
of this second approach is that under these note that as the neighborhood shrinks, the
functional form assumptions and with homo- true expectation of W conditional on X will
skedasticity, the estimator for τ is guaranteed become closer to  ˜
being linear, and so equa-
to have a lower asymptotic variance.42 By tion (13) (with X    containing only the linear
contrast, the “residualizing” approach can in term) will become a better approximation.
some cases raise standard errors.43 For the voting example used throughout
The disadvantage of solely relying upon this paper, Lee (2008) shows that adding a
this second approach, however, is that it does set of covariates essentially has no impact on
not help distinguish between an inappropri- the RD estimates in the model where the
ate functional form and discontinuities in W, outcome variable is winning the next elec-
as both could potentially cause the estimates tion. Doing so does not have a large impact
of τ to change significantly when W is includ- on the standard errors either, at least up
ed.44 On the other hand, the “residualizing” to the third decimal. Using the procedure
approach allows one to examine how well the based on residuals instead actually slightly
residuals fit the assumed order of polynomial increases the second step standard errors—
(using, for example, the methods described a possibility mentioned above. Therefore in
in subsection 4.3.2). If it does not fit well, this particular example, the main advantage
then it suggests that the use of that order of of using baseline covariates is to help estab-
polynomial with the second approach is not lish the validity of the RD design, as opposed
justified. Overall, one sensible approach is to to improving the efficiency of the estimators.
directly enter the covariates, but then to use
4.6 A Recommended “Checklist” for
the “residualizing” approach as an additional
Implementation
diagnostic check on whether the assumed
order of the polynomial is justified. Below is a brief summary of our recom-
As discussed earlier, an alternative mendations for the analysis, presentation,
approach to estimating the discontinuity and estimation of RD designs.
  ˜ γ + 
 ˜  ˜   ˜  , imply-
41 To see this, rewrite equation (13) as Y = Dτ +  X     the footnote above, e = 0, and thus D −  X   d =  D
 ˜
Da +  X  b + Wc + μ, where a, b, c, and μ are linear projec- ing that the denominator in the ratio does not change when
tion coefficients  ˜
and the residual from a population regres- W is included.
sion ε on D,  X   , and W. If a = 0, then adding W will not 43 From equation (14), the regression error variance
affect the coefficient on D. This will be true—applying the will increase if V(ε − uπ) > V(ε) ⇔ V(uπ) − 2C(ε, uπ) > 
Frisch–Waugh  ˜
theorem—when the covariance between 0, which will hold when, for example, ε is orthogonal to u
ε and D −  X   d − We  ˜
(where d and e are coefficients from and π is nonzero.
projecting D on X      and W ) is zero. This will be true when 44 If the true equation for W contains more polyno-
 ˜
e = 0,  ˜
because ε is by assumption orthogonal to both D mial terms than X     , then e, as defined in the preceding
and  X   . Applying the Frisch–Waugh theorem again,  ˜
e is the footnotes (the coefficient obtained  ˜
by regressing D on the
coefficient obtained by regressing D on W −  X   δ ≡ u; by residual from projecting W on  X   ), will not be zero. This
assumption u and D are uncorrelated, so e = 0. implies that including W will generally lead to inconsis-
42 The asymptotic variance for the least squares esti- tent estimates of τ, and may  ˜
cause the asymptotic
 
variance
mator (without  
including W) of τ is given by the ratio to increase (since V(D −  X   d − We) ≤ V( D ˜  )).
V(ε)/V( D ˜  ) where    D˜   is the residual from the population 45 And we have noted that one can justify this by assum-
 ˜
regression of D on X     . If W is included, then the least ing that in that specified neighborhood, the underlying
squares
 ˜
estimator has asymptotic variance of σ 2/V(D − function is in fact linear, and make standard parametric
    d − We), where σ 2 is the variance of the error when W
X inferences. Or one can conduct a nonparametric inference
is included,  ˜
and d and e are coefficients from projecting approach by making assumptions about the rate at which
D on  X    and W. σ 2 cannot exceed V(ε), and as shown in the bandwidth shrinks as the sample size grows.
1. To assess the possibility of manipula- 3. Graph a benchmark polynomial

tion of the assignment variable, show specification. Superimpose onto the
its distribution. The most straightfor- graph the predicted values from a low-
ward thing to do is to present a histo- order polynomial specification (see fig-
gram of the assignment variable, using ures 6–11). One can often informally
a fixed number of bins. The bin widths assess by comparing the two functions
should be as small as possible, without whether a simple polynomial specifi-
compromising the ability to visually see cation is an adequate summary of the
the overall shape of the distribution. data. If the local averages represent the
For an example, see figure 16. The bin- most flexible “nonparametric” repre-
to-bin jumps in the frequencies can sentation of the function, the polyno-
provide a sense in which any jump at mial represents a “best case” scenario
the threshold is “unusual.” For this rea- in terms of the variance of the RD
son, we recommend against plotting a estimate, since if the polynomial speci-
smooth function comprised of kernel fication is correct, under certain con-
density estimates. A more formal test ditions, the least squares estimator is
of a discontinuity in the density can be efficient.
found in McCrary (2008).
4. Explore the sensitivity of the results
2. Present the main RD graph using to a range of bandwidths, and a range
binned local averages. As with the his- of orders to the polynomial. For an
togram, we recommend using a fixed example, see tables 2 and 3. The tables
number of nonoverlapping bins, as should be supplemented with infor-
described in subsection 4.1. For exam- mation on the implied rule-of-thumb
ples, see figures 6–11. The nonover- bandwidth and cross-validation band-
lapping nature of the bins for the local widths for local linear regression (as
averages is important; we recommend in table 4), as well as the AIC-implied
against simply presenting a continuum optimal order of the polynomial. The
of nonparametric estimates (with a sin- specification tests that involve add-
gle break at the threshold), as this will ing bin dummies to the polynomial
naturally tend to give the impression of specifications can help rule out overly
a discontinuity even if there does not restrictive specifications. Among all the
exist one in the population. We recom- specifications that are not rejected by
mend reporting bandwidths implied by the bin-dummy tests, and among the
cross-validation, as well as the range of polynomial orders recommended by
widths that are not statistically rejected the AIC, and the estimates given by
in favor of strictly less restrictive alterna- both rule of thumb and CV bandwidths,
tives (for an example, see table 1). We report a “typical” point estimate and
recommend generally “undersmooth- a range of point estimates. A useful
ing,” while at the same time avoiding graphical device for illustrating the
“too narrow” bins that produce a scatter sensitivity of the results to bandwidths
of data points, from which it is difficult is to plot the local linear discontinuity
to see the shape of the underlying func- estimate against a continuum of band-
tion. Indeed, we recommend against widths (within a range of bandwidths
simply plotting the raw data without a that are not ruled out by the above
minimal amount of local averaging. specification tests). For an example
0.18
0.16
95 percent
0.14 confidence bands
Estimated treatment effect
0.12
0.10
0.08
0.06
0.04 Quadratic fit

LLR estimate of
the treatment
0.02 effect
0.00
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Bandwidth
Figure 18. Local Linear Regression with Varying Bandwidth: Share of Vote at Next Election
of such a presentation, see the online assumption holds. If the estimates do

appendix to Card, Carlos Dobkin, and change in an important way, it may indi-
Nicole Maestas (2009), and figure 18. cate a potential sorting of the assign-
ment variable that may be reflected in
5. Conduct a parallel RD analysis on a discontinuity in one or more of the
the baseline covariates. As discussed baseline covariates. In terms of imple-
earlier, if the assumption that there is mentation, in subsection 4.5, we sug-
no precise manipulation or sorting of gest simply including the covariates
the assignment variable is valid, then directly, after choosing a suitable order
there should be no discontinuities in of polynomial. Significant changes in
variables that are determined prior the estimated effect or increases in the
to the assignment. See figure 17, for standard errors may be an indication of
example. a misspecified functional form. Another
check is to perform the “residualizing”
6. Explore the sensitivity of the results procedure suggested there, to see if
to the inclusion of baseline covari- that same order of polynomial provides
ates. As discussed above, the inclusion a good fit for the residuals, using the
of baseline covariates—no matter how specification tests from point 4.
highly correlated they are with the out-
come—should not affect the estimated We recognize that, due to space limitations,
discontinuity, if the no-manipulation researchers may be unable to present every
permutation of presentation (e.g., points Additionally, the various estimation and

2–4 for every one of 20 baseline covariates) graphing techniques discussed in section
within a published article. Nevertheless, we 4 can readily be used in the case of a dis-
do believe that documenting the sensitiv- crete assignment variable. For instance,
ity of the results to these array of tests and as with a continuous assignment variable,
alternative specifications—even if they only either local linear regressions or polyno-
appear in unpublished, online appendices— mial regressions can be used to estimate the
is an important component of a thorough jump in the regression function at the cutoff
RD analysis. point. Furthermore, the discreteness of the
assignment variable simplifies the problem
of bandwidth choice when graphing the
5. Special Cases
data since, in most cases, one can simply
In this section, we discuss how the RD compute and graph the mean of the out-
design can be implemented in a number of come variable for each value of the discrete
specific cases beyond the one considered up assignment variable. The fact that the vari-
to this point (that of a single cross-section able is discrete also provides a natural way
with a continuous assignment variable). of testing whether the regression model is
well specified by comparing the fitted model
5.1 Discrete Assignment Variable and
to the raw dispersion in mean outcomes at
Specification Errors
each value of the assignment variable. Lee
Up until now, we have assumed the assign- and Card (2008) show that, when errors are
ment variable was continuous. In practice, homoskedastic, the model specification can
however, X is often discrete. For example, be tested using the standard goodness-of-fit
age or date of birth are often only available statistic
at a monthly, quarterly, or annual frequency
level. Studies relying on an age-based cut-
G ≡  __
(ESSR − ESSUR)/( J − K)
off thus typically rely on discrete values of      
  ,
the age variable when implementing an RD ESSUR/(N − J)
design.
Lee and Card (2008) study this case in where ESSR is the estimated sum of squares
detail and make a number of important of the restricted model (e.g., low order poly-
points. First, with a discrete assignment vari- nomial), while ESSUR is the estimated sum
able, it is not possible to compare outcomes of squares of the unrestricted model where
in very narrow bins just to the right and left a full set of dummies (for each value of the
of the cutoff point. Consequently, one must assignment variable) are included. In this
use regressions to estimate the conditional unrestricted model, the fitted regression cor-
expectation of the outcome variable at the responds to the mean outcome in each cell.
cutoff point by extrapolation. As discussed G follows a F ( J − K, N − J) distribution
in section 4, however, in practice we always where J is the number of values taken by the
extrapolate to some extent, even in the case assignment variables and K is the number of
of a continuous assignment variable. So the parameters of the restricted model.
fact we must do so in the case of a discrete This test is similar to the test in section
assignment variable does not introduce par- 4 where we suggested including a full set
ticular complications from an econometric of bin dummies in the regression model
point of view, provided the discrete variable and testing whether the bin dummies were
is not too coarsely distributed. jointly significant. The procedure is even
simpler here, as bin dummies are replaced 5.2 Panel Data and Fixed Effects
by dummies for each value of the discrete
assignment variable. In the presence of het- In some situations, the RD design will
eroskedasticity, the goodness-of-fit test can be embedded in a panel context, whereby
be computed by estimating the model and period by period, the treatment variable is
testing whether a set of dummies for each determined according to the realization of
value of the discrete assignment variable are the assignment variable X. Again, it seems
jointly significant. In that setting, the test sta- natural to propose the model
tistic follows a chi-square distribution with
J − K degrees of freedom. Yit = Ditτ + f (Xit; γ) + ai + εit
In Lee and Card (2008), the difference
between the true conditional expectation (where i and t denote the individuals and
E [Y | X = x] and the estimated regression time, respectively), and simply estimate a
function forming the basis of the goodness- fixed effects regression by including indi-
of-fit test is interpreted as a random specifi- vidual dummy variables to capture the unit-
cation error that introduces a group structure specific error component, ai . It is important
in the standard errors. One way of correcting to note, however, that including fixed effects
the standard errors for group structure is to is unnecessary for identification in an RD
run the model on cell means.46 Another way design. This sharply contrasts with a more
is to “cluster” the standard errors. Note that traditional panel data setting where the error
in this setting, the goodness-of-fit test can also component ai is allowed to be correlated
be interpreted as a test of whether standard with the observed covariates, including the
errors should be adjusted for the group struc- treatment variable Dit, in which case includ-
ture. In practice, it is nonetheless advisable to ing fixed effects is essential for consistently
either group the data or cluster the standard estimating the treatment effect τ.
errors in micro-data models irrespective of the An alternative is to simply conduct the
results of the goodness-of-fit test. The main RD analysis for the entire pooled-cross-
purpose of the test should be to help choose a section dataset, taking care to account for
reasonably accurate regression model. within-individual correlation of the errors
Lee and Card (2008) also discuss a num- over time using clustered standard errors.
ber of issues including what to do when The source of identification is a compari-
specification errors under treatment and son between those just below and above
control are correlated, and how to possibly the threshold, and can be carried out with
adjust the RD estimates in the presence of a single cross-section. Therefore, impos-
specification errors. Since these issues are ing a specific dynamic structure intro-
beyond the scope of this paper, interested duces more restrictions without any gain in
readers should consult Lee and Card (2008) identification.
for more detail. Time dummies can also be treated like
any other baseline covariate. This is appar-
ent by applying the main RD identification
46 When the discrete assignment variable—and the
result: conditional on what period it is, we
“treatment” dummy solely dependent on this variable—is are assuming the density of X is continuous
the only variable used in the regression model, standard at the threshold and, hence, conditional on
OLS estimates will be numerically equivalent to those X, the probability of an individual observa-
obtained by running a weighted regression on the cell
means, where the weights are the number of observations tion coming from a particular period is also
(or the sum of individual weights) in each cell. continuous.
We note that it becomes a little bit more In summary, one can utilize the panel
awkward to use the justification proposed in nature of the data by conducting an RD
subsection 4.5 for directly including dum- analysis on the entire dataset, using lagged
mies for individuals and time periods on variables as baseline covariates for inclusion
the right hand side of the regression. This as described in subsection 4.5. The primary
is because the assumption would have to caution in doing this is to ensure that for
be that the probability that an observation each period, the included covariates are the
belonged to each individual (or the probabil- variables determined prior to the present
ity that an observation belonged to each time period’s realization of Xit.
period) is a polynomial function in X and,
strictly speaking, nontrivial polynomials are
6. Applications of RD Designs in
not bounded between 0 and 1.
Economics
A more practical concern is that inclusion
of individual dummy variables may lead to an In what areas has the RD design been
increase in the variance of the RD estimator applied in economic research? Where do
for another reason. If there is little “within- discontinuous rules come from and where
unit” variability in treatment status, then might we expect to find them? In this section,
the variation in the main variable of interest we provide some answers to these questions
(treatment after partialling out the individual by providing a survey of the areas of applied
heterogeneity) may be quite small. Indeed, economic research that have employed the
seeing standard errors rise when including RD design. Furthermore, we highlight some
fixed effects may be an indication of a mis- examples from the literature that illustrate
specified functional form.47 what we believe to be the most important
Overall, since the RD design is still valid elements of a compelling, “state-of-the-art”
ignoring individual or time effects, then the implementation of RD.
only rationale for including them is to reduce
6.1 Areas of Research Using RD
sampling variance. But there are other ways
to reduce sampling variance by exploiting the As we suggested in the introduction, the
structure of panel data. For instance, we can notion that the RD design has limited appli-
treat the lagged dependent variable Yit−1 as cability to a few specific topics is inconsistent
simply another baseline covariate in period with our reading of existing applied research
t. In cases where Yit is highly persistent over in economics. Table 5 summarizes our survey
time, Yit−1 may well be a very good predic- of empirical studies on economic topics that
tor and has a very good chance of reducing have utilized the RD design. In compiling
the sampling error. As we have also discussed this list, we searched economics journals as
earlier, looking at possible discontinuities in well as listings of working papers from econ-
baseline covariates is an important test of the omists, and chose any study that recognized
validity of the RD design. In this particular the potential use of an RD design in their
case, since Yit can be highly correlated with given setting. We also included some papers
Yit−1, finding a discontinuity in Yit but not from non-economists when the research was
in Yit−1 would be a strong piece of evidence closely related to economic work.
supporting the validity of the RD design. Even with our undoubtedly incomplete
compilation of over sixty studies, table 5
illustrates that RD designs have been applied
in many different contexts. Table 5 summa-
47 See the discussion in section 4.5. rizes the context of the study, the outcome
Table 5
Regression Discontinuity Applications in Economics
Study Context Outcome(s) Treatment(s) Assignment variable(s)

Education
Angrist and Lavy (1999) Public Schools Test scores Class size Student enrollment
(Grades 3–5), Israel
Asadullah (2005) Secondary schools, Examination pass rate Class size Student enrollment
Bangladesh
Bayer, Ferreira, Valuation of schools Housing prices, Inclusion in school Geographic location
and McMillan (2007) and neighborhoods, school test scores, attendance region
Northern California demographic
characteristics
Black (1999) Valuation of school Housing prices Inclusion in school Geographic location
quality, Massachusetts attendance region
Canton and Blom (2004) Higher education, University enrollment, Student loan receipt Economic need index
Mexico GPA, Part-time
employment, Career
choice
Cascio and Lewis (2006) Teenagers, AFQT test scores Age at school entry Birthdate
United States
Chay, McEwan, Elementary schools, Test scores Improved infrastructure, School averages of test
and Urquiola (2005) Chile more resources scores
Chiang (2009) School accountability, Test scores, education Threat of sanctions School’s assessment
Florida quality score
Clark (2009) High schools, U.K. Examination pass rates “Grant maintained” Vote share
school status
Ding and Lehrer (2007) Secondary school Academic achievement School assignment Entrance examination
students, China (Test scores) scores
Figlio and Kenny (2009) Elementary and middle Private donations to D or F grade in school Grading points
schools, Florida school performance measure
Goodman (2008) College enrollment, School choice Scholarship offer Test scores
Massachusetts
Goolsbee and Public schools, Internet access in E-Rate subsidy amount Proportion of students
Guryan (2006) California classrooms, test scores eligible for lunch
program
Guryan (2001) State-level equalization: Spending on schools, State education aid Relative average
elementary, middle test scores property values
schools, Massachusetts
Hoxby (2000) Elementary schools, Test scores Class size Student enrollment
Connecticut
Kane (2003) Higher education, College attendance Financial aid receipt Income, assets, GPA
California
Lavy (2002) Secondary schools, Test scores, Performance based Frequency of school
Israel drop out rates incentives for teachers type in community
Lavy (2004) Secondary schools, Test scores Pay-for-performance School matriculation
Israel incentives rates
Lavy (2006) Secondary schools, Dropout rates, School choice Geographic location
Tel Aviv test scores
Jacob and Lefgren (2004a) Elementary schools, Test scores Teacher training School averages on
Chicago test scores
Table 5 (continued)
Jacob and Lefgren (2004b) Elementary schools, Test scores Summer school Standardized test
Chicago attendance, grade scores
retention
Leuven, Lindahl, Primary schools, Test scores Extra funding Percent disadvantaged
Oosterbeek, and Netherlands minority pupils
Webbink (2007)
Matsudaira (2008) Elementary schools, Test scores Summer school, Test scores
Northeastern United grade promotion
States
Urquiola (2006) Elementary schools, Test scores Class size Student enrollment
Bolivia
Urquiola and Class size sorting- RD Test scores Class size Student enrollment
Verhoogen (2009) violations, Chile
Van der Klaauw College enrollment, Enrollment Financial Aid offer SAT scores, GPA
(2002, 1997) East Coast College
Van der Klaauw (2008a) Elementary/middle Test scores, Title I federal funding Poverty rates
schools, New York student attendance
City
Labor Market
Battistin and Rettore Job training, Italy Employment rates Training program Attitudinal test score
(2002) (computer skills)
Behaghel, Crepon, Labor laws, France Hiring among age Tax exemption for Age of worker
and Sedillot (2008) groups hiring firm
Black, Smith, Berger, and UI claimants, Kentucky Earnings, benefit Mandatory reemploy- Profiling score
Noel (2003); Black, receipt/duration ment services (job (expected benefit
Galdo, and Smith (2007b) search assistance) duration)
Card, Chetty, Unemployment Unemployment Lump-sum severance Months employed,
and Weber (2007) benefits, Austria duration pay, extended UI job tenure
benefits
Chen and van der Klaauw Disability insurance Labor force Disability insurance Age at disability
(2008) beneficiaries, participation benefits decision
United States
De Giorgi (2005) Welfare-to-work Re-employment Job search assistance, Age at end of
program, United probability training, education unemployment spell
Kingdom
DiNardo and Lee (2004) Unionization, Wages, employment, Union victory in NLRB Vote share
United States output election
Dobkin and Individuals, California Educational attainment, Age at school entry Birthdate
Ferreira (2009) and Texas wages
Edmonds (2004) Child labor supply and Child labor supply, school Pension receipt of oldest Age
school attendance, attendance family member
South Africa
Hahn, Todd, and Discrimination, Minority employment Coverage of federal Number of employees
van der Klaauw (1999) United States antidiscrimination law at firm
Lalive (2008) Unemployment Unemployment Maximum benefit Age at start of
Benefits, Austria duration duration unemployment
spell, geographic
location
Table 5 (continued)

Lalive (2007) Unemployment, Unemployment duration, Benefits duration Age at start of
Austria duration of job search, unemployment spell
quality of post-
unemployment jobs
Lalive, Van Ours, Unemployment, Unemployment Benefit replacement Pre-unemployment
and Zweimller (2006) Austria duration rate, potential benefit income, age
duration
Leuven and Oosterbeek Employers, Training, wages Business tax deduction, Age of employee
(2004) Netherlands training
Lemieux and Milligan Welfare, Canada Employment, marital Cash benefit Age
(2008) status, living
arrangements
Oreopoulos (2006) Returns to education, Earnings Coverage of compulsory Birth year
U.K. schooling law
Political Economy
Albouy (2009) Congress, United States Federal expenditures Party control of seat Vote share in election
Albouy (2008) Senate, United States Roll call votes Incumbency Initial vote share
Ferreira and Gyourko Mayoral Elections, Local expenditures Incumbency Initial vote share
(2009) United States
Lee (2008, 2001) Congressional elections, Vote share in next Incumbency Initial vote share
United States election
Lee, Moretti, and Butler House of Representa- Roll call votes Incumbency Initial vote share
(2004) tives, United States
McCrary (2008) House of Representa- N/A Passing of resolution Share of roll call vote
tives, United States “Yeay”
Pettersson-Lidbom (2006) Local Governments, Expenditures, Number of council seats Population
Sweden and Finland tax revenues
Pettersson-Lidbom (2008) Local Governments, Expenditures, Left-, right-wing bloc Left-wing parties’
Sweden tax revenues share
Health
Card and Shore-Sheppard Medicaid, Overall insurance Medicaid eligibility Birthdate
(2004) United States coverage
Card, Dobkin, Medicare, Health care utilization Coverage under Age
and Maestas (2008) United States Medicare
Card, Dobkin, Medicare, California Insurance coverage, Medicare coverage Age
and Maestas (2009) Health services, Mortality
Carpenter and Dobkin Alcohol and mortality, Mortality Attaining minimum Age
(2009) United States legal drinking age
Ludwig and Miller (2007) Head Start, Child mortality, Head Start funding County poverty rates
United States educational attainment
McCrary and Royer (2003) Maternal education, Infant health, fertility Age of school entry Birthdate
United States, timing
California and Texas
Snyder and Evans (2006) Social Security Mortality Social security Birthdate
recipients, United payments ($)
States
Table 5 (continued)
Crime
Berk and DeLeeuw (1999) Prisoner behavior in Inmate misconduct Prison security levels Classification score
California
Berk and Rauma (1983) Ex-prisoners recidivism, Arrest, parole violation Unemployment Reported hours of
California insurance benefit work
Chen and Shapiro (2004) Ex-prisoners recidivism, Arrest rates Prison security levels Classification score
United States
Lee and McCrary (2005) Criminal offenders, Arrest rates Severity of sanctions Age at arrest
Florida
Hjalmarsson (2009) Juvenile offenders, Recidivism Sentence length Criminal history score
Washington State
Environment
Chay and Greenstone Health effects of Infant mortality Regulatory status Pollution levels
(2003) pollution, United States
Chay and Greenstone Valuation of air quality, Housing prices Regulatory status Pollution levels
(2005) United States
Davis (2008) Restricted driving Hourly air pollutant Restricted automobile Time
policy, Mexico measures use
Greenstone and Gallagher Hazardous waste, Housing prices Superfund clean-up Ranking of level of
(2008) United States status hazard
Other
Battistin and Rettore Mexican anti-poverty School attendance Cash grants Pre-assigned
(2008) program probability of being
(PROGRESA) poor
Baum-Snow and Marion Housing subsidies, Residents’ Increased subsidies Percentage of eligible
(2009) United States characteristics, new households in area
housing construction
Buddelmeyer and Skoufias Mexican anti-poverty Child labor and Cash grants Pre-assigned
(2004) program school attendance probability of being
(PROGRESA) poor
Buettner (2006) Fiscal equalization Business tax rate Implicit marginal tax Tax base
across municipalities, rate on grants to
Germany localities
Card, Mas, and Rothstein Racial segregation, Changes in census tract Minority share exceeding Initial minority share
(2008) United States racial composition “tipping” point
Cole (2009) Bank nationalization, Share of credit granted Nationalization of Size of bank
India by public banks private banks
Edmonds, Mammen, and Household structure, Household composition Pension receipt of Age
Miller (2005) South Africa oldest family member
Ferreira (2007) Residential mobility, Household mobility Coverage of tax benefit Age
California
Pence (2006) Mortgage credit, Size of loan State mortgage credit Geographic location
United States laws
Pitt and Khandker (1998) Poor households, Labor supply, children Group-based credit Acreage of land
Bangladesh school enrollment program
Pitt, Khandker, McKernan, Poor households, Contraceptive use, Group-based credit Acreage of land
and Latif (1999) Bangladesh Childbirth program
variable, the treatment of interest, and the esting “treatment” to examine since it seems
assignment variable employed. nothing more than an arbitrary label. On the
While the categorization of the various other hand, if failing the exam meant not
studies into broad areas is rough and some- being able to advance to the next grade in
what arbitrary, it does appear that a large share school, the actual experience of treated and
come from the area of education, where the control individuals is observably different, no
outcome of interest is often an achievement matter how large or small the impact on the
test score and the assignment variable is also outcome.
a test score, either at the individual or group As another example, in the U.S. Congress,
(school) level. The second clearly identifiable a Democrat obtaining the most votes in
group are studies that deal with labor market an election means something real—the
issues and outcomes. This probably reflects Democratic candidate becomes a represen-
that, within economics, the RD design has so tative in Congress; otherwise, the Democrat
far primarily been used by labor economists, has no official role in the government. But
and that the use of quasi-experiments and in a three-way electoral race, the treatment
program evaluation methods in documenting of the Democrat receiving the second-most
causal relationships is more prevalent in labor number of votes (versus receiving the low-
economics research. est number) is not likely a treatment of inter-
There is, of course, nothing in the struc- est: only the first-place candidate is given
ture of the RD design tying it specifically to any legislative authority. In principle, stories
labor economics applications. Indeed, as the could be concocted about the psychological
rest of the table shows, the remaining half effect of placing second rather than third
of the studies are in the areas of political in an election, but this would be an exam-
economy, health, crime, environment, and ple where the salience of the treatment is
other areas. more speculative than when treatment is a
concrete and observable event (e.g., a can-
6.2 Sources of Discontinuous Rules
didate becoming the sole representative of a
Where do discontinuous rules come from, constituency).
and in what situations would we expect to
encounter them? As table 5 shows, there is 6.2.1 Necessary Discretization
a wide variety of contexts where discontinu-
ous rules determine treatments of interest. Many discontinuous rules come about
There are, nevertheless, some patterns that because resources cannot, for all practical
emerge. We organize the various discontinu- purposes, be provided in a continuous man-
ous rules below. ner. For example, a school can only have a
Before doing so, we emphasize that a good whole number of classes per grade. For
RD analysis—as with any other approach a fixed level of enrollment, the moment a
to program evaluation—is careful in clearly school adds a single class, the average class
spelling out exactly what the treatment is, size drops. As long as the number of classes
and whether it is of any real salience, inde- is an increasing function of enrollment,
pendent of whatever effect it might have on there will be discontinuities at enrollments
the outcome. For example, when a pretest where a teacher is added. If there is a man-
score is the assignment variable, we could dated maximum for the student to teacher
always define a “treatment” as being “having ratio, this means that these discontinuities
passed the exam” (with a test score of 50 per- will be expected at enrollments that are
cent or higher), but this is not a very inter- exact multiples of the maximum. This is the
essence of the discontinuous rules used in identified three broad motivations behind
the analyses of Angrist and Lavy (1999), M. the use of these discontinuous rules.
Niaz Asadullah (2005), Caroline M. Hoxby First, a number of rules seem driven
(2000), Urquiola (2006), and Urquiola and by a compensatory or equalizing motive.
Verhoogen (2009). For example, in Kenneth Y. Chay, Patrick
Another example of necessary discretiza- J. McEwan and Urquiola (2005), Edwin
tion arises when children begin their school- Leuven et al. (2007), and van der Klaauw
ing years. Although there are certainly (2008a), extra resources for schools were allo-
exceptions, school districts typically follow a cated to the neediest communities, either on
guideline that aims to group children together the basis of school-average test scores, dis-
by age, leading to a grouping of children advantaged minority proportions, or poverty
born in year-long intervals, determined by rates. Similarly, Ludwig and Miller (2007),
a single calendar date (e.g., September 1). Erich Battistin and Enrico Rettore (2008),
This means children who are essentially of and Hielke Buddelmeyer and Emmanuel
the same age (e.g., those born on August 31 Skoufias (2004) study programs designed to
and September 1), start school one year apart. help poor communities, where the eligibility
This allocation of students to grade cohorts of a community is based on poverty rates. In
is used in Elizabeth U. Cascio and Ethan G. each of these cases, one could imagine pro-
Lewis (2006), Dobkin and Fernando Ferreira viding the most resources to the neediest and
(2009), and McCrary and Royer (2003). gradually phasing them out as the need index
Choosing a single representative by way declines, but in practice this is not done, per-
of an election is yet another example. When haps because it was impractical to provide
the law or constitution calls for a single rep- very small levels of the treatment, given the
resentative of some constituency and there fixed costs in administering the program.
are many competing candidates, the choice A second motivation for having a discon-
can be made via a “first-past-the-post” or tinuous rule is to allocate treatments on the
“winner-take-all” election. This is the typi- basis of some measure of merit. This was
cal system for electing government officials the motivation behind the merit award from
at the local, state, and federal level in the the analysis of Thistlethwaite and Campbell
United States. The resulting discontinuous (1960), as well as recent studies of the effect
relationship between win/loss status and the of financial aid awards on college enroll-
vote share is used in the context of the U.S. ment, where the assignment variable is some
Congress in Lee (2001, 2008), Lee, Enrico measure of student achievement or test
Moretti and Matthew J. Butler (2004), score, as in Thomas J. Kane (2003) and van
David Albouy (2009), Albouy (2008), and in der Klaauw (2002).
the context of mayoral elections in Ferreira Finally, we have observed that a number
and Joseph Gyourko (2009). The same idea of discontinuous rules are motivated by the
is used in examining the impacts of union need to most effectively target the treatment.
recognition, which is also decided by a secret For example, environmental regulations or
ballot election (DiNardo and Lee 2004). clean-up efforts naturally will focus on the
most polluted areas, as in Chay and Michael
6.2.2 Intentional Discretization
Greenstone (2003), Chay and Greenstone
Sometimes resources could potentially (2005), and Greenstone and Justin Gallagher
be allocated on a continuous scale but, in (2008). In the context of criminal behav-
practice, are instead done in discrete lev- ior, prison security levels are often assigned
els. Among the studies we surveyed, we based on an underlying score that quantifies
potential security risks, and such rules were age. Receipt of pension benefits is typi-
used in Richard A. Berk and Jan de Leeuw cally tied to reaching a particular age (see
(1999) and M. Keith Chen and Jesse M. Eric V. Edmonds 2004; Edmonds, Kristin
Shapiro (2004). Mammen, and Miller 2005) and, in the
United States, eligibility for the Medicare
6.3 Nonrandomized Discontinuity Designs
program begins at age 65 (see Card,
Throughout this article, we have focused Dobkin, and Maestas 2008) and young
on regression discontinuity designs that fol- adults reach the legal drinking age at 21
low a certain structure and timing in the (see Christopher Carpenter and Dobkin
assignment of treatment. First, individuals 2009). Similarly, one is subject to the less
or communities—potentially in anticipa- punitive juvenile justice system until the
tion of the assignment of treatment—make age of majority (typically, 18) (see Lee and
decisions and act, potentially altering their McCrary 2005).
probability of receiving treatment. Second, These cases stand apart from the typical
there is a stochastic shock due to “nature,” RD designs discussed above because here
reflecting that the units have incomplete assignment to treatment is essentially inevi-
control over the assignment variable. And table, as all subjects will eventually age into
finally, the treatment (or the intention to the program (or, conversely, age out of the
treat) is assigned on the basis of the assign- program). One cannot, therefore, draw any
ment variable. parallels with a randomized experiment,
We have focused on this structure because which necessarily involves some ex ante
in practice most RD analyses can be viewed uncertainty about whether a unit ultimately
along these lines, and also because of the receives treatment (or the intent to treat).
similarity to the structure of a randomized Another important difference is that
experiment. That is, subjects of a random- the tests of smoothness in baseline char-
ized experiment may or may not make deci- acteristics will generally be uninformative.
sions in anticipation to participating in a Indeed, if one follows a single cohort over
randomized controlled trial (although their time, all characteristics determined prior to
actions will ultimately have no influence on reaching the relevant age threshold are by
the probability of receiving treatment). Then construction identical just before and after
the stochastic shock is realized (the random- the cutoff.48 Note that in this case, time is
ization). Finally, the treatment is adminis- the assignment variable, and therefore can-
tered to one of the groups. not be manipulated.
A number of the studies we surveyed This design and the standard RD share
though, did not seem to fit the spirit or the necessity of interpreting the discontinu-
essence of a randomized experiment. Since it ity as the combined effect of all factors that
is difficult to think of the treatment as being switch on at the threshold. In the example of
locally randomized in these cases, we will Thistlethwaite and Campbell (1960), if pass-
refer to the two research designs we identi- ing a scholarship exam provides the symbolic
fied in this category as “nonrandomized” dis-
continuity designs. 48 There are exceptions to this. There could be attrition
6.3.1 Discontinuities in Age with Inevitable over time, so that in principle, the number of observations
could discontinuously drop at the threshold, changing the
Treatment composition of the remaining observations. Alternatively,
when examining a cross-section of different birth cohorts at
Sometimes program status is turned a given point in time, it is possible to have sharp changes in
on when an individual reaches a certain the characteristics of individuals with respect to birthdate.
honor of passing the exam as well as a mon- no discontinuity at age 67, when Social
etary award, the true treatment is a pack- Security benefits commence payment. On
age of the two components, and one cannot the other hand, some medical procedures
attribute any effect to only one of the two. are too expensive for an under-65-year-old
Similarly, when considering an age-activated but would be covered under Medicare upon
treatment, one must consider the possibility turning 65. In this case, individuals’ greater
that the age of interest is causing eligibility awareness of such a predicament will tend to
for potentially many other programs, which increase the size of the discontinuity in uti-
could affect the outcome. lization of medical procedures with respect
There are at least two new issues that to age (e.g., see Card, Dobkin, and Maestas
are irrelevant for the standard RD but are 2008).
important for the analysis of age discontinui- At this time we are unable to provide any
ties. First, even if there is truly an effect on more specific guidelines for analyzing these
the outcome, if the effect is not immediate, age/time discontinuities since it seems that
it generally will not generate a discontinu- how one models expectations, information,
ity in the outcome. For example, suppose and behavior in anticipation of sharp changes
the receipt of Social Security benefits has in regimes will be highly context-dependent.
no immediate impact but does have a long- But it does seem important to recognize
run impact on labor force participation. these designs as being distinct from the stan-
Examining the labor force behavior as a dard RD design.
function of age will not yield a discontinuity We conclude by emphasizing that when
at age 67 (the full retirement age for those distinguishing between age-triggered treat-
born after 1960), even though there may be ments and a standard RD design, the involve-
a long-run effect. It is infeasible to estimate ment of age as an assignment variable is not
long-run effects because by the time we as important as whether the receipt of treat-
examine outcomes five years after receiving ment—or analogously, entering the control
the treatment, for example, those individuals state—is inevitable. For example, on the sur-
who were initially just below and just above face, the analysis of the Medicaid expansions
age 67 will be exposed to essentially the same in Card and Lara D. Shore-Sheppard (2004)
length of time of treatment (e.g., five years).49 appears to be an age-based discontinuity
The second important issue is that, be- since, effective July 1991, U.S. law requires
cause treatment is inevitable with the pas- states to cover children born after September
sage of time, individuals may fully anticipate 30, 1983, implying a discontinuous relation-
the change in the regime and, therefore, may ship between coverage and age, where the
behave in certain ways prior to the time when discontinuity in July 1991 was around 8 years
treatment is turned on. Optimizing behavior of age. This design, however, actually fits
in anticipation of a sharp regime change may quite easily into the standard RD framework
either accentuate or mute observed effects. we have discussed throughout this paper.
For example, simple life-cycle theories, First, note that treatment receipt is not
assuming no liquidity constraints, suggest inevitable for those individuals born near the
that the path of consumption will exhibit September 30, 1983, threshold. Those born
strictly after that date were covered from
July 1991 until their 18th birthday, while
49 By contrast, there is no such limitation with the
those born on or before the date received no
standard RD design. One can examine outcomes defined
at an arbitrarily long time period after the assignment to such coverage. Second, the data generating
treatment. process does follow the structure discussed
above. Parents do have some influence streets, parks, commercial development, etc.
regarding when their children are born, but In order for this to resemble a more standard
with only imprecise control over the exact RD design, one would have to imagine the
date (and at any rate, it seems implausible relevant boundaries being set in a “random”
that parents would have anticipated that such way, so that it would be simply luck deter-
a Medicaid expansion would have occurred mining whether a house ended up on either
eight years in the future, with the particular side of the boundary. The concern over the
birthdate cutoff chosen). Thus the treatment endogeneity of boundaries is clearly recog-
is assigned based on the assignment variable, nized by Black (1999), who “. . . [b]ecause
which is the birthdate in this context. of concerns about neighborhood differences
Examples of other age-based discontinui- on opposite sides of an attendance district
ties where neither the treatment nor control boundary, . . . was careful to omit boundar-
state is guaranteed with the passage of time ies from [her] sample if the two attendance
that can also be viewed within the standard districts were divided in ways that seemed
RD framework include studies by Cascio and to clearly divide neighborhoods; attendance
Lewis (2006), McCrary and Royer (2003), districts divided by large rivers, parks, golf
Dobkin and Ferreira (2009), and Phillip courses, or any large stretch of land were
Oreopoulos (2006). excluded.” As one could imagine, the selec-
tion of which boundaries to include could
6.3.2 Discontinuities in Geography
quickly turn into more of an art than a science.
Another “nonrandomized” RD design We have no uniform advice on how to ana-
is one involving the location of residences, lyze geographic discontinuities because it
where the discontinuity threshold is a bound- seems that the best approach would be par-
ary that demarcates regions. Black (1999) ticularly context-specific. It does, however,
and Patrick Bayer, Ferreira, and Robert seem prudent for the analyst, in assessing
McMillan (2007) examine housing prices on the internal validity of the research design,
either side of school attendance boundaries to carefully consider three sets of questions.
to estimate the implicit valuation of different First, what is the process that led to the loca-
schools. Lavy (2006) examines adjacent tion of the boundaries? Which came first:
neighborhoods in different cities, and there- the houses or the boundaries? Were the
fore subject to different rules regarding boundaries a response to some preexisting
student busing. Rafael Lalive (2008) com- geographical or political constraint? Second,
pares unemployment duration in regions in how might sorting of families or the endog-
Austria receiving extended benefits to adja- enous location of houses affect the analysis?
cent control regions. Karen M. Pence (2006) And third, what are all the things differing
examines census tracts along state borders between the two regions other than the treat-
to examine the impact of more borrower- ment of interest? An exemplary analysis and
friendly laws on mortgage loan sizes. discussion of these latter two issues in the
In each of these cases, it is awkward to context of school attendance zones is found
view either houses or families as locally ran- in Bayer, Ferreira, and McMillan (2007).
domly assigned. Indeed this is a case where
economic agents have quite precise control
7. Concluding Remarks on RD Designs in
over where to place a house or where to live.
Economics: Progress and Prospects
The location of houses will be planned in
response to geographic features (rivers, lakes, Our reading of the existing and active lit-
hills) and in conjunction with the planning of erature is that—after being largely ignored
by economists for almost forty years—there We believe a “state-of-the-art” RD

have been significant inroads made in under- analysis today will consider carefully
standing the properties, limitations, inter- the possibility of endogenous sorting. A
pretability, and perhaps most importantly, recent analysis that illustrates this stan-
in the useful application of RD designs to a dard is that of Urquiola and Verhoogen
wide variety of empirical questions in eco- (2009), who examine the class size cap
nomics. These developments have, for the RD design pioneered by Angrist and
most part, occurred within a short period of Lavy (1999) in the context of Chile’s
time, beginning in the late 1990s. highly liberalized market for primary
Here we highlight what we believe are the schools. In a certain segment of the pri-
most significant recent contributions of the vate market, schools receive a fixed pay-
economics literature to the understanding ment per student from the government.
and application of RD designs. We believe However, each school faces a very high
these are helpful developments in guiding marginal cost (hiring one extra teacher)
applied researchers who seek to implement for crossing a multiple of the class size
RD designs, and we also illustrate them with cap. Perhaps unsurprisingly, they find
a few examples from the literature. striking discontinuities in the histogram
of the assignment variable (total enroll-
• Sorting and Manipulation of the ment in the grade), with an undeniable
Assignment Variable: Economists con- “stacking” of schools at the relevant class
sider how self-interested individuals or size cap cutoffs. They also provide evi-
optimizing organizations may behave in dence that those families in schools just
response to rules that allocate resources. to the left and right of the thresholds
It is therefore unsurprising that the are systematically different in family
discussion of how endogenous sort- income, suggesting some degree of sort-
ing around the discontinuity threshold ing. For this reason, they conclude that
can invalidate the RD design has been an RD analysis in this particular con-
found (to our knowledge, exclusively) in text is most likely inappropriate.51 This
the economics literature. By contrast, study, as well as the analysis of Bayer,
textbook treatments outside econom- Ferreira, and McMillan (2007) reflects a
ics on RD do not discuss this sorting or heightened awareness of a sorting issue
manipulation, and give the impression recognized since the beginning of the
that the knowledge of the assignment recent wave of RD applications in eco-
rule is sufficient for the validity of the nomics.52 From a practitioner’s perspec-
RD.50 tive, an important recent development
50 For example, Trochim (1984) characterizes the three make it clear that the existence of a deterministic rule for
central assumptions of the RD design as: (1) perfect adher- the assignment of treatment is not sufficient for unbiased-
ence to the cutoff rule, (2) having the correct functional ness, and it is necessary to assume the influence of all other
form, and (3) no other factors (other than the program of factors (omitted variables) are the same on either side of the
interest) cause the discontinuity. More recently, William discontinuity threshold (i.e., their continuity assumption).
R. Shadish, Cook, and Campbell (2002) claim on page 243 51 Urquiola and Verhoogen (2009) emphasize the sort-
that the proof of the unbiasedness of RD primarily follows ing issues may well be specific to the liberalized nature of
from the fact that treatment is known perfectly once the the Chilean primary school market, and that they may or
assignment variable is known. They go on to argue that this may not be present in other countries.
deterministic rule implies omitted variables will not pose 52 See, for example, footnote 23 in van der Klaauw
a problem. But Hahn, Todd, and van der Klaauw (2001) (1997) and page 549 in Angrist and Lavy (1999).
is the notion that we can empirically some control over the assignment vari-
examine the degree of sorting, and one able, as long as this control is impre-
way of doing so is suggested in McCrary cise—that is, the ex ante density of the
(2008). assignment variable is continuous—the
consequence will be local randomiza-
• RD Designs as Locally Randomized tion of the treatment. So in a number
Experiments: Economists are hesitant of nonexperimental contexts where
to apply methods that have not been resources are allocated based on a sharp
rigorously formalized within an econo- cutoff rule, there may indeed be a hid-
metric framework, and where crucial den randomized experiment to utilize.
identifying assumptions have not been And furthermore, as in a randomized
clearly specified. This is perhaps one of experiment, this implies that all observ-
the reasons why RD designs were under- able baseline covariates will locally have
utilized by economists for so long, since it the same distribution on either side of
is only relatively recently that the under- the discontinuity threshold—an empiri-
lying assumptions needed for the RD cally testable proposition.
were formalized.53 In the recent litera- We view the testing of the continuity
ture, RD designs were initially viewed of the baseline covariates as an impor-
as a special case of matching (Heckman, tant part of assessing the validity of any
Lalonde, and Smith 1999), or alterna- RD design—particularly in light of the
tively as a special case of IV (Angrist and incentives that can potentially generate
Krueger 1999), and these perspectives sorting—and as something that truly
may have provided empirical researchers sets RD apart from other evaluation
a familiar econometric framework within strategies. Examples of this kind of test-
which identifying assumptions could be ing of the RD design include Jordan D.
more carefully discussed. Matsudaira (2008), Card, Raj Chetty
Today, RD is increasingly recognized and Andrea Weber (2007), DiNardo
in applied research as a distinct design and Lee (2004), Lee, Moretti and Butler
that is a close relative to a randomized (2004), McCrary and Royer (2003),
experiment. Formally shown in Lee Greenstone and Gallagher (2008), and
(2008), even when individuals have Urquiola and Verhoogen (2009).
53 An example of how economists’/econometricians’ nonzero”(p. 647). After reading the article, an econometri-
notion of a proof differs from that in other disciplines is cian will recognize the discussion above not as a proof of
found in Cook (2008), who views the discussion in Arthur the validity of the RD, but rather as a restatement of the
S. Goldberger (1972a) and Goldberger (1972b) as the first consequence of z being an indicator variable determined
“proof of the basic design,” quoting the following passage by an observed variable x, in a specific parameterized
in Goldberger (1972a) (brackets from Cook 2008): “The example. Today we know the existence of such a rule is
explanation for this serendipitous result [no bias when not sufficient for a valid RD design, and a crucial neces-
selection is on an observed pretest score] is not hard to sary assumption is the continuity of the influence of all
locate. Recall that z [a binary variable representing the other factors, as shown in Hahn, Todd, and van der Klaauw
treatment contrast at the cutoff] is completely determined (2001). In Goldberger (1972a), the role of the continuity of
by pretest score x [an obtained ability score]. It cannot omitted factors was not mentioned (although it is implicitly
contain any information about x* [true ability] that is not assumed in the stylized model of test scores involving nor-
contained within x. Consequently, when we control on x mally distributed and independent errors). Indeed, appar-
as in the multiple regression, z has no explanatory power ently Goldberger himself later clarified that he did not set
with respect to y [the outcome measured with error]. More out to propose the RD design, and was instead interested
formally, the partial correlation of y and z controlling on in the issues related to selection on observables and unob-
x vanishes although the simple correlation of y and z is servables (Cook 2008).
• Graphical Analysis and Presentation: threshold.55 There are many exam-

The graphical presentation of an ples that follow this general principle;
RD analysis is not a contribution of recent ones include Matsudaira (2008),
economists,54 but it is safe to say that the Card, Chetty and Weber (2007), Card,
body of work produced by economists Dobkin, and Maestas (2009), McCrary
has led to a kind of “industry standard” and Royer (2003), Lee (2008), and
that the transparent identification strat- Ferreira and Gyourko (2009).
egy of the RD be accompanied by an
equally transparent graph showing the • Applicability: Soon after the introduc-
empirical relation between the outcome tion of RD, in a chapter in a book on
and the assignment variable. Graphical research methods, Campbell and Julian
presentations of RD are so prevalent in C. Stanley (1963) wrote that the RD
applied research, it is tempting to guess design was “very limited in range of
that studies not including the graphical possible applications.” The emerging
evidence are ones where the graphs are body of research produced by econo-
not compelling or well-behaved. mists in recent years has proven quite
In an RD analysis, the graph is indis- the opposite. Our survey of the litera-
pensable because it can summarize a ture suggests that there are many kinds
great deal of information in one picture. of discontinuous rules that can help
It can give a rough sense of the range answer important questions in econom-
of the both the assignment variable and ics and related areas. Indeed, one may
the outcome variable as well as the over- go so far as to guess that whenever a
all shape of the relationship between scarce resource is rationed for individual
the two, thus indicating what functional entities, if the political climate demands
forms are likely to make sense. It can also a transparent way of distributing that
alert the researcher to potential outliers resource, it is a good bet there is an
in both the assignment and outcome RD design lurking in the background.
variables. A graph of the raw means—in In addition, it seems that the approach
nonoverlapping intervals, as discussed of using changes in laws that disqualify
in section 4.1—also gives a rough sense older birth cohorts based on their date
of the likely sampling variability of the of birth (as in Card and Shore-Sheppard
RD gap estimate itself, since one can (2004) or Oreopoulos (2006)) may well
compare the size of the jump at the have much wider applicability.
discontinuity to natural “bumpiness” in One way to understand both the
the graph away from the discontinuity. applicability and limitations of the RD
Our reading of the literature is that the design is to recognize its relation to a
most informative graphs are ones that standard econometric policy evaluation
simultaneously allow the raw data “to framework, where the main variable
speak for themselves” in revealing a of interest is a potentially endogenous
discontinuity if there is one, yet at the binary treatment variable (as consid-
same time treat data near the thresh- ered in Heckman 1978 or more recently
old the same as data away from the discussed in Heckman and Vytlacil
55 For example, graphing a smooth conditional expec-

54 Indeed
the original article of Thistlethwaite and tation function everywhere except at the discontinuity
Campbell (1960) included a graphical analysis of the data. threshold violates this principle.
2005). This selection model applies to a is perfect compliance of the discontinuous

great deal of economic problems. As we rule, it may be that the researcher does not
pointed out in section 3, the RD design directly observe the assignment variable, but
describes a situation where you are able instead possesses a slightly noisy measure of
to observe the latent variable that deter- the variable. Understanding the effects of
mines treatment. As long as the density this kind of measurement error could further
of that variable is continuous for each expand the applicability of RD. In addition,
individual, the benefit of observing the there may be situations where the researcher
latent index is that one neither needs to both suspects and statistically detects some
make exclusion restrictions nor assume degree of precise sorting around the thresh-
any variable (i.e., an instrument) is old, but that the sorting may appear to be
independent of errors in the outcome relatively minor, even if statistically signifi-
equation. cant (based on observing discontinuities in
From this perspective, for the class of baseline characteristics). The challenge,
problems that fit into the standard treat- then, is to specify under what conditions one
ment evaluation problem, RD designs can correct for small amounts of this kind of
can be seen as a subset since there is an contamination.
institutional, index-based rule playing a Second, so far we have discussed the
role in determining treatment. Among sorting or manipulation issue as a potential
this subset, the binding constraint of problem or nuisance to the general program
RD lies in obtaining the necessary data: evaluation problem. But there is another way
readily available public-use household of viewing this sorting issue. The observed
survey data, for example, will often only sorting may well be evidence of economic
contain variables that are correlated agents responding to incentives, and may
with the true assignment variable (e.g., help identify economically interesting phe-
reported income in a survey, as opposed nomena. That is, economic behavior may
to the income used for allocation of ben- be what is driving discontinuities in the fre-
efits), or are measured too coarsely (e.g., quency distribution of grade enrollment (as
years rather than months or weeks) to in Urquiola and Verhoogen 2009), or in the
detect a discontinuity in the presence distribution of roll call votes (as in McCrary
of a regression function with significant 2008), or in the distribution of age at offense
curvature. This is where there can be a (as in Lee and McCrary 2005), and those
significant payoff to investing in secur- behavioral responses may be of interest.
ing high quality data, which is evident in These cases, as well as the age/time and
most of the studies listed in table 5. boundary discontinuities discussed above,
do not fit into the “standard” RD framework,
7.1 Extensions
but nevertheless can tell us something impor-
We conclude by discussing two natural tant about behavior, and further expand the
directions in which the RD approach can kinds of questions that can be addressed by
be extended. First, we have discussed the exploiting discontinuous rules to identify
“fuzzy” RD design as an important departure meaningful economic parameters of interest.
from the “classic” RD design where treat-
ment is a deterministic function of the assign- References
ment variable, but there are other departures Albouy, David. 2008. “Do Voters Affect or Elect Poli-
that could be practically relevant but not as cies? A New Perspective With Evidence from the
well understood. For example, even if there U.S. Senate.” Unpublished.
Albouy, David. 2009. “Partisan Representation in Con- Services More Effective Than the Services Them-
gress and the Geographic Distribution of Federal selves? Evidence from Random Assignment in the
Funds.” National Bureau of Economic Research UI System.” American Economic Review, 93(4):
Working Paper 15224. 1313–27.
Angrist, Joshua D. 1990. “Lifetime Earnings and the Black, Sandra E. 1999. “Do Better Schools Matter?
Vietnam Era Draft Lottery: Evidence from Social Parental Valuation of Elementary Education.” Quar-
Security Administrative Records.” American Eco- terly Journal of Economics, 114(2): 577–99.
nomic Review, 80(3): 313–36. Blundell, Richard, and Alan Duncan. 1998. “Kernel
Angrist, Joshua D., and Alan B. Krueger. 1999. “Empir- Regression in Empirical Microeconomics.” Journal
ical Strategies in Labor Economics.” In Handbook of of Human Resources, 33(1): 62–87.
Labor Economics, Volume 3A, ed. Orley Ashenfelter Buddelmeyer, Hielke, and Emmanuel Skoufias. 2004.
and David Card, 1277–1366. Amsterdam; New York “An Evaluation of the Performance of Regression
and Oxford: Elsevier Science, North-Holland. Discontinuity Design on PROGRESA.” World Bank
Angrist, Joshua D., and Victor Lavy. 1999. “Using Mai- Policy Research Working Paper 3386.
monides’ Rule to Estimate the Effect of Class Size Buettner, Thiess. 2006. “The Incentive Effect of Fis-
on Scholastic Achievement.” Quarterly Journal of cal Equalization Transfers on Tax Policy.” Journal of
Economics, 114(2): 533–75. Public Economics, 90(3): 477–97.
Asadullah, M. Niaz. 2005. “The Effect of Class Size on Campbell, Donald T., and Julian C. Stanley. 1963.
Student Achievement: Evidence from Bangladesh.” “Experimental and Quasi-experimental Designs for
Applied Economics Letters, 12(4): 217–21. Research on Teaching.” In Handbook of Research on
Battistin, Erich, and Enrico Rettore. 2002. “Testing for Teaching, ed. N. L. Gage, 171–246. Chicago: Rand
Programme Effects in a Regression Discontinuity McNally.
Design with Imperfect Compliance.” Journal of the Canton, Erik, and Andreas Blom. 2004. “Can Student
Royal Statistical Society: Series A (Statistics in Soci- Loans Improve Accessibility to Higher Education
ety), 165(1): 39–57. and Student Performance? An Impact Study of
Battistin, Erich, and Enrico Rettore. 2008. “Ineli- the Case of SOFES, Mexico.” World Bank Policy
gibles and Eligible Non-participants as a Double Research Working Paper 3425.
Comparison Group in Regression-Discontinuity De- Card, David, Raj Chetty, and Andrea Weber. 2007.
signs.” Journal of Econometrics, 142(2): 715–30. “Cash-on-Hand and Competing Models of Inter-
Baum-Snow, Nathaniel, and Justin Marion. 2009. “The temporal Behavior: New Evidence from the Labor
Effects of Low Income Housing Tax Credit Devel- Market.” Quarterly Journal of Economics, 122(4):
opments on Neighborhoods.” Journal of Public Eco- 1511–60.
nomics, 93(5–6): 654–66. Card, David, Carlos Dobkin, and Nicole Maestas. 2008.
Bayer, Patrick, Fernando Ferreira, and Robert McMil- “The Impact of Nearly Universal Insurance Cover-
lan. 2007. “A Unified Framework for Measuring age on Health Care Utilization: Evidence from Medi-
Preferences for Schools and Neighborhoods.” Jour- care.” American Economic Review, 98(5): 2242–58.
nal of Political Economy, 115(4): 588–638. Card, David, Carlos Dobkin, and Nicole Maestas. 2009.
Behaghel, Luc, Bruno Crépon, and Béatrice Sédillot. “Does Medicare Save Lives?” Quarterly Journal of
2008. “The Perverse Effects of Partial Employment Economics, 124(2): 597–636.
Protection Reform: The Case of French Older Work- Card, David, Alexandre Mas, and Jesse Rothstein.
ers.” Journal of Public Economics, 92(3–4): 696–721. 2008. “Tipping and the Dynamics of Segregation.”
Berk, Richard A., and Jan de Leeuw. 1999. “An Evalu- Quarterly Journal of Economics, 123(1): 177–218.
ation of California’s Inmate Classification System Card, David, and Lara D. Shore-Sheppard. 2004.
Using a Generalized Regression Discontinuity “Using Discontinuous Eligibility Rules to Identify
Design.” Journal of the American Statistical Associa- the Effects of the Federal Medicaid Expansions on
tion, 94(448): 1045–52. Low-Income Children.” Review of Economics and
Berk, Richard A., and David Rauma. 1983. “Capital- Statistics, 86(3): 752–66.
izing on Nonrandom Assignment to Treatments: A Carpenter, Christopher, and Carlos Dobkin. 2009.
Regression-Discontinuity Evaluation of a Crime- “The Effect of Alcohol Consumption on Mortality:
Control Program.” Journal of the American Statisti- Regression Discontinuity Evidence from the Mini-
cal Association, 78(381): 21–27. mum Drinking Age.” American Economic Journal:
Black, Dan A., Jose Galdo, and Jeffrey A. Smith. 2007a. Applied Economics, 1(1): 164–82.
“Evaluating the Regression Discontinuity Design Cascio, Elizabeth U., and Ethan G. Lewis. 2006.
Using Experimental Data.” Unpublished. “Schooling and the Armed Forces Qualifying Test:
Black, Dan A., Jose Galdo, and Jeffrey A. Smith. 2007b. Evidence from School-Entry Laws.” Journal of
“Evaluating the Worker Profiling and Reemploy- Human Resources, 41(2): 294–318.
ment Services System Using a Regression Disconti- Chay, Kenneth Y., and Michael Greenstone. 2003.
nuity Approach.” American Economic Review, 97(2): “Air Quality, Infant Mortality, and the Clean Air Act
104–07. of 1970.” National Bureau of Economic Research
Black, Dan A., Jeffrey A. Smith, Mark C. Berger, and Working Paper 10053.
Brett J. Noel. 2003. “Is the Threat of Reemployment Chay, Kenneth Y., and Michael Greenstone. 2005.
“Does Air Quality Matter? Evidence from the Hous- upport and Elderly Living Arrangements in a Low-
S
ing Market.” Journal of Political Economy, 113(2): Income Country.” Journal of Human Resources,
376–424. 40(1): 186–207.
Chay, Kenneth Y., Patrick J. McEwan, and Miguel Fan, Jianqing, and Irene Gijbels. 1996. Local Polyno-
Urquiola. 2005. “The Central Role of Noise in mial Modelling and Its Applications. London; New
Evaluating Interventions That Use Test Scores to York and Melbourne: Chapman and Hall.
Rank Schools.” American Economic Review, 95(4): Ferreira, Fernando. Forthcoming. “You Can Take It
1237–58. With You: Proposition 13 Tax Benefits, Residential
Chen, M. Keith, and Jesse M. Shapiro. 2004. “Does Mobility, and Willingness to Pay for Housing Ameni-
Prison Harden Inmates? A Discontinuity-Based ties.” Journal of Public Economics.
Approach.” Yale University Cowles Foundation Dis- Ferreira, Fernando, and Joseph Gyourko. 2009. “Do
cussion Paper 1450. Political Parties Matter? Evidence from U.S. Cities.”
Chen, Susan, and Wilbert van der Klaauw. 2008. “The Quarterly Journal of Economics, 124(1): 399–422.
Work Disincentive Effects of the Disability Insur- Figlio, David N., and Lawrence W. Kenny. 2009.
ance Program in the 1990s.” Journal of Economet- “Public Sector Performance Measurement and
rics, 142(2): 757–84. Stakeholder Support.” Journal of Public Economics,
Chiang, Hanley. 2009. “How Accountability Pressure 93(9–10): 1069–77.
on Failing Schools Affects Student Achievement.” Goldberger, Arthur S. 1972a. “Selection Bias in Evalu-
Journal of Public Economics, 93(9–10): 1045–57. ating Treatment Effects: Some Formal Illustrations.”
Clark, Damon. 2009. “The Performance and Competi- Unpublished.
tive Effects of School Autonomy.” Journal of Political Goldberger, Arthur S. 1972b. Selection Bias in Evalu-
Economy, 117(4): 745–83. ating Treatment Effects: The Case of Interaction.
Cole, Shawn. 2009. “Financial Development, Bank Unpublished.
Ownership, and Growth: Or, Does Quantity Imply Goodman, Joshua. 2008. “Who Merits Financial Aid?:
Quality?” Review of Economics and Statistics, 91(1): Massachusetts’ Adams Scholarship.” Journal of Pub-
33–51. lic Economics, 92(10–11): 2121–31.
Cook, Thomas D. 2008. “‘Waiting for Life to Arrive’: Goolsbee, Austan, and Jonathan Guryan. 2006. “The
A History of the Regression-Discontinuity Design Impact of Internet Subsidies in Public Schools.”
in Psychology, Statistics and Economics.” Journal of Review of Economics and Statistics, 88(2): 336–47.
Econometrics, 142(2): 636–54. Greenstone, Michael, and Justin Gallagher. 2008.
Davis, Lucas W. 2008. “The Effect of Driving Restric- “Does Hazardous Waste Matter? Evidence from
tions on Air Quality in Mexico City.” Journal of Politi- the Housing Market and the Superfund Program.”
cal Economy, 116(1): 38–81. Quarterly Journal of Economics, 123(3): 951–1003.
De Giorgi, Giacomo. 2005. “Long-Term Effects of a Guryan, Jonathan. 2001. “Does Money Matter?
Mandatory Multistage Program: The New Deal for Regression-Discontinuity Estimates from Education
Young People in the UK.” Institute for Fiscal Studies Finance Reform in Massachusetts.” National Bureau
Working Paper 05/08. of Economic Research Working Paper 8269.
DesJardins, Stephen L., and Brian P. McCall. 2008. Hahn, Jinyong. 1998. “On the Role of the Propensity
“The Impact of the Gates Millennium Scholars Pro- Score in Efficient Semiparametric Estimation of
gram on the Retention, College Finance- and Work- Average Treatment Effects.” Econometrica, 66(2):
Related Choices, and Future Educational Aspirations 315–31.
of Low-Income Minority Students.” Unpublished. Hahn, Jinyong, Petra Todd, and Wilbert van der
DiNardo, John, and David S. Lee. 2004. “Economic Klaauw. 1999. “Evaluating the Effect of an Antidis-
Impacts of New Unionization on Private Sector crimination Law Using a Regression-Discontinuity
Employers: 1984–2001.” Quarterly Journal of Eco- Design.” National Bureau of Economic Research
nomics, 119(4): 1383–1441. Working Paper 7131.
Ding, Weili, and Steven F. Lehrer. 2007. “Do Peers Hahn, Jinyong, Petra Todd, and Wilbert van der
Affect Student Achievement in China’s Secondary Klaauw. 2001. “Identification and Estimation of
Schools?” Review of Economics and Statistics, 89(2): Treatment Effects with a Regression-Discontinuity
300–312. Design.” Econometrica, 69(1): 201–09.
Dobkin, Carlos, and Fernando Ferreira. 2009. “Do Heckman, James J. 1978. “Dummy Endogenous Vari-
School Entry Laws Affect Educational Attainment ables in a Simultaneous Equation System.” Econo-
and Labor Market Outcomes?” National Bureau of metrica, 46(4): 931–59.
Economic Research Working Paper 14945. Heckman, James J., Robert J. Lalonde, and Jeffrey A.
Edmonds, Eric V. 2004. “Does Illiquidity Alter Child Smith. 1999. “The Economics and Econometrics of
Labor and Schooling Decisions? Evidence from Active Labor Market Programs.” In Handbook of
Household Responses to Anticipated Cash Trans- Labor Economics, Volume 3A, ed. Orley Ashenfelter
fers in South Africa.” National Bureau of Economic and David Card, 1865–2097. Amsterdam; New York
Research Working Paper 10265. and Oxford: Elsevier Science, North-Holland.
Edmonds, Eric V., Kristin Mammen, and Douglas L. Heckman, James J., and Edward Vytlacil. 2005. “Struc-
Miller. 2005. “Rearranging the Family? Income tural Equations, Treatment Effects, and Econometric
Policy Evaluation.” Econometrica, 73(3): 669–738. Experience: A Regression Discontinuity Analysis of

Hjalmarsson, Randi. 2009. “Juvenile Jails: A Path to the Close Elections.” University of California Berkeley
Straight and Narrow or to Hardened Criminality?” Center for Labor Economics Working Paper 31.
Journal of Law and Economics, 52(4): 779–809. Lee, David S. 2008. “Randomized Experiments from
Horowitz, Joel L., and Charles F. Manski. 2000. “Non- Non-random Selection in U.S. House Elections.”
parametric Analysis of Randomized Experiments Journal of Econometrics, 142(2): 675–97.
with Missing Covariate and Outcome Data.” Jour- Lee, David S. 2009. “Training, Wages, and Sample
nal of the American Statistical Association, 95(449): Selection: Estimating Sharp Bounds on Treat-
77–84. ment Effects.” Review of Economic Studies, 76(3):
Hoxby, Caroline M. 2000. “The Effects of Class Size 1071–1102.
on Student Achievement: New Evidence from Popu- Lee, David S., and David Card. 2008. “Regression Dis-
lation Variation.” Quarterly Journal of Economics, continuity Inference with Specification Error.” Jour-
115(4): 1239–85. nal of Econometrics, 142(2): 655–74.
Imbens, Guido W., and Joshua D. Angrist. 1994. “Iden- Lee, David S., and Justin McCrary. 2005. “Crime, Pun-
tification and Estimation of Local Average Treatment ishment, and Myopia.” National Bureau of Economic
Effects.” Econometrica, 62(2): 467–75. Research Working Paper 11491.
Imbens, Guido W., and Karthik Kalyanaraman. 2009. Lee, David S., Enrico Moretti, and Matthew J. Butler.
“Optimal Bandwidth Choice for the Regression Dis- 2004. “Do Voters Affect or Elect Policies? Evidence
continuity Estimator.” National Bureau of Economic from the U.S. House.” Quarterly Journal of Econom-
Research Working Paper 14726. ics, 119(3): 807–59.
Imbens, Guido W., and Thomas Lemieux. 2008. Lemieux, Thomas, and Kevin Milligan. 2008. “Incen-
“Regression Discontinuity Designs: A Guide to Prac- tive Effects of Social Assistance: A Regression Dis-
tice.” Journal of Econometrics, 142(2): 615–35. continuity Approach.” Journal of Econometrics,
Jacob, Brian A., and Lars Lefgren. 2004a. “The Impact 142(2): 807–28.
of Teacher Training on Student Achievement: Quasi- Leuven, Edwin, Mikael Lindahl, Hessel Oosterbeek,
experimental Evidence from School Reform Efforts and Dinand Webbink. 2007. “The Effect of Extra
in Chicago.” Journal of Human Resources, 39(1): Funding for Disadvantaged Pupils on Achievement.”
50–79. Review of Economics and Statistics, 89(4): 721–36.
Jacob, Brian A., and Lars Lefgren. 2004b. “Remedial Leuven, Edwin, and Hessel Oosterbeek. 2004. “Evalu-
Education and Student Achievement: A Regression- ating the Effect of Tax Deductions on Training.”
Discontinuity Analysis.” Review of Economics and Journal of Labor Economics, 22(2): 461–88.
Statistics, 86(1): 226–44. Ludwig, Jens, and Douglas L. Miller. 2007. “Does Head
Kane, Thomas J. 2003. “A Quasi-experimental Estimate Start Improve Children’s Life Chances? Evidence
of the Impact of Financial Aid on College-Going.” from a Regression Discontinuity Design.” Quarterly
National Bureau of Economic Research Working Journal of Economics, 122(1): 159–208.
Paper 9703. Matsudaira, Jordan D. 2008. “Mandatory Summer
Lalive, Rafael. 2007. “Unemployment Benefits, Unem- School and Student Achievement.” Journal of Econo-
ployment Duration, and Post-unemployment Jobs: A metrics, 142(2): 829–50.
Regression Discontinuity Approach.” American Eco- McCrary, Justin. 2008. “Manipulation of the Running
nomic Review, 97(2): 108–12. Variable in the Regression Discontinuity Design:
Lalive, Rafael. 2008. “How Do Extended Benefits A Density Test.” Journal of Econometrics, 142(2):
Affect Unemployment Duration? A Regression 698–714.
Discontinuity Approach.” Journal of Econometrics, McCrary, Justin, and Heather Royer. 2003. “Does
142(2): 785–806. Maternal Education Affect Infant Health? A Regres-
Lalive, Rafael, Jan C. van Ours, and Josef Zweimüller. sion Discontinuity Approach Based on School Age
2006. “How Changes in Financial Incentives Affect Entry Laws.” Unpublished.
the Duration of Unemployment.” Review of Eco- Newey, Whitney K., and Daniel L. McFadden. 1994.
nomic Studies, 73(4): 1009–38. “Large Sample Estimation and Hypothesis Test-
Lavy, Victor. 2002. “Evaluating the Effect of Teachers’ ing.” In Handbook of Econometrics, Volume 4, ed.
Group Performance Incentives on Pupil Achieve- Robert F. Engle and Daniel L. McFadden, 2111–
ment.” Journal of Political Economy, 110(6): 2245. Amsterdam; London and New York: Elsevier,
1286–1317. North-Holland.
Lavy, Victor. 2004. “Performance Pay and Teachers’ Oreopoulos, Philip. 2006. “Estimating Average and
Effort, Productivity and Grading Ethics.” National Local Average Treatment Effects of Education
Bureau of Economic Research Working Paper 10622. When Compulsory Schooling Laws Really Matter.”
Lavy, Victor. 2006. “From Forced Busing to Free American Economic Review, 96(1): 152–75.
Choice in Public Schools: Quasi-Experimental Evi- Pence, Karen M. 2006. “Foreclosing on Opportunity:
dence of Individual and General Effects.” National State Laws and Mortgage Credit.” Review of Eco-
Bureau of Economic Research Working Paper 11969. nomics and Statistics, 88(1): 177–82.
Lee, David S. 2001. “The Electoral Advantage to Pettersson, Per. 2000. “Do Parties Matter for Fis-
Incumbency and Voters’ Valuation of Politicians’ cal Policy Choices?” Econometric Society World
Congress 2000 Contributed Paper 1373. Social Security Notch.” Review of Economics and
Pettersson-Lidbom, Per. 2008a. “Does the Size of the Statistics, 88(3): 482–95.
Legislature Affect the Size of Government? Evidence Thistlethwaite, Donald L., and Donald T. Campbell.
from Two Natural Experiments.” Unpublished. 1960. “Regression-Discontinuity Analysis: An Alter-
Pettersson-Lidbom, Per. 2008b. “Do Parties Matter for native to the Ex Post Facto Experiment.” Journal of
Economic Outcomes? A Regression-Discontinuity Educational Psychology, 51(6): 309–17.
Approach.” Journal of the European Economic Asso- Trochim, William M. K. 1984. Research Design for
ciation, 6(5): 1037–56. Program Evaluation: The Regression-Discontinuity
Pitt, Mark M., and Shahidur R. Khandker. 1998. “The Approach. Beverly Hills: Sage Publications.
Impact of Group-Based Credit Programs on Poor Urquiola, Miguel. 2006. “Identifying Class Size Effects in
Households in Bangladesh: Does the Gender of Developing Countries: Evidence from Rural Bolivia.”
Participants Matter?” Journal of Political Economy, Review of Economics and Statistics, 88(1): 171–77.
106(5): 958–96. Urquiola, Miguel, and Eric A. Verhoogen. 2009. “Class-
Pitt, Mark M., Shahidur R. Khandker, Signe-Mary Size Caps, Sorting, and the Regression-Discontinu-
McKernan, and M. Abdul Latif. 1999. “Credit Pro- ity Design.” American Economic Review, 99(1):
grams for the Poor and Reproductive Behavior in 179–215.
Low-Income Countries: Are the Reported Causal van der Klaauw, Wilbert. 1997. “A Regression-Dis-
Relationships the Result of Heterogeneity Bias?” continuity Evaluation of the Effect of Financial Aid
Demography, 36(1): 1–21. Offers on College Enrollment.” New York University
Porter, Jack. 2003. “Estimation in the Regression Dis- C.V. Starr Center for Applied Economics Working
continuity Model.” Unpublished. Paper 10.
Powell, James L. 1994. “Estimation of Semiparametric van der Klaauw, Wilbert. 2002. “Estimating the Effect
Models.” In Handbook of Econometrics, Volume 4, of Financial Aid Offers on College Enrollment: A
ed. Robert F. Engle and Daniel L. McFadden, 2443– Regression-Discontinuity Approach.” International
2521. Amsterdam; London and New York: Elsevier, Economic Review, 43(4): 1249–87.
North-Holland. van der Klaauw, Wilbert. 2008a. “Breaking the Link
Shadish, William R., Thomas D. Cook, and Donald T. between Poverty and Low Student Achievement:
Campbell. 2002. Experimental and Quasi-Experi- An Evaluation of Title I.” Journal of Econometrics,
mental Designs for Generalized Causal Inference. 142(2): 731–56.
Boston: Houghton Mifflin. van der Klaauw, Wilbert. 2008b. “Regression-Disconti-
Silverman, Bernard W. 1986. Density Estimation for nuity Analysis: A Survey of Recent Developments in
Statistics and Data Analysis. London and New York: Economics.” Labour, 22(2): 219–45.
Chapman and Hall. White, Halbert. 1980. “A Heteroskedasticity-Consistent
Snyder, Stephen E., and William N. Evans. 2006. “The Covariance Matrix Estimator and a Direct Test for
Effect of Income on Mortality: Evidence from the Heteroskedasticity.” Econometrica, 48(4): 817–38.
This article has been cited by:
1. Guido W. Imbens. 2010. Better LATE Than Nothing: Some Comments on Deaton (2009) and
Heckman and Urzua (2009)Better LATE Than Nothing: Some Comments on Deaton (2009) and
Heckman and Urzua (2009). Journal of Economic Literature 48:2, 399-423. [Abstract] [View PDF
article] [PDF with links]

RDDEconomics PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RDDEconomics PDF

Uploaded by

Copyright:

Available Formats

Journal of Economic Literature 48 (June 2010): 281–355

Regression Discontinuity Designs

This paper provides an introduction and “user guide” to Regression Discontinuity

1. Introduction (1960) analyzed the impact of merit awards

R egression Discontinuity (RD) designs

assigned to individuals (or “units”) with a instrumental variables (IV) approaches.

Figure 1. Simple Linear RD Setup

Assignment variable (X)

The fundamental problem of causal infer- B − A = ​l

Figure 3. Randomized Experiment as a RD Design

In a randomized experiment, units are words, continuity is a direct consequence of

to the subpopulation whose treatment was 3.1 Valid or Invalid RD?

Figure 4. Density of Assignment Variable Conditional on W = w, U = u

unobservable term U. Since it is not possible to ­distinguish

3.3 Generalizability: The RD Gap as a The discontinuity gap then, is a par-

20 As discussed in 3.2.1, the inclusion of a direct effect

B. Regression Discontinuity Design

Figure 5. Treatment, Observables, and Unobservables in Four Research Designs

+ E[U | W * = w*], (8) (E[Y | W * = w*, Z = z, D = 1]

where W has been split up into W * and Z − E[Y | W * = w*, Z = z, D = 0])

Figure 6. Share of Vote in Next Election, Bandwidth of 0.02 (50 bins)

Figure 7. Share of Vote in Next Election, Bandwidth of 0.01 (100 bins)

Figure 8. Share of Vote in Next Election, Bandwidth of 0.005 (200 bins)

Figure 9. Winning the Next Election, Bandwidth of 0.02 (50 bins)

A. Optimal bandwidth selected by cross-validation

Left 0.021 0.049

B. P-values of tests for the numbers of bins in RD graph

10 0.100 0.000 0.000 0.001 0.000

Figure 12. Cross-Validation Function for Choosing the Bandwidth in a RD Graph:

Figure 13. Cross-Validation Function for Choosing Bandwidth in a RD Graph:

­ nderlying regression model was linear in

procedure should also be applied to the RD Y = αr + fr (X − c) + ε,

where ​ ˜ ′ ′(∙) is the second derivative (curva-

A. Rule-of-thumb bandwidth Share of vote Win next election

mentioned earlier, it is also preferable to a particular parametric model (say a cubic

Figure 17. Discontinuity in Baseline Covariate (Share of Vote in Prior Election)

element of W is also a polynomial function of equation (13) is correct, in computing esti-

1. To assess the possibility of manipula- 3. Graph a benchmark polynomial

0.04 Quadratic fit

of such a presentation, see the online assumption holds. If the estimates do

permutation of presentation (e.g., points Additionally, the various estimation and

Study Context Outcome(s) Treatment(s) Assignment variable(s)

Study Context Outcome(s) Treatment(s) Assignment variable(s)

Study Context Outcome(s) Treatment(s) Assignment variable(s)

Study Context Outcome(s) Treatment(s) Assignment variable(s)

by economists for almost forty years—there We believe a “state-of-the-art” RD

• Graphical Analysis and Presentation: threshold.55 There are many exam-

55 For example, graphing a smooth conditional expec-

2005). This selection model applies to a is perfect compliance of the discontinuous

Policy Evaluation.” Econometrica, 73(3): 669–738. Experience: A Regression Discontinuity Analysis of

You might also like

The fundamental problem of causal infer- B − A = l   

unobservable term U. Since it is not possible to distinguish

+ E[U | W * = w], (8) (E[Y | W  = w*, Z = z, D = 1]

where W has been split up into W * and Z − E[Y | W * = w*, Z = z, D = 0])

nderlying regression model was linear in

procedure should also be applied to the RD Y = αr + fr (X − c) + ε,

where    ˜ ′ ′(∙) is the second derivative (curva-

1. To assess the possibility of manipula- 3. Graph a benchmark polynomial

• Graphical Analysis and Presentation: threshold.55 There are many exam-