In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment. Intuitively, IV is used when an explanatory variable of interest is correlated with the error term, in which case ordinary least squares gives biased results. A valid instrument induces changes in the explanatory variable but has no independent effect on the dependent variable, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.
Instrumental variable methods allow for consistent estimation when the explanatory variables (covariates) are correlated with the error terms in a regression model. Such correlation may occur when changes in the dependent variable change the value of at least one of the covariates ("reverse" causation), when there are omitted variables that affect both the dependent and independent variables, or when the covariates are subject to measurement error. Explanatory variables which suffer from one or more of these issues in the context of a regression are sometimes referred to as endogenous. In this situation, ordinary least squares produces biased and inconsistent estimates. However, if an instrument is available, consistent estimates may still be obtained. An instrument is a variable that does not itself belong in the explanatory equation but is correlated with the endogenous explanatory variables, conditional on the value of other covariates. In linear models, there are two main requirements for using IV:
- The instrument must be correlated with the endogenous explanatory variables, conditional on the other covariates. If this correlation is strong, then the instrument is said to have a strong first stage. A weak correlation may provide misleading inferences about parameter estimates and standard errors.
- The instrument cannot be correlated with the error term in the explanatory equation, conditional on the other covariates. In other words, the instrument cannot suffer from the same problem as the original predicting variable. If this condition is met, then the instrument is said to satisfy the exclusion restriction.
The concept of instrumental variables was first derived by Philip G. Wright, possibly in co-authorship with his son Sewall Wright, in the context of simultaneous equations in his 1928 book The Tariff on Animal and Vegetable Oils. In 1945, Olav Reiersøl applied the same approach in the context of errors-in-variables models in his dissertation, giving the method its name.
While the ideas behind IV extend to a broad class of models, a very common context for IV is in linear regression. Traditionally, an instrumental variable is defined as a variable Z that is correlated with the independent variable X and uncorrelated with the "error term" U in the linear equation
Note that is a matrix, usually with a column of ones and perhaps with additional columns for other covariates. Consider how an instrument solves this problem. Recall that OLS solves for such that (when we minimize the sum of squared errors, , the first-order condition is exactly .) If the true model is believed to have due to any of the reasons listed above—for example, if there is an omitted variable which affects both and separately—then this OLS procedure will not yield the causal impact of on . OLS will simply pick the parameter that makes the resulting errors appear uncorrelated with .
Consider for simplicity the single-variable case. Suppose we are considering a regression with one variable and a constant (perhaps no other covariates are necessary, or perhaps we have partialed out any other relevant covariates):
In this case, the coefficient on the regressor of interest is given by . Substituting for gives
where is what the estimated coefficient vector would be if x were not correlated with u. It can be shown that would be an unbiased estimator of If in the underlying model that we believe, then OLS gives a coefficient which does not reflect the underlying causal effect of interest. IV helps to fix this problem by identifying the parameters not based on whether is uncorrelated with , but based on whether another variable (or set of variables) is (are) uncorrelated with . If theory suggests that is related to (the first stage) but uncorrelated with (the exclusion restriction), then IV may identify the causal parameter of interest where OLS fails. Because there are multiple specific ways of using and deriving IV estimators even in just the linear case (IV, 2SLS, GMM), we save further discussion for the Estimation section below.
Of course, IV techniques have been developed among a much broader class of non-linear models. General definitions of instrumental variables, using counterfactual and graphical formalism, were given by Pearl (2000; p. 248). The graphical definition requires that Z satisfy the following conditions:
where stands for d-separation and stands for the graph in which all arrows entering X are cut off.
The counterfactual definition requires that Z satisfies
where Yx stands for the value that Y would attain had X been x and stands for independence.
If there are additional covariates W then the above definitions are modified so that Z qualifies as an instrument if the given criteria hold conditional on W.
The essence of Pearl's definition is:
- The equations of interest are "structural," not "regression."
- The error term U stands for all exogenous factors that affect Y when X is held constant.
- The instrument Z should be independent of U.
- The instrument Z should not affect Y when X is held constant (exclusion restriction).
- The instrument Z should not be independent of X.
These conditions do not rely on specific functional form of the equations and are applicable therefore to nonlinear equations, where U can be non-additive (see Non-parametric analysis). They are also applicable to a system of multiple equations, in which X (and other factors) affect Y through several intermediate variables. Note that an instrumental variable need not be a cause of X; a proxy of such cause may also be used, if it satisfies conditions 1-5. Note also that the exclusion restriction (condition 4) is redundant; it follows from conditions 2 and 3.
Informally, in attempting to estimate the causal effect of some variable X on another Y, an instrument is a third variable Z which affects Y only through its effect on X. For example, suppose a researcher wishes to estimate the causal effect of smoking on general health. Correlation between health and smoking does not imply that smoking causes poor health because other variables, such as depression, may affect both health and smoking, or because health may affect smoking. It is at best difficult and expensive to conduct controlled experiments on smoking status in the general population. The researcher may attempt to estimate the causal effect of smoking on health from observational data by using the tax rate for tobacco products as an instrument for smoking. The tax rate for tobacco products is a reasonable choice for an instrument because the researcher assumes that it can only be correlated with health through its effect on smoking. If the researcher then finds tobacco taxes and state of health to be correlated, this may be viewed as evidence that smoking causes changes in health.
Angrist and Krueger (2001) present a survey of the history and uses of instrumental variable techniques.
Selecting suitable instruments
Since U is unobserved, the requirement that Z be independent of U cannot be inferred from data and must instead be determined from the model structure, i.e., the data-generating process. Causal graphs are a representation of this structure, and the graphical definition given above can be used to quickly determine whether a variable Z qualifies as an instrumental variable given a set of covariates W. To see how, consider the following example.
Suppose that we wish to estimate the effect of a university tutoring program on GPA. The relationship between attending the tutoring program and GPA may be confounded by a number of factors. Students that attend the tutoring program may care more about their grades or may be struggling with their work. This confounding is depicted in the Figures 1-3 on the right through the bidirected arc between Tutoring Program and GPA. If students are assigned to dormitories at random, the proximity of the student's dorm to the tutoring program is a natural candidate for being an instrumental variable.
However, what if the tutoring program is located in the college library? In that case, Proximity may also cause students to spend more time at the library, which in turn improves their GPA (see Figure 1). Using the causal graph depicted in the Figure 2, we see that Proximity does not qualify as an instrumental variable because it is connected to GPA through the path Proximity Library Hours GPA in . However, if we control for Library Hours by adding it as a covariate then Proximity becomes an instrumental variable, since Proximity is separated from GPA given Library Hours in .
Now, suppose that we notice that a student's "natural ability" affects his or her number of hours in the library as well as his or her GPA, as in Figure 3. Using the causal graph, we see that Library Hours is a collider and conditioning on it opens the path Proximity Library Hours GPA. As a result, Proximity cannot be used as an instrumental variable.
Finally, suppose that Library Hours does not actually affect GPA because students who do not study in the library simply study elsewhere, as in Figure 4. In this case, controlling for Library Hours still opens a spurious path from Proximity to GPA. However, if we do not control for Library Hours and remove it as a covariate then Proximity can again be used an instrumental variable.
We now revisit and expand upon the mechanics of IV in greater detail. Suppose the data are generated by a process of the form
The parameter vector is the causal effect on of a one unit change in each element of , holding all other causes of constant. The econometric goal is to estimate . For simplicity's sake assume the draws of e are uncorrelated and that they are drawn from distributions with the same variance (that is, that the errors are serially uncorrelated and homoskedastic).
Suppose also that a regression model of nominally the same form is proposed. Given a random sample of T observations from this process, the ordinary least squares estimator is
where X, y and e denote column vectors of length T. Note the similarity of this equation to the equation involving in the introduction (this is the matrix version of that equation). When X and e are uncorrelated, under certain regularity conditions the second term has an expected value conditional on X of zero and converges to zero in the limit, so the estimator is unbiased and consistent. When X and the other unmeasured, causal variables collapsed into the e term are correlated, however, the OLS estimator is generally biased and inconsistent for β. In this case, it is valid to use the estimates to predict values of y given values of X, but the estimate does not recover the causal effect of X on y.
To recover the underlying parameter , we introduce a set of variables Z that is highly correlated with each endogenous component of X but (in our underlying model) is not correlated with e. For simplicity, one might consider X to be a T × 2 matrix composed of a column of constants and one endogenous variable, and Z to be a T × 2 consisting of a column of constants and one instrumental variable. However, this technique generalizes to X being a matrix of a constant and, say, 5 endogenous variables, with Z being a matrix composed of a constant and 5 instruments. In the discussion that follows, we will assume that X is a T × K matrix and leave this value K unspecified. An estimator in which X and Z are both T × K matrices is referred to as just-identified .
Suppose that the relationship between each endogenous component xi and the instruments is given by
The most common IV specification uses the following estimator:
Note that this specification approaches the true parameter as the sample gets large, so long as in the true model:
As long as in the underlying process which generates the data, the appropriate use of the IV estimator will identify this parameter. This works because IV solves for the unique parameter that satisfies , and therefore hones in on the true underlying parameter as the sample size grows.
Now an extension: suppose that there are more instruments than there are covariates in the equation of interest, so that Z is a T × M matrix with M > K. This is often called the over-identified case. In this case, the generalized method of moments (GMM) can be used. The GMM IV estimator is
where refers to the projection matrix.
Note that this expression collapses to the first when the number of instruments is equal to the number of covariates in the equation of interest. The over-identified IV is therefore a generalization of the just-identified IV.
Proof that βGMM collapses to βIV in the just-identified case
Developing the expression:
Summary In this paper, we consider parameter estimation in a linear simultaneous equations model. It is well known that two-stage least squares (2SLS) estimators may perform poorly when the instruments are weak. In this case 2SLS tends to suffer from the substantial small sample biases. It is also known that LIML and Nagar-type estimators are less biased than 2SLS but suffer from large small sample variability. We construct a bias-corrected version of 2SLS based on the Jackknife principle. Using higher-order expansions we show that the MSE of our Jackknife 2SLS estimator is approximately the same as the MSE of the Nagar-type estimator. We also compare the Jackknife 2SLS with an estimator suggested by Fuller (Econometrica 45, 933–54) that significantly decreases the small sample variability of LIML. Monte Carlo simulations show that even in relatively large samples the MSE of LIML and Nagar can be substantially larger than for Jackknife 2SLS. The Jackknife 2SLS estimator and Fuller's estimator give the best overall performance. Based on our Monte Carlo experiments we conduct informal statistical tests of the accuracy of approximate bias and MSE formulas. We find that higher-order expansions traditionally used to rank LIML, 2SLS and other IV estimators are unreliable when identification of the model is weak. Overall, our results show that only estimators with well-defined finite sample moments should be used when identification of the model is weak.