Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Challenges of Estimating with Many Weak Instruments in Time Series Econometrics, Lecture notes of Literature

The challenges of estimating linear instrumental variables (IV) in time series settings where many instruments are available. The document focuses on the endogeneity of the estimated instrument and the methods to address it, including sample splitting, cross-fitting, jackknifing, and deleted diagonal approaches. The document also compares the performance of various estimators and discusses the implications for identification and consistency.

Typology: Lecture notes

2021/2022

Uploaded on 09/27/2022

jeanette
jeanette 🇬🇧

3.7

(7)

238 documents

1 / 44

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Many Weak Instruments in Time Series Econometrics
By Anna Mikusheva1
Abstract
This paper studies linear instrumental variables (IV) estimation in time series settings where
many instruments are available. Motivation comes from GMM estimation of rational expec-
tation models, including the New Keynesian Phillips curve, Euler equations, and Taylor rules.
The paper surveys and summarizes ideas from the cross-sectional literature on many weak in-
struments, establishes new results for a split-sample approach, and discusses extensions and
adaptations of the cross-sectional results to time series settings. The main challenge of estima-
tion with many weak instruments comes from endogeneity of the estimated instrument, which
can be solved using sample splitting, cross-fitting, jackknifing and deleted diagonal approaches.
This paper shows that the split-sample approach is agnostic to the method used to estimate
the optimal instrument, allowing for a variety of machine learning estimators to be employed,
and produces easy-to-implement, asymptotically reliable statistical inferences under both weak
and strong identification.
Keywords: Weak Identification, Many Instruments, Time Series
JEL Codes: C14, C22, C26, C55
This draft: February 2021.
1 Introduction
Many structural macroeconometric relations, including the New Keynesian Phillips Curve,
Euler equations, and Taylor rules, are known to be weakly identified when estimated by
GMM using aggregated macro-data, see i.e. Mavroeidis (2004). One important and
probably underused feature of these models is that they are formulated as conditional
moment restrictions, leading to potentially many unconditional moment equations which
may be used for estimation. Specifically, all lags of any available macro variable can serve
1Department of Economics, M.I.T., 50 Memorial Drive, E52-526, Cambridge, MA, 02142. Email:
amikushe@mit.edu. National Science Foundation support under grant number 1757199 is gratefully
acknowledged. I benefitted from discussions with Stanislav Anatolyev, Isaiah Andrews, Josh Angrist,
Whitney Newey, Liyang Sun, David Hughes and Sylvia Klosin.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c

Partial preview of the text

Download Challenges of Estimating with Many Weak Instruments in Time Series Econometrics and more Lecture notes Literature in PDF only on Docsity!

Many Weak Instruments in Time Series Econometrics

By Anna Mikusheva^1 Abstract This paper studies linear instrumental variables (IV) estimation in time series settings where many instruments are available. Motivation comes from GMM estimation of rational expec- tation models, including the New Keynesian Phillips curve, Euler equations, and Taylor rules. The paper surveys and summarizes ideas from the cross-sectional literature on many weak in- struments, establishes new results for a split-sample approach, and discusses extensions and adaptations of the cross-sectional results to time series settings. The main challenge of estima- tion with many weak instruments comes from endogeneity of the estimated instrument, which can be solved using sample splitting, cross-fitting, jackknifing and deleted diagonal approaches. This paper shows that the split-sample approach is agnostic to the method used to estimate the optimal instrument, allowing for a variety of machine learning estimators to be employed, and produces easy-to-implement, asymptotically reliable statistical inferences under both weak and strong identification. Keywords: Weak Identification, Many Instruments, Time Series JEL Codes: C14, C22, C26, C

This draft: February 2021.

1 Introduction

Many structural macroeconometric relations, including the New Keynesian Phillips Curve, Euler equations, and Taylor rules, are known to be weakly identified when estimated by GMM using aggregated macro-data, see i.e. Mavroeidis (2004). One important and probably underused feature of these models is that they are formulated as conditional moment restrictions, leading to potentially many unconditional moment equations which may be used for estimation. Specifically, all lags of any available macro variable can serve (^1) Department of Economics, M.I.T., 50 Memorial Drive, E52-526, Cambridge, MA, 02142. Email: amikushe@mit.edu. National Science Foundation support under grant number 1757199 is gratefully acknowledged. I benefitted from discussions with Stanislav Anatolyev, Isaiah Andrews, Josh Angrist, Whitney Newey, Liyang Sun, David Hughes and Sylvia Klosin.

as a valid instrument. The potential of exploiting this wealth of available information to produce more accurate inferences about structural parameters makes usage of many weak instruments very promising. There have been many recent advances in understanding the statistical issues and developing reliable methods to exploit many weak instruments in cross-sectional settings (i.g. Hausman et al (2012), Belloni at all (2012) among many)

  • an adaption of these methods to time series is lagging behind. The main goals of this paper are: to survey and systematize recent advances in cross- sectional studies of many weak instruments; to establish some missing results; and to investigate how these tools can be adapted to empirical macroeconometric applications. The paper advocates for the use of split-sample IV estimation as the easiest and most versatile approach for extracting information from an abundant set of instruments, and delivering clean statistical inferences on the structural coefficient. This approach, as we argue, is very adaptable to the additional challenges posed by time series data. We establish new results about the consistency and asymptotic distributions of the split- sample estimator, and discuss weak identification robust inferences. The paper also surveys machine learning (ML) approaches popular in time series settings that can be freely combined with the sample splitting idea to select/ estimate the optimal instrument. We frame the central issue of using many instruments as the problem of endogeneity of the estimated instrument. The optimal instrument in a model with homoskedastic martingale-difference errors coincides with the best predictor of the endogenous regressor given the available set of instruments. A variety of ML techniques can be used to select the best predictive model and to construct the optimal instrument. The challenge, however, is that fitting the endogenous regressor with a very flexible model also fits the endogenous part of the regressor in a flexible way. Similarly, selecting an instrument out of a large set of available instruments based on its predictive power for the endogenous regressor favors the instruments showing larger in-sample correlation with the endogenous first- stage error term. As we argue, flexible estimation/selection of instruments leads to the constructed optimal instrument being endogenous, despite each original instrument being exogenous. When the instruments contain a strong signal about the regressor, the problem of endogenous selection is reflected in large finite-sample biases, while in the case where the information is limited, we may end up with an inconsistent estimator and

2 Empirical examples

In structural macroeconometrics it is common to have a large number of potential in- struments. For example, when estimating rational expectations models the exclusion restriction is often formulated as a conditional expectation, where the conditioning is on all information available at the time the expectation is taken. This makes all lags of any macro variables valid instruments. Despite the seeming abundance of potential instru- ments, structural estimation using aggregate data often suffers from weak identification, at least when a relatively small number of carefully chosen instruments is used.

Example 1. New Keynesian Phillips Curve. The NKPC is a rational expectation model capturing a trade-off between the rate of inflation and the level of economic activity. A theoretical justification of the NKPC comes from the Calvo model. There exists a diverse range of empirical specifications, but the most common is the following:

πt = λxt + γf Etπt+1 + γbπt− 1 + ut. (1)

Here πt is inflation in period t, xt is a proxy for marginal costs (often the labor share or output gap), ut is unpredictable structural error, and Et is a rational expectation formed at time t. Gali and Gertler (1999) proposed GMM-IV estimation of the NKPC by forming the moment condition

E[(πt − λxt − γf πt+1 − γbπt− 1 )Zt− 1 ] = 0,

where one can use any variable observed at time t−1 in the instrument set Zt− 1. Kleiber- gen and Mavroeidis (2009) show that ‘weak instrument problems arise if marginal costs have limited dynamics or if their coefficient is close to zero, that is, when the NKPC is flat, since in those cases the exogenous variation in inflation forecasts is limited.’ A survey paper by Mavroeidis et al (2014) reports that ‘estimation of the NKPC using macro data is subject to a severe weak instruments problem. Consequently, seemingly innocuous specification changes lead to big differences in point estimates.’ 

Example 2. Euler equation. Euler equations for consumption or output are an important part of many macroeconomic models. There are multiple specifications used

in empirical work – below is a formulation suggested by Fuhrer and Rudenbusch (2004):

ct = α + ϕEtct+1 − φrt +

∑^ J

j=

αj ct−j + ut,

where ct and rt are the logs of consumption (output) and the real interest rate. GMM estimation of the Euler equation was proposed by Hansen and Singleton (1982), who suggested using lags of available variables as instruments. Yogo (2004) raised the issue of weak identification of the coefficient of intertemporal substitution. Acsari et al (2020) provides a comprehensive survey of different specifications and estimation approaches to the Euler equation. 

Example 3. Taylor rule. A Taylor rule is a policy reaction function that describes how a Central bank conducts monetary policy. One common specification is

rt = ¯r + β(Etπt+1 − π¯) + γEtxt+1 + ϵt,

where rt is the Federal Funds rate, πt the inflation rate, xt the output gap, and ¯r and ¯π are equilibrium rates. Clarida et al (1998) suggested a GMM approach to estimating the Taylor rule that allows the researcher to use any lagged variables as instruments. Mavroeidis (2004) draws attention to weak identification of the Taylor rule. 

Example 4. Factor pricing. Factor pricing models assume that the expected excess return on a stock or a portfolio of assets is equal to the price of risk (or risk premia) λ, for some risk factor Ft, multiplied by the portfolio’s quantity of risk βi:

Erit = λβi, βi = (V ar(Ft))−^1 cov(Ft, rit).

One commonly used estimation procedure is the Fama-MacBeth approach (Fama and MacBeth (1973), Shanken (1992)) that first estimates βi by running time series regressions of excess returns rit on the realization of risk factor Ft for each asset separately. Then, one runs a cross-sectional regression of the average return for each asset on its estimated βi to obtain an estimate of the risk premia λ. This procedure can be interpreted as a classical TSLS estimator with a large number of instruments. The number of instruments here equals to the number of assets used for estimation multiplied by the number of factors,

3 Cross-Section: statement of the problem

3.1 Many Instruments: constructing optimal instrument

In this section we concentrate our attention on cross-sectional data and assume that we observe an i.i.d. sample (Yi, Xi, Zi) for i = 1, .., n. We consider a linear IV model with one-dimensional endogenous regressor X and K-dimensional instrument Z

Yi = βXi + ei, E[ei|Zi] = 0.

Chamberlain (1987) derived the optimal instrument fi that minimizes the variance of the IV estimate: fi = E E[[Xe (^2) ii ||ZZii]]. Many papers in this literature aim to find an estimation and in- ference procedure that achieves semi-parametric efficiency under homoskedasticity, while at the same time delivering valid results under heteroscedasticity (heteroscedasticity- robust). In accordance with this goal, we look for an optimal instrument of the form

fi = E[Xi|Zi]. (2)

In practice the optimal instrument is not known and has to be estimated – Newey (1990) suggested estimating fi non-parametrically. In this paper we consider only two-step estimators, which covers vast majority of available estimators. In the first step one constructs a model of the best predictor for Xi based on the potential predictors/features Zi using some regularized non-parametric estimation and selection methods. Denote this estimated optimal instrument as f̂i = Ê [Xi|Zi]. In the second step, the estimated optimal instrument is employed in the just identified linear IV:

β̂ =

∑n ∑ni=1^ f̂iYi i=1 f̂iXi

Let us mention several prominent approaches for first-step estimation. Donald and Newey (2001) proposed an instrument selection procedure based on a Mallows criteria. Belloni et al (2010) and Belloni et al (2012) suggest using LASSO estimation on the first step to construct the optimal instrument. Okui (2011) proposes a shrinkage estimator, assuming that there is a known set of strong instruments that delivers a consistent es- timator of β. Carrasco (2012) suggests several regularization procedures based on the

spectral decomposition of the conditional expectation operator. Among her proposals are the principal components approach and Tikhonov’s regularization of the conditional expectation operator. There have been also recent suggestions to use other Machine Learning techniques for the optimal instrument construction, such as the random forest (Ash et al (2018)). All procedures mentioned above deliver semi-parametric efficient estimators under some set of assumptions. Typically this involves some assumption placed on the form of the optimal instrument, which allows for its consistent estimation. For example, the LASSO procedure of Belloni et al (2010) delivers the desired results if the first-stage regression is approximately sparse, that is, a relatively small number of the instruments successfully approximates the optimal instrument. Donald and Newey (2001) assumes a known ordering among instruments (or groups of instruments) by strength/informativeness. Another type of assumption often needed is a regularity condition placed on the condi- tional expectation operator. For example, Belloni et al (2012) restrict eigenvalues of empirical Gram matrix, while Carrasco (2012) assumes that the conditional expecta- tion operator is a Hilbert-Schmidt operator. All papers mentioned above assume strong identification of the IV model.

3.2 Weak Instruments

In this section we provide a very brief summary of known facts about weak identification in a just identified case. Specifically, we consider the identification strength of the optimal instrument as if it is known, and take the ‘optimal’ instrument to be the one defined in (2). We acknowledge the limitation of this definition and recognize that the weak IV literature has an unresolved debate about the choice of a powerful test, and direction of power, for over-identified linear IV models. However, we intend to stay away from this debate and solve a somewhat simpler problem, maintaining definition (2) as a goalpost. Let us write the (infeasible) first-stage regression as:

Xi = E[Xi|Zi] + vi = fi + vi, (4)

where vi is the prediction error with E[vi|Zi] = 0. Weak identification arises when the uncertainty coming from the prediction error vi is empirically important; that is, cases

A wide literature is devoted to identification-robust testing and confidence set con- struction. The most well known and often used tests are the Anderson-Rubin statistic, Kleibergen’s (2002) KLM and the conditional likelihood ratio of Moreira (2003). They are justified in settings with a small number of instruments and are equivalent in the just-identified case. The recommendation for the just-identified setting is to always use identification-robust tests rather than employing the weak identification pre-test. This recommendation follows from a statement that the robust tests mentioned above are asymptotically efficient in a just-identified homoskedastic case, if identification is strong.

3.3 Main problem: endogeneity of the estimated instrument

Problematic simulations. A recent thought-provoking paper by Angrist and Frand- sen (2020) assesses the utility of machine learning techniques in modern applied labor economics applications. Discussed in great detail is the use of machine learning techniques for instrument selection. The authors create simulation exercises based on two applica- tions: identification of the return to education using quarter of birth as instruments (Angrist and Krueger (1991)); and the effect of a movie’s opening-weekend viewership on subsequent sales, with instruments generated by weather indicators (Gilchrist and Sands (2016)). The authors diligently design the simulation settings to match the empirical examples along many directions, including the heterogeneity of the first-stage effects and heteroscedasticity. The amazing conclusion of Angrist and Frandsen (2020) is that the use of machine learning techniques for construction of the optimal instrument in these two applications does not deliver the results many hope for. The authors explored the performance of IV regression using both LASSO and random forest estimators for the first stage and contrasted it with OLS, TSLS and several jackknife and split-sample estimators we will discuss below. In almost all cases the IV estimators using LASSO and random forest estimates delivered large biases, comparable to that of TSLS and OLS without much improvement in terms of variance. The performance of the both machine learning meth- ods depends significantly on the choice of regularization parameter (the cross-validation or plug-in penalties for the LASSO, or the leaf-size for the random forest) with none of the standard choices being totally satisfactory in these applications. These results rhyme

well with the simulation evidence in Hansen and Kozbur (2014), where the authors report less than stellar performance of IV estimators using LASSO first stage when the signal on the first stage is weak.

Essence of the problem. Estimating the optimal instrument in a very flexible way, or selection from among many instruments, may lead to over-fitting the endogenous part of regressor X. This makes the estimated optimal instrument endogenous, E[ f̂iei] ̸= 0, even though each individual instrument is exogenous, and leads to the bias of the IV estimator. We explain this phenomenon below in the context of the TSLS estimator, following Bekker (1994). Assume the optimal instrument fi = π′Zi is a linear combination of the available K instruments and assume that Z′Z is a matrix of rank K. The TSLS estimator uses the following estimated optimal instrument:

f̂i = X′Z(Z′Z)−^1 Zi = fi + v′Z(Z′Z)−^1 Zi,

where the estimation error is correlated with the structural error, since the prediction error vi is endogenous. Under conditional homoskedasticity we find the following formula for the endogeneity of the estimated instrument:

E

[

n

∑^ n i=

( f̂i − fi)ei

]

= E[v ni ei]tr(Z(Z′Z)−^1 Z′) = Kn σev.

As we see, the endogeneity of the estimated optimal instrument is increasing in the number of available instruments as the endogenous part of regressor X is being fitted more flexibly. A larger number of instruments, or a more flexible first stage, may result in a large finite-sample bias of the two-step estimator. Under asymptotics in which the number of instruments K is growing, this may even lead to inconsistency. Similar observations can be made about other IV estimators that have relatively sim- ple form (e.g. see Hansen and Kozbur (2014) for ridge-first stage). Unfortunately, the exact form of the bias is unavailable for first-stage estimators that involve simultane- ous variable selection and estimation, and hence do not have a simple analytical form. Nevertheless, the logic behind the endogeneity of the estimated instrument is somewhat similar. Among many instruments that have similar explanatory power, the ones that

the estimated instrument for each observation i ∈ I 2 , and then runs just-identified IV on the second subsample with instrument f̂i. We denote β̂ SS , the split-sample estimator of β, defined as:

β̂ SS =

∑^ i∈I^2 f̂^ (A^1 , Zi)Yi i∈I 2 f̂^ (A^1 , Zi)Xi

Whenever the method used for the construction of the optimal instrument is specified, we add this to the name of the estimator. For example, SS-LASSO is the estimator that estimates f̂i using LASSO regression of X on the instruments in the I 1 sample only, and then runs just identified IV on the subsample I 2. We usually assume that the number of observations in the second subsample I 2 increases to infinity with the sample size, but it is not required for the two subsamples to be of the same size.

Cross-fit. To salvage the efficiency loss from using only part of the sample for structural estimation, the researcher may use the two subsamples symmetrically: fitting the first stage on each subsample separately and producing the estimated optimal instrument in each subsample by using the fitted model estimated on the opposite one. Specifically, for i ∈ I 1 the estimated instrument f̂i = f̂ (A 2 , Zi) is a function of A 2 and Zi only, while for i ∈ I 2 the corresponding f̂i = f̂ (A 1 , Zi) is a function of A 1 and Zi. The cross-fit split-sample estimator of the structural parameter β is defined as:

β̂CF SS =

i∈I 1 f̂^ (A^2 , Zi)Yi^ +^

∑^ i∈I^2 f̂^ (A^1 , Zi)Yi i∈I 1 f̂^ (A^2 , Zi)Xi^ +^

i∈I 2 f̂^ (A^1 , Zi)Xi

The formulation above make sense if the two subsamples are of equal size, but we may also consider other alternatives such as weighted averages of the two individual split-sample estimates.

Jackknife. An extreme form of the sample splitting idea is the jackknife or leave-one- out (Angrist et al. (1999)) estimator, where in order to estimate the optimal instrument for observation i one uses the sample containing all observations except i. In the case where the first stage is estimated using OLS, this means (3) using f̂i = ̂π′ (−i)Zi, with ̂ π(−i) = (Z (′−i)Z(−i))−^1 Z (′−i)X(−i), where the index (−i) indicates the matrix including all observations but i.

This idea may be applied to any other ML approach applied on the first step. For jackknife estimators let us for any index i denote A−i the full set (Xj , Yj , Zj ) of ran- dom variables in the samples excluding observation (Xi, Yi, Zi). Then the jackknife IV estimator is β̂JIV E =

∑^ i^ f̂^ (A−i, Zi)Yi i f̂^ (A−i, Zi)Xi

It is worth pointing out that for a complicated ML algorithm this estimator is com- putationally demanding as it requires re-running the ML first-stage estimation on each data set A−i separately. For clarity, in this paper we name these estimators as JIVE- combined with the name of the first-stage estimator, e.g JIVE-LASSO runs a separate LASSO estimation for each observation i on the first stage. We refer to the estimator introduced in Angrist et al. (1999), in which the first stage is estimated using least squares, as JIVE-OLS.

Deleted diagonal estimators. Direct implementation of the jackknife form described above can be numerically demanding. However, Angrist et al. (1999) showed that the jackknife IV estimator with the OLS first step (JIVE-OLS) can be calculated as β̂ JIV E = XX′′^ P XP Y˜˜ , where the n × n matrix of weights P˜ can be calculated from projection matrix

P = Z(Z′Z)−^1 Z′^ by eliminating diagonal elements and re-scaling rows: P˜ij = (^1) −PijPii if i ̸= j, and P˜ii = 0. Notice that the JIVE-OLS estimate is the solution to the optimization problem that minimizes the quadratic form (Y − βX)′^ P˜ (Y − βX), using the deleted diagonal weights P˜. Han and Phillips (2006) show that, when the number of the moment conditions is large, the minimizer to the theoretical GMM objective function is not the true parameter. For TSLS, the value of the theoretical objective function at the true parameter value is equal to E [e′P e] = ∑ i PiiEe^2 i ̸= 0. As argued in Han and Phillips (2006), this leads to the bias of TSLS we discussed above. The JIVE-OLS estimator solves this problem, since P˜ii = 0. Based on this idea, the JIVE-OLS formulation has inspired another class of estimators that is also called jackknife, though may or may not be associated with a direct jackknifing procedure. To distinguish these alternative estimators, in this paper we use the term deleted diagonal. Assume we have an estimator that is defined as the optimizer of some objective function that is either a quadratic form or ratio of two quadratic forms, say

next subsection asks a similar question about asymptotic gaussianity of the estimator and its standard errors. Then we ask what empirical criteria a practitioner may check to pre-test if the signal is strong enough for gaussian inferences. The final subsection discusses identification robust testing as a default approach under very low signal.

4.1 When does a consistent estimator exist?

The signal strength of the optimal instrument that is required to achieve a consistent estimator, as measured by the concentration parameter (5), depends crucially on how much is known and how much we are willing to assume about the form of the optimal instrument. Let us start with the simplest case in which the optimal instrument is fully known and available, that is, fi is a part of our data set and its identity is known. In this case, the optimal IV estimator β̂ o is consistent as long as μ^2 → ∞ as the sample size increases. Under very mild assumptions

μ^2 is the convergence rate of the optimal estimator (Stock et al. (2002)). Conversely, assume that nothing is known about the form of the optimal instrument and that we search among all linear combinations of the available K instruments (for this result we assume that K < n). That is, assume the optimal instrument is linear fi = π′Zi, but that no information about the direction of π is available. Then, a necessary and sufficient condition for consistency of the IV estimator is √μK^2 → ∞. The strength of identification should not just be large, but large in comparison to the complexity of the first-stage estimation, measured by the square root of the number of instruments. This necessary condition comes from a result in Mikusheva and Sun (2020) which states that, in the best possible circumstances (such as a linear, gaussian, homoskedastic model with known reduced form covariance of the error term), if √μK^2 is bounded asymptotically and the direction of π is completely unknown, then one cannot consistently distinguish any two values of β. There are a number of estimators that are consistent under het- eroscedasticity and some relatively minor technical assumptions when √μK^2 → ∞. These include JIVE-OLS, DD-LIML and DD-Fuller (Hausman et al (2012)), with earlier results for the homoskedastic case obtained by Chao and Swanson(2005). Notice, that all these estimators are agnostic about the direction of the optimal instrument.

Consistency of split-sample IV with ML first stage. The negative result of Miku- sheva and Sun (2020) critically relies on the direction of the optimal instrument being completely unrestricted and unknown. If a researcher has some knowledge about the optimal instrument and may adjust her first-stage estimation accordingly, then the re- quirement on the strength of identification is less stringent and depends on the rate of consistency of the first-step estimator. The following statement characterises the consis- tency of the estimator under such conditions.

Theorem 1 Assume that the data is i.i.d., E[ei|Zi] = E[vi|Zi] = 0 and E[e^2 i |Zi] < C, 0 < c < E[v i^2 |Zi] < C almost surely. Assume that the prediction error for the estimation approach used on the first step is such that E

[

( f̂i − fi)^2

]

= O(r nn ) and E[f̂ifi] ≥ cE[f (^) i^2 ]. For the sample split estimator assume that the number of observations in I 2 is at least [αn] for α > 0. For the cross-fit assume that the number of observations is equal in the both subsamples. If √μr^2 n → ∞ , then β̂CF SS and β̂ SS are consistent for β.

Theorem 1 shows that if one has information about the optimal instrument and can use it to improve the optimal instrument estimation rate, characterized by rn, then the requirements on the strength of the optimal instrument may be weakened. If nothing is known about the instruments and the search is done among all linear combinations of K available instruments (with the assumption that K < n), then the approriate rate for the optimal instrument estimation is rn = K, returning us to the earlier condition √μK^2 → ∞. If one is willing to impose assumptions on the first stage and use them, then better rates for optimal instrument estimation can be achieved. One such potential restriction is sparsity or approximate sparsity (Belloni et al, 2010) that allows the optimal instrument to be well approximated by a linear combination of a small number of the available instruments. In this case we may allow the number of available instruments, K, to be (much) larger than the sample size n. Let us assume that

fi = Z i′π 0 + Ri, ∥π 0 ∥ 0 ≤ s,

where the approximation error is small in the following sense

n

i R^2 i ≤^ C

√ (^) s n ,^ and the number of important terms is small s = o(n/ log(K)). The leading proposal to estimate the predictive sparse regression is via LASSO (Tibshirani, 1996). Under appropriate mo- ment assumptions and assumptions on sub-matrices of Z′Z, Belloni et al (2010, Theorem

the split-sample estimator described in Theorem 1, the latter term has asymptotic order Op(√rn). The condition for the leading term in the numerator to dominate is μ rn^2 → ∞. There is a gap between the rate required for consistency, and the stricter rate required for standard gaussian inference to be valid. Assume that the strength of identification is such that the estimator is consistent ( √μr^2 n → ∞), but μ rn^2 is asymptotically bounded. Then, the asymptotic distribution of the estimator depends more finely on the first-stage procedure and calls for asymptotic theorems for the term ∑ni=1(f̂i − fi)ei. The biggest challenge in determining the asymptotic distribution of the last term is the complicated dependence between summands. Specifically, the first-stage estimation error f̂i − fi, will exhibit dependence over i if the first-stage estimation relies on common observations. For complicated ML procedures on the first stage, this dependence may be very intricate.

Some DD and JIVE estimators. This issue has been successfully solved for deleted diagonal style estimators and several JIVE estimators, including JIVE-OLS, JIVE-Ridge, DD-TSLS, DD-LIML and DD-Fuller (Chao et al (2012), Hansen et al (2008), Hausman et al (2012), Hansen and Kozbur (2014)). We discuss these results for the example of DD-TSLS. The DD-TSLS estimator equals to the ratio of two quadratic forms XX′′^ P XP Y˜˜ , where P˜ equals to the projection matrix P with a deleted diagonal. It implicitly uses the estimated instrument

f̂i = i′^ P X˜ = i′^ P f˜ + ∑ j̸ =i

P^ ˜ij vj ,

with i denoting a selection vector with the ith component equal to 1 and all other elements equal to zero. For simplicity, let us ignore for a moment the distinction between i′^ P f˜ and fi – this will be reasonably small for well chosen P˜. The prediction mistake introduced in Theorem 1 is 1 n

∑^ n i=

E(f̂i − fi)^2 ≍ (^1) n

∑^ n i=

E

[∑

j̸ =i

P^ ˜ij vj

] 2

≍ Kn.

Thus, in this example we have the rate rn = K, and by reasoning similar to Theorem 1, DD-TSLS (as well as the other DD estimators mentioned above) will be consistent when √^ μ^2 K → ∞.^ However, standard gaussian inferences require^

∑n i=1 fiei^ to dominate the numerator in equation (8), and hence μ K^2 → ∞. In the gap between these two rates, the

rate needed for consistency and the rate needed for standard inferences, the asymptotic behavior of ∑ni=1(f̂i − fi)ei becomes the dominating one. The first-stage estimation error of the optimal instrument f̂i − fi = ∑ j̸ =i P˜ij vj is a weighted average of all but its own endogenous errors. This means that the estimation errors will be heavily correlated over i – notice also that ( f̂i − fi) is correlated with ej (when j ̸= i), as vj is part of the first-stage estimation error and vj is correlated with ej. As a result, getting a central limit theorem for ∑ni=1(f̂i − fi)ei is highly non-trivial in general and calls for more structure to be put on f̂i − fi. Such structure exists in the above mentioned DD and JIVE estimators, and the leading term in ∑ni=1(f̂i − fi)ei is given by ∑ni=1^ ∑ j̸ =i P˜ij vj ei. Chao et al (2012) and Hansen et al (2008) establish a central limit theorem for quadratic forms of this type, that provides conditions for gaussianity of the leading term. Hausman et al. (2012) provides methods for estimating standard errors that work for several JIVE and DD-type estimators. To summarize, once the identification is strong enough for a number of JIVE and DD estimators (including JIVE-OLS, JIVE-Ridge, DD-TSLS, DD-LIML) to become consis- tent ( √μK^2 → ∞), these estimators are also asymptotically gaussian (under mild additional assumptions). However, the standard errors needed for asymptotically valid inferences differ and require a quadratic form CLT to be used. It is worth pointing out that the stan- dard errors proposed by Hausman et al. (2012) contain variance estimates for both terms appearing in the numerator of (8) and work well once the corresponding IV estimator is consistent.

Split-sample estimators. The theorem below establishes conditions for asymptotic gaussianity of the split-sample estimator β̂SS. It shows that once the split-sample esti- mator is consistent, inference can be performed in a standard way, treating the estimated instrument f̂i as the only available instrument in a just-identified setting.

Theorem 2 Assume that the data is i.i.d., E[εi|Zi] = 0, and E[|εi|^4 |Zi] < C for εi = (ei, vi)′^ and E[f (^) i^4 ] < C. Assume that the size of subsample I 2 is growing to infinity as n → ∞. Let the following assumptions hold: (i) (^) (E[ f̂ (^2) i 1 |A 1 ]) 2 E

[

|f̂i|^4 |A 1

]

< C almost surely;

(ii) (^) a^1 n^ ∑ i∈I 2 f̂ifi →p^1 for some A 1 -measurable sequence of random variables an;