1. Introduction
This paper develops model selection and averaging methods for moment restriction models. We first propose a focused information criterion (FIC) based on the generalized empirical likelihood (GEL) estimator [
1,
2], which nests the empirical likelihood (EL) [
3,
4] and exponential tilting (ET) [
5,
6] estimators as special cases. Motivated by Claeskens and Hjort [
7], we address the issue of selecting an optimal model for estimating a specific parameter of interest, rather than identifying a correct model or selecting a model with good global fit. Then, as an extension of FIC, this study presents a GEL-based frequentist model averaging (FMA) estimator that is designed to minimize the mean squared error (MSE) of the estimator.
Traditional model selection methods, such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC), select a single model regardless of the specific goal of inference [
8,
9]. AIC selects a model that is close to the true data generating process (DGP) in terms of Kullback-Leibler discrepancy, while BIC selects the model with the highest posterior probability. However, a model with good global fit is not necessarily a good model for estimating a specific parameter. For instance, Hansen [
10] considers the problem of deciding the order of autoregressive models. His simulation study demonstrates that the AIC-selected model does not necessarily produce a good estimate of the impulse response. This result reveals that the best model generally differs for different intended uses of the model.
In their seminal work, Claeskens and Hjort [
7] established an FIC that is designed to select the optimal model depending on its intended use. Their goal is to select the model that attains the minimum MSE of the maximum likelihood estimator for the parameter of interest, which they call the focus parameter. The FIC is constructed from an asymptotic estimate of the MSE.
Since then, an FIC has been derived for several models. Claeskens, Croux and Kerckhoven [
11] proposed an FIC for logistic regressions. Hjort and Claeskens [
12] proposed an FIC for the Cox hazard regression model. Zhang and Liang [
13] developed an FIC for the generalized additive partial linear model. Models studied in those papers are likelihood-based. However, econometric models are often specified via moment restrictions rather than parametric density functions. This paper indicates that the idea of Claeskens and Hjort [
7] is applicable to moment restriction models. Our FIC is constructed using an asymptotic estimate of the MSE of the GEL estimator.
Model selection for moment restriction models is still underdeveloped. Andrews and Lu [
14] proposed selection criteria based on the
J-statistic of the generalized method of moments (GMM) estimator [
15]. Hong, Preston and Shum [
16] extended the results of Andrews and Lu to the GEL estimation. Sueishi [
17] developed information criteria similar to the AIC. The goal of Andrews and Lu [
14] and Hong, Preston and Shum [
16] was to identify the correct model, whereas Sueishi [
17] selects the best approximating model in terms of Cressie-Read discrepancy. Although these criteria are useful in many applications, they do not address the issue of selecting the model that best serves its intended purpose.
Model averaging is an alternative to model selection. Inference after model selection is typically conducted as if the selected model is the true DGP. However, this ignores uncertainty introduced by model selection. Rather than conditioning on the single selected model, the averaging technique uses all candidate models to incorporate model selection uncertainty. Although Bayesian methods are predominant in the literature [
18], there is also a growing FMA literature for likelihood-based models [
19,
20,
21]. See also Yang [
22], Leung and Barron [
23] and Goldenshluger [
24] for related issues.
In the FMA literature, it is often of particular interest to obtain an optimal averaging estimator in terms of a certain loss [
25,
26,
27,
28]. This study investigates a GEL-based averaging method that minimizes the asymptotic mean squared error in a framework similar to that of Hjort and Claeskens [
21]. A simulation study indicates that our averaging estimator outperforms existing post-model-selection estimators.
Although this study investigates GEL-based methods, in general, its results are readily applied to the two-step GMM estimator, because our results rely only on first-order asymptotic theory. However, the two-step GMM estimator often suffers from a large bias that cannot be captured by first-order asymptotics, even if the model is correctly specified. Because the FIC addresses a trade-off between misspecification bias and estimation variance, the GEL estimator will be more suitable for our framework.
Now, we review related works. DiTraglia [
29] proposes an instrument selection criterion for GMM that is based on the concept of FIC. Our approach resembles DiTraglia’s, but his interest is instrument selection, whereas ours is model selection. DiTraglia intentionally uses an invalid large set of instruments to improve efficiency; we intentionally use a wrong small model to improve efficiency. Liu [
30] proposes an averaging estimator for the linear regression model by using a local asymptotic framework. Although Liu considers exogenous regressors, we allow endogenous regressors. Martins and Gabriel [
31] consider GMM-based model averaging estimators under a framework different from ours.
The remainder of the paper is organized as follows.
Section 2 describes our local misspecification framework.
Section 3 derives the FIC.
Section 4 discusses the FMA estimator.
Section 5 provides a simple example for which our methods are applicable.
Section 6 presents the result of Monte Carlo study.
Section 7 concludes.
2. Local Misspecification Framework
We first introduce our setup. The basic construct follows Claeskens and Hjort [
7]. There is a smallest and a largest model in our set of candidate models. The smallest, which we call the reduced model, has a
p dimensional unknown parameter vector,
. The largest, or the full model, has an additional
q dimensional unknown parameter vector,
. The full model is assumed to be correctly specified and nests the reduced model;
i.e., the reduced model corresponds to the special case of the full model in which
for some known
. Typically,
is a vector of zeros:
. An example is given in
Section 5.
There are up to submodels, all of which have as the common parameter vector. A submodel treats some elements of as unknown parameters and is indexed by a subset, S, of . The model, S, contains parameters, , such that . Thus, the reduced and full models correspond to and , respectively. We use “red” and “full” to denote the reduced and full models, respectively.
The focus parameter, , which is the parameter of interest, is a function of and : . It could be merely an element of . Prior knowledge or economic theories suggest that should be estimated, but we are unsure which elements of should be treated as unknown parameters. Estimating a larger model usually implies a lesser modeling bias and a larger estimation variance. However, if the reduced model is globally misspecified in the sense that the violation of the moment restriction does not disappear even in the limit, then the misspecification bias asymptotically dominates the variance of the GEL estimator. Thus, we cannot make a reasonable comparison of bias and variance in the asymptotic framework.
A local misspecification framework is introduced to take into account the bias-variance trade-off. Let
be i.i.d.random vectors from an unknown density,
, which depends on the sample size,
n.
1The functional form of
is not specified. The full model is defined via the following moment restriction:
where
is a known vector-valued function up to the parameters. For each
n, the true parameter values of
and
are
and
, respectively. Note that
is known, but
and
are unknown. We assume that
;
i.e., the model is over-identified.
The moment function of the reduced model is
. The reduced model is misspecified in the sense that there is no value
, such as
, for any fixed
n. However, if the moment function is differentiable with respect to
, then (
1) implies that the reduced model satisfies:
for some vector,
between
and
. Thus, even though the moment restriction is invalid at
, the violation disappears in the limit. A similar relationship also holds for the other submodels. As the next section reveals, under this framework, the squared bias and variance of the GEL estimator are both of the order,
. Hence, the trade-off between bias and variance can be considered. If
is sufficiently small, it might be better to set
rather than estimate
.
In general, the dimension of the moment function can differ among submodels. For instance, consider a linear instrumental variable model. The model (structural form) can be estimated as long as the number of instruments exceeds or equals the number of unknown parameters. Thus, it is possible to use only a subset of instruments to estimate a submodel. For ease of exposition, however, we consider only the case where the dimension of the moment function is fixed for all submodels.
3. Focused Information Criterion
To construct an FIC, we first derive the asymptotic distribution of the GEL estimator under the local misspecification framework. Newey [
32] and Hall [
33] obtained a similar result in the case of GMM estimation to analyze the local power properties of specification tests.
A model, S, contains unknown parameters. The moment function of the model is denoted as , where is the complementary set of S. The values of are set to be their null values for .
Let
be a concave function on its domain,
, which is an open interval containing zero. We normalize
, so that
, where
. The GEL estimator of
is obtained by solving the saddle-point problem:
where
is the parameter space of
and
is the set of feasible values of
. The EL and ET estimators are special cases with
and
, respectively. Although
has
p elements for any
S, we adopt the subscript,
S, to emphasize that the value of the estimator depends on
S.
Let
,
, and
. Furthermore, let
. We define:
where
E denotes the expectation with respect to
. It is assumed that
satisfies:
For the full model, we denote:
Then, we can write
and
, where
is the projection matrix of size,
, that maps
to the subvector,
:
.
Let and . Furthermore, let . To obtain the asymptotic distribution of the GEL estimator, we impose the following conditions:
Assumption 3.1
- 1.
, , and are compact.
- 2.
is continuous in and for almost every y.
- 3.
under the sequence of .
- 4.
as for all , , and .
- 5.
is nonsingular for all and .
- 6.
is the unique solution to and .
- 7.
is twice continuously differentiable in a neighborhood of zero.
- 8.
and are of full rank.
- 9.
for some .
- 10.
is continuously differentiable in and in a neighborhood, , of .
- 11.
and under the sequence of .
- 12.
and as .
- 13.
as .
Conditions are rather high-level and strong. Some conditions can be replaced with primitive and weaker conditions [
34].
We obtain the following lemma.
Lemma 3.1 Suppose Assumption 3.1 holds. Then, under the sequence of ,
we have:
The proof is given in the
Appendix.
If the model, S, is correctly specified, then the limiting distribution of the GEL estimator is . Therefore, as usual, local misspecification affects only the mean of the limiting distribution.
Next, we get the asymptotic distribution of the GEL estimator for the focus parameter. Additional notations are introduced. Let
and
;
i.e.,
Q and
are the lower right block matrices of
and
, respectively. Let
. We assume that
is differentiable with respect to
and
. Let:
where the partial derivatives are evaluated at
. The true focus parameter is denoted as
. Moreover, the GEL estimator of
for the model,
S, is denoted as
. Lemma 3.1 and the delta method imply the following theorem:
Theorem 3.1 Suppose Assumption 3.1 holds. Then, under the sequence of ,
we have:and:where is independent of D.
The proof is almost the same as that of Lemma 3.3 in Hjort and Claeskens [
21], so it is omitted.
Because
and
, as the special cases of the theorem, we have:
Therefore, in terms of the asymptotic MSE, the reduced model is better than the full model if
, which is the case when the deviation of the reduced model from the true DGP is small.
More generally, Theorem 3.1 implies that the MSE of the limiting distribution of
is:
The idea behind FIC is to estimate (
2) for each model and select the model that attains the minimum estimated MSE.
All components in (
2) except
can be estimated easily by using their sample analogs. However, a consistent estimator for
is unavailable, because
converges in distribution to a normal random variable. This difficulty is inevitable, as long as we utilize the local misspecification framework. Because the mean of
is
, following Claeskens and Hjort [
7], we use
to estimate
. Then, the sample counterpart of (
2) is:
which is an asymptotically unbiased estimator for (
2). Because the last two terms do not depend on the model, we can ignore them for the purpose of model selection. Let
and
. Then, our FIC for the model,
S, is:
where
. The bigger the model is, the smaller the first term and the larger the second term in (
3). Since
w depends on
, FIC can be used to select an appropriate submodel, depending on the parameter of interest.
Although we consider only the case where is a scalar, our FIC is also applicable to a vector-valued focus parameter by viewing each element of the vector as a different scalar-valued focus parameter. Different models might be used to estimate different elements of the vector.
We conclude this section with a remark on the estimation of
. Because we estimate
by
, the estimate can be negative definite in finite sample. That means that the squared bias term can be negative. To avoid such cases, as suggested by Claeskens and Hjort [
35], we can also use the following bias-corrected FIC:
where
is the event of negligible bias:
See Section 6.4 of Claeskens and Hjort [
35] for details.
4. Model Averaging
This section extends the result of
Section 3 to the averaging problem. In the FMA literature, it is often of particular interest to obtain an optimal averaging estimator in terms of a certain loss. We consider a possibility of obtaining the best averaging weights that minimize the MSE in the local misspecification framework. A similar analysis is presented in Liu [
30] in the case of linear regression.
Let
be the set of all candidate models. We consider an averaging estimator for the focus parameter of the form:
where the weights,
, add up to unity. Note that a post-selection estimator of
can also be written in this form. Let
be the FIC-selected model. Then the post-selection estimator using FIC is:
where
is the indicator function. Thus, the post-selection estimator is a special case of the averaging estimator.
If the weights are not random, then it is straightforward from Theorem 3.1 that:
where
. Therefore, the asymptotic mean and variance of the averaging estimator are given by:
Thus, there is a set of weights that minimizes the asymptotic MSE of
.
Suppose there are
M candidate models:
. Let
be a vector of averaging weights, which is in the unit simplex in
:
Ignoring
, which does not depend on the model, the optimal weight vector,
, that minimizes the asymptotic MSE is:
where
A is an
matrix, whose
element is given by:
If we replace
A with its appropriate estimate,
, we obtain a feasible estimator:
For instance, if we estimate
by
, then:
Although there is no closed-form solution for (
4), it can be solved numerically by a usual quadratic programing algorithm.
Unfortunately,
cannot be a consistent estimator for
, because there is no consistent estimator for
A. Suppose that
for a random matrix,
, and for all
. Then, we have:
Thus,
is random, even in the limit.
Let
and
be the
i-th element of
and
, respectively. Furthermore, let
denote the averaging estimator using
. Because
and
are both determined through
,
and
converge jointly to
and
. Therefore, the limiting distribution of
is given by:
Because weights are random, the limiting distribution is no longer normal. Thus, (
5) is not readily applicable for inference. However, as suggested by Hjort and Claeskens [
21], (
5) implies that:
where
is a consistent estimator for
. This result can be used to construct a confidence interval for
.
5. Example
This section gives a simple example to which our methods are applicable. One of the most popular models described by moment restrictions is the linear instrumental variable model. The full model we consider here is:
where
and
are
and
vectors of explanatory variables. Some elements of
are potentially correlated with
. The vector of instruments,
, is
, which may contain elements of
and
. Economic theory suggests that
should be included in the model, but we are unsure which components of
should be included. Thus, the reduced model corresponds to the case that
.
In this model,
is given by:
Let
be the residual from the full model:
. Then, for instance,
can be estimated by:
Other components of
can be estimated in a similar manner. It also is possible to replace the empirical probability,
, with the GEL-induced probability.
If the focus parameter is the
k-th element of
, then we have:
where
is the
k-th unit vector, which have one in the
k-th element and zero, elsewhere. On the other hand, if the focus parameter is
for a fixed covariate value
, then:
To obtain a good estimate of
for a range of covariate values, rather than a single covariate value, we can utilize the idea of Claeskens and Hjort [
36], who address minimizing an averaged risk over the range of covariates, rather than the pointwise risk.
6. Monte Carlo Study
We now investigate the performance of post-selection and averaging estimators by a simple Monte Carlo study. Our EL-based methods are compared with EL-based selection methods of Hong, Preston and Shum [
16]. The following post-selection and averaging estimators are considered: (i) AIC-like model selection (ii) BIC-like model selection, (iii) FIC model selection and (iv) an averaging estimator, whose weights are given by (
4). AIC- and BIC-like criteria are proposed by Hong, Preston and Shum [
16] and are given by:
We use (
6) to estimate
J.
We consider the linear instrumental variable model. The DGP is specified by the following equations:
where
and
for some vector
. Exogenous variables,
, are normally distributed with mean zero and variance one, and the correlation between
and
is
for
. The vector of instruments is fixed to be
. The error term,
, is independent of
and is generated from a standard normal distribution. Thus, the moment restriction for the full model is:
Table 1.
Estimation results; DGP, data generating process; AIC, Akaike information criterion; BIC, Bayesian information criterion; FIC, focused information criterion.
Table 1.
Estimation results; DGP, data generating process; AIC, Akaike information criterion; BIC, Bayesian information criterion; FIC, focused information criterion.
| | DGP |
| | (1) | (2) | (3) | (4) |
Full | Bias | -0.104 | -0.109 | - 0.089 | - 0.076 |
| Std | 0.544 | 0.533 | 0.509 | 0.489 |
| RMSE | 0.554 | 0.544 | 0.516 | 0.495 |
Reduced | Bis | -0.279 | -0.057 | -0.148 | -0.048 |
| Std | 0.780 | 0.473 | 0.955 | 0.448 |
| RMSE | 0.828 | 0.477 | 0.965 | 0.450 |
AIC | Bias | -0.113 | -0.099 | -0.101 | -0.079 |
| Std | 0.559 | 0.557 | 0.497 | 0.509 |
| RMSE | 0.570 | 0.566 | 0.507 | 0.515 |
BIC | Bias | -0.136 | -0.088 | -0.104 | -0.073 |
| Std | 0.689 | 0.552 | 0.499 | 0.502 |
| RMSE | 0.702 | 0.559 | 0.510 | 0.507 |
FIC | Bias | -0.139 | -0.095 | -0.112 | -0.076 |
| Std | 0.530 | 0.509 | 0.464 | 0.452 |
| RMSE | 0.548 | 0.517 | 0.477 | 0.458 |
Averaging | Bias | -0.139 | -0.092 | -0.107 | -0.074 |
| Std | 0.511 | 0.476 | 0.455 | 0.444 |
| RMSE | 0.529 | 0.484 | 0.468 | 0.450 |
The focus parameter is . In many applications, it is often the case that the only parameter of interest in the linear model is the coefficient of the endogenous regressor. Exogenous regressors are included simply to avoid omitted variable bias. Thus, if the bias is small, it may be better to exclude some regressors to reduce the variance. In this simulation, we include the constant term, , and in all candidate models, but some elements of may be excluded. That is, some elements of are set to zero. Therefore, there are submodels in total.
To evaluate the performance of the post-selection and averaging estimators, we calculate the bias, standard deviation and root MSE (RMSE) of each estimator over 1,000 repetitions. For reference, we also report the results of the full and reduced models. The sample size is
.
2 We consider four DGPs: (1)
, (2)
, (3)
and (4)
. The DGPs (1) and (3) are favorable for the full model, while (2) and (4) are favorable for the reduced model. The results are summarized in
Table 1.
Table 1 indicates that there are certain cases where we should avoid using the full model, even if it is the correct model. Performance of the full model is poorer than the FIC-selected model for all GDPs. As the theory suggests, the efficiency gain of FIC over the full model is large when
is small. The averaging estimator outperforms all post-selection estimators. It is even better than FIC. As is consistent with findings in the literature, averaging is a useful method to reduce the risk of the estimator.