1. Introduction
Composite likelihood inference is an important approach to deal with those real situations of large data sets or very complex models, in which classical likelihood methods are computationally difficult, or even, not possible to manage. Composite likelihood methods have been successfully used in many applications concerning, for example, genetics ([
1]), generalized linear mixed models ([
2]), spatial statistics ([
3,
4,
5]), frailty models ([
6]), multivariate survival analysis ([
7,
8]), etc.
Let us introduce the problem, adopting here the notation by [
9]. Let
be a parametric identifiable family of distributions for an observation
, a realization of a random
m-vector
. In this setting, the composite likelihood function based on
K different marginal or conditional distributions has the form
and the corresponding composite log-density
with
, where
is a family of sets of indices associated either with marginal or conditional distributions involving some
,
and
,
are non-negative and known weights. If the weights are all equal, then they can be ignored. In this case, all the statistical procedures give equivalent results. The composite maximum likelihood estimator (CMLE),
, is obtained by maximizing, in respect to
, the expression (
1).
The CMLE is consistent and asymptotically normal and, based on it, we can establish hypothesis testing procedures in a similar way to the classical likelihood ratio test, Wald test or Rao’s score test. A development of the asymptotic theory of the CMLE including its application to obtain the composite ratio statistics, Wald-type tests and Rao score tests in the context of composite likelihood can be seen in [
10]. However, in [
11,
12,
13] is shown that the CMLE and the derived testing procedures present an important lack of robustness. In this sense, [
11,
12,
13] derived some new distance-based estimators and tests with good robustness behaviour without an important loss of efficiency. In this paper, we are going to consider the composite minimum density power divergence estimator (CMDPDE), introduced in [
12], in order to present a model selection criterion in a composite likelihood framework.
Model selection criteria, for summarizing data evidence in favor of a model, is a very well studied subject in statistical literature, overall in the context of full likelihood. The construction of such criteria requires a measure of similarity between two models, which are typically described in terms of their distributions. This can be achieved if an unbiased estimator of the expected overall discrepancy is found, which measures the statistical distance between the true, but unknown model, and the entertained model. Therefore, the model with the smallest value of the criterion is the most preferable model. The use of divergence measures, in particular Kullback–Leibler divergence ([
14]), to measure this discrepancy, is the main idea of some of the most known criteria: Akaike Information Criterion (AIC, [
15,
16]), the criterion proposed by Takeuchi (TIC, [
17]) and other modifications of AIC [
18]. DIC criterion, based on the density power divergence (DPD), was presented in [
19] and, recently, [
20] presented a local BHHJ power divergence information criterion following [
21]. In the context of the composite likelihood there are some criteria based on Kullback–Leibler divergence, see for instance [
22,
23,
24] and references therein. To the best of our knowledge only Kullback–Leibler divergence was used to develop model selection criteria in a composite likelihood framework. To fill this gap, our interest is now focused on DPD.
In this paper, we present a new information criterion for model selection in the framework of composite likelihood based on DPD measure. This divergence measure, introduced and studied in the case of complete likelihood by [
25], has been considered previously in [
12,
13] in the context of composite likelihood. In these papers, a new estimator, the CMDPDE, was introduced and its robustness in relation to the CMLE as well as the robustness of some families of test statistics were studied, but the problem of model selection was not considered. This problem is considered in this paper. The criterion introduced in this paper will be called composite likelihood DIC criterion (CLDIC). The motivation of considering a criterion based on DPD instead of Kullback–Leibler divergence is due to the robustness of the procedures based on DPD in statistical inference, not only in the context of full likelihood [
25,
26], but also in the context of composite likelihood [
12,
13]. In
Section 2, the CMDPDE is presented and some properties of this estimator are discussed. The new model selection criterion, CLDIC, based on CMDPDE is introduced in
Section 3 and some of its asymptotic properties are studied. A simulation study is carried out in
Section 4 and some numerical examples are presented in
Section 5. Finally, some concluding remarks are presented in
Section 6.
2. Composite Minimum Density Power Divergence Estimator
Given two probability density functions
g and
f, associated with two
m-dimensional random variables respectively, the DPD ([
25]) measures a statistical distance between
g and
f by
for
while for
it is defined by
where
is the Kullback–Leibler divergence (see, for example, [
26]). For
, the expression (
2) leads to the
distance
It is also interesting to note that (
2) is a special case of the so-called Bregman divergence
If we consider
in (
3), we get
. The parameter
controls the trade-off between robustness and asymptotic efficiency of the parameter estimates which are the minimizers of this family of divergences. For more details about this family of divergence measures we refer to [
27].
Let now
be independent and identically distributed replications of
which are characterized by the true but unknown distribution
g. Taking into account that the true model
g is unknown, suppose that
is a parametric identifiable family of candidate distributions to describe the observations
. Then, the DPD between the true model
g and the composite likelihood function,
associated to the parametric model
is defined as
for
, while for
we have
, which is defined by
In
Section 3, we are going to introduce and study the CLDIC criterion based on (
4).
Let
be a family of candidate models to govern the observations
. We shall assume that the true model is included in
For a specific
, the parametric model
is described by the composite likelihood function
In this setting, it is quite clear that the most suitable candidate model to describe the observations is the model that minimizes the DPD in (
4). However, the unknown parameter
is included in it, so it is not possible to use directly this measure for the choice of the most suitable model. A way to overcome this problem is to plug-in, in (
4), the unknown parameter
by an estimator which is desirable to obey some nice properties, like consistency and asymptotic normality. Based on this point, the CMDPDE, introduced in [
12], can be used. This estimator is described in the sequel for the sake of completeness.
If we denote the kernel of (
4) as
we can write
and the term
does not depend on
and could be ignored in (
9). A natural estimator of
, given in (
7), can be obtained by observing that the last integral in (
7), can be expressed in the form
, for
G the distribution function corresponding to
g. Hence, if the empirical distribution function of
will be exploited, this last integral is approximated by
, i.e.,
Definition 1. The CMDPDE of θ, is defined, for , by We shall denote the score of the composite likelihood by
Let
be the true value of the parameter
. In [
12], it was shown that the asymptotic distribution of
is given by
being
and
Remark 1. For we get the CMLE of θ At the same time it is well-known thatwhere denotes the Godambe information matrix defined by , with being the sensitivity or Hessian matrix and being the variability matrix, defined, respectively, by 3. A New Model Selection Criterion
In order to describe the CLDIC criterion we consider the model
given in (
6). Following standard methodology (cf. [
28], pp. 240), the most suitable candidate model to describe the data
is the model that minimizes the expected estimated DPD
subject to the assumption that the unknown model
g is belonging to
, i.e., the true model is included in
and taking into account that
, defined in (
9), is a consistent and asymptotic normally distributed estimator of
. However, this expected value is still depending on the unknown parameter
. So, as a criterion, it should be used an asymptotically unbiased estimator of (
14), for
.
The most appropriate model to select is the model which minimizes the expected value
This expected value is still depending on the unknown parameter
. So, an asymptotically unbiased estimator of the above expected value could be the basis of a selection criterion, for
. In order to proceed with the derivation of such an asymptotically unbiased estimator of
. The empirical version of
, in (
7), is
, given in (
8), and plays a central role in the development of the model selection criterion on the basis of the next theorem which expresses the expected value
by means of the respective expected value of
, in an asymptotically equivalent way.
Theorem 1. If the true distribution g belongs to the parametric family and denotes the true value of the parameter θ, then we havewith and given in (11) and (12), respectively. Based on the above theorem, the proof of which is presented in a full detail in the
Appendix A, an asymptotic unbiased estimator of
is given by
This ascertainment is the basis and a strong motivation for the next definition which introduces the model selection criterion.
Definition 2. Let be candidate models for the observations . The selected model verifieswhere was given in (8) and and were defined in (11) and (12), respectively. The next remark summarizes the model selection criterion in the case and it therefore extends, in a sense, the pioneer and classic AIC.
Remark 2. For we have,with . Therefore, the most appropriate model which should be selected, is the model which minimizes the expected valuewhere is the CMLE of defined in (9). The expected value (15) is still depending on the unknown parameter θ. A natural estimator of can be obtained by replacing the distribution function G, of g, by the empirical distribution function based on ,Based on it, we select the model that verifieswithwhere and are defined in Remark 1. In a manner, quite similar to that of the previous theorem, it can be established that is an asymptotic unbiased estimator of This would be the model selection criterion in a composite likelihood framework based on Kullback–Leibler divergence. We can observe that this criterion coincides with the criterion given in [22] as a generalization of the classical criterion of Akaike, which will be referred from now as Composite Akaike Information Criterion (CAIC). 4. Numerical Simulations
4.1. Scenario 1: Two-Component Mixed Model
We are starting with a simulation example, which is motivated and follows ideas from the paper [
29] and the Example 4.1 in [
20] which will compare the behaviour of the proposed criteria with the CAIC criterion, for
(see Remark 2).
Consider the random vector from an unknown density g and let now be independent and identically distributed replications of which are described by the true but unknown distribution g. Taking into account that the true model g is unknown, suppose that is a parametric identifiable family of candidate distributions to describe the observations . Let also denotes the composite likelihood function associated to the parametric model .
We consider the problem of choosing (on the basis of
n independent and identically distributed replications
of
) between a 4-variate normal distribution,
, with
and
and a 4-variate
t-distribution with
degrees of freedom,
, with different location parameters
and same variance-covariance matrix
, and density,
with
,
and
.
Consider the composite likelihood function,
with
and
, where
and
are the densities of the marginals of
, i.e., bivariate normal distributions with mean vectors
and
, respectively, and common variance-covariance matrix
In a similar manner consider the composite likelihood
with
and
, where
and
are the densities of the marginals of
, i.e., bivariate
t-distributions with mean vectors
and
, respectively, and common variance-covariance matrix
Under this formulation, the simulation study follows in the next two scenarios.
4.1.1. Scenario 1a
Following Example 4.1 in [
20], the steps of the simulation study are the following:
Results are summarized in
Table 1. Extreme values of
represent the times that the 4-variate normal model was selected under the 4-variate
t-distribution and 4-variate normal distribution, respectively. This means that, for
, the perfect discrimination will be achieved when 1000 of the 1000 simulated samples are correctly assigned, while for
, the more near to 0, the better discrimination of the criterion.
means that each sample was generated both from the normal and
t-distribution in the same proportion.
4.1.2. Scenario 1b
Same Scenario is evaluated under the more-closed means
and
for moderate-large sample sizes and
. Here
and
. Results are shown in
Table 2. In this case, the models under consideration are more similar, so it would be understandable that the CLDIC criterion did not discriminate in such as good way.
4.2. Scenario 2: Three-Component Mixed Model
Now, we consider a mixed model composed on two 4-variate normal distributions and a 4-variate
t-distribution with
degrees of freedom. The three distributions have common variance-covariance matrix, as in the previous scenario, with unknown
and different but known means
,
and
. The model is defined by
with
being again a common variance-covariance matrix with unknown parameter
of the form
Following the same steps that in the first scenario, we generate 1000 samples of the three-component mixture for different sample sizes
and different values of
and
. Then, we consider the problem of choosing among one of the two 4-variate normal distributions and the 4-variate
t-distribution through the CLDIC criterion, for different values of the tuning parameter
. See
Table 3 for results. Here, the normal models are denoted by N1 and N2, respectively, while the 4-variate
t-distribution is denoted by MT. The first three cases evaluate the selected model under these multivariate distributions. In the last two scenarios, a mixed model is considered as the true distribution.
4.3. Discussion of Results
In Scenario 1a, two well-differentiated multivariate models are considered. In this case CLDIC criterion works in a very efficient way, with an almost-perfect discrimination for extreme values of . The good behaviour is also observed for not so extreme values of , such as or . We can not observe a significant difference in the choice of .
In Scenario 1b we consider closer models, which affect the discrimination power of the CLDIC. However, in this case, we do observe great differences when considering different . While the discrimination power of CLDIC for (CAIC) and is around , for or the behaviour is excellent. This happens also for large but not extreme values of , such as . However, a medium value of turns into a worse discrimination for low values of .
Scenario 2 deals with three different models, two multivariate normal and one multivariate t (N1, N2 and MT, respectively). The second normal distribution is closer to MT in terms of means. While CLDIC criterion discriminate well between N1 and N2 and between N1 and MT, it has difficulties in distinguishing N2 an MT distributions, overall for small samples sizes and .
It seems, therefore, that when we have well-discriminated models, CLDIC criterion works very well, independently of the sample size and the tuning parameter considered. Dealing with closer models leads, as expected, to worst results, overall for (CAIC).
Note that the behaviour of Wald-type and Rao tests based on CMDPDEs was studied in [
12,
13] through extensive simulation studies.