1. Introduction
Model assessment, that is assessing the adequacy of a model and/or ability to perform model selection, is one of the fundamental components of statistical analyses. For example, in the model adequacy problem one usually begins with a fixed model and interest centers on measuring the model misspecification cost. A natural way to create a framework within which we can assess model misspecification is by using statistical distances as loss functions. These constructs measure the distance between the unknown distribution that generated the data and an estimate from the data model. By identifying statistical distances as loss functions, we can begin to understand the role distances play in model fitting and selection, as they become measures of the overall cost of model misspecification. This strategy will allow us to investigate the construction of a loss function as the maximum error in a list of model fit questions. Therefore, our fundamental question is the following. How can one design a loss function that is scientifically and statistically meaningful? We would like to be able to attach a specific scientific meaning to the numerical values of the loss, so that a value of the distance equal to 4, for example, has an explicit interpretation in terms of our statistical goals. When we select between models, we would like to measure the quality of the approximation via the model’s ability to provide answers to important scientific questions. This presupposes that the meaning of “best fitting model” should depend on the “statistical questions” being asked of the model.
Lindsay [
1] discusses a distance-based framework for assessing model adequacy. A fundamental tenet of the framework for model adequacy put forward by Lindsay [
1] is that it is possible and reasonable to carry out a model-based scientific inquiry without believing that the model is true, and without assuming that the truth is included in the model. All this of course, assuming that we have a way to measure the quality of the approximation to the “truth”, is offered by the model. This point of view never assumes the correctness of the model. Of course, it is rather presumptuous to label any distribution as the truth as any basic modeling assumption generated by the sampling scheme that provided the data is never exactly true. An example of a basic modeling assumption might be “
are independent, identically distributed from an unknown distribution
”. This, as any other statistical assumption, is subject to question even in the most idealized of data collection frameworks. However, we believe that well designed experiments can generate data that is similar to data from idealized models, therefore we operate as if the basic assumption is true. This means that we assume that there is a true distribution
that generates the data, which is “knowable” if we can collect an infinite amount of data. Furthermore, we note that the basic modeling assumption will be the global framework for assessment of all more restrictive assumptions about the data generation mechanism. In a sense, it is the “nonparametric” extension of the more restrictive models that might be considered.
We let be the class of all distributions consistent with the basic assumptions. Hence , and sets are called models. We assume that ; hence, there is a permanent model misspecification error. Statistical distances will then provide a measure for the model misspecification error.
One natural way to measure model adequacy is to define a loss function that describes the loss incurred when the model element M is used instead of the true distribution . Such a loss function should, in principle, indicate, in an inferential sense, how far apart the two distributions , M are. In the next section, we offer a formal definition of the concept of a statistical distance.
If the statistical questions of interest can be expressed as a list of functionals
of the model
M that we wish to be uniformly close to the same functionals
of the true distribution, then we can turn the set of model fit questions into a distance via
where the supremum is taken over the class of functionals of interest. Using the supremum of the individual errors is one way of assessing overall error, but using this measure has the nice feature that its value gives a bound on all individual errors. The statistical questions of interest may be global, such as: is the normal model correct in every aspect? Or we may be interested to have answers on a few key characteristics, such as the mean.
Lindsay et al. [
2] introduced a class of statistical distances, called quadratic distances, and studied their use in the context of goodness-of-fit testing. Furthermore, Markatou et al. [
3] discuss extensively the chi-squared distance, a special case of quadratic distance, and its role in robustness. In this paper, we study non-quadratic distances and their role in model assessment. The paper is organized as follows.
Section 2 presents the definition of a statistical distance and its associated properties.
Section 3,
Section 4 and
Section 5 discuss in detail three popular distances, total variation, the mixture index of fit and the Kullback-Leibler distance, with the aim of understanding their role in model assessment problems. The likelihood distance is also briefly discussed in
Section 5.
Section 6 illustrates computation and applications of total variation, mixture index of fit and Kullback-Leibler distances. Finally,
Section 7 presents discussion and conclusions pertaining to the use of total variation and mixture index of fit distances.
2. Statistical Distances and Their Properties
If we adopt the usual convention that loss functions are nonnegative in their arguments, and zero if the correct model is used, and have larger value if the two distributions are not very similar, then the loss
can also be viewed as a distance between
,
M. In fact, we will always assume that for any two distributions
F,
GIf this holds, we will say that is a statistical distance. Unlike the requirements for a metric, we do not require symmetry. In fact, there is no reason that the loss should be symmetric, as the roles of , M are different. We also do not require to be nonzero when the arguments differ. This zero property will allow us to specify that two distributions are equivalent as far as our statistical purposes are concerned by giving them zero distance.
Furthermore, it is important to note that if
is in
and
, and
, say
, then the distance
induces a loss function on the parameter space via
Therefore, if is in the model, the losses defined by are parametric losses.
We begin within the discrete distribution framework. Let , where T is possibly infinite, be a discrete sample space. On this sample space we define a true probability density , as well as a family of densities , where is the parameter space. Assume we have independent and identically distributed random variables producing the realizations from . We record the data as , where is the number of observations in the sample with value equal to t. We note here that we use the word “density” in a generic fashion that incorporates both, probability mass functions as well as probability density functions. A rather formal definition of the concept of statistical distance is as follows.
Definition 1. (Markatou et al. [3]) Let τ, m be two probability density functions. Then is a statistical distance between the corresponding probability distributions if , with equality if and only if τ and m are the same for all statistical purposes. We would require
to indicate the worst mistake that we can make if we use
m instead of
. The precise meaning of this statement is obvious in the case of total variation that we discuss in detail in
Section 3 of the paper.
We would also like our statistical distances to be convex in their arguments.
Definition 2. Let τ, m be a pair of probability density functions, with m being represented as , . We say that the statistical distance is convex in the right argument ifwhere , are two probability density functions. Definition 3. Let τ, m be a pair of probability density functions, and assume , . Then, we say that is convex in the left argument ifwhere , are two densities. Lindsay et al. [
2] define and study quadratic distances as measures of goodness of fit, a form of model assessment. In the next sections, we study non-quadratic distances and their role in the problem of model assessment. We begin with the total variation distance.
3. Total Variation
In this section, we study the properties of the total variation distance. We offer a loss function interpretation of this distance and discuss sensitivity issues associated with its use. We will begin with the case of discrete probability measures and then move to the case of continuous probability measures. The results presented here are novel and are useful in selecting the distances to be used in any given problem.
The total variation distance is defined as follows.
Definition 4. Let τ, m be two probability distributions. We define the total variation distance between the probability mass functions τ, m to be This measure is also known as the -distance (without the factor ) or index of dissimilarity.
Corollary 1. The total variation distance takes values in the interval .
Proof. By definition
with equality if and only if
,
. Moreover,
. But
,
m are probability mass functions (or densities), therefore
and hence
or, equivalently
Therefore . ☐
Proposition 1. The total variation distance is a metric.
Proof. By definition, the total variation distance is non-negative. Moreover, it is symmetric because
and it satisfies the triangle inequality since
Thus, it is a metric. ☐
The following proposition states that the total variation distance is convex in both, left and right arguments.
Proposition 2. Let τ, m be a pair of densities with τ represented as , . Then Moreover, if m is represented as , , then Proof. It is a straightforward application of the definition of the total variation distance. ☐
The total variation measure has major implications for prediction probabilities. A statistically useful interpretation of the total variation distance is that it can be thought of as the worst error we can commit in probability when we use the model m instead of the truth . The maximum value of this error equals 1 and it occurs when , m are mutually singular.
Denote by the probability of a set under the measure and by the probability of a set under the measure m.
Proposition 3. Let τ, m be two probability mass functions. Thenwhere A is a subset of the Borel set . Proof. Define the sets
,
,
. Notice that
Because on the set
the two probability mass functions are equal
, and hence
Note that, because of the nature of the sets
and
, both terms in the last expression are positive. Therefore
Remark 1. The model misspecification measure has a “minimax ”expression This indicates the sense in which the measure assesses the overall risk of using m instead of τ, then chooses m that minimizes the aforementioned risk.
We now offer a testing interpretation of the total variation distance. We establish that the total variation distance can be obtained as a solution to a suitably defined optimization problem. It is obtained as that test function which maximizes the difference between the power and level of a suitably defined test problem.
Definition 5. A randomized test function for testing a statistical hypothesis versus the alternative is a (measurable) function ϕ defined on and taking values in the interval with the following interpretation. If x is the observed value of X and , then a coin whose probability of falling heads is y is tossed and is rejected when head appears. In the case where y is either 0 or 1, , the test is called non-randomized.
Proposition 4. Let versus and is a test function, f, g are probability mass functions. Then An advantage of the total variation distance is that it is not sensitive to small changes in the density. That is, if
is replaced by
where
and
is small then
Therefore, when the changes in the density are small . When describing a population, it is natural to describe it via the proportion of individuals in various subgroups. Having small would ensure uniform accuracy for all such descriptions. On the other hand, populations are also described in terms of a variety of other variables, such as means. Having the total variation measure small does not imply that means are close on the scale of standard deviation.
Remark 2. The total variation distance is not differentiable in the arguments. Using as an inference function, where d denotes the data estimate of τ (i.e., ), yields estimators of θ that have the feature of not generating smooth, asymptotically normal estimators when the model is true [4]. This feature is related to the pathologies of the variation distance described by Donoho and Liu [5]. However, if parameter estimation is of interest, one can use alternative divergences that are free of these pathologies. We now study the total variation distance in continuous probability models.
Definition 6. The total variation distance between two probability density functions τ, m is defined as The total variation distance has the same interpretation as in the discrete probability model case. That is
One of the important issues in the construction of distances in continuous spaces is the issue of invariance, because the behavior of distance measures under transformations of the data is of interest. Suppose we take a monotone transformation of the observed variable X and use the corresponding model distribution; how does this transformation affect the distance between X and the model?
Invariance seems to be desirable from an inferential point of view, but difficult to achieve without forcing one of the distributions to be continuous and appealing to the probability integral transform for a common scale. In multivariate continuous spaces, the problem of transformation invariance is even more difficult, as there is no longer a natural probability integral transformation to bring data and model on a common scale.
Proposition 5. Let be the total variation distance between the densities , for a random variable X. If is a one-to-one transformation of the random variable X, then Proof. Write
where
is the inverse transformation. Next, we do a change of variable in the integral. Set
from where we obtain
and
; the prime denotes derivative with respect to the corresponding argument. Then
Now since
is a one-to-one transformation,
is either increasing or decreasing on different segments of
. Thus
where
. ☐
A fundamental problem with the total variation distance is that it cannot be used to compute the distance between a discrete distribution and a continuous distribution because the total variation distance between a continuous measure and a discrete measure is always the maximum possible, that is 1. This inability of the total variation distance to discriminate between discrete and continuous measures can be interpreted as asking “too many questions”at once, without any prioritization. This limits its use despite its invariant characteristics.
We now discuss the relationship between the total variation distance and Fisher information. Denote by the joint density of n independent and identically distributed random variables. Then we have the following proposition.
Proposition 6. The total variation distance is locally equivalent to the Fisher information number, that iswhere , are two discrete probability models. Proof. Now, expand
using Taylor series in the neighborhood of
to obtain
where the prime denotes derivative with respect to the parameter
. Further, write
to obtain
where
Therefore, assuming that
converges to a normal random variable in absolute mean, then
because
,
and
when
. ☐
The total variation is a non-quadratic distance. It is however related to a quadratic distance, the Hellinger distance, defined as by the following inequality.
Proposition 7. Let τ, m be two probability mass functions. Then Proof. Straightforward using the definitions of the distances involved and Cauchy-Swartz inequality. Holder’s inequality provides . ☐
Note that
; the square root of this quantity, that is
, is known as Matusita’s distance [
6,
7]. Further, define the affinity between two probability densities by
Then, it is easy to prove that
The above inequality indicates the relationship between total variation and Matusita’s distance.
4. Mixture Index of Fit
Rudas, Clogg, and Lindsay [
8] proposed a new index of fit approach to evaluate the goodness of fit analysis of contingency tables based on the mixture model framework. The approach focuses attention on the discrepancy between the model and the data, and allows comparisons across studies. Suppose
is the baseline model. The family of models which are proposed for evaluating goodness of fit is a two-point mixture model given by
Here denotes the mixing proportion, which is interpreted as the proportion of the population outside the model . In the robustness literature the mixing proportion corresponds to the contamination proportion, as explained below. In the contingency table framework , describe the tables of probabilities for each latent class. The family of models defines a class of nested models as varies from zero to one. Thus, if the model does not fit well the data, then by increasing , the model will be an adequate fit for sufficiently large.
We can motivate the index of fit by thinking of the population as being composed of two classes with proportions and respectively. The first class is perfectly described by , whereas the second class contains the “outliers”. The index of fit can then be interpreted as the fraction of the population intrinsically outside , that is, the proportion of outliers in the sample.
We note here that these ideas can be extended beyond the contingency table framework. In our setting, the probability distribution describing the true data generating mechanism may be written as , where and is arbitrary. This representation of is arbitrary such that we can construct another representation . However, there always exists the smallest unique such that there exists a representation of that puts the maximum proportion in one of the population classes. Next, we define formally the mixture index of fit.
Definition 7. (Rudas, Clogg, and Lindsay [8]) The mixture index of fit π* is defined by Notice that is a distance. This is because if we set for a fixed , we have and if .
Definition 8. Define the statistical distance as follows: Remark 3. Note that, to be able to present Proposition 8 below, we have turned arbitrary discrete distributions into vectors. As an example, if the sample space and , we write this discrete distribution as the vector . If, furthermore, we consider the vectors , , and as degenerate distributions assigning mass 1 at positions 0, 1, 2 then . This representation of distributions is used in the proof of Proposition 8.
Proposition 8. The set of vectors satisfying the relationship is a simplex with extremal points , where is the vector with 1 at the th position and 0 everywhere else.
Proof. Given
with
, there exists a representation of
Write any arbitrary discrete distribution
as follows:
where
and
takes the value 1 at the
th position and the value 0 everywhere else. Then
which belongs to a simplex. ☐
Proof. Then
with equality at some
t. Let now the error term be
Then and cannot be made smaller without making negative at a point . This concludes the proof. ☐
Corollary 2. We haveif there exists such that and . Proof. By Proposition 9 , but it equals 1 at . ☐
One of the advantages of the mixture index of fit is that it has an intuitive interpretation that does not depend upon the specific nature of the model being assessed. Liu and Lindsay [
9] extended the results of Rudas et al. [
8] to the Kullback-Leibler distance. Computational aspects of the mixture index of fit are discussed in Xi and Lindsay [
4] as well as in Dayton [
10] and Ispány and Verdes [
11].
Finally, a new interpretation to the mixture index of fit was presented by Ispány and Verdes [
11]. Let
be the set of probability measures and
. If
d is a distance measure on
and
, then
is the least non-negative solution of the equation
in
.
Next, we offer some interpretations associated with the mixture index of fit. The statistical interpretations made with this measure are attractive, as any statement based on the model applies to at least of the population involved. However, while the “outlier”model seems interpretable and attractive, the distance itself is not very robust.
In other words, small changes in the probability mass function do not necessarily mean small changes in distance. This is because if
, then a change of
in
from
to 0 causes
to go to 1. Moreover, assume that our framework is that of continuous probability measures, and that our model is a normal density. If
is a lighter tailed distribution than our normal model
, then
and therefore
That is, light tailed densities are interpreted as outliers. Therefore, the mixture index of fit measures error from the model in a “one-sided” way. This is in contrast to total variation, which measures the size of “holes” as well as the “outliers” by allowing the distributional errors to be neutral.
In what follows, we show that if we can find a mixture representation for the true distribution then this implies a small total variation distance between the true probability mass function and the assumed model m. Specifically, we have the following.
Proposition 10. Let π* be the mixture index of fit. If , then Proof. Write
with
. This is because there always exists the smallest unique
such that
can be represented as a mixture model.
Thus, the above relationship can be written as
☐
There is a mixture representation that connects total variation with the mixture index of fit. This is presented below.
Proof. Fix
; for any given
m let
be a solution to the equation
Let
and
and note that since
then
Rewrite now Equation (
1) as follows:
where
and
. Thus, ignoring the constraints, every pair
satisfying the equation above also satisfies
for some number
. Moreover, such pair must have
in order the constraints
,
to be satisfied. Hence, varying
over
gives a class of solutions. To determine
,
and adding these we obtain
and the maximum value is obtained when
. Therefore
and so
☐
Therefore, for small the mixture index of fit and the total variation distance are nearly equal.
5. Kullback-Leibler Distance
The Kullback-Leibler distance [
12] is extensively used in statistics and in particular in model selection. The celebrated AIC model selection criterion [
13] is based on this distance. In this section, we present the Kullback-Leibler distance and some of its properties with particular emphasis on interpretations.
Definition 9. The Kullback-Leibler distance between two densities τ, m is defined asor Proposition 12. The Kullback-Leibler distance is nonnegative, that iswith equality if and only if . Proof. Set , then is a convex, non-negative function that equals 0 at . Therefore . ☐
Definition 10. We define the likelihood distance between two densities τ, m as The intuition behind the above expression of the likelihood distance comes from the fact that the log-likelihood in the case of discrete random variables taking discrete values, , m is the number of groups, can be written, after appropriate algebraic manipulations, in the above form.
Alternatively, we can write the likelihood distance as
and use this relationship to obtain insight into connections of the likelihood distance with the chi-squared measures studied by Markatou et al. [
3].
Specifically, if we write the Pearson’s chi-squared statistic as
then from the functional relationship
we obtain that
. However, it is also clear from the right tails of the functions that there is no way to bound
below by a multiple of
. Hence, these measures are not equivalent in the same way that Hellinger distance and symmetric chi-squared are (see Lemma 4, Markatou et al. [
3]). In particular, knowing that
is small is no guarantee that all Pearson
z-statistics are uniformly small.
On the other hand, one can show by the same mechanism that
, where
and
is the symmetric chi-squared distance given as
It is therefore true that small likelihood distance implies small z-statistics with blended variance estimators. However, the reverse is not true because the right tail in r for is of magnitude r, as opposed to for the likelihood distance.
These comparisons provide some feeling for the statistical interpretation of the likelihood distance. Its meaning as a measure of model misspecification is unclear. Furthermore, our impression is that likelihood, like Pearson’s chi-squared is too sensitive to outliers and gross errors in the data. Despite Kullback-Leibler’s theoretical and computational advantages, a point of inconvenience in the context of model selection is the lack of symmetry. One can show that reversing the roles of the arguments in the Kullback-Leibler divergence can yield substantially different results. The sum of the Kullback-Leibler distance and the likelihood distance produces the symmetric Kullback-Leibler distance or J divergence. This measure is symmetric in the arguments, and when used as a model selection measure it is expected to be more sensitive than each of the individual components.
6. Computation and Applications of Total Variation, Mixture Index of Fit and Kullback- Leibler Distances
The distances discussed in this paper are used in a number of important applications. Euán et al. [
14] use the total variation to detect changes in wave spectra, while Alvarez- Esteban et al. [
15] cluster time series data on the basis of the total variation distance. The mixture index of fit has found a number of applications in the area of social sciences. Rudas et al. [
8] provided examples of the application of
* to two-way contingency tables. Applications involving differential item functioning and latent class analysis were presented in Rudas and Zwick [
16] and Dayton [
17] respectively. Formann [
18] applied it in regression models involving continuous variables. Finally, Revuelta [
19] applied the
* goodness-of-fit statistic to finite mixture item response models that were developed mainly in connection with Rasch models [
20,
21].
The Kullback-Leibler (KL) distance [
12] is fundamental in information theory and its applications. In statistics, the celebrated Akaike information Criterion (AIC) [
13,
22], widely used in model selection, is based on the Kullback-Leibler distance. There are numerous additional applications of the KL distance in fields such as fluid mechanics, neuroscience, machine learning. In economics, Smith, Naik, and Tsai [
23] use KL distance to simultaneously select the number of states and variables associated with Markov-switching regression models that are used in marketing and other business applications. KL distance is also used in diagnostic testing for ruling in or ruling out disease [
24,
25], as well as in a variety of other fields [
26].
Table 1 presents the software, written in R, that can be used to compute the aforementioned distances. Additionally, Zhang and Dayton [
27] present a SAS program to compute the two-point mixture index of fit for the two-class latent class analysis models with dichotomous variables. There are a number of different algorithms that can be used to compute the mixture index of fit for contingency tables. Rudas et al. [
8] propose to use a standard EM algorithm, Xi and Lindsay [
4] use sequential quadratic programming and discuss technical details and numerical issues related to applying nonlinear programming techniques to estimate
*. Dayton [
10] discusses explicitly the practical advantages associated with the use of nonlinear programming as well as the limitations, while Pan and Dayton [
28] study a variety of additional issues associated with computing
*. Additional algorithms associated with the computation of
* can be found in Verdes [
29] and Ispány and Verdes [
11].
We now describe a simulation study that aims to illustrate the performance of the total variation, Kullback-Leibler, and mixture index of fit as model selection measures. Data are generated from either an asymmetric
contamination model, or from a symmetric
contamination model, where
is the percentage of contamination. Specifically, we generate 500 Monte Carlo samples of sample sizes 200, 1000, and 5000 as follows. If the sample has size
n and the percentage of contamination is
, then
of the sample size is generated from model
or
and the remaining
from a
model. We use
and
in the
model and
in the
model. The total variation distance was computed between the simulated data and the
model. The Kullback-Leibler distance was calculated between the data generated from the aforementioned contamination models and a random sample of the same size
n from
. When computing the mixture index of fit, we specified the component distribution as a normal distribution with initial mean 0 and variance 1. All simulations were carried out on a laptop computer with an Intel Core i7 processor and 64 bit Windows 7 operation system. The R packages used are presented in
Table 1.
Table 2 and
Table 3 present means and standard deviations of the total variation and Kullback-Leibler distances as a function of the contamination model and the sample size. To compute the total variation distance we use the R function “TotalVarDist” of the R package “distrEx”. It smooths the empirical distribution of the provided data using a normal kernel and computes the distance between the smoothed empirical distribution and the provided continuous distribution (in our case this distribution is
). We note here that the package “distrEx” provides an alternative option to compute the total variation which relies on discretizing the continuous distribution and then computes the distance between the discretized continuous distribution and the data. We think that smoothing the data to obtain an empirical estimator of the density and then calculating its distance from the continuous density is a more natural way to handle the difference in scale between the discrete data and the continuous model. Lindsay [
1] and Markatou et al. [
3] discuss this phenomenon and call it discretization robustness. The Kullback-Leibler distance was computed using the function “KLD.matrix” of the R package “bioDist”.
We observe from the results of
Table 2 and
Table 3 that the total variation distance for small percentages of contamination is small and generally smaller than the Kullback-Leibler distance for both asymmetric and symmetric contamination models with a considerably smaller standard deviation. The above behavior of the total variation distance in comparison to the Kullback-Leibler manifests itself across all sample sizes used.
Table 4 presents the mixture index of fit computed using the R function “pistar.uv” from the R package “pistar” (
https://rdrr.io/github/jmedzihorsky/pistar/man/; accessed on 5 June 2018). Since the fundamental assumption in the definition of the mixture index of fit is that the population on which the index is applied is heterogeneous and expressed via the two-point model, we only used the asymmetric contamination model for various values of the contamination distribution.
We observe that the mixture index of fit generally estimates well the mixing proportion
. We observe (see
Table 4) that when the second population is
the bias associated with estimating the mixing (or contamination) population can be as high as
. This is expected because the population
is very close to
creating essentially a unimodal sample. As the means of the two normal components get more separated, the mixture index of fit provides better estimates of the mixing quantity and the percentage of observations that need to be removed so that
provides a good fit to the remaining data points.
7. Discussion and Conclusions
Divergence measures are widely used in scientific work, and popular examples of these measures include the Kullback-Leibler divergence, Bregman Divergence [
30], the power divergence family of Cressie and Read [
31], the density power divergence family [
32] and many others. Two relatively recent books that discuss various families of divergences are Pardo [
33] and Basu et al. [
34].
In this paper we discuss specific divergences that do not belong to the family of quadratic divergences, and examine their role in assessing model adequacy. The total variation distance might be preferable as it seems closest to a robust measure, in that if the two probability measures differ only on a set of small probability, such as a few outliers, then the distance must be small. This was clearly exemplified in
Table 2 and
Table 3 of
Section 6. Outliers influence chi-squared measures more. For example, the Pearson’s chi-squared distance can be made dramatically larger by increasing the amount of data in a cell with small model probability
. In fact, if there is data in a cell with model probability zero, the distance is infinite. Note that if data occur in a cell with probability, under the model, equal to zero, then it is possible that the model is not true. Still, even in this case, we might wish to use it on the premise that
provides a good approximation.
There is a pressing need for the further development of well-tested software for computing the mixture index of fit. This measure is intuitive and has found many applications in the social sciences. Reiczigel et al. [
35] discuss bias-corrected point estimates of
*, as well as a bootstrap test and new confidence limits, in the context of contingency tables. Well-developed and tested software will further popularize the dissemination and use of this method.
The mixture index of fit ideas were extended in the context of testing general model adequacy problems by Liu and Lindsay [
9]. Recent work by Ghosh and Basu [
36] presents a systematic procedure of generating new divergences. Ghosh and Basu [
36], building upon the work of Liu and Lindsay [
9], generate new divergences through suitable model adequacy tests using existing divergences. Additionally, Dimova et al. [
37] use the quadratic divergences introduced in Lindsay et al. [
2] and construct a model selection criterion from which we can obtain AIC and BIC as special cases.
In this paper, we discuss non-quadratic distances that are used in many scientific fields where the problem of assessing the fitted models is of importance. In particular, our interest centered around the properties and potential interpretations of these distances, as we think this offers insight into their performance as measures of model misspecification. One important aspect for the dissemination and use of these distances is the existence of well-tested software that facilitates computation. This is an area where further development is required.