1. Introduction
In a variety of applications, the use of divergence-based inferential methods is gaining momentum, as these methods provide robust alternatives to traditional maximum likelihood-based procedures. Since the work of [
1,
2], divergence-based methods have been developed for various classes of statistical models. A comprehensive treatment of these ideas is available, for instance, in [
3,
4]. The objective of this paper is to study the large deviation tail behavior of the minimum divergence estimators and, more specifically, the minimum Hellinger distance estimators (MHDE).
To describe the general problem, suppose , and let denote a family of densities indexed by . Let denote a class of i.i.d. random variables, postulated to have a continuous density with respect to Lebesgue measure and belonging to the family , and let X be a generic element of this class. We denote by the true density of X.
Before providing an informal description of our results, we begin by recalling that the square of the Hellinger distance (SHD) between two densities
and
on
is given by
The quantity
is referred to as the
affinity between
and
and denoted by
. Hence, the SHD between the postulated density and the true density is given by
When
is compact, it is known that there exists a unique
minimizing the
. Furthermore, when
and
satisfies an identifiability condition, it is well known that
coincides with
; cf. [
1]. Turning to the sample version, we replace
by
in the definition of SHD, obtaining the objective function
and
where the kernel
is a probability density function and
and
as
.
It is known that when the parameter space
is compact, there exists a unique
minimizing
, and that
converges almost surely to
as
; cf. [
1]. Furthermore, under some natural assumptions,
where, under the probability measure associated with
,
G is a Gaussian random vector with mean vector
and covariance matrix
. If
, then the variance of
G coincides with the inverse of the Fisher information matrix
, yielding statistical efficiency. When the true distribution
does not belong to
, we will call this the “model misspecifed case,” while when
, we will say that the “postulated model” holds.
In this paper, we focus on the large deviation behavior of
; namely, the asymptotic probability that the estimate
will achieve values within a set
away from the central tendency described in (
2). We establish results of the form
for some “rate function”
I and given Borel subset
. Similar large deviation estimates for maximum likelihood estimators (MLE) have been investigated in [
5,
6,
7], and for general
M-estimators in [
8,
9]. These results allow for a precise description of the probabilities of Type I and Type II error in both the Neymann–Pearson and likelihood ratio test frameworks. Furthermore, large deviation bounds allow one to identify the best exponential rate of decrease of Type II error amongst all tests that satisfy a bound on the Type I error, as in Stein’s lemma (cf. [
10]). Additional evidence of the importance of large deviation results for statistical inference has been described in [
11] and in the book [
12].
One of our initial goals was to derive sharp probability bounds for Type I and Type II error in the context of robust hypothesis testing using Hellinger deviance tests. This article is a first step towards this endeavor. A key issue that distinguishes our work from earlier works is that, in our case, the objective function is a nonlinear function of the smoothed empirical measure, and the analysis of this case requires more involved methods compared with those currently existing in the statistical literature on large deviations. Consistent with large deviation analysis more generally, we identify the rate function
I as the convex conjugate of a certain limiting cumulant generating function, although in our problem, we uncover a subtle asymmetry between the upper and lower bounds when our limiting generating function is nondifferentiable. In the classical large deviation literature, similar asymmetries have been studied in other one-dimensional contexts (e.g. [
13]), although the statistical problem is still quite different, as the dependence on the parameter
arises explicitly—inhibiting the use of convexity methods typically exploited in the large deviation literature—and hence requiring novel techniques.
1.1. Large Deviations
In this subsection we provide relevant definitions and properties from large deviation theory required in the sequel. In the following, will denote the set of non-negative real numbers.
Definition 1. A collection of probability distributions on a topological space is said to satisfy the weak large deviation principle ifandfor some lower semicontinuous function . The function I is called the rate function.
If the level sets of I are compact, we call I a good rate function
and we say that satisfies the large deviation principle
(LDP).
We begin with a brief review of large deviation results for i.i.d. random variables and empirical measures. Let
be an i.i.d. sequence of real-valued random variables, and let
denote the distribution of the sample mean
. If the moment generating function of
is finite in a neighborhood of the origin, then Cramér’s theorem states that
satisfies the LDP with good rate function
, where
is the convex conjugate (or Legendre–Fenchel transform) of
, and where
is the cumulant generating function of
(cf. [
10], Section 2.2).
Next, consider the empirical measures
, defined by
where
denotes the collection of Borel subsets of
. It is well known (cf. [
14]) that
converges weakly to
P, namely to the distribution of
. Then Sanov’s theorem asserts that
satisfies a large deviation principle with rate function
given by
where
is the
Kullback–Leibler information between the probability measures
and
P. When
and
P each possesses a density with respect to Lebesgue measure (say
p and
g, respectively), the above expression becomes
In Sanov’s theorem, the rate function
is defined on the space of probability measures, which is a metric space with the open sets induced by weak convergence. Extensions of Sanov’s theorem to strong topologies have been investigated in the literature; cf., e.g., [
15].
We now turn to a general result, which will play a central role in this paper, namely Varadhan’s integral lemma (cf. [
10], Theorem 4.3.1). This result will allow us to infer the scaled limit of a sequence of generating functions from the existence of the large deviation principle.
Lemma 1 (Varadhan)
. Let be a sequence of random variables taking values in a regular topological space , and assume that the probability law of satisfies the LDP with good rate function I. Then for any bounded continuous function , 1.2. Minimum Hellinger Distance Estimator and Large Deviations
We first observe that the MHDE is obtained by maximizing
which involves solving the equation
The idea behind the large deviation analysis is to observe that the large deviation behavior of the maximizer can be extracted from that of the objective function
near
. By the Gärtner–Ellis theorem (cf. [
10], Section 2.3), this amounts to investigating the asymptotic behavior as
of
where
as
. In the case of maximum likelihood estimation (MLE) or minimum contrast estimation (MCE), the objective function can be expressed as
where
is the empirical measure associated with
. Thus, while the objective functions associated with the MLE and MCE are linear functions of the empirical measure, the affinity is a nonlinear function of the empirical measure. This creates certain complications in identifying the rate function
alluded to in (
3). Of course, in the case of likelihood and minimum contrast estimator analysis, an explicit formula for
ensues as the Legendre–Fenchel transform of the cumulant generating function of
, viz.
. One approach to evaluating the limiting generating function is to apply Varadhan’s lemma as given above in (
7). In the context of our problem, that requires an investigation into the large deviation principle for the density estimators
viewed as elements of
, viz. the space of integrable functions on
S. Equivalently, we require a version of Sanov’s theorem in
-space, which leads to certain topological considerations. The main issue here is that, when
is equipped with a norm topology, the sequence of kernel density estimates
possesses large deviation bounds, but the associated rate function may not have compact level sets, as is required for a typical application of Varadhan’s lemma. Nonetheless, one obtains a full LDP when
is equipped with the weak topology.
The asymptotic properties of MHDE, such as consistency and asymptotic normality, are established using the norm convergence of
to
. For this reason, we focus on a subclass of densities
(see Proposition 1 below) possessing certain equicontinuity properties where norm convergence prevails. These issues are handled in
Section 2, where the precise statements of our main results can also be found.
Section 3 is devoted to the proofs of the main results.
Section 4 contains some concluding remarks.
2. Notation, Assumptions, and Main Results
Let denote the postulated density of , defined on a measure space . Let denote the support of X and . Let the true density of be given by . Throughout the paper, we assume that the following regularity conditions hold.
Hypothesis 1. is a compact and convex subset of .
Hypothesis 2. The family is identifiable; namely, if , on a set of positive Lebesgue measure.
Hypothesis 3. For every , is three times continuously differentiable with respect to all components of . Denote by the gradient of and its components by . Let denote the matrix of second partial derivatives of with respect to and the element of .
Hypothesis 4. Let the matrix of second partial derivatives of and be denoted by and , respectively. Assume that and are continuous in and that is positive definite for every . For and , let denote the smallest eigenvalue of the matrix Assume that , where c is independent of .
These hypotheses on the family
are generally standard and are used to establish the asymptotic properties of the MHDE. Sufficient conditions on
for the validity of these hypotheses are described in [
3,
16], and [
17]. A remark on Hypothesis 4 is warranted here. When
, this assumption is related to the positive definiteness of the Fisher information matrix. If one assumes
, then this hypothesis reduces to the condition that
, which is standard. Finally, we remark that we have not attempted to provide the weakest regularity conditions, and we do believe some of these conditions can possibly be relaxed.
Recall that the MHDE of
can be obtained by solving the equation
where
is the
score function, which is obtained using
.
We begin by providing some heuristics for the case
. Let
denote the derivative of
when
. Let
denote the argzero of the function
obtained from (
11) above. Let
and
. Since
, we obtain using Markov’s inequality that for any
,
where
. Similarly, for
, it can be seen that
Thus, an evaluation of (
9) will allow us to obtain the logarithmic upper bound for
and
. Next, using the inequalities
under additional hypotheses, one can derive large deviation lower bounds for
. Deriving these bounds for MLE and MCE is rather standard, since the objective functions and their derivatives are
linear functionals of the empirical distribution, as stated in (
10), but this is not the case for the Hellinger distance.
Observe that the probabilities in (
12) and (
13) represent rare-event probabilities since, under the hypotheses described previously,
converges to
almost surely as
. The distributional results concerning
rely on the continuity and differentiability properties of
, which depend nonlinearly on
, and the norm convergence of
to
g.
Let
denote the collection of all probability densities with support
S. By Scheffe’s theorem, the pointwise convergence of
to
g implies
as
. Additionally, when
is the kernel density estimator, then Glick’s Theorem guarantees that
almost surely as
when
and
; cf. [
18]. Since the MHDE are functionals of density estimators, it is natural to expect that the large deviations of density estimators will play a significant role in our analysis. For this reason, one is forced to consider the topological issues that arise in the large deviation analysis of density estimators. Interestingly, it turns out that the weak topology on
plays a prominent role. This, in turn, leads to the question of whether certain continuity properties, which were part of the traditional theory of MHD analysis, continue to hold if
were viewed as a subset of
equipped with weak topology. Expectedly, while the answer in general is no (cf. [
19]), Proposition 1 provides sufficient conditions on the family
under which one additionally obtains norm convergence.
Before proceeding, we now introduce some further regularity conditions, as follows.
Hypothesis 5. and is an -continuous function of .
Hypothesis 6. The family consists of bounded equicontinuous densities.
Hypothesis 7. The family consists of bounded and equicontinuous densities.
Hypothesis 8. and is an -continuous function of .
Here, we note that Hypotheses 6 and 7 are related. Furthermore, if one is willing to assume that , then one does not need Hypothesis 7. On the other hand, if one believes that parametric distributions are approximations to , then one needs to work with Hypothesis 7. For this reason, we have maintained both of these hypotheses in our main results. Hypotheses 5 and 8 are related to finiteness of the Fisher information and are standard in the statistical literature.
Before we state the first proposition, we recall the definition of weak topology on
(cf. [
19]). A sequence
is said to converge weakly in
if
as
for every
, where
is a class of essentially bounded functions. We assume throughout the paper that the topology on
is the standard topology generated by the Euclidean metric.
Proposition 1. Let denote the class of densities, equipped with the weak topology. Further assume that Hypotheses 1–7 hold. Let be equipped with the product topology. Then the mapping defined byis jointly continuous in . Furthermore, if , thenFinally, under Hypothesis 7, the family is a weakly sequentially closed subset of . Our next result is concerned with the limit behavior of the generating function of . In the following we use the notation to mean the probability measures associated with and are absolutely continuous.
Theorem 1. Assume that Hypotheses 1–7 hold, and setThen exists and is a convex function given bywhere Remark 1. Since is defined via a limiting operation, it is hard to extract its qualitative properties. However, we can obtain a simple lower bound by observing that if and only if , and an upper bound using that the Kullback–Leibler information is nonnegative. This results in the following bounds:Furthermore, if all densities in are bounded by one, then impliesUsing a variational argument, it can be shown that the supremum on the right-hand side is attained at given bycf. [20]. Furthermore, the maximum that results from this choice of isyielding yet another lower bound for , although the comparison of these two lower bounds is not immediate. Returning to our main discussion, recall from [
21] that the convex conjugate of the function
is defined by
Let
denote the domain of
; namely,
and let
denote the range of the gradient map
; that is,
We begin with the discussion of the case
. In this case, the generating function
reduces to
By the convexity of
, this function is differentiable almost everywhere (cf. [
21]), and in the proof, we would like to exploit the differentiability of this function at the point
where it attains its minimum value. If
is not differentiable at this point, it is helpful to consider the directional derivatives of
. Specifically, let
and
denote the right and left derivatives of
, respectively. When
, then it is well known that
, but this observation will not be sufficient to obtain a proper lower bound. For that to hold, we need a stronger condition, namely that
, which will only be true if
is differentiable at its point of minimum,
. Otherwise, the expected lower bound turns out to be
, where
; cf. [
13].
We now turn to our large deviation theorem in , where we study the rare-event probabilities for sets C that are away from the true value . Specifically, we establish an analogue of the LDP, but where a subtle difference arises in the lower bound in the absence of differentiability of .
We recall that
is defined using the kernel density estimator
defined in (
1), whose behavior is dictated by the bandwidth sequence
.
Theorem 2. Assume , Hypotheses 1–8 are satisfied, and is the unique zero of . Further assume that and as . Then for any closed set F not containing ,Moreover, for any open set G not including ,whereand the infimum is taken to be infinity if the set is empty. Remark 2. If where , then in both the upper and lower bounds, it is sufficient to evaluate the infimum at the boundary point . That is,Similarly, if where , thenFurthermore, if is achieved at a unique point and is differentiable at , then the right-hand side of (28) reduces to , i.e., the upper and lower bounds coincide and the limits exist. Since the rate function appearing in the upper and lower bounds coincide in this case, we obtain a proper LDP if the resulting rate function has the required regularity properties, in particular, is lower semicontinuous and has compact level sets. The proof of the above theorem relies on (
14) and (
15) combined with Theorem 1, together with a change of measure argument characteristic of large deviation analysis. The comparison inequalities in (
14) and (
15) are critical to obtaining the characterizations in the above theorem, but these are essentially one-dimensional results and their analogues in higher dimensions (
) are not immediate. Consequently, when
is not differentiable, new complications arise, which lead to a slightly different, and less explicit, representation of the lower bound.
Next we establish a large deviation theorem for , generalizing the previous theorem to higher dimensions. In the following, let denote the distance between a point and a set .
Theorem 3. Assume Hypotheses 1–8 are satisfied, and assume that and as Then for any closed set F not containing,Moreover, for any open set G not including ,where and for some universal constant and the infimum is taken to be infinity if the set is empty. Remark 3. As we noted for the one-dimensional case in Remark 2, under a differentiability assumption on , the function can be identified as , but in full generality, it is not immediately known that is even nontrivial. Moreover, without differentiability, the infimum in the definition of is more restrictive than what we encountered in the one-dimensional problem. However, if one assumes additional geometry on G, such as a translated cone structure, then one obtains improved estimates in the sense that one can take unbounded regions in the definition of , just as we saw in Theorem 2.2. For further remarks in this direction, see the discussion given after the proof of the theorem.
3. Proofs
We turn first to Proposition 1.
Proof of Proposition 1. Since
is equipped with product topology, it is sufficient to show that if
and
, then
converges to
, where
Let
, and observe that
where the penultimate equation follows by applying the Cauchy–Schwarz inequality. Then by the Cauchy–Schwarz inequality and Hypothesis 5,
. Since Hellinger distance is dominated by the
-distance, in order to complete the proof, it is sufficient to show that
. Now since
, it follows that as
,
Evidently,
and
are nondecreasing and right continuous. Furthermore, if
and
, then
and
, where
,
,
,
. Thus
converges to
G, which is a proper distribution function. Then by Lemma 1 of Boos [
22],
converges to
uniformly on compact sets. This, in turn, implies the
convergence of
to
(by Scheffe’s lemma), which establishes the convergence of
to 0, thus completing the proof of the joint continuity of
.
Next, the uniform convergence (
17) follows by Hypothesis 5, since
Finally, to prove that
is weakly sequentially closed, note that convergence in weak topology implies pointwise convergence, yielding
. Noting that
it follows that
integrates to one, using
convergence, thus completing the proof of the proposition. □
We now turn to the proof of Theorem 1. The proof relies on the large deviation theorem for the kernel density estimator
in the weak topology of
. The next proposition is concerned with the LDP for
in
, equipped with the inherited weak topology from
. This issue has received considerable attention recently (cf. [
23,
24]), where it is established that the full LDP may
not hold for
in norm topology, but does hold under the weak topology.
Proposition 2. Assume Hypotheses 1–8 and that and as . Then satisfies the LDP in the weak topology of with good rate function I given by Proof of Theorem 1. As before, let
be equipped with the weak topology. Set
, and define
as follows:
By Hypothesis 5,
. To show that
is continuous, let
as
. Then
where we have used the Cauchy–Schwarz inequality that the
distance dominates the Hellinger distance in (
38). Now by Hypothesis 7, as in the proof of Proposition 1, we have that
as
, establishing the continuity of
. Next, to show that
is bounded, note that
by the Cauchy–Schwarz inequality. Then by Proposition 2, it follows by Varadhan’s integral lemma (see [
10], Theorem 4.3.1) that
This completes the proof of the theorem. □
The proofs of our main results will involve probability bounds on the modulus of continuity of
and
, respectively. Recall that the modulus of continuity
of a function
is given by
Observe that when
is replaced by
or
, the modulus of continuity becomes a random quantity. Our next proposition summarizes the continuity properties of
and
via their modulus of continuity as real-valued functionals from
equipped with the weak topology.
Proposition 3. Assume that Hypotheses 1–8 hold and that and as . Then, with respect to and , the modulus of continuity satisfies the following relations, each with probability one:Similarly, the sequence and satisfy the analogous relations with probability one; namely, Proof. First observe that
converges uniformly to
. To see this, note that if
, then by Proposition 1, it converges in
. Hence
where the last inequality follows using that the Hellinger distance is dominated by the
-distance. We now prove (i). For this we invoke the properties of the modulus of continuity. Observe that
which yields
Next observe that
where the last convergence follows from the uniform convergence of
to 0 as shown in (
42). The proof of (iv) is similar, and specifically is obtained by using that
where the above convergence follows from (
17).
We now turn to the proof of (ii). Using the Cauchy–Schwarz inequality and the definition of Hellinger distance,
where
is continuous since
is continuous in
. Also, since
is compact,
is uniformly continuous. Since the modulus of continuity converges to 0 if and only if
is uniformly continuous, (ii) follows. Turning to (v), notice that, as before,
Now, since
is
continuous, by Hypothesis 5, the proof follows as in (ii) due to to the compactness of
. The proofs of (iii) and (vi) are similar to (ii) and (v), respectively, and are therefore omitted. □
Proposition 4. For any and , there exists a positive number such that Proof. By Markov’s inequality and (
46), it follows that for any
,
Since
as
, there exists an
such that for all
,
. Since
is arbitrary, the proposition follows by taking
, for some
. The proof of the second inequality is similar, using (
47). □
Proof of Theorem 2. We begin with the proof of the upper bound. Since we assume that the equation
has a unique solution, it follows from the inequality in (
12) that for any
and
,
where the last equality follows by applying Theorem 1 with
. Since the inequality holds for every
,
Now, noticing that
, we then obtain
Similarly, for
, using (
13), one can show by an analogous calculation that
Now let
and
. Then
and so by (
52) and (
53), it follows that
where the last step follows since
F closed implies
.
Next we turn now to the proof of the lower bound. Let
G be an open set, and let
. Then there exists an
(to be chosen) such that
. Note that
Thus,
We now investigate
. Let
denote the distribution of
, and define
as follows:
Let
, for some
, where
and
. Then
Taking the logarithm, dividing by
n, and then taking the limit as
, we obtain
Now since
, we can apply Theorem IV.1 of [
25] to obtain that the last term on the right-hand side of the previous equation converges to zero. Upon letting
, it follows that
Since the above inequality holds for all
, we conclude that
where
.
By Proposition 4, choosing
, one can find
such that
Since
by the choice of
M, it follows from (
61) that
Taking the supremum on left- and right-hand side over all
yields the required lower bound. □
Turning to the higher dimensional case, we first need the following result, which provides a uniform bound on the Hessian of the objective function .
Lemma 2. Under Hypotheses 1–8, there exists a finite constant such that with probability one, Proof. This is standard. Specifically, note that the
element of the matrix
is given by
Next, writing down the expression for
in terms of the derivatives of the score function
, using the Cauchy–Schwarz inequality along with Hypotheses 3, 4, 6, and 8, and the definition of the matrix norm, the lemma follows. □
In the proof of the lower bound, we will take a somewhat different approach, involving the analysis of
k constraints, and our strategy will be to reduce this to a problem involving a single constraint. Specifically, in (
67) below, we establish that, instead of studying k constraints on a quantity
(which we are about to define), we can cast the problem in terms of a
d-dimensional vector
(defined in (
70) below) belonging to a ball centered at
and of appropriate radius.
To be more precise, let
be open, and consider the probability that we obtain an estimated value
. Let
, and for any
, set
and
. If
is chosen as the estimate, then we must have
for all
j, so, in particular,
(by which we mean that
for all
j in this last probability).
To evaluate the latter probability, observe that by a second-order Taylor expansion,
Using the positive definiteness and uniform boundedness of the matrix
, by Hypothesis 4, we have that for any unit vector
,
where
c is a positive constant independent of
. Thus, for each
j,
Integrating with respect to
and using the definition of
, we then obtain that
where
Let
, where for
:
(We have suppressed
in the notation for
.) Then the inequality
corresponds to an event
described by the occurrence of the inequality
where the right-hand side is always negative for small
(since
) and behaves like a constant multiple of
as this distance tends to infinity. Thus, we can choose a positive constant
such that
and set
. Finally, let
denote the event that
Then for all
j,
, where we recall that
was defined via (
72). Now, since the definition of the event
does not depend on any specific vector
, one can replace the vector
by any unit vector
in
. Hence
and we now derive a large deviation lower bound for the probability on the right-hand side.
Proposition 5. Assume that Hypotheses 1–8 hold, and suppose that G is an open subset of . Assume that and as . Then for any and ,where and the infimum is taken to be infinity if the set is empty. Proof. We begin by studying the limiting generating function of
. By Varadhan’s integral lemma, it follows that
where
Define the
-shifted distribution by
where
denotes the distribution of
. Note by the convexity of
that it is almost everywhere differentiable. Fix
and choose
such that
. Let
be such that
. Then
implying
Now, notice that the limiting cumulant generating function of
under the measure
is given by
Since
is a proper convex function, it is continuous since
is finite in the
, and moreover, by the choice of
, it is differentiable at
. Hence Condition II.1 of [
25] is satisfied. Now, using Theorem IV.1 of [
25], it follows that
Substituting the above into (
80), we obtain
Taking the supremum in
, the proposition follows. □
Proof of Theorem 3: Upper Bound. Let
F be a closed subset of
. Note
compact implies that
F is compact. Let
denote an open cover of
F, and let
denote the finite subcover. Using that
, we then obtain that for any
,
Adding and subtracting
to
and then applying Hölder’s inequality yields
, where
First we study
. For
and
, the Cauchy–Schwarz inequality gives
where
is the Hessian matrix consisting of the second partial derivatives of
. Hence we obtain for any
that
Now by Lemma 2,
Also, for each
, Theorem 1 provides that
Thus
Since the last inequality holds for all
,
Moreover, for each
j,
Hence
The upper bound follows by letting
. □
Proof of Theorem 3: Lower Bound. Let
G be an open subset of
, and let
. Then
is compact, and there exists a collection
such that
forms a finite subcover of
, where
Since
it follows that
where
We now investigate the behavior of
and
. Starting with
, note that
Now by (
74), it follows that
where
is as in (
71) and
. Applying Proposition 3.4, we obtain
where
, and we now observe that
r may be chosen to be
, where
is given as in (
73). Hence we may replace
with
on the right-hand side of the previous equation. Next, using Proposition 4 yields that
Finally, the required lower bound is obtained by maximizing the right-hand side over all
. □
In the proof of the lower bound, it is clear that the choice of
plays a central role, and the rate function
will be minimized when
k is small. As a simple example, suppose that our goal is to obtain a lower bound for
, where
which is a union of two halfspaces, This can be expressed as
, where
and
, which is an example of a
translated cone. Now if
, then we can find two elements which generate the entire set
, in the sense that all other normalized differences lie between these two unit vectors. These two representative points are the unit vectors
and
, and all other normalized differences
lie between these vectors for all
Now going back to (
73), we see that this equation again holds. Furthermore, (
74) holds with
now replaced by an intersection of
two halfspaces rather than of all halfspaces, yielding an unbounded region in the definition of
. This potentially improves the quality of the lower bound compared with what is presented in the statement of Theorem 3. This idea can be potentially generalized to other sets, such as other unions of halfspaces, and so from a practical perspective, could apply somewhat generally.