Normal Laws for Two Entropy Estimators on Infinite Alphabets

Chen, Chen; Grabchak, Michael; Stewart, Ann; Zhang, Jialin; Zhang, Zhiyi

doi:10.3390/e20050371

Open AccessArticle

Normal Laws for Two Entropy Estimators on Infinite Alphabets

by

Chen Chen

,

Michael Grabchak

,

Ann Stewart

,

Jialin Zhang

^*

and

Zhiyi Zhang

Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA

^*

Author to whom correspondence should be addressed.

Entropy 2018, 20(5), 371; https://doi.org/10.3390/e20050371

Submission received: 3 April 2018 / Revised: 9 May 2018 / Accepted: 10 May 2018 / Published: 17 May 2018

Download

Browse Figures

Versions Notes

Abstract

:

This paper offers sufficient conditions for the Miller–Madow estimator and the jackknife estimator of entropy to have respective asymptotic normalities on countably infinite alphabets.

Keywords:

entropy; nonparametric estimator; Miller–Madow estimator; jackknife estimator; asymptotic normality

MSC:

Primary 62F10; 62F12; 62G05; 62G20

1. Introduction

Let

X = {ℓ_{k}; k \geq 1}

be a finite or countably infinite alphabet, let

p = {p_{k}; k \geq 1}

be a probability distribution on

X

, and define

K = \sum_{k \geq 1} 1 [p_{k} > 0]

, where

1 [\cdot]

is the indicator function, to be the effective cardinality of

X

under

p

. An important quantity associated with

p

is entropy, which is defined by [1] as

H = - \sum_{k \geq 1} p_{k} ln p_{k} .

(1)

Here and throughout, we adopt the convention that

0 ln 0 = 0

.

Many properties of entropy and related quantities are discussed in [2]. The problem of statistical estimation of entropy has a long history (see the survey paper [3] or the recent book [4]). It is well-known that no unbiased estimators of entropy exist, and, for this reason, much energy has been focused on deriving estimators with relatively little bias (see [5] and the references therein for a discussion of some (but far from all) of these). Perhaps the most commonly used estimator is the plug-in. Its theoretical properties have been studied going back, at least, to [6], where conditions for consistency and asymptotic normality, in the case of finite alphabets, were derived. It would be almost fifty years before corresponding conditions for the countabe case would appear in the literature. Specifically, consistency, both in terms of almost sure and

L^{2}

convergence, was verified in [7]. Later, sufficient conditions for asymptotic normality were derived in two steps in [3,8].

Despite a simple form and nice theoretical properties, the plug-in suffers from large finite sample bias, which has led to the development of modifications that aim to reduce this bias. Two of the most popular are the Miller–Madow estimator of [6] and the jackknife estimator of [9]. Theoretical properties of these have not been studied, as extensively, in the literature. In this paper, we give sufficient conditions for the asymptotic normality of these two estimators. This is important for deriving confidence intervals and hypothesis tests, and it immediately implies consistency (see e.g., [4]).

We begin by introducing some notation. We say that a distribution

p = {p_{k}; k \geq 1}

is uniform if and only if its effective cardinality

K < \infty

and for each

k = 1, 2, \dots

either

p_{k} = 1 / K

or

p_{k} = 0

. We write

f \sim g

to denote

{lim}_{n \to \infty} f (n) / g (n) = 1

and we write

f = O (g (n))

to denote

{lim sup}_{n \to \infty} |f (n) / g (n)| < \infty

. Furthermore, we write

\overset{L}{\to}

to denote convergence in law and

\overset{p}{\to}

to denote convergence in probability. If a and b are real number, we write

a \lor b

to denote the maximum of a and b. When it is not specified, all limits are assumed to be taken as

n \to \infty

.

Let

X_{1}, \dots, X_{n}

be independent and identically distributed (

i i d

) random variables on

X

under

p

. Let

{Y_{k}; k \geq 1}

be the observed letter counts in the sample, i.e.,

Y_{k} = \sum_{i = 1}^{n} 1 [X_{i} = ℓ_{k}]

, and let

\hat{p} = {{\hat{p}}_{k}; k \geq 1}

, where

{\hat{p}}_{k} = Y_{k} / n

, be the corresponding relative frequencies. Perhaps the most intuitive estimator of H is the plug-in, which is given by

\hat{H} = - \sum_{k \geq 1} {\hat{p}}_{k} ln {\hat{p}}_{k} .

(2)

When the effective cardinality, K, is finite, [10] showed that the bias of

\hat{H}

is

E (\hat{H}) - H = - \frac{K - 1}{2 n} + \frac{1}{12 n^{2}} (1 - \sum_{k = 1}^{K} \frac{1}{p_{k}}) + O (n^{- 3}) .

(3)

One of the simplest and earliest approaches aiming to reduce the bias of

\hat{H}

is to estimate the first order term. Specifically, let

\hat{m} = \sum_{k \geq 1} 1 [Y_{k} > 0]

be the number of letters observed in the sample and consider an estimator of the form,

{\hat{H}}_{M M} = \hat{H} + \frac{\hat{m} - 1}{2 n} .

(4)

This estimator is often attributed to [6] and is known as the Miller–Madow estimator. Note that, for finite K,

E (\frac{\hat{m} - 1}{2 n}) = \frac{K - 1}{2 n} - \frac{\sum_{k} {(1 - p_{k})}^{n}}{2 n} .

Since

\sum_{k} {(1 - p_{k})}^{n} \leq K {(1 - p_{\land})}^{n}

, where

p_{\land} = min {p_{k} : p_{k} > 0}

, decays exponentially fast, it follows that, for finite K, the bias of

{\hat{H}}_{M M}

is

E ({\hat{H}}_{M M}) - H = \frac{1}{12 n^{2}} (1 - \sum_{k = 1}^{K} \frac{1}{p_{k}}) + O (n^{- 3}) .

Among the many estimators in the literature aimed at reducing bias in entropy estimation, the Miller–Madow estimator is one of the most commonly used. Its popularity is due to its simplicity, its intuitive appeal, and, more importantly, its good performance across a wide range of different distributions including those on countably infinite alphabets. See, for instance, the simulation study in [5].

The jackknife entropy estimator is another commonly used estimator designed to reduce the bias of the plug-in. It is calculated in three steps:

for each $i \in {1, 2, \dots, n}$ construct ${\hat{H}}^{(i)}$ , which is a plug-in estimator based on a sub-sample of size $n - 1$ obtained by leaving the ith observation out;
obtain ${\hat{H}}_{(i)} = n \hat{H} - (n - 1) {\hat{H}}^{(i)}$ for $i = 1, \dots, n$ ; and then
compute the jackknife estimator

${\hat{H}}_{J K} = \frac{\sum_{i = 1}^{n} {\hat{H}}_{(i)}}{n} .$

(5)

Equivalently, (5) can be written as

{\hat{H}}_{J K} = n \hat{H} - (n - 1) \frac{\sum_{i = 1}^{n} {\hat{H}}^{(i)}}{n} .

(6)

The jackknife estimator of entropy was first described by [9]. From (2), it may be verified that, when

K < \infty

, the bias of

{\hat{H}}_{J K}

is

E ({\hat{H}}_{J K}) - H = O (n^{- 2}) .

(7)

Both the Miller–Madow and the jackknife estimators are adjusted versions of the plug-in. When the effective cardinality is finite, i.e.,

K < \infty

, the asymptotic normalities of both can be easily verified. A question of theoretical interest is whether these normalities still hold when the effective cardinality is countably infinite. In this paper, we give sufficient conditions for

\sqrt{n} ({\hat{H}}_{M M} - H)

and

\sqrt{n} ({\hat{H}}_{J K} - H)

to have asymptotic normalities on countably infinite alphabets and provide several illustrative examples. The rest of paper is organized as follows. Our main results for both the Miller–Madow and the jackknife estimators are given in Section 2. A small simulation study is given in Section 3. This is followed by a brief discussion in Section 4. Proofs are postponed to Section 5.

2. Main Results

We begin by recalling a sufficient condition due to [8] for the asymptotic normality of the plug-in estimator.

Condition 1.

The distribution,

p = {p_{k}; k \geq 1}

, satisfies

\begin{matrix} \sum_{k \geq 1} p_{k} {ln}^{2} p_{k} < \infty, \end{matrix}

(8)

and there exists an integer-valued function

K (n)

such that, as

n \to \infty

,

1.: $K (n) \to \infty$ ,
2.: $K (n) / \sqrt{n} \to 0$ , and
3.: $\sqrt{n} \sum_{k \geq K (n)} p_{k} ln p_{k} \to 0$ .

Note that, by Jensen’s inequality (see e.g., [2]), (8) implies that

H^{2} = {(- \sum_{k \geq 1} p_{k} ln p_{k})}^{2} \leq \sum_{k \geq 1} p_{k} {ln}^{2} p_{k} < \infty,

where equality holds, i.e.,

H^{2} = \sum_{k \geq 1} p_{k} {ln}^{2} p_{k}

, if and only if

p

is a uniform distribution. Thus, when (8) holds, we have

H < \infty

. The following result is given in [8].

Lemma 1.

Let

p = {p_{k}; k \geq 1}

be a distribution, which is not uniform, and set

σ^{2} = \sum_{k \geq 1} p_{k} {ln}^{2} p_{k} - H^{2} a n d {\hat{σ}}^{2} = \sum_{k \geq 1} {\hat{p}}_{k} {ln}^{2} {\hat{p}}_{k} - {\hat{H}}^{2} .

(9)

If

p

satisfies Condition 1, then

\hat{σ} \overset{p}{\to} σ

,

\frac{\sqrt{n} (\hat{H} - H)}{σ} \overset{L}{⟶} N (0, 1),

and

\frac{\sqrt{n} (\hat{H} - H)}{\hat{σ}} \overset{L}{⟶} N (0, 1) .

The following is useful for checking when Condition 1 holds.

Lemma 2.

Let

p = {p_{k}; k \geq 1}

and

p^{'} = {p_{k}^{'}; k \geq 1}

be two distributions and assume that

p^{'}

satisfies Condition 1. If there exists a

C > 0

such that, for large enough k,

p_{k} \leq C p_{k}^{'},

then

p

satisfies Condition 1 as well.

In [8], it is shown that Condition 1 holds for

p = {p_{k}; k \geq 1}

with

p_{k} = \frac{C}{k^{2} {ln}^{2} k}, k = 1, 2, \dots,

where

C > 0

is a normalizing constant. It follows from Lemma 2 that any distribution with tails lighter than this satisfies Condition 1 as well.

We are interested in finding conditions under which the result of Lemma 1 can be extended to bias adjusted modifications of

\hat{H}

. Let

{\hat{H}}_{*}

be any bias-adjusted estimator of the form

{\hat{H}}_{*} = \hat{H} + {\hat{B}}_{*},

(10)

where

{\hat{B}}_{*}

is an estimate of the bias. Combining Lemma 1 with Slutsky’s theorem immediately gives the following.

Theorem 1.

Let

p = {p_{k}; k \geq 1}

be a distribution, which is not uniform, and let

σ^{2}

and

{\hat{σ}}^{2}

be as in (9). If Condition 1 holds and

\sqrt{n} {\hat{B}}_{*} \overset{p}{⟶} 0

, then

\hat{σ} \overset{p}{\to} σ

,

\frac{\sqrt{n} ({\hat{H}}_{*} - H)}{σ} \overset{L}{⟶} N (0, 1),

and

\frac{\sqrt{n} ({\hat{H}}_{*} - H)}{\hat{σ}} \overset{L}{⟶} N (0, 1) .

For the Miller–Madow estimator and the jackknife estimator, respectively, the bias correction term,

{\hat{B}}_{*}

, in (10) takes the form

\begin{matrix} M i l l e r - M a d o w : & {\hat{B}}_{M M} & = \frac{\hat{m} - 1}{2 n}, \\ J a c k k n i f e : & {\hat{B}}_{J K} & = \frac{n - 1}{n} \sum_{i = 1}^{n} (\hat{H} - {\hat{H}}^{(i)}) . \end{matrix}

Below, we give sufficient conditions for when

\sqrt{n} {\hat{B}}_{M M} \overset{p}{⟶} 0

and when

\sqrt{n} {\hat{B}}_{J K} \overset{p}{⟶} 0

.

2.1. Results for the Miller–Madow Estimator

Condition 2.

The distribution,

p = {p_{k}; k \geq 1}

, satisfies that, for sufficiently large k,

\begin{matrix} p_{k} \leq \frac{1}{a (k) b (k) k^{3}}, \end{matrix}

(11)

where

a (k) > 0

and

b (k) > 0

are two sequences such that

1.

a (k) \to \infty

as

k \to \infty

, and, furthermore,

(a): the function $a (k)$ is eventually nondecreasing, and
(b): there exists an $ε > 0$ such that

$\underset{k \to \infty}{lim sup} \frac{{(a (k))}^{2 ε}}{a (\frac{\sqrt{k}}{{(a (k))}^{ε}})} < \infty;$

(12)

2.

\sum_{k \geq 1} \frac{1}{k b (k)} < \infty .

(13)

Since this condition only requires that

p_{k}

, for sufficiently large k, is upper bounded in the appropriate way, we immediately get the following.

Lemma 3.

Let

p = {p_{k}; k \geq 1}

and

p^{'} = {p_{k}^{'}; k \geq 1}

be two distributions and assume that

p^{'}

satisfies Condition 2. If there exists a

C > 0

such that, for large enough k,

p_{k} \leq C p_{k}^{'},

then

p

satisfies Condition 2 as well.

We now give our main results for the Miller–Madow Estimator.

Theorem 2.

Let

p = {p_{k}; k \geq 1}

be a distribution, which is not uniform, and let

σ^{2}

and

{\hat{σ}}^{2}

be as in (9). If Condition 2 holds, then

\hat{σ} \overset{p}{\to} σ

,

\frac{\sqrt{n} ({\hat{H}}_{M M} - H)}{σ} \overset{L}{⟶} N (0, 1)

and

\frac{\sqrt{n} ({\hat{H}}_{M M} - H)}{\hat{σ}} \overset{L}{⟶} N (0, 1) .

In the proof of the theorem, we will show that Condition 2 implies that Condition 1 holds. Condition 2 requires

p_{k}

to decay slightly faster than

k^{- 3}

by two factors

1 / a (k)

and

1 / b (k)

, where

a (k)

and

b (k)

satisfy (12) and (13) respectively. While (13) is clear in its implication on

b (k)

, (12) is much less so on

a (k)

. To have a better understanding of (12), we give an important situation where (12) holds. Consider the case

a (n) = ln n

. In this case, for any

ε \in (0, 0.5)

\frac{{(a (n))}^{2 ε}}{a (\frac{\sqrt{n}}{{(a (n))}^{ε}})} = \frac{{(ln n)}^{2 ε}}{0.5 ln n - ε ln ln n} \sim \frac{{(ln n)}^{2 ε}}{0.5 ln n} ⟶ 0 .

We now give a more general situation, which shows just how slow

a (k)

can be. First, we recall the iterated logarithm function. Define

{ln}^{(r)} (x)

, recursively for sufficiently large

x > 0

, by

{ln}^{(0)} (x) = x

and

{ln}^{(r)} (x) = ln ({ln}^{(r - 1)} x)

for

r \geq 1

. By induction, it can be shown that

\frac{d}{d x} {ln}^{(r)} (x) = {(\prod_{i = 0}^{r - 1} {ln}^{(i)} (x))}^{- 1}

for

r \geq 1

.

Lemma 4.

The function

a (n) = {ln}^{(r)} (n)

satisfies (12) with

ε = 0.5

for any

r \geq 2

.

We now give three examples.

Example 1.

Let

p = {p_{k}; k \geq 1}

be such that for sufficiently large k,

p_{k} \leq \frac{C}{k^{3} (ln k) {(ln ln k)}^{2 + ε}},

where

ε > 0

and

C > 0

are fixed constants. In this case, Condition 2 holds with

a (k) = ln ln k

and

b (k) = (ln k) {(ln ln k)}^{1 + ε} / C

in (11).

We can consider a more general form, which allows for even heavier tails.

Example 2.

Let r be an integer with

r \geq 2

and let

p = {p_{k}; k \geq 1}

be such that, for sufficiently large k,

p_{k} \leq \frac{C}{k^{3} (\prod_{i = 1}^{r - 1} {ln}^{(i)} k) {({ln}^{(r)} k)}^{2 + ε}}

where

ε > 0

and

C > 0

are fixed constants. In this case, Condition 2 holds with

a (k) = {ln}^{(r)} k

and

b (k) = (\prod_{i = 1}^{r - 1} {ln}^{(i)} k) {({ln}^{(r)} k)}^{1 + ε} / C

in (11). The fact that

b (k)

satisfies (13) follows by the integral test for convergence.

It follows from Lemma 3 that any distribution with tails lighter than those in this example must satisfy Condition 2. On the other hand, the tails cannot get too much heavier.

Example 3.

Let

p = {p_{k}; k \geq 1}

be such that

p_{k} = C k^{- 3}

, where

C > 0

is a normalizing constant. In this case, Condition 2 does not hold. However, Condition 1 does hold.

2.2. Results for the Jackknife Estimator

For any distribution

p

, let

B_{n} = E (\hat{H}) - H

be the bias of the plug-in based on a sample of size n.

Condition 3.

The distribution,

p = {p_{k}; k \geq 1}

, satisfies

lim_{n \to \infty} n^{3 / 2} (B_{n} - B_{n - 1}) = 0 .

Theorem 3.

Let

p = {p_{k}; k \geq 1}

be a distribution, which is not uniform, and let

σ^{2}

and

{\hat{σ}}^{2}

be as in (9). If Conditions 1 and 3 hold, then

\hat{σ} \overset{p}{\to} σ

,

\frac{\sqrt{n} ({\hat{H}}_{J K} - H)}{σ} \overset{L}{⟶} N (0, 1)

and

\frac{\sqrt{n} ({\hat{H}}_{J K} - H)}{\hat{σ}} \overset{L}{⟶} N (0, 1) .

It is not clear to us whether Conditions 1 and 3 are equivalent or, if not, which is more stringent. For that reason, in the statement of Theorem 3, both conditions are imposed. The proof of the theorem uses the following lemma, which gives some insight into

{\hat{B}}_{J K}

and Condition 3.

Lemma 5.

For any probability distribution

p = {p_{k}; k \geq 1}

, we have

{\hat{B}}_{J K} = \frac{n - 1}{n} \sum_{i = 1}^{n} (\hat{H} - {\hat{H}}^{(i)}) \geq 0

and

E [{\hat{B}}_{J K}] = (n - 1) (B_{n} - B_{n - 1}) \geq 0 .

We now give a condition, which implies Condition 3 and tends to be easier to check.

Proposition 1.

If the distribution

p = {p_{k}; k \geq 1}

is such that there exists an

ε \in (1 / 2, 1)

with

\sum_{k \geq 1} p_{k}^{1 - ε} < \infty

, then Condition 3 is satisfied.

We now give an example where this holds.

Example 4.

Let

p = {C k^{- (2 + δ)}; k \geq 1},

where

δ > 0

is fixed and

C > 0

is a normalizing constant. In this case, the assumption of Proposition 1 holds and thus Condition 3 is satisfied.

To see that the assumption of Proposition 1 holds in this case, fix

ε \in (1 / 2, (1 + δ) / (2 + δ))

. Note that

- (1 + δ / 2) < - (2 + δ) (1 - ε) < - 1

, and thus

\sum_{k \geq 1} p_{k}^{1 - ε} = C^{1 - ε} \sum_{k \geq 1} k^{- (2 + δ) (1 - ε)} < \infty .

3. Simulations

The main application of the asymptotic normality results given in this paper is the construction of asymptotic confidence intervals and hypothesis tests. For instance, if

p

satisfies the assumptions of Theorem 2, then an asymptotic

(1 - α) 100 %

confidence interval for H is given by

\begin{matrix} ({\hat{H}}_{M M} - z_{α / 2} \frac{\hat{σ}}{\sqrt{n}}, {\hat{H}}_{M M} + z_{α / 2} \frac{\hat{σ}}{\sqrt{n}}), \end{matrix}

where

z_{α / 2}

is a number such that

P (Z > z_{α / 2}) = α / 2

and Z is a standard normal random variable. Similarly, if the assumptions of Theorem 3 are satisfied, then we can replace

{\hat{H}}_{M M}

with

{\hat{H}}_{J K}

, and if the assumptions of Lemma 1 are satisfied, then we can replace

{\hat{H}}_{M M}

with

\hat{H}

. In this section, we give a small-scale simulation study to evaluate the finite sample performance of these confidence intervals.

For concreteness, we focus on the geometric distribution, which corresponds to

p_{k} = p {(1 - p)}^{k - 1}, k = 1, 2, \dots,

where

p \in (0, 1)

is a parameter. The true entropy of this distribution is given by

H = - p^{- 1} (p ln p + (1 - p) ln (1 - p))

. In this case, Conditions 1, 2, and 3 all hold. For our simulations, we took

p = 0.5

. The simulations were performed as follows. We began by simulating a random sample of size n and used it to evaluated a

95 %

confidence interval for the given estimator. We then checked to see if the true value of H was in the interval or not. This was repeated 5000 times and the proportion of times when the true value was in the interval was calculated. This proportion should be close to

0.95

when the confidence interval works well. We repeated this for sample sizes ranging from 20 to 1000 in increments of 10. The results are given in Figure 1. We can see that the Miller–Madow and jackknife estimators consistently outperform the plug-in. It may be interesting to note that, although the proofs of Theorems 1–3 are based on showing that the bias correction term approaches zero, it does not mean that the bias correction term is not useful. On the contrary, bias correction improves the finite sample performance of the asymptotic confidence intervals.

4. Discussion

In this paper, we gave sufficient conditions for the asymptotic normality of the Miller–Madow and the Jackknife estimators of entropy. While our focus is on the case of countably infinite alphabets, our results are formulated and proved in the case where the effective cardinality K may be finite or countably infinite. As such, they hold in the case of finite alphabets as well. In fact, for finite alphabets, Conditions 1–3 always hold and we have asymptotic normality so long as the underlying distribution is not uniform. The difficulty with the uniform distribution is that it is the unique distribution for which

σ^{2}

, as given by (9), is zero (see the discussion just below Condition 1). When the distribution is uniform, the asymptotic distribution is chi-squared with

(K - 1)

degrees of freedom (see [6]).

In general, we do not know if our conditions are necessary. However, they cover most distributions of interest. The only distributions, which they preclude, are ones with extremely heavy tails. However, in complete generality, Conditions 1–3 may look complicated, and they are easily checked in many situations. For instance, Condition 2 always holds when, for large enough k,

p_{k} \leq C k^{- 3 - δ}

for some

C, δ > 0

, i.e., when

\sum_{k = 1}^{\infty} k^{2} p_{k} < \infty .

If the alphabet

X = N

is the set of natural numbers, then this is equivalent to the distribution

p

having a finite variance. Similarly, Conditions 1 and 3 both holding is the case when, for large enough k,

p_{k} \leq C k^{- 2 - δ}

for some

C, δ > 0

, i.e., when

\sum_{k = 1}^{\infty} k p_{k} < \infty .

If the alphabet

X = N

is the set of natural numbers, then this is equivalent to the distribution

p

having a finite mean.

5. Proofs

Proof of Lemma 2.

Without loss of generality, assume that

C > 1

and thus that

ln C > 0

. Let

f (x) = x ln x

for

x \in (0, 1)

. It is readily checked that f is negative and decreasing for

x \in (0, e^{- 1})

. Since

C p_{k}^{'} \to 0

as

k \to \infty

, it follows that

C p_{k}^{'} < e^{- 1}

for large enough k. Now, let

K (n)

be the sequence that works for

p^{'}

in Condition 1. For large enough n,

\begin{matrix} 0 & \geq & \sqrt{n} \sum_{k \geq K (n)} p_{k} ln p_{k} \geq C \sqrt{n} \sum_{k \geq K (n)} p_{k}^{'} ln (C p_{k}^{'}) \\ \geq & C ln (C) \sqrt{n} \sum_{k \geq K (n)} p_{k}^{'} + C \sqrt{n} \sum_{k \geq K (n)} p_{k}^{'} ln (p_{k}^{'}) \\ \geq & C (ln (C) + 1) \sqrt{n} \sum_{k \geq K (n)} p_{k}^{'} ln (p_{k}^{'}) ⟶ 0 . \end{matrix}

Similarly, the function

g (x) = x {ln}^{2} x

for

x \in (0, 1)

is positive and increasing for

x \in (0, e^{- 2})

. Thus, there is an integer

M > 0

such that if

k \geq M

, then

C p_{k}^{'} < e^{- 2}

and

\begin{matrix} 0 \leq \sum_{k \geq 1} p_{k} {ln}^{2} p_{k} & \leq & \sum_{k = 1}^{M - 1} p_{k} {ln}^{2} p_{k} + C \sum_{k = M}^{\infty} p_{k}^{'} {ln}^{2} (C p_{k}^{'}) \\ = & \sum_{k = 1}^{M - 1} p_{k} {ln}^{2} p_{k} + C \sum_{k = M}^{\infty} p_{k}^{'} {ln}^{2} (p_{k}^{'}) \\ + C {ln}^{2} (C) \sum_{k = M}^{\infty} p_{k}^{'} + 2 C ln (C) \sum_{k = M}^{\infty} p_{k}^{'} ln (p_{k}^{'}) < \infty, \end{matrix}

as required. ☐

To prove Theorem 2, the following Lemma is needed.

Lemma 6.

If Condition 2 holds, then there exists a

K_{1} > 0

such that for all

k \geq K_{1}

p_{k} \leq \frac{1}{a (k) b (k) k^{3}} \leq 1 - {(1 - \frac{2}{k b (k)})}^{\frac{1}{a (k) k^{2}}} .

(14)

Proof.

Observing that

e^{- x} \geq 1 - x

holds for all real x and that

{lim}_{x \to 0} (1 - e^{- x}) / x = 1

, we have

e^{- 2 / (k b (k))} \geq 1 - 2 / (k b (k))

, and hence

\begin{matrix} 1 - {(1 - \frac{2}{k b (k)})}^{\frac{1}{a (k) k^{2}}} \geq 1 - e^{- \frac{2}{a (k) b (k) k^{3}}} \sim \frac{2}{a (k) b (k) k^{3}} . \end{matrix}

This implies that there is a

K_{1} > 0

such that for all

k \geq K_{1}

(11) holds and

\frac{\frac{2}{a (k) b (k) k^{3}}}{1 - {(1 - \frac{2}{k b (k)})}^{\frac{1}{a (k) k^{2}}}} \leq 2 .

It follows that, for such k,

p_{k} \leq \frac{1}{a (k) b (k) k^{3}} = . 5 \frac{\frac{2}{a (k) b (k) k^{3}}}{1 - {(1 - \frac{2}{k b (k)})}^{\frac{1}{a (k) k^{2}}}} [1 - {(1 - \frac{2}{k b (k)})}^{\frac{1}{a (k) k^{2}}}] \leq 1 - {(1 - \frac{2}{k b (k)})}^{\frac{1}{a (k) k^{2}}}

as required. ☐

Proof of Theorem 2.

By Theorem 1, it suffices to show that Condition 2 implies that both Condition 1 and

\sqrt{n} {\hat{B}}_{M M} \overset{p}{\to} 0

hold. The fact that Condition 2 implies Condition 1 follows by Example 3, Lemmas 2 and 3. We now show that

\sqrt{n} {\hat{B}}_{M M} \overset{p}{\to} 0

.

Fix

ε_{0} \in (0, ε)

. From (12) and the facts that

a (k)

is positive, eventually nondecreasing, and approaches infinity, it follows that

\begin{matrix} \underset{k \to \infty}{lim sup} \frac{{(a (k))}^{2 ε_{0}}}{a (\frac{\sqrt{k}}{{(a (k))}^{ε_{0}}})} & = & \underset{k \to \infty}{lim sup} {(a (k))}^{- 2 (ε - ε_{0})} \frac{{(a (k))}^{2 ε}}{a (\frac{\sqrt{k}}{{(a (k))}^{ε_{0}}})} \\ \leq & \underset{k \to \infty}{lim sup} {(a (k))}^{- 2 (ε - ε_{0})} \frac{{(a (k))}^{2 ε}}{a (\frac{\sqrt{k}}{{(a (k))}^{ε}})} = 0 . \end{matrix}

(15)

Let

K_{2}

be a positive integer such that, for all

n \geq K_{2}

,

a (n)

is nondecreasing, and let

r_{n} = (\sqrt{n} / {(a (n))}^{ε_{0}}) \lor K_{3}

, where

K_{3} = K_{1} \lor K_{2}

and

K_{1}

is as in Lemma 6. It follows that

\begin{matrix} E (\sqrt{n} {\hat{B}}_{M M}) & = \sqrt{n} E (\frac{\hat{m} - 1}{2 n}) \leq \frac{1}{\sqrt{n}} E (\hat{m}) = \frac{1}{\sqrt{n}} \sum_{k \geq 1} [1 - {(1 - p_{k})}^{n}] \\ \leq \frac{1}{\sqrt{n}} \sum_{k \leq r_{n}} 1 + \frac{1}{\sqrt{n}} \sum_{k > r_{n}} [1 - {(1 - \frac{2}{k b (k)})}^{\frac{n}{a (k) k^{2}}}] = : S_{1} + S_{2} . \end{matrix}

We have

\begin{matrix} S_{1} & \leq \frac{r_{n}}{\sqrt{n}} = ({(a (n))}^{- ε_{0}}) \lor \frac{K_{3}}{\sqrt{n}} \to 0; a n d \\ S_{2} & \leq \frac{1}{\sqrt{n}} \sum_{k > r_{n}} [1 - {(1 - \frac{2}{k b (k)})}^{\frac{n}{a (r_{n}) r_{n}^{2}}}] \leq \frac{1}{\sqrt{n}} \sum_{k > r_{n}} [1 - {(1 - \frac{2}{k b (k)})}^{\frac{{(a (n))}^{2 ε_{0}}}{a (\frac{\sqrt{n}}{{(a (n))}^{ε_{0}}})}}] . \end{matrix}

By (15), it follows that, for large enough

n,

S_{2} \leq \frac{1}{\sqrt{n}} \sum_{k > r_{n}} [1 - (1 - \frac{2}{k b (k)})] = \frac{1}{\sqrt{n}} \sum_{k > r_{n}} \frac{2}{k b (k)} \leq (\sum_{k \geq 1} \frac{1}{k b (k)}) \frac{2}{\sqrt{n}} \to 0 .

From here, Markov’s inequality implies that

\sqrt{n} {\hat{B}}_{M M} \overset{p}{⟶} 0

. ☐

Proof of Lemma 4.

First note that

{ln}^{(r - 1)} (0.5 ln n - 0.5 {ln}^{(v)} n) \sim {ln}^{(r - 1)} (0.5 ln n)

for any

v \geq 2

and

r \geq 1

. This can be shown by induction on r. Specifically, the result is immediate for

r = 1

. If the result is true for

r = m

, then for

r = m + 1

\begin{matrix} {ln}^{(m)} (0.5 ln n - 0.5 {ln}^{(v)} n) & = & ln ({ln}^{(m - 1)} (0.5 ln n - 0.5 {ln}^{(v)} n)) \\ = & ln (\frac{{ln}^{(m - 1)} (0.5 ln n - 0.5 {ln}^{(v)} n)}{{ln}^{(m - 1)} (0.5 ln n)}) + {ln}^{(m)} (0.5 ln n) \\ \sim & {ln}^{(m)} (0.5 ln n) . \end{matrix}

It follows that for

r \geq 2

\begin{matrix} lim_{n \to \infty} \frac{{ln}^{(r)} n}{{ln}^{(r)} (\sqrt{n / ({ln}^{(r)} n)})} & = & lim_{n \to \infty} \frac{{ln}^{(r)} n}{{ln}^{(r - 1)} (0.5 ln n - 0.5 {ln}^{(r + 1)} n)} \\ = & lim_{n \to \infty} \frac{{ln}^{(r - 1)} (ln n)}{{ln}^{(r - 1)} (0.5 ln n)} = 1, \end{matrix}

where the final equality follows by the fact that

{ln}^{(r - 1)} (x)

is a slowly varying function. Recall that a positive-valued function ℓ is called slowly varying if for any

t > 0

lim_{x \to \infty} \frac{ℓ (x t)}{ℓ (x)} = 1

(see [11] for a standard reference). To see that

{ln}^{(r - 1)} (x)

is slowly varying, note that ln is slowly varying and that compositions of slowly varying functions are slowly varying by Proposition 1.3.6 in [11]. ☐

Proof of Lemma 5.

Observing the convention that

0 ln 0 = 0

,

\begin{matrix} \sum_{i = 1}^{n} {\hat{H}}^{(i)} & = \sum_{k \geq 1} \sum_{i : X_{i} = ℓ_{k}} {\hat{H}}^{(i)} \\ = \sum_{k \geq 1} Y_{k} (- \frac{Y_{k} - 1}{n - 1} ln \frac{Y_{k} - 1}{n - 1} - \sum_{j : j \geq 1, j \neq k} \frac{Y_{j}}{n - 1} ln \frac{Y_{j}}{n - 1}) \\ = \sum_{k \geq 1} Y_{k} [- \frac{Y_{k} - 1}{n - 1} (ln \frac{Y_{k} - 1}{Y_{k}} + ln \frac{Y_{k}}{n - 1}) - \sum_{j : j \geq 1, j \neq k} \frac{Y_{j}}{n - 1} ln \frac{Y_{j}}{n - 1}] \\ = \sum_{k \geq 1} Y_{k} [- \frac{Y_{k} - 1}{n - 1} ln \frac{Y_{k} - 1}{Y_{k}} + \frac{1}{n - 1} ln \frac{Y_{k}}{n - 1} - \sum_{j \geq 1} \frac{Y_{j}}{n - 1} ln \frac{Y_{j}}{n - 1}] \\ = - \frac{1}{n - 1} \sum_{k \geq 1} Y_{k} (Y_{k} - 1) ln \frac{Y_{k} - 1}{Y_{k}} + \sum_{k \geq 1} \frac{Y_{k}}{n - 1} ln \frac{Y_{k}}{n - 1} - \sum_{k \geq 1} Y_{k} \sum_{j \geq 1} \frac{Y_{j}}{n - 1} ln \frac{Y_{j}}{n - 1} \\ = - \frac{1}{n - 1} \sum_{k \geq 1} Y_{k} (Y_{k} - 1) ln \frac{Y_{k} - 1}{Y_{k}} - (n - 1) \sum_{k \geq 1} \frac{Y_{k}}{n - 1} ln \frac{Y_{k}}{n - 1} \\ = - \frac{1}{n - 1} \sum_{k \geq 1} Y_{k} (Y_{k} - 1) ln \frac{Y_{k} - 1}{Y_{k}} - \sum_{k \geq 1} Y_{k} (ln \frac{Y_{k}}{n} + ln \frac{n}{n - 1}) \\ = - \frac{1}{n - 1} \sum_{k \geq 1} Y_{k} (Y_{k} - 1) ln \frac{Y_{k} - 1}{Y_{k}} - n ln \frac{n}{n - 1} + n \hat{H} . \end{matrix}

Therefore,

\begin{matrix} \sum_{i = 1}^{n} (\hat{H} - {\hat{H}}^{(i)}) = & \frac{1}{n - 1} \sum_{k \geq 1} Y_{k} (Y_{k} - 1) ln \frac{Y_{k} - 1}{Y_{k}} + n ln \frac{n}{n - 1} \\ = & \frac{1}{n - 1} \sum_{k \geq 1} Y_{k} [(Y_{k} - 1) ln \frac{Y_{k} - 1}{Y_{k}} + (n - 1) ln \frac{n}{n - 1}] \\ = & \frac{1}{n - 1} \sum_{k \geq 1} Y_{k} [(n - 1) ln \frac{n}{n - 1} - (Y_{k} - 1) ln \frac{Y_{k}}{Y_{k} - 1}] . \end{matrix}

It suffices to show that for any

y \in {1, 2, \dots, n}

(y - 1) ln y - (y - 1) ln (y - 1) \leq (n - 1) ln n - (n - 1) ln (n - 1) .

(16)

Towards that end, first note that the inequality of (16) holds for

y = 1

. Now, let

f (y) = (y - 1) ln y - (y - 1) ln (y - 1)

and, therefore, letting

s = 1 - 1 / y

,

f^{'} (y) = ln \frac{y}{y - 1} - \frac{1}{y} = (1 - \frac{1}{y} - 1) - ln (1 - \frac{1}{y}) = (s - 1) - ln s .

Since

s - 1 \geq ln s

for all

s > 0

(see e.g., 4.1.36 in [12]),

f (y) \geq 0

for all y,

1 < y \leq n

, which implies (16).

For the second part, we use the first part to get

0 \leq E [\sum_{i = 1}^{n} (\hat{H} - {\hat{H}}^{(i)})] = n E [\hat{H} - H] - \sum_{i = 1}^{n} E [{\hat{H}}^{(i)} - H] = n (B_{n} - B_{n - 1}),

where the last equality follows from the facts that for each i,

{\hat{H}}^{(i)}

is a plug-in estimator of H based on a sample of size

(n - 1)

and that

E [{\hat{H}}^{(i)}]

does not depend on i due to symmetry. From here, the result follows. ☐

Proof of Theorem 3.

By Theorem 1, it suffices to show

\sqrt{n} {\hat{B}}_{J K} \overset{p}{⟶} 0

. Note that, by Lemma 5,

0 \leq E [\sqrt{n} {\hat{B}}_{J K}] = \sqrt{n} (n - 1) (B_{n} - B_{n - 1}) \sim n^{3 / 2} (B_{n} - B_{n - 1}) \to 0,

where the convergence follows by Condition 2. From here, the result follows by Markov’s inequality. ☐

To prove Proposition 1, we need several lemmas, which may be of independent interest.

Lemma 7.

Let

S_{n}

and

S_{n - 1}

be binomial random variables with parameters

(n, p)

and

(n - 1, p)

, respectively. If

n \geq 2

and

p \in (0, 1)

, then

E (S_{n} ln S_{n}) = E (n p ln (S_{n - 1} + 1)) .

(17)

The proof is given on page 178 in [7].

Lemma 8.

Let

X_{1}, \dots, X_{n}

be

i i d

Bernoulli random variables with parameter

p \in (0, 1)

. For

m = 1, \dots, n

let

S_{m} = \sum_{i = 1}^{m} X_{i}

,

{\hat{p}}_{m} = S_{m} / m

,

{\hat{h}}_{m} = - {\hat{p}}_{m} ln {\hat{p}}_{m}

, and

Δ_{m} = E [{\hat{h}}_{m} - {\hat{h}}_{m - 1}]

. Then,

Δ_{n} = E [{\hat{h}}_{n} - {\hat{h}}_{n - 1}] \leq \frac{p (2 - p)}{(n - 1) [(n - 2) p + 2]} \leq \frac{2 p}{(n - 1) [(n - 2) p + 2]} .

(18)

Proof.

Applying Lemma 7 to

Δ_{n}

gives

\begin{matrix} Δ_{n} & = & p ln (\frac{n}{n - 1}) + p E [ln (\frac{S_{n - 2} + 1}{S_{n - 1} + 1})] \\ = & p ln (\frac{n}{n - 1}) + p E [ln (\frac{S_{n - 2} + 1}{S_{n - 2} + X_{n - 1} + 1})] . \end{matrix}

Conditioning on

X_{n - 1}

gives

Δ_{n} = p ln (\frac{n}{n - 1}) + p^{2} E [ln (\frac{S_{n - 2} + 1}{S_{n - 2} + 2})] .

Noting that

f (x) = ln (x / (x + 1))

is a concave function for

x > 0

, by Jensen’s inequality,

Δ_{n} \leq p ln (\frac{n}{n - 1}) + p^{2} ln (\frac{(n - 2) p + 1}{(n - 2) p + 2}) .

Applying the following inequalities (both of which follow from 4.1.36 in [12]) to the terms of the above expression,

ln (\frac{x}{x - 1}) < \frac{1}{x - 1} for x > 1 and ln (\frac{x}{x + 1}) < - \frac{1}{x + 1} for x > 0,

it follows that

Δ_{n} \leq \frac{p}{n - 1} - \frac{p^{2}}{(n - 2) p + 2} = \frac{p (2 - p)}{(n - 1) [(n - 2) p + 2]},

which completes the proof. ☐

For fixed

ε > 0

, rewriting the upper bound of (18) gives

\frac{2 p}{(n - 1) [(n - 2) p + 2]} = \frac{2 p^{1 - ε}}{n - 1} \{\frac{p^{ε}}{(n - 2) p + 2}\} = : \frac{2 p^{1 - ε}}{n - 1} {g (n, p, ε)} .

(19)

Lemma 9.

For any

ε \in (0, 1)

and

n \geq 3

, there exists a

p_{0} \in (0, 1)

such that

g (n, p, ε)

defined in (19) is maximized at

p_{0}

and

0 \leq g (n, p_{0}, ε) = O (n^{- ε}) .

(20)

Proof.

Taking the derivative of

ln g (n, p, ε)

with respect to p gives

\frac{\partial}{\partial p} ln g (n, p, ε) = \frac{ε}{p} - \frac{n - 2}{(n - 2) p + 2} .

It is readily checked that this equals zero only at

p_{0} = \frac{2 ε}{(1 - ε) (n - 2)}

and is positive for

0 < p < p_{0}

and negative for

p_{0} < p < 1

. Thus,

p_{0}

is the global maximum. For a fixed

ε

, we have

\begin{matrix} g (n, p_{0}, ε) = \frac{{(2 ε)}^{ε}}{{(1 - ε)}^{ε} {(n - 2)}^{ε} [\frac{2 ε}{(1 - ε)} + 2]} = {(\frac{ε}{n - 2})}^{ε} {(\frac{1 - ε}{2})}^{1 - ε} = O (n^{- ε}), \end{matrix}

as required. ☐

Proof of Proposition 1.

For every k and every

m \leq n

, let

S_{m, k} = \sum_{i = 1}^{m} 1 [X_{i} = ℓ_{k}] and {\hat{H}}_{m} = - \sum_{k \geq 1} \frac{S_{m, k}}{m} ln (\frac{S_{m, k}}{m})

be the observed letter counts and the plug-in estimator of entropy based on the first m observations. Thus,

S_{m, k} = Y_{k}

and

{\hat{H}}_{n} = \hat{H}

. We are interested in evaluating

\begin{matrix} B_{n} - B_{n - 1} & = E ({\hat{H}}_{n} - {\hat{H}}_{n - 1}) \\ = \sum_{k \geq 1} E [\frac{S_{n - 1, k}}{n - 1} ln (\frac{S_{n - 1, k}}{n - 1}) - \frac{S_{n, k}}{n} ln (\frac{S_{n, k}}{n})] \\ \leq \sum_{k \geq 1} \frac{2 p_{k}}{(n - 1) [(n - 2) p_{k} + 2]} \\ = 2 \sum_{k \geq 1} p_{k}^{1 - ε} \frac{g (n, p_{k}, ε)}{(n - 1)}, \end{matrix}

where the third line follows by Lemma 8. Now, applying Lemmas 5 and 9 gives

\begin{matrix} 0 \leq n^{3 / 2} (B_{n} - B_{n - 1}) & \leq 2 n^{3 / 2} \sum_{k \geq 1} p_{k}^{1 - ε} \frac{g (n, p_{k}, ε)}{(n - 1)} \\ \leq n^{3 / 2} \frac{g (n, p_{0}, ε)}{n - 1} \sum_{k \geq 1} p_{k}^{1 - ε} = O (n^{1 / 2 - ε}), \end{matrix}

which converges to zero when

ε \in (1 / 2, 1)

. ☐

Author Contributions

C.C., M.G., A.S., J.Z. and Z.Z. contributed to the proofs; C.C., M.G. and J.Z. contributed editorial input; Z.Z. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Son, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons: New York, NY, USA, 2017. [Google Scholar]
Zhang, Z.; Grabchak, M. Bias adjustment for a nonparametric entropy estimator. Entropy 2013, 15, 1999–2011. [Google Scholar] [CrossRef]
Miller, G.A.; Madow, W.G. On the Maximum-Likelihood Estimate of the Shannon-Wiener Measure of Information; Operational Applications Laboratory, Air Force, Cambridge Research Center, Air Research and Development Command, Report AFCRC-TR-54-75; Luce, R.D., Bush, R.R., Galanter, E., Eds.; Bolling Air Force Base: Washington, DC, USA, 1954. [Google Scholar]
Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithm 2001, 19, 163–193. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, X. A normal law for the plug-in estimator of entropy. IEEE Trans. Inf. Theory 2012, 58, 2745–2747. [Google Scholar] [CrossRef]
Zahl, S. Jackknifing an index of diversity. Ecology 1977, 58, 907–913. [Google Scholar] [CrossRef]
Harris, B. The statistical estimation of entropy in the non-parametric case. In Topics in Information Theory; Csiszár, I., Ed.; North-Holland: Amsterdam, The Netherlands, 1975; pp. 323–355. [Google Scholar]
Bingham, N.H.; Goldie, C.M.; Teugels, J.L. Regular Variation; Cambridge University Press: New York, NY, USA, 1987. [Google Scholar]
Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions, 10th ed.; Dover Publications: New York, NY, USA, 1972. [Google Scholar]

Figure 1. Effectiveness of the

95 %

confidence intervals as a function of sample size. The plot on the top left gives the proportions for the Miller–Madow and the plug-in estimators, while the one on the top right gives the proportions for the jackknife and the plug-in estimators. The horizontal line is at

0.95

. The closer the proportion is to this line, the better the performance. The plot on the bottom left gives the proportion for Miller–Madow minus the proportion for the plug-in, while the one of the bottom right gives the proportion for the jackknife minus the proportion for the plug-in. The larger the value, the greater the improvement due to bias correction. Here, the horizontal line is at 0.

Figure 1. Effectiveness of the

95 %

confidence intervals as a function of sample size. The plot on the top left gives the proportions for the Miller–Madow and the plug-in estimators, while the one on the top right gives the proportions for the jackknife and the plug-in estimators. The horizontal line is at

0.95

. The closer the proportion is to this line, the better the performance. The plot on the bottom left gives the proportion for Miller–Madow minus the proportion for the plug-in, while the one of the bottom right gives the proportion for the jackknife minus the proportion for the plug-in. The larger the value, the greater the improvement due to bias correction. Here, the horizontal line is at 0.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, C.; Grabchak, M.; Stewart, A.; Zhang, J.; Zhang, Z. Normal Laws for Two Entropy Estimators on Infinite Alphabets. Entropy 2018, 20, 371. https://doi.org/10.3390/e20050371

AMA Style

Chen C, Grabchak M, Stewart A, Zhang J, Zhang Z. Normal Laws for Two Entropy Estimators on Infinite Alphabets. Entropy. 2018; 20(5):371. https://doi.org/10.3390/e20050371

Chicago/Turabian Style

Chen, Chen, Michael Grabchak, Ann Stewart, Jialin Zhang, and Zhiyi Zhang. 2018. "Normal Laws for Two Entropy Estimators on Infinite Alphabets" Entropy 20, no. 5: 371. https://doi.org/10.3390/e20050371

APA Style

Chen, C., Grabchak, M., Stewart, A., Zhang, J., & Zhang, Z. (2018). Normal Laws for Two Entropy Estimators on Infinite Alphabets. Entropy, 20(5), 371. https://doi.org/10.3390/e20050371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Normal Laws for Two Entropy Estimators on Infinite Alphabets

Abstract

1. Introduction

2. Main Results

2.1. Results for the Miller–Madow Estimator

2.2. Results for the Jackknife Estimator

3. Simulations

4. Discussion

5. Proofs

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI