An Extended Result on the Optimal Estimation Under the Minimum Error Entropy Criterion

Chen, Badong; Wang, Guangmin; Zheng, Nanning; Principe, Jose C.

doi:10.3390/e16042223

Open AccessArticle

An Extended Result on the Optimal Estimation Under the Minimum Error Entropy Criterion

by

Badong Chen

^1,*

,

Guangmin Wang

¹,

Nanning Zheng

¹ and

Jose C. Principe

²

¹

Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an 710049, China

²

Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(4), 2223-2233; https://doi.org/10.3390/e16042223

Submission received: 24 January 2014 / Revised: 2 April 2014 / Accepted: 4 April 2014 / Published: 17 April 2014

Download

Browse Figure

Versions Notes

Abstract

: The minimum error entropy (MEE) criterion has been successfully used in fields such as parameter estimation, system identification and the supervised machine learning. There is in general no explicit expression for the optimal MEE estimate unless some constraints on the conditional distribution are imposed. A recent paper has proved that if the conditional density is conditionally symmetric and unimodal (CSUM), then the optimal MEE estimate (with Shannon entropy) equals the conditional median. In this study, we extend this result to the generalized MEE estimation where the optimality criterion is the Renyi entropy or equivalently, the α-order information potential (IP).MSC Codes: 62B10

Keywords:

estimation; minimum error entropy; Renyi entropy; information potential

1. Introduction

Given two random variables: X ∈ ℝⁿ which is an unknown parameter to be estimated, and Y ∈ ℝ^m which is the observation or measurement. The estimation of X based on Y, is in general a measurable function of Y, denoted by X̂ = g(Y) ∈ G, where G stands for the collection of all Borel measurable functions with respect to the σ -field generated by Y. The optimal estimate g^* (Y) can be determined by minimizing a certain risk, which is usually a function of the error distribution. If X has conditional probability density function (PDF) p(x|y), then:

g^{*} = \underset{g \in G}{arg min} R (p^{g} (x))

(1)

where p^g (x) is the PDF of the estimate error E = X–g (Y), $ℛ$ (.) is the risk function: ℙ_E → ℝ, where ℙ_E denotes the collection of all possible PDFs of the error. Let F(y) be the distribution function of Y, the PDF p^g (x) will be:

p^{g} (x) = \int_{ℝ^{m}} p (x + g (y) ∣ y) d F (y)

(2)

As one can see from Equation (2), the problem of choosing an optimal g is actually the problem of shifting the components of a mixture of the conditional PDF so as to minimize the risk $ℛ$ .

The risk function $ℛ$ plays a central role in estimation related problems since it determines the performance surface and hence governs the optimal solution and the performance of the search algorithms. Traditional Bayes risk functions are, in general, defined as the expected value of a certain loss function (usually a nonlinear mapping) of the error:

R_{B a y e s} (p^{g} (x)) = \int_{ℝ^{n}} l (x) p^{g} (x) d x

(3)

where l(.) is the loss function. The most common Bayes risk function used for estimation is the mean square error (MSE), also called the squared error or quadratic error risk, which is defined by $R_{M S E} (p^{g} (x)) = \int_{ℝ^{n}} {‖ x ‖}_{2}^{2} p^{g} (x) d x$ (in this paper, ||.||_p denotes the p -norm). Using the MSE as risk, the optimal estimate of X is simply the conditional mean π (y) ≜ mean [p(.|y)]. The popularity of the MSE is due to its simplicity and optimality for linear Gaussian cases [1–3]. However, MSE is not always a superior risk function especially for non-linear and non-Gaussian situations, since it only takes into account the second order statistics. Therefore, many alternative Bayes risk functions have been used in practical applications. The mean absolute deviation (MAD) $ℛ$ _MAD (p^g (x)) = ∫_ℝⁿ ||x||₁ p^g (x)dx, with which the optimal estimate is the conditional median μ (y) ≜ median [p(.|y)] (here the median of a random vector is defined as the element-wise median vector), is a robust risk function and has been successfully used in adaptive filtering in impulsive noise environments [4]. The mean 0–1 loss $ℛ$ _0–1 (p^g (x)) = ∫_ℝⁿ l_0–1 (x) p^g (x) dx, where l_0–1 (.)denotes the 0–1 loss function (the 0–1 loss function has been frequently used in statistics and decision theory. If the error is a discrete variable, l_0–1 (x) = Entropy 16 02223f1 (x ≠ 0), where (.) is the indicator function, whereas if the error is a continuous variable, l_0–1 (x) is defined as l_0–1 (x) = 1 – δ (x), where δ (.) is the Dirac delta function), yields the optimal estimate as ζ (y) ≜ mode [p(.|y)], i.e., the conditional mode (the mode of a continuous probability distribution is the value at which its PDF attains its maximum value), which is also the maximum a posteriori (MAP) (the MAP estimate is a limit of Bayes estimator (under the 0–1 loss function), but generally not a Bayes estimator) estimate if regarding p(.y) as the posterior density. Other important Bayes risk functions include the mean p-power error [5], Huber’s M-estimation cost [6], and the risk-sensitive cost [7], etc. For general Bayes risk Equation (3), there is no explicit expression for the optimal estimate unless some conditions on l(x) or/and conditional density p(x|y) are imposed. As shown in [8], if l(x) is even and convex, and the conditional density p(x|y) is symmetric in x, the optimal estimate will be the conditional mean (or equivalently, the conditional median).

Besides the traditional Bayes risk functions, the error entropy (EE) can also be used as a risk function in estimation problems. Using Shannon’s definition of entropy [9], the EE risk function is:

R_{S} (p^{g} (x)) = - \int_{ℝ^{n}} p^{g} (x) log p^{g} (x) d x

(4)

As the entropy measures the average dispersion or uncertainty of a random variable, its minimization makes the error concentrated. Different from conventional Bayes risks, the “loss function” of the EE risk (4) is – log p^g (x), which is directly related to the error’s PDF. Therefore, when using the EE risk, we are nonlinearly transforming the error by its own PDF. In 1970, Weidemann and Stear published a paper entitled Entropy Analysis of Estimating Systems [10] in which they studied the parameter estimation problem using the error entropy as a criterion functional. They proved minimizing the error entropy is equivalent to minimizing the mutual information between error and observation, and also proved that the reduced error entropy is upper-bounded by the amount of information obtained by observation. Later, Tomita et al. [11] and Kalata and Priemer [12] studied the estimation and filtering problems from the viewpoint of information theory and derived the famed Kalman filter as a special case of minimum-error-entropy (MEE) linear estimators. Like most Bayes risks, the EE risk (4) has no explicit expression for the optimal estimate unless some constraints on the conditional density p(x|y) are imposed. In a recent paper [13], Chen and Geman proved that, if p(x|y) is conditionally symmetric and unimodal (CSUM), the MEE estimate (the optimal estimate under EE risk) will be the conditional median (or equivalently, the conditional mean or mode). Table 1 gives a summary of the optimal estimates for several risk functions. Since the entropy of a PDF remains unchanged after shift, the MEE estimator is in general restricted to be an unbiased one (i.e., with zero-mean error).

In statistical information theory, there are many extensions to Shannon’s original definition of entropy. Renyi’s entropy is one of the parametrically extended entropies. Given a random variable X with PDF p(x), α -order Renyi entropy is defined by [14]:

H_{α} (X) = \frac{1}{1 - α} log (\int_{ℝ^{n}} {(p (x))}^{α} d x)

(5)

where α > 0, and α ≠ 1. The entropy definition (5) becomes the usual Shannon entropy as α → 1. Renyi entropy can be used to define a generalized EE risk:

R_{α} (p^{g} (x)) = \frac{1}{1 - α} log (\int_{ℝ^{n}} {(p^{g} (x))}^{α} d x)

(6)

In recent years, the EE risk (6) has been successfully used as an adaptation cost in information theoretic learning (ITL) [15–22]. It has been shown that the nonparametric kernel (Parzen window) estimator of Renyi entropy (especially when α = 2) is more computationally efficient than that of Shannon entropy [15]. The argument of the logarithm in Renyi entropy, denoted by V_α (V_α = ∫_ℝⁿ (p(x))^α dx), is called the α-order information potential (IP) (this quantity is called information potential since each term in its kernel estimator can be interpreted as a potential between two particles (see [15] for the physical interpretation of kernel estimator of information potential) [15]. As the logarithm is a monotonic function, the minimization of Renyi entropy is equivalent to the minimization (when α < 1) or maximization (when α > 1) of information potential. In practical applications, information potential has been frequently used as an alternative to Renyi entropy [15].

A natural and important question now arises: what is the optimal estimate under the generalized EE risk (6)? We do not know the answer to this question in the general case. In this work, however, we will extend the results by Chen and Geman [13] to a more general case and show that, if the conditional density p(x|y)is CSUM, the generalized MEE estimate will also be the conditional median (or equivalently, the conditional mean or mode).

2. Main Theorem and the Proof

In this section, our focus is on the α -order information potential (IP), but the conclusions drawn can be immediately transferred to Renyi entropy. The main theorem of the paper is as follows.

Theorem 1

Assume for every value y ∈ ℝ^m, that the conditional PDF p(x|y) is conditionally symmetric (rotation invariant for multivariate case) and unimodal (CSUM) in x ∈ ℝ^n, and let μ (y) = median [p(.|y)]. If α -order information potential V_α (X – μ (Y)) < ∞ (α > 0, α α 1), then:

{\begin{array}{l} V_{α} (X - μ (Y)) \leq V_{α} (X - g (Y)) & i f 0 < α < 1 \\ V_{α} (X - μ (Y)) \geq V_{α} (X - g (Y)) & i f α > 1 \end{array}

(7)

for all g :ℝ^m → ℝⁿ for which V_α (X – g(Y)) < ∞.

Remark

As p(x|y) is CSUM, the conditional median μ (y) in Theorem 1 is the same as the conditional mean π (y) and conditional mode ζ (y). According to the relationship between information potential and Renyi entropy, the inequalities in Equation (7) are equivalent to:

H_{α} (X - μ (Y)) \leq H_{α} (X - g (Y))

(8)

Proof of the Theorem

In this work, we give a proof for the univariate case (n = 1). A similar proof can be easily extended to the multivariate case (n > 1). In the proof we assume, without loss of generality, that ∀y, p(x|y) has median at x = 0, since otherwise we could replace p(x|y)by p(x+ μ(y)|y) and work instead with conditional densities centered at x = 0. The road map of the proof is similar to that contained in [13]. There are, however, significant differences between our work and [13]: (1) we extend the entropy minimization problem to the generalized error entropy; (2) in our proof, the Holder inequality is applied, and there is no discretization procedure, which simplifies the proof significantly.

First, we prove the following proposition:

Proposition 1

Assume that f (x|y) (not necessarily a conditional density function) satisfies

(1): non-negative, continuous and integrable in x for each y ∈ ℝ^m;
(2): symmetric (rotation invariant for n > 1) around x = 0 and unimodal for each y ∈ ℝ^m;
(3): uniformly bounded in (x, y);
(4): V_α(f⁰) < ∞, where V_α (f⁰) = ∫_ℝ (f⁰(x))^α dx, and f⁰(x) = ∫_ℝ^m f (x|y)dF (y).

Then for all g :ℝ^m → ℝ for which V_α (f^g) < ∞, we have

{\begin{array}{l} V_{α} (f^{0}) \leq V_{α} (f^{g}) & i f 0 < α < 1 \\ V_{α} (f^{0}) \geq V_{α} (f^{g}) & i f α > 1 \end{array}

(9)

where V_α (f^g) = ∫_ℝ (f^g(x))^α dx, f^g (x) = ∫_ℝ^m f (x+g (y)|y) dF (y).

Remark

It is easy to observe that ∫ f⁰ dx = ∫f^g dx ≤ sup₍_x_,_y₎ f (x|y) < ∞ (not necessarily ∫ f⁰ dx = 1).

Proof of the Proposition

The proof is based on the following three lemmas.

Lemma 1[13]

Let non-negative function h : ℝ → [0, ∞) be bounded, continuous, and integrable, and define function O_h (z) by:

O_{h} (z) = λ {x : h (x) \geq z}

(10)

where λ is Lebesgue measure. Then the following results hold:

(a)

Define m^h (x) = sup {z : O_h (z) ≥ x}, x ∈ (0, ∞), and m^h (0) = sup_x h(x). Then m^h (x) is continuous and non-increasing on [0, ∞), and m^h (x) → 0 as x → ∞.

(b)

For any function G:[0, ∞) → ℝ with ∫_ℝ |G(h(x)) dx < ∞:

\int_{ℝ} G (h (x)) d x = \int_{0}^{\infty} G (m^{h} (x)) d x

(11)

(c)

For any x₀ ∈ [0, ∞):

\int_{0}^{x_{0}} m^{h} (x) d x = sup_{A : λ (A) = x_{0}} \int_{A} h (x) d x

(12)

Proof of Lemma 1

See the proof of Lemma 1 in [13].

Remark

The transformation h → m^h in Lemma1 is also called the “rearrangement” of h [23]. By Lemma 1, we have V_α (m ^{f^g}) = V_α (f^g) < ∞ and V_α (m ^f⁰) = V_α (f⁰) < ∞ (let G(x) = x ^α). Therefore, to prove Proposition 1, it suffices to prove:

{\begin{array}{l} V_{α} (m^{f^{0}}) \leq V_{α} (m^{f^{g}}) & i f 0 < α < 1 \\ V_{α} (m^{f^{0}}) \geq V_{α} (m^{f^{g}}) & i f α > 1 \end{array}

(13)

Lemma 2

Denote m^g = m^{f ^g}^, m⁰ = m^f ⁰. Then:

(a)

\int_{0}^{\infty} m^{g} (x) d x = \int_{0}^{\infty} m^{0} (x) d x < \infty

(14)

(b)

\int_{0}^{x_{0}} m^{g} (x) d x \leq \int_{0}^{x_{0}} m^{0} (x) d x, \forall x_{0} \in [0, \infty)

(15)

Proof of Lemma 2

See the proof of Lemma 3 in [13].

Lemma 3

∀α > 0, let n be a non-negative integer such that n < α ≤ n+1. Then ∀x₀ ∈ [0, ∞):

(a)

\int_{0}^{x_{0}} {{(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α}} d x \leq \int_{0}^{x_{0}} m^{0} (x) d x

(16)

(b)

\int_{x_{0}}^{\infty} {{(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α}} d x \leq \int_{x_{0}}^{\infty} m^{g} (x) d x

(17)

Proof of Lemma 3

According to Holder inequality [23], we have ∀Ω ⊂ [0, ∞):

\int_{Ω} {{(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α}} d x \leq {(\int_{Ω} m^{g} (x) d x)}^{α - n} {(\int_{Ω} m^{0} (x) d x)}^{n + 1 - α}

(18)

By Lemma 2, $\int_{0}^{x_{0}} m^{g} (x) d x \leq \int_{0}^{x_{0}} m^{0} (x) d x$ , it follows that:

\begin{array}{l} \int_{0}^{x_{0}} {{(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α}} d x \leq {(\int_{0}^{x_{0}} m^{g} (x) d x)}^{α - n} {(\int_{0}^{x_{0}} m^{0} (x) d x)}^{n + 1 - α} \\ \leq {(\int_{0}^{x_{0}} m^{0} (x) d x)}^{α - n} {(\int_{0}^{x_{0}} m^{0} (x) d x)}^{n + 1 - α} \\ = \int_{0}^{x_{0}} m^{0} (x) d x \end{array}

(19)

Further, since $\int_{0}^{\infty} m^{g} (x) d x = \int_{0}^{\infty} m^{0} (x) d x$ , we have $\int_{x_{0}}^{\infty} m^{g} (x) d x \geq \int_{x_{0}}^{\infty} m^{0} (x) d x$ , and hence:

\begin{array}{l} \int_{x_{0}}^{\infty} {{(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α}} d x \leq {(\int_{x_{0}}^{\infty} m^{g} (x) d x)}^{α - n} {(\int_{x_{0}}^{\infty} m^{0} (x) d x)}^{n + 1 - α} \\ \leq {(\int_{x_{0}}^{\infty} m^{g} (x) d x)}^{α - n} {(\int_{x_{0}}^{\infty} m^{g} (x) d x)}^{n + 1 - α} \\ = \int_{x_{0}}^{\infty} m^{g} (x) d x \end{array}

(20)

Q.E.D. (Lemma 3)

Let S_g = sup{x : m^g (x) > 0}, which is finite or infinite, Equation (17) can be rewritten as:

\int_{x_{0}}^{S_{g}} {{(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α}} d x \leq \int_{x_{0}}^{S_{g}} m^{g} (x) d x

(21)

Now we are in position to prove Equation (13):

(1)

0 < α < 1: In this case, we have:

\begin{array}{l} \int_{0}^{\infty} {(m^{g} (x))}^{α} d x = \int_{0}^{S_{g}} m^{g} (x) {(m^{g} (x))}^{α - 1} d x \\ = \int_{0}^{S_{g}} m^{g} (x) (\int_{0}^{{(m^{g} (x))}^{α - 1}} d y) d x \\ = \int_{0}^{S_{g}} {\int_{0}^{\infty} m^{g} (x) I (y \leq {(m^{g} (x))}^{α - 1}) d y} d x \\ = \int_{0}^{\infty} (\int_{inf {x : {(m^{g} (x))}^{α - 1} \geq y}}^{S_{g}} m^{g} (x) d x) d y \\ \overset{(A)}{\geq} \int_{0}^{\infty} {\int_{inf {x : {(m^{g} (x))}^{α - 1} \geq y}}^{S_{g}} {(m^{g} (x))}^{1 - α} {(m^{0} (x))}^{α} d x} d y \\ = \int_{0}^{\infty} {\int_{0}^{S_{g}} {(m^{g} (x))}^{1 - α} {(m^{0} (x))}^{α} I ({(m^{g} (x))}^{α - 1} \geq y) d x} d y \\ = \int_{0}^{S_{g}} {(m^{g} (x))}^{1 - α} {(m^{0} (x))}^{α} (\int_{0}^{{(m^{g} (x))}^{α - 1}} d y) d x \\ = \int_{0}^{S_{g}} {(m^{0} (x))}^{α} d x \\ \overset{(B)}{=} \int_{0}^{S_{g}} {(m^{0} (x))}^{α} d x + \int_{S_{g}}^{0} {(m^{0} (x))}^{α} d x \\ = \int_{0}^{\infty} {(m^{0} (x))}^{α} d x \end{array}

(22)

where (A) follows from Equation (21), and (B) is due to $\int_{S_{g}}^{\infty} {(m^{0} (x))}^{α} d x = 0$ , since

\begin{matrix} 0 = \int_{S_{g}}^{\infty} m^{g} (x) d x \geq \int_{S_{g}}^{\infty} m^{0} (x) d x \geq 0 \\ \Rightarrow \int_{S_{g}}^{\infty} m^{0} (x) d x = 0 \end{matrix}

(23)

(2)

α > 1: First we have

\begin{array}{l} \int_{0}^{\infty} {(m^{0} (x))}^{α} d x = \int_{0}^{\infty} m^{0} (x) {(m^{0} (x))}^{α - 1} d x \\ = \int_{0}^{\infty} m^{0} (x) (\int_{0}^{{(m^{0} (x))}^{α - 1}} d y) d x \\ = \int_{0}^{\infty} {\int_{0}^{\infty} m^{0} (x) I (y \leq {(m^{0} (x))}^{α - 1}) d y} d x \\ = \int_{0}^{\infty} (\int_{0}^{sup {x : {(m^{0} (x))}^{α - 1} \geq y}} m^{0} (x) d x) d y \\ \overset{(C)}{\geq} \int_{0}^{\infty} {\int_{0}^{sup {x : {(m^{0} (x))}^{α - 1} \geq y}} {(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α} d x} d y \\ = \int_{0}^{\infty} {\int_{0}^{\infty} {(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α} I ({(m^{0} (x))}^{α - 1} \geq y) d x} d y \\ = \int_{0}^{\infty} {(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n + 1 - α} (\int_{0}^{{(m^{0} (x))}^{α - 1}} d y) d x \\ = \int_{0}^{\infty} {(m^{0} (x))}^{n} {(m^{g} (x))}^{α - n} d x \end{array}

(24)

where (C) follows from Equation (16). Further, one can derive:

\begin{array}{l} \int_{0}^{\infty} {(m^{0} (x))}^{n} {(m^{g} (x))}^{α - n} d x = \int_{0}^{\infty} (m^{0} (x) \int_{0}^{{(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n - 1}} d y) d x \\ = \int_{0}^{\infty} {\int_{0}^{\infty} m^{0} (x) I (y \leq {(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n - 1}) d y} d x \\ = \int_{0}^{\infty} {\int_{0}^{sup {x : {(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n - 1} \geq y}} m^{0} (x) d x} d y \\ \overset{(D)}{\geq} \int_{0}^{\infty} {\int_{0}^{sup {x : {(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n - 1} \geq y}} m^{g} (x) d x} d y \\ = \int_{0}^{\infty} {\int_{0}^{\infty} m^{g} (x) I ({(m^{g} (x))}^{α - n} {(m^{0} (x))}^{n - 1} \geq y) d x} d y \\ = \int_{0}^{\infty} {(m^{0} (x))}^{n - 1} {(m^{g} (x))}^{α - n + 1} d x \\ ⋮ \\ \geq \int_{0}^{\infty} {(m^{0} (x))}^{n - 2} {(m^{g} (x))}^{α - n + 2} d x \\ ⋮ \\ \geq \int_{0}^{\infty} {(m^{g} (x))}^{α} d x \end{array}

(25)

where (D) is because $\int_{0}^{x_{0}} m^{g} (x) d x \leq \int_{0}^{x_{0}} m^{0} (x) d x, \forall x_{0} \in [0, \infty)$ , ∀x₀ ∈ [0, ∞). Combining Equations (24) and (25), we get $\int_{0}^{\infty} {(m^{0} (x))}^{α} d x \geq \int_{0}^{\infty} {(m^{g} (x))}^{α} d x$ (i.e., V_α (m⁰) ≥ V_α (m^g)).

Up to now, the proof of Proposition 1 has been completed. Let us come back to the proof of Theorem 1. The remaining task is just to remove the conditions of continuity and uniform boundedness imposed in Proposition 1. This can be easily accomplished by approximating p(x|y) by a sequence of functions {f_n (x|y)}, n = 1, 2,..., which satisfy the conditions of Proposition 1. Then similar to [13], we define:

f_{n} (x ∣ y) = {\begin{array}{l} n \int_{x}^{x + (1 / n)} min (n, p (z ∣ y)) d z \forall x \in [0, \infty) \\ f_{n} (- x ∣ y) \forall x \in (- \infty, 0) \end{array}

(26)

It is easy to verify that for each n, f_n (x|y) satisfies all the conditions of Proposition 1. Here we only give the proof for condition 4). Let $f_{n}^{0} (x) = \int_{ℝ^{m}} f_{n} (x ∣ y) d F (y)$ , we have:

\begin{array}{l} V_{α} (f_{n}^{0}) = \int_{ℝ} {(f_{n}^{0} (x))}^{α} d x \\ = \int_{ℝ} {(\int_{ℝ^{m}} f_{n} (x ∣ y) d F (y))}^{α} d x \\ = 2 \int_{ℝ_{+}} {(\int_{ℝ^{m}} (n \int_{x}^{x + (1 / n)} min (n, p (z ∣ y)) d z) d F (y))}^{α} d x \\ \leq 2 \int_{ℝ_{+}} {(\int_{ℝ^{m}} (n \int_{x}^{x + (1 / n)} sup_{z \in [x, x + (1 / n)]} p (z ∣ y) d z) d F (y))}^{α} d x \\ \overset{(E)}{=} 2 \int_{ℝ_{+}} {(\int_{ℝ^{m}} (n \int_{x}^{x + (1 / n)} p (x ∣ y) d z) d F (y))}^{α} d x \\ = 2 \int_{ℝ_{+}} {(\int_{ℝ^{m}} p (x ∣ y) d F (y))}^{α} d x \\ = V_{α} (p^{0}) < \infty \end{array}

(27)

where (E) comes from the fact that ∀y, p(x|y)is non-increasing in x over [0, ∞), since it is CSUM.

According to Proposition 1, we have, for every n :

{\begin{array}{l} V_{α} (f_{n}^{0}) \leq V_{α} (f_{n}^{g}) & i f 0 < α < 1 \\ V_{α} (f_{n}^{0}) \geq V_{α} (f_{n}^{g}) & i f α > 1 \end{array}

(28)

where $f_{n}^{g} (x) \int_{ℝ^{m}} f_{n} (x + g (y) ∣ y) d F (y)$ . In order to complete the proof of Theorem 1, we only need to show that $V_{α} (f_{n}^{0}) \to V_{α} (p^{0})$ , and $V_{α} (f_{n}^{g}) \to V_{α} (p^{g})$ . This can be proved by the dominated convergence theorem. Here we only show $V_{α} (f_{n}^{0}) \to V_{α} (p^{0})$ , the proof for $V_{α} (f_{n}^{g}) \to V_{α} (p^{g})$ is identical. First, it is clear that $f_{n}^{0} (x) \leq p^{0} (x)$ , ∀x, and hence ${(f_{n}^{0} (x))}^{α} \leq {(p^{0} (x))}^{α}$ , ∀x. Also, we can derive:

\int_{ℝ} ∣ f_{n}^{0} (x) - p^{0} (x) ∣ d x \to 0

(29)

As V_α (p⁰) = ∫_ℝ (p⁰(x))^α < ∞, then by dominated convergence theorem, $V_{α} (f_{n}^{0}) \to V_{α} (p^{0})$ .

Q.E.D. (Theorem 1)

Remark

The condition of the CSUM in Theorem 1 is a little strong, but it can be easily relaxed to just requiring that the conditional PDF p(x|y) is generalized uniformly dominated (GUD) in x ∈ ℝⁿ (see [24] for the definition of GUD).

3. Conclusions

The problem of determining a minimum-error-entropy (MEE) estimator is actually the problem of shifting the components of a mixture of the conditional PDF so as to minimize the entropy of the mixture. It has been proved in a recent paper that, if the conditional distribution is conditionally symmetric and unimodal (CSUM), the Shannon entropy of the mixture distribution will be minimized by aligning the conditional median. In the present work, this result has been extended to a more general case. We show that if the conditional distribution is CSUM, the Renyi entropy of the mixture distribution will also be minimized by aligning the conditional median.

Acknowledgments

This work was supported by National Natural Science Foundation of China (No. 61372152, No. 90920301) and 973 Program (No.2012CB316400, No. 2012CB316402).

Conflicts of Interest

The authors declare no conflict of interest.

Author ContributionsThe contributions of each author are as follows: Badong Chen proved the main theorem and finished the draft; Guangmin Wang polished the language and typeset the manuscript; Nanning Zheng was in charge of technical checking; Jose C. Principe proofread the paper.

References

Haykin, S. Adaptive Filtering Theory; Prentice Hall: New York, NY, USA, 1996. [Google Scholar]
Kailath, T.; Sayed, A.H.; Hassibi, B. Linear Estimation; Prentice Hall: Englewood Cliffs, NJ, USA, 2000. [Google Scholar]
Papoulis, A.; Pillai, S.U. Probability, Random Variables, and Stochastic Processes; McGraw-Hill Education: New York, NY, USA, 2002. [Google Scholar]
Shao, M.; Nikias, C.L. Signal processing with fractional lower order moments: Stable processes and their applications. IEEE Proc 1993, 81, 986–1009. [Google Scholar]
Pei, S.C.; Tseng, C.C. Least mean p-power error criterion for adaptive FIR filter. IEEE J. Sel. Areas Commun 1994, 12, 1540–1547. [Google Scholar]
Rousseeuw, P.J.; Leroy, A.M. Robust Regression and Outlier Detection; John Wiley & Sons: New York, NY, USA, 1987. [Google Scholar]
Boel, R.K.; James, M.R.; Petersen, I.R. Robustness and risk-sensitive filtering. IEEE Trans. Automat. Control 2002, 47, 451–461. [Google Scholar]
Hall, E.B.; Wise, G.L. On optimal estimation with respect to a large family of cost functions. IEEE Trans. Inform. Theory 1991, 37, 691–693. [Google Scholar]
Cover, T.M.; Thomas, J.A. Element of Information Theory; Wiley & Son, Inc: New York, NY, USA, 1991. [Google Scholar]
Weidemann, H.L.; Stear, E.B. Entropy analysis of estimating systems. IEEE Trans. Inform. Theory 1970, 16, 264–270. [Google Scholar]
Tomita, Y.; Ohmatsu, S.; Soeda, T. An application of the information theory to estimation problems. Inf. Control 1976, 32, 101–111. [Google Scholar]
Kalata, P.; Priemer, R. Linear prediction, filtering and smoothing: An information theoretic approach. Inf. Sci 1979, 17, 1–14. [Google Scholar]
Chen, T.-L.; Geman, S. On the minimum entropy of a mixture of unimodal and symmetric distributions. IEEE Trans. Inf. Theory 2008, 54, 3166–3174. [Google Scholar]
Renyi, A. On Measures of Entropy and Information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Statistical Laboratory of the University of California, Berkeley, CA, USA, 20 June–30 July, 1960; 1, pp. 547–561.
Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
Chen, B.; Zhu, Y.; Hu, J.; Principe, J.C. System Parameter Identification: Information Criteria and Algorithms; Elsevier Inc: London, UK, 2013. [Google Scholar]
Erdogmus, D.; Principe, J.C. An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Trans. Signal Process 2002, 50, 1780–1786. [Google Scholar]
Erdogmus, D.; Principe, J.C. Generalized information potential criterion for adaptive system training. IEEE Trans. Neur. Netw 2002, 13, 1035–1044. [Google Scholar]
Erdogmus, D.; Principe, J.C. From linear adaptive filtering to nonlinear information processing—The design and analysis of information processing systems. IEEE Signal Process. Mag 2006, 23, 14–33. [Google Scholar]
Santamaria, I.; Erdogmus, D.; Principe, J.C. Entropy minimization for supervised digital communications channel equalization. IEEE Trans. Signal Process 2002, 50, 1184–1192. [Google Scholar]
Chen, B.; Hu, J.; Pu, L.; Sun, Z. Stochastic gradient algorithm under (h, φ)-entropy criterion. Circuits Syst. Signal Process 2007, 26, 941–960. [Google Scholar]
Chen, B.; Zhu, Y.; Hu, J. Mean-square convergence analysis of ADALINE training with minimum error entropy criterion. IEEE Trans. Neur. Netw 2010, 21, 1168–1179. [Google Scholar]
Hardy, G.H.; Littlewood, J.E.; Polya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1934. [Google Scholar]
Chen, B.; Principe, J.C. Some further results on the minimum error entropy estimation. Entropy 2012, 14, 966–977. [Google Scholar]

Table 1. Optimal estimates for several risk functions.

**Table 1.** Optimal estimates for several risk functions.
Risk function	$ℛ$ (p^g(x))	Optimal estimate
Mean square error (MSE)	$\int_{ℝ^{n}} {‖ x ‖}_{2}^{2} p^{g} (x) d x$	g^* (y) = π(y) ≜ mean [p(.\|y)]
Mean absolute deviation (MAD)	∫_ℝⁿ \|\|x\|\|₁ p^g (x) dx	g^* (y) = μ(y) ≜ median [p(.\|y)]
Mean 0–1 loss	∫_ℝⁿ l_0–1 (x) p^g (x)dx	g^* (y) = ζ (y) ≜ mode [p(.\|y)]
General Bayes risk	∫_ℝⁿ l (x) p^g (x) dx	If l(x) is even and convex, and p(x\|y) is symmetric in x, then g^* (y) = π(y) = μ(y)
Error entropy (EE)	−∫_ℝⁿ p^g (x) log p^g (x) dx	If p(x\|y) is CSUM, then g^*(y) = π(y) = μ(y) = ζ(y)

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Chen, B.; Wang, G.; Zheng, N.; Principe, J.C. An Extended Result on the Optimal Estimation Under the Minimum Error Entropy Criterion. Entropy 2014, 16, 2223-2233. https://doi.org/10.3390/e16042223

AMA Style

Chen B, Wang G, Zheng N, Principe JC. An Extended Result on the Optimal Estimation Under the Minimum Error Entropy Criterion. Entropy. 2014; 16(4):2223-2233. https://doi.org/10.3390/e16042223

Chicago/Turabian Style

Chen, Badong, Guangmin Wang, Nanning Zheng, and Jose C. Principe. 2014. "An Extended Result on the Optimal Estimation Under the Minimum Error Entropy Criterion" Entropy 16, no. 4: 2223-2233. https://doi.org/10.3390/e16042223

APA Style

Chen, B., Wang, G., Zheng, N., & Principe, J. C. (2014). An Extended Result on the Optimal Estimation Under the Minimum Error Entropy Criterion. Entropy, 16(4), 2223-2233. https://doi.org/10.3390/e16042223

Article Menu

An Extended Result on the Optimal Estimation Under the Minimum Error Entropy Criterion

Abstract

1. Introduction

2. Main Theorem and the Proof

Theorem 1

Remark

Proof of the Theorem

Proposition 1

Remark

Proof of the Proposition

Lemma 1[13]

Proof of Lemma 1

Remark

Lemma 2

Proof of Lemma 2

Lemma 3

Proof of Lemma 3

Remark

3. Conclusions

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI