On the Smoothed Minimum Error Entropy Criterion

Chen, Badong; Principe, Jose C.

doi:10.3390/e14112311

Open AccessArticle

On the Smoothed Minimum Error Entropy Criterion

by

Badong Chen

^1,2,* and

Jose C. Principe

¹

Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA

²

Department of Precision Instruments and Mechanology, Tsinghua University, Beijing, 100084, China

^*

Author to whom correspondence should be addressed.

Entropy 2012, 14(11), 2311-2323; https://doi.org/10.3390/e14112311

Submission received: 9 July 2012 / Revised: 1 November 2012 / Accepted: 1 November 2012 / Published: 12 November 2012

Download

Browse Figure

Versions Notes

Abstract

:

Recent studies suggest that the minimum error entropy (MEE) criterion can outperform the traditional mean square error criterion in supervised machine learning, especially in nonlinear and non-Gaussian situations. In practice, however, one has to estimate the error entropy from the samples since in general the analytical evaluation of error entropy is not possible. By the Parzen windowing approach, the estimated error entropy converges asymptotically to the entropy of the error plus an independent random variable whose probability density function (PDF) corresponds to the kernel function in the Parzen method. This quantity of entropy is called the smoothed error entropy, and the corresponding optimality criterion is named the smoothed MEE (SMEE) criterion. In this paper, we study theoretically the SMEE criterion in supervised machine learning where the learning machine is assumed to be nonparametric and universal. Some basic properties are presented. In particular, we show that when the smoothing factor is very small, the smoothed error entropy equals approximately the true error entropy plus a scaled version of the Fisher information of error. We also investigate how the smoothing factor affects the optimal solution. In some special situations, the optimal solution under the SMEE criterion does not change with increasing smoothing factor. In general cases, when the smoothing factor tends to infinity, minimizing the smoothed error entropy will be approximately equivalent to minimizing error variance, regardless of the conditional PDF and the kernel.

Keywords:

entropy; supervised machine learning; minimum error entropy criterion

MSC Codes:

62B10

1. Introduction

The principles and methods in Shannon’s information theory have been widely applied in statistical estimation, filtering, and learning problems [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]. The learning process is essentially a procedure of information processing with the goal of decreasing data redundancy in the presence of uncertainty and encoding the data into a model, and hence it is intrinsically related to information theory. An information theoretic description of learning processes was given in [10], where learning is defined as a process in which the system’s subjective entropy or, equivalently, its missing information decreases in time. The mathematical concept of information was also brought to biologically plausible information processing [11]. In addition, a unifying framework for information theoretic learning (ITL) has been presented in [18].

From a statistical viewpoint, learning can be thought of as approximating the a posteriori distribution of the targets given a set of examples (training data). Figure 1 shows a general scheme of supervised machine learning, where the desired system output Y (the teacher) is assumed to be related to the input signal X through a conditional probability density function (PDF)

p_{Y | X} (y | x)

, and the learning machine (model) is represented by a parametric mapping g (.,W), where

W \in ℝ^{d}

denotes a parameter vector that needs to be estimated.

Figure 1. A general scheme of supervised machine learning.

The learning goal is then to adapt the parameter W such that the discrepancy between the model output

\hat{Y} = g (X, W)

and the desired output Y is minimized. How to measure the discrepancy (or model mismatch) is a key aspect in learning. One can use a statistical descriptor of the error (

E = Y - \hat{Y}

) distribution as the measure of discrepancy. The most popular descriptors are the second order moments (variance, correlation, etc.), which combined with the Gaussian assumption, in general leads to mathematically convenient and analytically tractable optimal solutions. A typical example is the mean square error (MSE) criterion in least-squares regression. However, the second order statistics as optimality criteria may perform poorly especially in nonlinear and non-Gaussian (e.g., heavy-tail or finite range distributions) situations. Recently, the error entropy, as an information theoretic alternative to MSE, has been successfully applied in supervised adaptive system training [12,13,14,15,16,17]. The minimum error entropy (MEE) criterion usually outperforms MSE criterion in many realistic scenarios, since it captures higher-order statistics and information content of signals rather than simply their energy. Under the MEE criterion, the optimal weight in Figure 1 will be:

\begin{array}{l} W^{*} = \underset{W \in ℝ^{d}}{\arg \min} H (E) \\ = \underset{W \in ℝ^{d}}{\arg \min} - \int p_{E} (x) \log p_{E} (x) d x \\ = \underset{W \in ℝ^{d}}{\arg \min} Ε [- \log p_{E} (E)] \end{array}

(1)

where H (E) denotes the Shannon entropy of the error E, p_E(.) denotes the error PDF, and

Ε

denotes the expectation operator. Throughout this paper, “log” denotes the natural logarithm. The formulation (1) can be generalized to other entropy definitions, such as the α-order Renyi entropy [18]. Since entropy is shift invariant, the error PDF p_E(.) is in general restricted to zero mean in practice.

The learning machine in Figure 1 can also be a nonparametric and universal mapping g(.). Familiar examples include the support vector machine (SVM) [19,20] and kernel adaptive filtering [21]. In this case, the hypothesis space for learning is in general a high (possibly infinite) dimensional reproducing kernel Hilbert space (RKHS)

H

, and the optimal mapping under MEE criterion is:

g^{*} = \underset{g \in H}{\arg \min} - \int p_{E} (x) \log p_{E} (x) d x

(2)

To implement the MEE learning, we should evaluate the error entropy. In practice, however, the error distribution is usually unknown, and the analytical evaluation of error entropy is not possible. Thus we have to estimate the error entropy from the samples. One simple way is to estimate the error PDF based on available samples, and plug the estimated PDF directly into the entropy definition to obtain the entropy estimator. In the literature there are many techniques for estimating the PDF of a random variable based on its sample data. In ITL, the most widely used approach is the Parzen windowing (or kernel density estimation) [22]. By Parzen windowing, the estimated error PDF is:

{\hat{p}}_{E} (x) = \frac{1}{N} \sum_{i = 1}^{N} κ_{λ} (x - e_{i})

(3)

where

κ_{λ} (x) = (1 / λ) κ (x / λ)

is the kernel function with smoothing factor (or kernel size) λ > 0,

κ

is a continuous density, and

{e_{i}}_{i = 1}^{N}

are error samples which are assumed to be independent and identically distributed (i.i.d.). As sample number N → ∞, the estimated PDF will uniformly converge (with probability 1) to the true PDF convolved with the kernel function, that is

{\hat{p}}_{E} (x) \overset{N \to \infty}{\to} (p_{E} * κ_{λ}) (x)

(4)

where * denotes the convolution operator. Then, by the Parzen windowing approach (with a fixed kernel function

κ_{λ}

), the estimated error entropy will converge almost surely (a.s.) to the entropy of the convolved density (see [22,23]). Thus, the actual entropy to be minimized is:

H (E + λ Z) = - \int (p_{E} * κ_{λ}) (x) \log (p_{E} * κ_{λ}) (x) d x

(5)

where Z denotes a random variable that is independent of X, Y, E, and has PDF

p_{Z} (x) = κ (x)

. Note that the PDF of the sum of two independent random variables equals the convolution of their individual PDFs. Here, we call the entropy H (E + λZ) the smoothed error entropy, and Z the smoothing variable. Under the smoothed MEE (SMEE) criterion, the optimal mapping in (2) becomes:

\begin{array}{l} g^{*} = \underset{g \in H}{\arg \min} H (E + λ Z) \\ = \underset{g \in H}{\arg \min} - \int (p_{E} * κ_{λ}) (x) \log (p_{E} * κ_{λ}) (x) d x \end{array}

(6)

Although SMEE is an actual learning criterion (as sample number N → ∞) in ITL, up to now its theoretical properties have been little studied. In this work, we study theoretically the SMEE criterion. The rest of the paper is organized as follows: in Section 2, we present some basic properties of the SMEE criterion. In Section 3, we investigate how the smoothing factor λ affects the optimal solution. Finally in Section 4, we give the conclusion.

2. Some Basic Properties of SMEE Criterion

In this section, some basic properties of SMEE criterion are presented. We assume from now on, without explicit mention, that the learning machine is a nonparametric and universal mapping g(.). In addition, the input vector X belongs to an m-dimensional Euclidean space,

X \in ℝ^{m}

, and for simplicity, the output Y is assumed to be a scalar signal,

Y \in ℝ

.

Property 1: Minimizing the smoothed error entropy will minimize an upper bound of the true error entropy H (E).

Proof: According to the entropy power inequality (EPI) [1], we have:

\exp (2 H (E + λ Z)) \geq \exp (2 H (E)) + \exp (2 H (λ Z))

(7)

It follows that:

H (E) \leq \frac{1}{2} \log (\exp (2 H (E + λ Z)) - \exp (2 H (λ Z)))

(8)

Thus, minimizing the smoothed error entropy H (E + λZ) minimizes an upper bound of H (E).

Remark 1: Although this property does not give a precise result concerning SMEE vs. MEE, it suggests that minimizing the smoothed error entropy will constrain the true error entropy to small values.

Property 2: The smoothed error entropy is upper bounded by

\frac{1}{2} \log (2 π e (σ_{E}^{2} + λ^{2} σ_{Z}^{2}))

, where

σ_{E}^{2}

and

σ_{Z}^{2}

denote the variances of E and Z, respectively, and this upper bound is achieved if and only if both E and Z are Gaussian distributed.

Proof: The first part of this property is a direct consequence of the maximum entropy property of the Gaussian distribution. The second part comes from Cramer’s decomposition theorem [24], which states that if X₁ and X₂ are independent and their sum X₁ + X₂ is Gaussian, then both X₁ and X₂ must also be Gaussian.

Remark 2: According to Property 2, if both E and Z are Gaussian distributed, then minimizing the smoothed error entropy will minimize the error variance.

Property 3: The smoothed error entropy has the following Taylor approximation around λ = 0:

H (E + λ Z) = H (E) + \frac{1}{2} λ^{2} σ_{Z}^{2} J (E) + ο (λ^{2})

(9)

where o (λ²) denotes the higher-order infinitesimal term of Taylor expansion, and J (E) is the Fisher information of E, defined as:

J (E) = Ε [{(\frac{\partial}{\partial E} \log p_{E} (E))}^{2}]

(10)

Proof: This property can be easily proved by using De Bruijn’s identity [25]. For any two independent random variables X₁ and X₂,

X_{1}, X_{2} \in ℝ

, De Bruijn’s identity can be expressed as

{\frac{\partial}{\partial t} H (X_{1} + \sqrt{t} X_{2}) |}_{t = 0} = \frac{1}{2} σ_{X_{2}}^{2} J (X_{1})

(11)

where

σ_{X_{2}}^{2}

denotes the variance of X₂ (Classical deBruijn identity assumes that X₂ is Gaussian. Here, we use a generalized deBruijn identity for arbitrary (not necessarily Gaussian) X₂ [25]). So, we have:

{\frac{\partial}{\partial λ^{2}} H (E + λ Z) |}_{λ^{2} = 0} = \frac{1}{2} σ_{Z}^{2} J (E)

(12)

and hence, we obtain the first-order Taylor expansion:

\begin{array}{l} H (E + λ Z) = H (E) + ({\frac{\partial}{\partial λ^{2}} H (E + λ Z) |}_{λ^{2} = 0}) λ^{2} + ο (λ^{2}) \\ = H (E) + \frac{1}{2} λ^{2} σ_{Z}^{2} J (E) + ο (λ^{2}) \end{array}

(13)

Remark 3: By Property 3, with small λ, the smoothed error entropy equals approximately the true error entropy plus a scaled version of the Fisher information of error. This result is very interesting since minimizing the smoothed error entropy will minimize a weighted sum of the true error entropy and the Fisher information of error. Intuitively, minimizing the error entropy tends to result in a spikier error distribution, while minimizing the Fisher information makes the error distribution smoother (smaller Fisher information implies a smaller variance of the PDF gradient). Therefore, the SMEE criterion provides a nice balance between the spikiness and smoothness of the error distribution. In [26], the Fisher information of error has been used as a criterion in supervised adaptive filtering.

Property 4: Minimizing the smoothed error entropy H (E + λZ) is equivalent to minimizing the mutual information between E + λZ and the input X, i.e.,

\underset{g \in H}{\arg \min} H (E + λ Z) = \underset{g \in H}{\arg \min} I (E + λ Z; X)

.

Proof: Since mutual information

I (X; Y) = H (X) - H (X | Y)

, where

H (X | Y)

denotes the conditional entropy of X given Y, we derive:

\begin{array}{l} \underset{g \in H}{\arg \min} I (E + λ Z; X) = \underset{g \in H}{\arg \min} {H (E + λ Z) - H (E + λ Z | X)} \\ = \underset{g \in H}{\arg \min} {H (E + λ Z) - H (Y - g (X) + λ Z | X)} \\ = \underset{g \in H}{\arg \min} {H (E + λ Z) - H (Y + λ Z | X)} \\ \overset{(a)}{=} \underset{g \in H}{\arg \min} H (E + λ Z) \end{array}

(14)

where (a) comes from the fact that the conditional entropy

H (Y + λ Z | X)

does not depend on the mapping g(.).

Property 5: The smoothed error entropy is lower bounded by the conditional entropy

H (Y + λ Z | X)

, and this lower bound is achieved if and only if the mutual information I (E + λZ; X) = 0.

Proof: As

I (E + λ Z; X) = H (E + λ Z) - H (Y + λ Z | X)

, we have:

\begin{array}{l} H (E + λ Z) = H (Y + λ Z | X) + I (E + λ Z; X) \\ \overset{(b)}{\geq} H (Y + λ Z | X) \end{array}

(15)

where (b) is because of the non-negativeness of the mutual information I (E + λZ; X).

Remark 4: The lower bound in Property 5 depends only on the conditional PDF of Y given X and the kernel function

κ_{λ}

, and is not related to the learning machine. Combining Property 5 and Property 2, the smoothed error entropy satisfies the following inequalities:

H (Y + λ Z | X) \leq H (E + λ Z) \leq \frac{1}{2} \log (2 π e (σ_{E}^{2} + λ^{2} σ_{Z}^{2}))

(16)

Property 6: Let

ρ (y | x, κ_{λ}) ≜ p_{Y | X} (y | x) * κ_{λ} (y)

be the smoothed conditional PDF of Y given X = x, where the convolution is with respect to y. If for every

x \in ℝ^{m}

,

ρ (y | x, κ_{λ})

is symmetric (not necessarily about zero) and unimodal in

y \in ℝ

, then the optimal mapping in (6) equals (almost everywhere):

g^{*} (x) = \int_{ℝ} y ρ (y | x, κ_{λ}) d y + c

(17)

where

c \in ℝ

is any constant.

Proof: The smoothed error PDF

(p_{E} * κ_{λ}) (.)

can be expressed as:

\begin{array}{l} (p_{E} * κ_{λ}) (ξ) = \int_{ℝ} p_{E} (τ) κ_{λ} (ξ - τ) d τ \\ = \int_{ℝ} (\int_{ℝ^{m}} p_{Y | X} (τ + g (x) | x) d F (x)) κ_{λ} (ξ - τ) d τ \\ = \int_{ℝ^{m}} (\int_{ℝ} p_{Y | X} (τ + g (x) | x) κ_{λ} (ξ - τ) d τ) d F (x) \\ = \int_{ℝ^{m}} (ρ (ξ + g (x) | x, κ_{λ})) d F (x) \end{array}

(18)

where F(x) denotes the distribution function of X. From (18), we see that the SMEE criterion can be formulated as a problem of shifting the components of a mixture of the smoothed conditional PDFs so as to minimize the entropy of the mixture. Then Property 6 holds since it has already been proved in [27] that, if all components (conditional PDFs) in the mixture are symmetric and unimodal, the conditional mean (or median) will minimize the mixture entropy. The constant c is added since the entropy is shift-invariant (in practice we usually set c = 0).

3. How Smoothing Factor Affects the Optimal Solution

The smoothing factor λ is an important parameter in SMEE criterion, which controls the smoothness of the performance surface. In the following, we will investigate how the smoothing factor affects the optimal solution (optimal mapping) g*(.).

When λ = 0, the smoothed error entropy becomes the true error entropy, and the SMEE criterion reduces to the original MEE criterion. When λ > 0, the two criteria are different and, in general have different solutions. However, in some situations, for any λ, the SMEE criterion yields the same solution as the MEE criterion.

Proposition 1: If the desired output Y is related to the input signal X through the nonlinear regression model Y = f(X) + N, where f is an unknown nonlinear mapping, and N is an additive noise that is independent of X and Z, then for any λ, the optimal solution under SMEE criterion is:

g^{*} (x) = f (x) + c

(19)

Proof: For any mapping

g \in ℋ

, and any λ, we have:

\begin{array}{l} H (E + λ Z) = H (f (X) + N - g (X) + λ Z) \\ = H ([f (X) - g (X)] + [N + λ Z]) \\ = H (U + V) \\ \geq H (U + V | U) \\ = H (V | U) \\ \overset{(c)}{=} H (V) \end{array}

(20)

where U = f(X) − g(X), V = N + λZ, and (c) comes from the fact that U and V are independent. The equality in (20) holds if and only if U is δ distributed, that is, U is a constant. This can be easily proved as follows.

If U is a constant, the equality in (20) will hold. Conversely, if the equality in (20) holds, we can prove that U is a constant. Actually, in this case, U and U + V are independent, and hence,

E [(U + V) U] = E [U + V] E [U]

. It follows that:

\begin{array}{l} E [(U + V) U] = E [U + V] E [U] \\ \Leftrightarrow E [U^{2}] + E [U V] = {(E [U])}^{2} + E [U] E [V] \\ \Leftrightarrow E [U^{2}] + E [U] E [V] = {(E [U])}^{2} + E [U] E [V] \\ \Leftrightarrow E [U^{2}] = {(E [U])}^{2} \\ \Leftrightarrow E [{(U - E [U])}^{2}] = 0 \end{array}

(21)

which implies that the variance of U is zero (i.e., U is a constant). Therefore we have g*(x) = f(x) + c.

Remark 5: Proposition 1 implies that for the nonlinear regression problem, the optimal solution under the SMEE criterion does not change with the smoothing factor provided that the additive noise is independent of the input signal.

If the unknown system (the conditional PDF) is not restricted to a nonlinear regression model, under certain conditions the optimal solution of SMEE can still remain unchanged with the smoothing factor λ. Specifically, the following proposition holds.

Proposition 2: If the conditional PDF

p_{Y | X} (y | x)

and the kernel function

κ_{λ} (y)

are both symmetric (not necessarily about zero) and both unimodal in

y \in ℝ

, then for any λ, the SMEE criterion produces the same solution:

g^{*} (x) = \int_{ℝ} y p_{Y | X} (y | x) d y + c

(22)

Proof: By Property 6, it suffices to prove that the smoothed conditional PDF

ρ (. | x, κ_{λ})

is symmetric and unimodal. This is a well-known result and a simple proof can be found in [28].

Remark 6: Note that the optimal solution under the minimum error variance criterion is also given by (22). In particular, if setting c = 0, the solution (22) becomes the conditional mean, which corresponds to the optimal solution under the MSE criterion.

Proposition 3: Assume that the conditional PDF

p_{Y | X} (. | x)

is symmetric (not necessarily unimodal) with uniformly bounded support, and the kernel function

κ_{λ}

is a zero-mean Gaussian PDF with variance λ². Then, if smoothing factor λ is larger than a certain value, the optimal solution under the SMEE is still given by (22).

Proof: By Property 6, it is sufficient to prove that if smoothing factor λ is larger than a certain value, the smoothed conditional PDF

ρ (. | x, κ_{λ})

is symmetric and unimodal. Suppose without loss of generality that the conditional PDF

p_{Y | X} (. | x)

is symmetric about zero with bounded support [-B,B], B > 0. Since kernel function

κ_{λ} (.)

is a zero-mean Gaussian PDF with variance λ², the smoothed PDF

ρ (. | x, κ_{λ})

can be expressed as:

\begin{array}{l} ρ (y | x, κ_{λ}) = \frac{1}{\sqrt{2 π} λ} \int_{- \infty}^{\infty} \exp (- \frac{{(y - τ)}^{2}}{2 λ^{2}}) p_{Y | X} (τ | x) d τ \\ = \frac{1}{\sqrt{2 π} λ} \int_{- B}^{B} \exp (- \frac{{(y - τ)}^{2}}{2 λ^{2}}) p_{Y | X} (τ | x) d τ \\ \overset{(d)}{=} \frac{1}{\sqrt{2 π} λ} \int_{0}^{B} {\exp (- \frac{{(y - τ)}^{2}}{2 λ^{2}}) + \exp (- \frac{{(y + τ)}^{2}}{2 λ^{2}})} p_{Y | X} (τ | x) d τ \end{array}

(23)

where (d) follows from

p_{Y | X} (τ | x) = p_{Y | X} (- τ | x)

. Clearly,

ρ (y | x, κ_{λ})

is symmetric in y. Further, we derive:

\frac{\partial}{\partial y} ρ (y | x, κ_{λ}) = \frac{- 1}{\sqrt{2 π} λ^{3}} \int_{0}^{B} {\exp (- \frac{{(y - τ)}^{2}}{2 λ^{2}}) (y - τ) + \exp (- \frac{{(y + τ)}^{2}}{2 λ^{2}}) (y + τ)} p_{Y | X} (τ | x) d τ

(24)

If λ ≥ 2B, we have:

{\begin{cases} \frac{\partial}{\partial y} ρ (y | x, κ_{λ}) \leq 0 f o r y \geq 0 \\ \frac{\partial}{\partial y} ρ (y | x, κ_{λ}) \geq 0 f o r y < 0 \end{cases}

(25)

We give below a simple proof of (25). It suffices to consider only the case y ≥ 0 In this case, we consider two subcases:

(1): y ≥ B: In this case, we have $\forall τ \in [0, B]$ , (y - τ) ≥ 0, and hence:

$\exp (- \frac{{(y - τ)}^{2}}{2 λ^{2}}) (y - τ) + \exp (- \frac{{(y + τ)}^{2}}{2 λ^{2}}) (y + τ) \geq 0 ， y \in [B, \infty), τ \in [0, B]$

(26)
(1): (2) 0 ≤ y < B: In this case, we have $\forall τ \in [0, B]$ , $0 \leq | y - τ | \leq y + τ \leq 2 B \leq λ$ . Since $\forall | x | \leq λ$ :

$\frac{\partial}{\partial x} {\exp (- \frac{x^{2}}{2 λ^{2}}) x} = \exp (- \frac{x^{2}}{2 λ^{2}}) \frac{λ^{2} - x^{2}}{λ^{2}} \geq 0$

(27)

we have $\exp (- \frac{{(y - τ)}^{2}}{2 λ^{2}}) | y - τ | \leq \exp (- \frac{{(y + τ)}^{2}}{2 λ^{2}}) (y + τ)$ , and it follows easily that:

$\exp (- \frac{{(y - τ)}^{2}}{2 λ^{2}}) (y - τ) + \exp (- \frac{{(y + τ)}^{2}}{2 λ^{2}}) (y + τ) \geq 0, y \in [0, B), τ \in [0, B]$

(28)

Combining (24), (26) and (28), we get

\forall y \geq 0

,

\frac{\partial}{\partial y} ρ (y | x, κ_{λ}) \leq 0

. Therefore (25) holds, which implies that if λ ≥ 2B the smoothed PDF

ρ (. | x, κ_{λ})

is symmetric and unimodal, and this completes the proof of the proposition.

Remark 7: Proposition 3 suggests that under certain conditions, when λ is larger than a certain value, the SMEE criterion yields the same solution as the minimum error variance criterion. In the next proposition, a similar but more interesting result is presented for general cases where no assumptions on the conditional PDF and on the kernel function are made.

Proposition 4: When the smoothing factor λ → ∞, minimizing the smoothed error entropy will be equivalent to minimizing the error variance plus an infinitesimal term.

Proof: The smoothed error entropy can be rewritten as:

\begin{array}{l} H (E + λ Z) = H (λ [\frac{1}{λ} E + Z]) = H (\frac{1}{λ} E + Z) + \log λ \\ = H (Z + \sqrt{t} E) - \frac{1}{2} \log t \end{array}

(29)

where t = 1/λ². Since the term

- \frac{1}{2} \log t

does not depend on learning machine, minimizing H (E + λZ) is equivalent to minimizing

H (Z + \sqrt{t} E)

, that is:

\min_{g \in H} H (E + λ Z) \Leftrightarrow \min_{g \in H} H (Z + \sqrt{t} E)

(30)

By De Bruijn’s identity (11):

{\frac{\partial}{\partial t} H (Z + \sqrt{t} E) |}_{t = 0} = \frac{1}{2} σ_{E}^{2} J (Z)

(31)

When λ is very large (hence t is very small):

H (Z + \sqrt{t} E) = H (Z) + \frac{t}{2} σ_{E}^{2} J (Z) + ο (t)

(32)

Combining (30) and (32) yields:

\min_{g \in H} H (E + λ Z) \Leftrightarrow \min_{g \in H} (σ_{E}^{2} + \frac{2}{J (Z) t} ο (t))

(33)

which completes the proof.

Remark 8: The above result is very interesting: when the smoothing factor λ is very large, minimizing the smoothed error entropy will be approximately equivalent to minimizing the error variance (or the mean square error if the error PDF is restricted to zero-mean). This result holds for any conditional PDF and any kernel function. A similar result can be obtained for the nonparametric entropy estimator based on Parzen windows. It was proved in [14] that in the limit, as the kernel size (the smoothing factor) tends to infinity, the entropy estimator approaches a nonlinearly scaled version of the sample variance.

4. Conclusions

Traditional machine learning methods mainly exploit second order statistics (covariance, mean square error, correlation, etc.). The optimality criteria based on second order statistics are computationally simple, and optimal under linear and Gaussian assumptions. Although second order statistics are still prevalent today in the machine learning community and provide successful engineering solutions to most practical problems, it has become evident that this approach can be improved, especially when data possess non-Gaussian distributions (multi-modes, fat tails, finite range, etc.). In most situations, a more appropriate approach should be applied to capture higher order statistics or information content of signals rather than simple their energy. Recent studies suggest that the supervised machine learning can benefit greatly from the use of the minimum error entropy (MEE) criterion. To implement the MEE learning, however, one has to estimate the error entropy from the samples. In the limit (when sample size tends to infinity), the estimated error entropy by Parzen windowing converges to the smoothed error entropy, i.e., the entropy of the error plus an independent random variable with PDF equal to the kernel function used in Parzen windowing, so the smoothed error entropy is the actual entropy that is minimized in the MEE learning.

In this paper, we study theoretically the properties of the smoothed MEE (SMEE) criterion in supervised machine learning and, in particular, we investigate how the smoothing factor affects the optimal solution. Some interesting results are obtained. It is shown that when the smoothing factor is small, the smoothed error entropy equals approximately the true error entropy plus a scaled version of the Fisher information of error. In some special situations, the SMEE solution remains unchanged with increasing smoothing factor. In general cases, however, when the smoothing factor is very large, minimizing the smoothed error entropy will be approximately equivalent to minimizing the error variance (or the mean square error if the error distribution is restricted to zero-mean), regardless of the conditional PDF and the kernel function.

This work does not address the learning issues when the number of samples is limited. In this case, the problem becomes much more complex since there is an extra bias in the entropy estimation. We leave this problem open for future research. The results obtained in this paper, however, are still very useful since they provide theoretical solutions to which the empirical solution (with finite data) must approximate.

Acknowledgments

This work was supported by NSF grant ECCS 0856441, ONR N00014-10-1-0375, and National Natural Science Foundation of China (No. 60904054).

References

Cover, T.M.; Thomas, J.A. Element of Information Theory; John Wiley & Sons, Inc.: New York, NY, USA, 1991. [Google Scholar]
Weidemann, H.L.; Stear, E.B. Entropy analysis of estimating systems. IEEE Trans. Inform. Theor. 1970, 16, 264–270. [Google Scholar] [CrossRef]
Tomita, Y.; Ohmatsu, S.; Soeda, T. An application of the information theory to estimation problems. Inf. Control 1976, 32, 101–111. [Google Scholar] [CrossRef]
Janzura, M.; Koski, T.; Otahal, A. Minimum entropy of error principle in estimation. Inf. Sci. 1994, 79, 123–144. [Google Scholar] [CrossRef]
Wolsztynski, E.; Thierry, E.; Pronzato, L. Minimum-entropy estimation in semi-parametric models. Signal Process. 2005, 85, 937–949. [Google Scholar] [CrossRef]
Chen, B.; Zhu, Y.; Hu, J.; Zhang, M. On optimal estimations with minimum error entropy criterion. J. Frankl. Inst. Eng. Appl. Math. 2010, 347, 545–558. [Google Scholar] [CrossRef]
Kalata, P.; Priemer, R. Linear prediction, filtering and smoothing: An information theoretic approach. Inf. Sci. 1979, 17, 1–14. [Google Scholar] [CrossRef]
Feng, X.; Loparo, K.A.; Fang, Y. Optimal state estimation for stochastic systems: An information theoretic approach. IEEE Trans. Automat. Contr. 1997, 42, 771–785. [Google Scholar] [CrossRef]
Guo, L.; Wang, H. Minimum entropy filtering for multivariate stochastic systems with non-Gaussian Noises. IEEE Trans. Autom. Control 2006, 51, 695–700. [Google Scholar] [CrossRef]
Pfaffelhuber, E. Learning and information theory. Int. J. Neurosci. 1972, 3, 83–88. [Google Scholar] [CrossRef] [PubMed]
Barlow, H. Unsupervised learning. Neural Comput. 1989, 1, 295–311. [Google Scholar] [CrossRef]
Erdogmus, D.; Principe, J.C. An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems. IEEE Trans. Signal Process. 2002, 50, 1780–1786. [Google Scholar] [CrossRef]
Erdogmus, D.; Principe, J.C. Generalized information potential criterion for adaptive system training. IEEE Trans. Neural Netw. 2002, 13, 1035–1044. [Google Scholar] [CrossRef] [PubMed]
Erdogmus, D.; Principe, J.C. Convergence properties and data efficiency of the minimum error entropy criterion in Adaline training. IEEE Trans. Signal Process. 2003, 51, 1966–1978. [Google Scholar] [CrossRef]
Erdogmus, D.; Principe, J.C. From linear adaptive filtering to nonlinear information processing—The design and analysis of information processing systems. IEEE Signal Process. Mag. 2006, 23, 14–33. [Google Scholar] [CrossRef]
Santamaria, I.; Erdogmus, D.; Principe, J.C. Entropy minimization for supervised digital communications channel equalization. IEEE Trans. Signal Process. 2002, 50, 1184–1192. [Google Scholar] [CrossRef]
Chen, B.; Hu, J.; Pu, L.; Sun, Z. Stochastic gradient algorithm under (h,φ)-entropy criterion. Circuits Syst. Signal Process. 2007, 26, 941–960. [Google Scholar]
Principe, J.C. Information Theoretic Learning: Renyi's Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
Scholkopf, B.; Smola, A.J. Learning with Kernels, Support Vector Machines, Regularization, Optimization and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Liu, W.; Principe, J.C. Kernel Adaptive Filtering: A Comprehensive Introduction; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2010. [Google Scholar]
Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman & Hall: New York, NY, USA, 1986. [Google Scholar]
Beirlant, J.; Dudewicz, E.J.; Gyorfi, L.; van der Meulen, E.C. Nonparametric entropy estimation: An overview. Int. J. Math. Statist. Sci. 1997, 6, 17–39. [Google Scholar]
Linnik, Ju.V.; Ostrovskii, I.V. Decompositions of Random Variables and Vectors; American Mathematical Society: Providence, RI, USA, 1977. [Google Scholar]
Rioul, O. Information theoretic proofs of entropy power inequalities. IEEE Trans. Inform. Theor. 2011, 57, 33–55. [Google Scholar] [CrossRef]
Xu, J.-W.; Erdogmus, D.; Principe, J.C. Minimizing Fisher information of the error in supervised adaptive filter training. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, Quebec, Canada, 17–21 May 2004.
Chen, T.-L.; Geman, S. On the minimum entropy of a mixture of unimodal and symmetric distributions. IEEE Trans. Inf. Theor. 2008, 54, 3166–3174. [Google Scholar] [CrossRef]
Purkayastha, S. Simple proofs of two results on convolutions of unimodal distributions. Statist. Prob. Lett. 1998, 39, 97–100. [Google Scholar] [CrossRef]

© 2012 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Chen, B.; Principe, J.C. On the Smoothed Minimum Error Entropy Criterion. Entropy 2012, 14, 2311-2323. https://doi.org/10.3390/e14112311

AMA Style

Chen B, Principe JC. On the Smoothed Minimum Error Entropy Criterion. Entropy. 2012; 14(11):2311-2323. https://doi.org/10.3390/e14112311

Chicago/Turabian Style

Chen, Badong, and Jose C. Principe. 2012. "On the Smoothed Minimum Error Entropy Criterion" Entropy 14, no. 11: 2311-2323. https://doi.org/10.3390/e14112311

APA Style

Chen, B., & Principe, J. C. (2012). On the Smoothed Minimum Error Entropy Criterion. Entropy, 14(11), 2311-2323. https://doi.org/10.3390/e14112311

Article Menu

On the Smoothed Minimum Error Entropy Criterion

Abstract

1. Introduction

2. Some Basic Properties of SMEE Criterion

3. How Smoothing Factor Affects the Optimal Solution

4. Conclusions

Acknowledgments

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI