Rate-Distortion Bounds for Kernel-Based Distortion Measures

Watanabe, Kazuho

doi:10.3390/e19070336

Open AccessArticle

Rate-Distortion Bounds for Kernel-Based Distortion Measures^†

by

Kazuho Watanabe

Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Hibarigaoka Tempaku-cho Toyohashi, Aichi 441-8580, Japan

^†

This paper is an extended version of my papers published in the Eighth Workshop on Information Theoretic Methods in Science and Engineering, Copenhagen, Denmark, 24–26 June 2015 and the IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017.

Entropy 2017, 19(7), 336; https://doi.org/10.3390/e19070336

Submission received: 9 May 2017 / Revised: 16 June 2017 / Accepted: 2 July 2017 / Published: 5 July 2017

(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Download

Browse Figures

Versions Notes

Abstract

:

Kernel methods have been used for turning linear learning algorithms into nonlinear ones. These nonlinear algorithms measure distances between data points by the distance in the kernel-induced feature space. In lossy data compression, the optimal tradeoff between the number of quantized points and the incurred distortion is characterized by the rate-distortion function. However, the rate-distortion functions associated with distortion measures involving kernel feature mapping have yet to be analyzed. We consider two reconstruction schemes, reconstruction in input space and reconstruction in feature space, and provide bounds to the rate-distortion functions for these schemes. Comparison of the derived bounds to the quantizer performance obtained by the kernel

K

-means method suggests that the rate-distortion bounds for input space and feature space reconstructions are informative at low and high distortion levels, respectively.

Keywords:

kernel methods; rate-distortion function; kernel K-means; preimaging

1. Introduction

Kernel methods have been widely used for nonlinear learning problems combined with linear learning algorithms such as the support vector machine and the principal component analysis [1]. By the so-called kernel trick, kernel-based methods can use linear learning methods in the kernel-induced feature space without explicitly computing the high-dimensional feature mapping. Kernel-based methods measure the dissimilarity between data points by the distance in the feature space, which, in input space, corresponds to a distance measure involving the feature mapping [2]. If a kernel-based learning method is used as a lossy source coding scheme, its optimal rate-distortion tradeoff is indicated by the rate-distortion function associated with the distortion measure defined by the kernel feature map [3]. Successful applications of kernel methods in learning problems and flexibility to create various distance measures suggest that kernel-based distortion measures can be suitable for certain lossy compression problems. However, the rate-distortion function of such a distortion measure has yet to be evaluated analytically. Although there are several kernel-based approaches to vector quantization [4,5], their rate-distortion tradeoffs are still unknown.

In this paper, we derive bounds for the rate-distortion functions for kernel-based distortion measures. We consider two schemes to reconstruct inputs in lossy coding methods. One is to obtain a reconstruction in the original input space. Since kernel methods usually yield results of learning by the linear combination of vectors in feature space, we need an additional step to obtain the reconstruction in input space, such as preimaging [6]. The other is to consider the linear combination of feature vectors as the reconstruction and measure the distortion in the feature space directly. We formulate the two reconstruction schemes (Section 3.1 and Section 3.2), and prove that the rate-distortion function of input space reconstruction provides an upper bound of that of feature space reconstruction (Section 3.3). We derive lower and upper bounds to the rate-distortion function of input space reconstruction, which are computable only by

o n e

-dimensional numerical integrations in the case of translation invariant and isotropic kernel functions (Section 4.1 and Section 4.2). We also provide an upper bound to the rate-distortion function of feature space reconstruction for general positive definite kernel functions (Section 4.4). In the usual applications of kernel-based quantization algorithms, one fixes the rate by determining the number of quantized points, and minimizes the average distortion for training data. The distortion-rate function, which is the inverse function of the rate-distortion function, shows the minimum achievable expected distortion (or distortion for test data) at the fixed rate. The derived bounds approximately characterize such optimal tradeoffs between the rate and expected distortion.

Furthermore, we design a vector quantizer using the kernel

K

-means method and compare its performance with the derived rate-distortion bounds (Section 5). We also compute the preimages of the quantized points in feature space to investigate the performance of the quantizer in input space. It is suggested through the experiments using synthetic and image data that the rate-distortion bounds of reconstruction in input space are accurate at low distortion levels while the upper bound for reconstruction in feature space is informative at high distortion levels.

2. Rate-Distortion Function

Let X and Y be random variables of input and reconstruction taking values in

X

and

Y

, respectively. For the non-negative distortion measure between x and y,

d (x, y)

, the rate-distortion function

R (D)

of the source

X \sim p (x)

is defined by

R (D) = inf_{q (y | x) : E [d (X, Y)] \leq D} I (q),

(1)

where

I (q) = I (X; Y)

is the mutual information and E denotes the expectation with respect to

q (y | x) p (x)

.

R (D)

shows the minimum achievable rate R under the given distortion measure d [3,7]. The distortion-rate function is the inverse function of the rate-distortion function and denoted by

D (R)

.

If the conditional distributions

q_{s} (y | x)

achieve the minimum of the following Lagrange functional parameterized by

s \geq 0

,

L (q) = I (q) + s (E [d (X, Y)] - D),

then, the rate-distortion function is parametrically given by

\begin{matrix} R (D_{s}) & = & I (q_{s}), \\ D_{s} & = & \int q_{s} (y | x) p (x) d (x, y) d x d y . \end{matrix}

The parameter s corresponds to the (negated) slope of the tangent of

R (D)

at

(D_{s}, R (D_{s}))

and hence is referred to as the slope parameter [3]. Alternatively, if there exists a marginal reconstruction density

q_{s} (y)

that minimizes the functional,

F (q) = - \frac{1}{s} E [log \int e^{- s d (X, y)} q (y) d y],

then the optimal conditional reconstruction distributions are given by

q_{s} (y | x) = \frac{e^{- s d (x, y)} q_{s} (y)}{\int e^{- s d (x, y)} q_{s} (y) d y}

(2)

(see, for example, [3,8]).

From the properties of the rate-distortion function

R (D)

, we know that

R (D) > 0

for

0 < D < D_{max}

, where

D_{max} = inf_{y} \int p (x) d (x, y) d x,

(3)

and

R (D) = 0

for

D \geq D_{max}

[3] (p. 90). Hence,

D_{max} = {lim}_{R \to 0} D (R)

.

3. Kernel-Based Distortion Measures

In kernel-based learning methods, data points in input space

X

are mapped into some high-dimensional feature space H by a feature mapping

ϕ

. Then, the similarity between the two points x and y in

X

is measured by the inner product

〈ϕ (x), ϕ (y)〉

in H.

The inner product is directly evaluated by a nonlinear function in input space

K (x, y) = 〈ϕ (x), ϕ (y)〉,

(4)

which is called the kernel function. Mercer’s theorem ensures that there exists some

ϕ

such that Equation (4) holds if K is a positive definite kernel [1]. This enables us to avoid explicitly computing the feature map

ϕ

in the potentially high-dimensional space H, which is called the kernel trick. A lot of learning methods that can be expressed by only the inner products between data points have been kernelized [1].

We identify the feature space H with the reproducing kernel Hilbert space (RKHS) associated with the kernel function K by the canonical feature map,

ϕ (x) = K (\cdot, x)

[9] (Lemma 4.19). We assume that the input space

X

is a subset of

R^{m}

, and the kernel function K is continuous [9] (Lemma 4.29). We focus on the squared norm in feature space as the distortion measure, and consider two reconstruction schemes in the following respective subsections.

3.1. Reconstruction in Input Space

If we restrict ourselves to the reconstruction in input space, that is, the reconstruction

y \in X \subset R^{m}

is computed for each input

x \in X

, the distortion measure is naturally defined by

\begin{matrix} d_{inp} (x, y) & = & | | ϕ (x) - {ϕ (y) | |}^{2} \\ = & K (x, x) + K (y, y) - 2 K (x, y) . \end{matrix}

(5)

Note that the reconstruction

ϕ (y)

of

ϕ (x)

is restricted to the subset of the feature space,

{ϕ (y); y \in X}

. To obtain a reconstruction in input space, we need a technique such as preimaging [6].

This is a difference distortion measure if and only if the kernel function is translation invariant, that is,

K (x + a, y + a) = K (x, y)

for any

a \in X

. In this case, the distortion measure is expressed as

d_{inp} (x, y) = ρ (x - y),

(6)

where

ρ (z) = 2 (C - K (z, 0))

and

C = K (0, 0)

. The rate-distortion function (distortion-rate function, resp.) for this distortion measure is denoted by

R_{inp} (D)

(

D_{inp} (R)

, resp.) and the maximum distortion

D_{max}

in Equation (3) is denoted by

D_{max, inp}

, that is,

D_{max, inp} = E [K (X, X)] + inf_{y} \{K (y, y) - 2 E [K (X, y)]\},

(7)

which is in the translation invariant case,

D_{max, inp} = 2 (C - {sup}_{y} E [K (X, y)])

.

3.2. Reconstruction in Feature Space

Suppose we have a sample of length n in input space,

S = {x_{1}, \dots, x_{n}}

so that

{ϕ (x_{1}), \dots, ϕ (x_{n})}

spans a linear subspace in feature space. If we compute the reconstruction by the linear combination

\sum_{i = 1}^{n} α_{i} ϕ (x_{i})

for

α_{i} \in R, i = 1, \dots, n

, and consider it as the reconstruction in feature space, the distortion can be measured by

\begin{matrix} d_{fea} (x, α) & = & d_{fea}^{[S]} (x, α) = {∥ϕ (x) - \sum_{i = 1}^{n} α_{i} ϕ (x_{i})∥}^{2} \\ = & K (x, x) - 2 α^{T} k (x) + α^{T} K α, \end{matrix}

(8)

where

α = {(α_{1}, \dots, α_{n})}^{T} \in R^{n}

,

k (x) = {(K (x_{1}, x), \dots, K (x_{n}, x))}^{T},

and

K = {(K (x_{i}, x_{j}))}_{i j}

is the Gram matrix. Note that the reconstruction is identified with the coefficients

α

whose domain is not identical to the input space

X

. Although the distortion measure

d_{fea}

depends on the sample S, we omit the dependence in the notation since we consider a fixed design of S for a sufficiently large n. The sample does not have to be distributed according to the source distribution, while it is required to overspread the support of the source.

The rate-distortion function (distortion-rate function, resp.) for this distortion measure is denoted by

R_{fea} (D)

(

D_{fea} (R)

, resp.) and the maximum distortion

D_{max}

in Equation (3) is given by

D_{max, fea} = E [K (X, X)] - E {[k (X)]}^{T} K^{- 1} E [k (X)],

(9)

which is derived from the direct minimization of the quadratic function of

α

,

\int d_{fea} (x, α) p (x) d x

.

3.3. $R_{inp} (D)$ and $R_{fea} (D)$

The following theorem claims that

R_{inp} (D)

provides an upper bound of

R_{fea} (D)

when n is sufficiently large.

Theorem 1.

If the input space

X

is bounded, and there exists a conditional density achieving the infimum in the definition of

R_{inp} (D)

, for any

ε > 0

,

D \geq ε

, and sufficiently large n, the following inequality holds:

R_{fea} (D + ε) \leq R_{inp} (D) .

The proof is given in Appendix A. This theorem shows that the feature space reconstruction gives better rates since a single feature vector

ϕ (y)

can be approximated by a linear combination

\sum_{i = 1}^{n} α_{i} ϕ (x_{i})

when n is sufficiently large.

4. Rate-Distortion Bounds

Since the rate-distortion problem (Section 2) is rarely solved in a closed form [8], we derive bounds to

R_{inp} (D)

and

R_{fea} (D)

.

4.1. Lower Bound to $R_{inp} (D)$

Although the Shannon lower bound to

R (D)

is defined for difference distortion measures in general [3] (p. 92), it diverges to

- \infty

for the distortion measure in Equation (6) since

\int e^{- s ρ (z)} d z

diverges to ∞. Hence, we consider an improved lower bound, which was introduced by [3] (p. 140). Let

Q_{B}

be the probability that

∥ X ∥ \leq B

. Then,

R (D)

is lower-bounded as

R (D) \geq Q_{B} \{h (p_{B}) - max_{g \in G_{B, D}} h (g)\},

(10)

where h denotes the differential entropy,

p_{B} (x) = \frac{1}{Q_{B}} p (x) u (B - ∥ x ∥),

(11)

and u is the step function.

G_{B, D}

is the set of all probability densities

g (\cdot)

for which

g (x) = 0

for

∥ x ∥ > B

and

\int ρ (z) g (z) d z \leq D / Q_{B}

.

In the case of the distortion measure in Equation (6), the maximum in Equation (10) is explicitly given by

g_{s} (z) = \frac{1}{C_{B, s}} exp (2 s K (z, 0)) u (B - ∥ z ∥),

(12)

where

C_{B, s} = \int_{∥ z ∥ \leq B} e^{2 s K (z, 0)} d z

for s related to D by

\int ρ (z) g_{s} (z) d z = D / Q_{B}

. Since its differential entropy is

h (g_{s}) = - s \frac{\partial log C_{B, s}}{\partial s} + log C_{B, s},

(13)

we arrive at the following theorem.

Theorem 2.

The rate distortion function

R_{inp} (D)

is parametrically lower-bounded as

\begin{matrix} R_{inp} (D_{s}) \geq R_{inp, L} (D_{s}) & = & Q_{B} \{h (p_{B}) + s \frac{\partial log C_{B, s}}{\partial s} - log C_{B, s}\}, \\ D_{s} & = & Q_{B} \{2 C - \frac{\partial log C_{B, s}}{\partial s}\} . \end{matrix}

(14)

If we further assume that the kernel function is radial, that is,

K (x, y) = K (x - y, 0) = k (∥ x - y ∥)

for some function k, the integrations above reduce to

o n e

-dimensional ones,

C_{B, s} = A (m) \int_{0}^{B} r^{m - 1} e^{2 s k (r)} d r,

and

\begin{matrix} \frac{\partial log C_{B, s}}{\partial s} & = & 2 \int_{∥ z ∥ \leq B} K (z, 0) e^{2 s K (z, 0)} d z \\ = & 2 A (m) \int_{0}^{B} r^{m - 1} k (r) e^{2 s k (r)} d r, \end{matrix}

(15)

where

A (m) = \frac{m {\sqrt{π}}^{m}}{Γ (m / 2 + 1)}

is the area of the m-dimensional unit sphere, and

Γ

is the gamma function.

4.2. Upper Bound to $R_{inp} (D)$

If

d_{inp}

in Equation (5) is a difference distortion measure, that is, K is translation invariant, by choosing

q (y | x) = g_{s} (y - x)

for the density

g_{s}

in Equation (12), the following upper bound is obtained,

\begin{matrix} R_{inp} (D_{s}) \leq R_{inp, U} (D_{s}) & = & h (g_{s} * p) - h (g_{s}) \end{matrix}

(16)

\begin{matrix} D_{s} & = & 2 C - \frac{\partial log C_{B, s}}{\partial s}, \end{matrix}

(17)

where

h (g_{s})

is given by Equation (13) and

(g_{s} * p) (y) = \int g_{s} (y - x) p (x) d x

is the convolution between

g_{s}

and p. This type of upper bound was used to prove the asymptotic tightness of the Shannon lower bound (as

D \to 0

) for a class of general sources and distortion measures [3,10,11,12]. However, this upper bound requires the evaluation of the differential entropy of the convolution.

The following theorem is derived from the facts that the spherical Gaussian distribution maximizes the entropy under the constraint that

E [∥ X ∥^{2}]

is no greater than a constant, and that

{E [∥ Y ∥}^{2} {] = E [∥ X ∥}^{2} {] + E [∥ Z ∥}^{2}]

holds for

Y = X + Z \sim g_{s} * p

.

Theorem 3.

If the kernel function is translation invariant and radial,

K (x, y) = k (∥ x - y ∥)

, then

R_{inp} (D)

is parametrically upper-bounded as

R_{inp} (D_{s}) \leq R_{inp, G} (D_{s}) = \frac{m}{2} log (2 π e (v_{p} + v_{s})) - h (g_{s}),

where

\begin{matrix} v_{p} & = & \frac{1}{m} \int {∥ x - μ ∥}^{2} p (x) d x, \\ μ & = & \int x p (x) d x, \\ v_{s} & = & \frac{1}{m} \int {∥ x ∥}^{2} g_{s} (x) d x \\ = & \frac{A (m)}{m C_{B, s}} \int_{0}^{B} r^{m + 1} e^{2 s k (r)} d r, \end{matrix}

(18)

and

D_{s}

is given by Equation (17) (and Equation (15)).

4.3. Rate-Distortion Dimension

In this section, we evaluate the rate-distortion dimension [13] of the kernel-based distortion measure in Equation (5) to investigate its property. We focus on the radial kernel,

K (x, y) = k (∥ x - y ∥)

, also in this section, and assume that

lim_{r \to 0} \frac{k (r) - k (0)}{r^{α}} = - β

(19)

holds for some

α > 0

and

β > 0

. For example, the Gaussian kernel,

k (r) = exp (- γ r^{2}) (γ > 0)

, satisfies Equation (19) for

α = 2

and

β = γ

.

To examine the limit

D \to 0

of

R_{inp} (D)

, we consider the asymptotic case of

s \to \infty

. Since

k (r) = k (0) - β r^{α} + o (r^{α})

, it follows that

\begin{matrix} C_{B, s} & = & A (m) \int_{0}^{B} e^{2 s k (r) r^{m - 1}} d r \\ = & A (m) e^{2 s k (0)} \frac{1}{α} {(\frac{1}{s β})}^{m / α} \{Γ (\frac{m}{α}) + o (1)\}, \end{matrix}

\int_{0}^{B} 2 k (r) e^{2 s k (r) r^{m - 1}} d r = 2 k (0) \frac{C_{B, s}}{A (m)} - 2 e^{2 s k (0)} \frac{1}{s α} {(\frac{1}{s β})}^{1 + m / α} \{Γ (1 + \frac{m}{α}) + o (1)\},

and

\begin{matrix} \frac{\partial log C_{B, s}}{\partial s} & = & \frac{\int_{0}^{B} 2 k (r) e^{2 s k (r) r^{m - 1}} d r}{\int_{0}^{B} e^{2 s k (r) r^{m - 1}} d r} \\ = & 2 k (0) - \frac{m}{s α β} + o (\frac{1}{s}) . \end{matrix}

Thus, we have from Equations (14) and (17),

- log D_{s} = log s + O (1),

for both the lower and upper bounds, and from Equation (13),

\begin{matrix} h (g_{s}) & = & - \frac{m}{α} log s + O (1) \\ = & \frac{m}{α} log D_{s} + O (1) . \end{matrix}

(20)

Since

d_{inp}

in Equation (5) is a norm squared for a valid RKHS kernel K, the rate-distortion dimension of the source distribution p is defined by [13],

\dim_{R} (p) = lim_{D \to 0} \frac{R_{inp} (D)}{- \frac{1}{2} log D} .

(21)

From Theorems 2 and 3 and Equation (20), we conclude the following.

Theorem 4.

If the source has a finite differential entropy, positive and finite

v_{p}

defined in Equation (18), and a bounded support, that is, there exists a finite

B > 0

such that

Q_{B} = 1

in Equation (11), and the radial kernel,

K (x, y) = k (∥ x - y ∥)

satisfies Equation (19) for

α > 0

and

β > 0

, then the rate-distortion dimension Equation (21) of

R_{inp} (D)

is given by

\dim_{R} (p) = \frac{2 m}{α} .

(22)

This theorem shows that the rate-distortion dimension is dependent only on the dimensionality of the input space and independent of the dimensionality of the feature space. In the case of the linear kernel,

K (x, y) = 〈 x, y 〉

, with

ϕ (x) = x

, the distortion measure in Equation (5) reduces to the usual squared distortion measure,

{∥ x - y ∥}^{2}

. It can be shown that under norm-based distortion measures including the squared distortion measure, the rate-distortion dimension of a source with an m-dimensional density is m [11,12]. From the preceding theorem, this is also the case for a general radial kernel if the kernel function has the order

α = 2

as the Gaussian kernel. Expression (22) of the rate-distortion dimension will be examined through a numerical experiment in Section 5.1.

4.4. Upper Bound to $R_{fea} (D)$

We construct an upper bound to the rate-distortion function

R_{fea} (D)

. We choose the conditional distribution of the reconstruction by

q (α | x) = N (α; m_{K} (x), {\tilde{K}}^{- 1} / 2 s),

(23)

where

\tilde{K} = K + c I

,

m_{K} (x) = {\tilde{K}}^{- 1} k (x),

and

N (\cdot; m, Σ)

denotes the n-dimensional normal density with mean

m

and covariance matrix

Σ

. Here, we have introduced the regularization constant

c \geq 0

with the

n \times n

identity matrix

I

. The conditional distribution in Equation (23) is implied by Equation (2) and the approximation

q_{s} (α) = N (α; 0, I / (2 s c))

. This reconstruction distribution yields the following upper bound:

\begin{matrix} R_{fea} (D_{s}) \leq R_{fea, U} (D_{s}) & = & h (M_{p}) - h (N (α; m_{K} (x), {\tilde{K}}^{- 1} / 2 s)), \end{matrix}

(24)

\begin{matrix} D_{s} & = & \frac{n - c tr {{\tilde{K}}^{- 1}}}{2 s} + D_{min} (c), \end{matrix}

(25)

where

M_{p} (α) = \int N (α; m_{K} (x), {\tilde{K}}^{- 1} / 2 s) p (x) d x

,

h (N (α; m_{K} (x), {\tilde{K}}^{- 1} / 2 s)) = \frac{n}{2} log (\frac{π e}{s} {| \tilde{K} |}^{1 / n}),

(26)

which is independent of the input x, and

D_{min} (c) = E [K (X, X)] - tr {{\tilde{K}}^{- 1} E [k (X) k {(X)}^{T}]} - c tr {{\tilde{K}}^{- 1} E [k (X) k {(X)}^{T}] {\tilde{K}}^{- 1}} .

If

c = 0

,

D_{min}

is the mean of the variance of the prediction by the associated Gaussian process [14].

Further upper-bounding the differential entropy

h (M_{p})

by the Gaussian entropy, we have the following theorem.

Theorem 5.

The rate distortion function

R_{fea} (D)

is upper-bounded as

R_{fea} (D) \leq R_{fea, G} (D) = \frac{1}{2} log |I + \frac{n - c tr {{\tilde{K}}^{- 1}}}{D - D_{min} (c)} {\tilde{K}}^{- 1} C|,

(27)

where

C = E [k (X) k {(X)}^{T}] - E [k (X)] E {[k (X)]}^{T} .

(28)

The proof is put in Appendix B. In the simplest case where

ϕ (x) = x \in R^{1}

,

n = 1

, and the source is the Gaussian,

p (x) = N (x; 0, σ^{2})

, the upper bound in Equation (27) reduces to

R_{fea, G} (D) = \frac{1}{2} log (1 + \frac{σ^{2}}{D}),

which is an asymptotically (as

D \to 0

) tight upper bound of the well-known rate distortion function for the Gaussian source under the squared distortion measure,

R (D) = \frac{1}{2} log (\frac{σ^{2}}{D})

[3,7].

5. Experimental Evaluation

We numerically evaluate the rate-distortion bounds obtained in the previous section. Designing a quantizer by the kernel

K

-means algorithm, we compare its performance with the bounds.

We focus on the case of the Gaussian kernel,

K (x, y) = e^{{- γ ∥ x - y ∥}^{2}}

(29)

with the kernel parameter

γ > 0

.

5.1. Synthetic Data

As a source, we first assumed the uniform distribution on the union of the two regions,

C_{1} = {x \in R^{m} {; A (m) ∥ x ∥}^{m} \leq m / 2}

and

C_{2} = {x \in R^{m}; m^{2} \leq {A (m) ∥ x ∥}^{m} \leq m (m + 1 / 2)}

, where

C_{1}

and

C_{2}

have equal volumes and

C_{1} \cup C_{2}

has volume 1. This suggests that

B = {\{\frac{m (m + 1 / 2)}{A (m)}\}}^{1 / m}

and

Q_{B} = 1

in Equation (10) and succeeding equations in Section 4.1 and Section 4.2.

We used the trapezoidal rule to compute the

o n e

-dimensional integrations in the lower bound

R_{inp, L}

and the upper bound

R_{inp, G}

. We generated i.i.d sample of the size

n = 200

from the source to compute

k (x)

and

K

for

R_{fea, G}

in Equation (27). Generating another 4000 data points, we approximated the required expectations. We optimized the regularization coefficient c to minimize the upper bound

R_{fea, G}

for each D.

Using the same data set of the size 4000 as a training data set, we run the kernel

K

-means algorithm 10 times with random initializations to obtain the minimum distortion for each rate. Varying the number

K

of quantized points from

2^{1}

to

2^{10}

, for each

K

, we counted the effective number

K_{eff}

of quantized points which have at least one assigned data point and computed the rate by

{log}_{2} K_{eff}

as the quantizer is first order, that is, the block length is one. The kernel parameter

γ

was chosen so that the clear separation of

C_{1}

and

C_{2}

is obtained when

K = 2

.

After the training, we computed the distortion and rate for the test data set, by assigning each of 20,000 test data generated from the same source to the nearest quantized points in the feature space.

For each quantized point, we obtained its preimage. That is, if the kth quantized point is expressed as

\sum_{i = 1}^{n} α_{k i} ϕ (x_{i})

, its preimage is

\begin{matrix} y_{k} & = & \underset{y}{argmin} {∥ϕ (y) - \sum_{i = 1}^{n} α_{k i} ϕ (x_{i})∥}^{2} \\ = & \underset{y}{argmax} \sum_{i = 1}^{n} α_{k i} K (y, x_{i}) . \end{matrix}

We used the mean shift procedure for the maximization, although this procedure only guarantees the convergence to a local maximum [15,16].

The obtained bounds and the quantizer performances are displayed in Figure 1a,b and for

m = 2

and

m = 10

, respectively, in the forms of distortion-rate functions. The values of

D_{max}

in Equations (7) and (9) are also indicated in the figures.

In both dimensions, the upper bound

D_{fea, G}

is smaller than

D_{inp, G}

at low rates while the bound is above the quantizer performance. However, the value of

D_{max, fea}

suggests that the bound is informative at low rates. As the rate becomes higher, the lower and upper bounds of the input space reconstruction,

D_{L, inp}

and

D_{G, inp}

, approach each other. In fact, they sandwich the quantizer performance tightly in the

t w o

-dimensional case, which suggests that the rate-distortion function for the feature space reconstruction,

R_{fea} (D)

is close to the rate-distortion function of the input space reconstruction

R_{inp} (D)

at high rates.

We see that the quantizer performances for

d_{fea}

and those for

d_{inp}

approach each other as the rate R grows. The upper bound

D_{inp, G}

reasonably approximates the quantizer performance by the preimages, and it indicates that, in the

t w o

-dimensional case (Figure 1a), the results for

R = 2

and 3 bits can be improved by at least about 1 bit.

At low distortion levels, each source output should be reconstructed within a small neighborhood in the feature space where we can find another point y in the input space whose feature map

ϕ (y)

is sufficiently close to the reconstruction. This suggests that the rate-distortion function of feature space reconstruction is well approximated by the rate-distortion function of input space reconstruction. In other words, combining multiple input points to make a reconstruction in feature space does not do any good for reducing distortion and only a single input point is enough when it is mapped into feature space. Hence, the rate-distortion bounds of input space reconstruction may be informative at low distortion levels.

In the 10-dimensional case (Figure 1b), the distortion in the test data set is close to

D_{inp, G} (R)

or above it at high rates. This may be due to overfitting of the kernel

K

-means to the training data set of the size, 4000. That is, as the the rate grows, the distortion in the training data set decreases and the discrepancy between the distortions in the training and test sets increases.

To examine the asymptotic behavior of

R_{inp} (D)

discussed in Section 4.3, we computed

R_{inp, L} (D)

and

R_{inp, G} (D)

for small D, that is, for large s. As well as the Gaussian kernel Equation (29), which has

α = 2

in Equation (19), we applied the Laplacian kernel,

K (x, y) = e^{- γ ∥ x - y ∥},

which corresponds to

α = 1

. The kernel parameter of the Laplacian kernel was set to the square root of the value used in the Gaussian kernel.

The rate-distortion bounds,

R_{inp, L} (D)

and

R_{inp, G} (D)

divided by

- (log D) / 2

for small distortion levels are shown in Figure 2a,b and for

m = 2

and

m = 10

, respectively. We can see that, in each case, the ratio tends to

2 m / α

, that is, the rate-distortion dimension evaluated in Equation (22) as

D \to 0

. For the distortion levels smaller than those presented in Figure 2, the ratios start oscillating due to the errors of numerical integrations.

5.2. Image Data

We carried out a similar evaluation of the rate-distortion bounds and quantizer performances for a grayscale image data set extracted from the COIL20 data set [18]. We used the first category from 20 categories of images, which consisted of 72 images of size

32 \times 32

. Dividing each

32 \times 32

image into small patches of size

2 \times 2

(

m = 4

), we obtained 256 data from each image, and 18,432 data in total. Removing duplicate data points, we finally obtained 13,368 data. We used first 2048 data as the training data and the remaining 11,320 data as the test data. The training data set was also used for approximating expectations of kernel functions required to compute

R_{fea} (D)

, and the first

n = 256

data points were used as the sample data in the definition of

d_{fea}

. We evaluated only the upper bounds,

R_{fea, G}

and

R_{inp, G}

, since the lower bound

R_{inp, L}

requires estimating the source entropy from empirical data, which depends heavily on the estimation method, and hence is to be addressed more in detail.

Each dimension was normalized so that it has mean 0 and variance 1. Hence,

v_{p}

in

R_{inp, G}

was approximated by the empirical variance, 1. The boundary B in

R_{inp, G}

was approximated by the maximum norm of the training data points.

The upper bounds and quantizer performances are presented in Figure 3. Although the upper bounds are loose and above the respective quantizer performances, the upper bound

D_{inp, G} (R)

is roughly predictive of the quantizer performance in the input space, and so does

min {D_{inp, G} (R), D_{fea, G} (R)}

for the reconstruction in the feature space.

6. Conclusions

In this paper, we have shown upper and lower bounds for the rate-distortion functions associated with kernel feature mapping. As suggested in Section 5, the upper bound for the reconstruction in feature space is informative at high distortion levels while the bounds for the reconstruction in input space are informative at low distortion levels. We have also evaluated the rate-distortion dimension of sources with bounded support under kernel-based distortion measures, which shows the asymptotic behavior of the rate-distortion function. Our future directions include deriving tighter bounds and exact evaluation of the rate-distortion function in some special cases. In particular, it is an important undertaking to derive a lower bound to the rate-distortion function of the reconstruction in feature space.

Acknowledgments

The author would like to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported in part by the Japan Society for the Promotion of Science (JSPS) grants 25120014, 15K16050, and 16H02825.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Theorem 1

Proof.

Let

q^{*} (y | x)

be the conditional density for

x \in X

that achieves the infimum of

R_{inp} (D)

. Then, for

Y \sim \int q^{*} (y | x) p (x) d x

, it holds that

I (X; Y) = R_{inp} (D)

and

E [{∥ ϕ (X) - ϕ (Y) ∥}^{2}] \leq D .

(A1)

Since the input space

X

is bounded and separable, and the kernel function K is continuous, for any

ε > 0

and

y \in X

, there exist coefficients

{α_{i} (y)}

such that

∥ϕ (y) - \sum_{i = 1}^{n} α_{i} (y) ϕ (x_{i})∥ \leq \frac{ε}{3 \sqrt{D}}

(A2)

holds when n is sufficiently large.

Let

α (y) = {(α_{1} (y), \dots, α_{n} (y))}^{T}

and

q^{*} (α | x) = \int δ (α - α (y)) q^{*} (y | x) d y,

where

δ

is Dirac’s delta function. Then, for

A \sim \int q^{*} (α | x) p (x) d x

, it follows from the triangle inequality that

\begin{matrix} E [d_{fea} (X, A)] & = & E [{∥ϕ (X) - \sum_{i = 1}^{n} α_{i} (Y) ϕ (x_{i})∥}^{2}] \\ \leq & E [{∥ϕ (X) - ϕ (Y)∥}^{2}] + 2 E [∥ϕ (X) - ϕ (Y)∥ ∥ϕ (Y) - \sum_{i = 1}^{n} α_{i} (Y) ϕ (x_{i})∥] \\ + E [{∥ϕ (Y) - \sum_{i = 1}^{n} α_{i} (Y) ϕ (x_{i})∥}^{2}], \end{matrix}

and hence

\begin{matrix} E [d_{fea} (X, A)] & \leq & D + \frac{2 ε}{3} + \frac{ε^{2}}{9 D} \end{matrix}

(A3)

\begin{matrix} \leq & D + ε . \end{matrix}

(A4)

To obtain Inequality (A3), we used Equations (A1) and (A2), and Jensen’s inequality,

\begin{matrix} E [\sqrt{{∥ϕ (X) - ϕ (Y)∥}^{2}}] & \leq & \sqrt{E [{∥ϕ (X) - ϕ (Y)∥}^{2}]} \\ \leq & \sqrt{D} . \end{matrix}

Thus, from Equation (A4) and the data-processing inequality [7], we have

R_{fea} (D + ε) \leq I (X; A) \leq I (X; Y) = R_{inp} (D),

which completes the proof. ☐

Appendix B. Proof of Theorem 5

Proof.

The mean and covariance matrix of the random vector

A \sim M_{p} (α)

are

\begin{matrix} E [A] & = & {\tilde{K}}^{- 1} \int k (x) p (x) d x \\ C o v [A] & = & E [A A^{T}] - E [A] E {[A]}^{T} \\ = & \{\frac{1}{2 s} I + {\tilde{K}}^{- 1} \int k (x) k {(x)}^{T} p (x) d x\} {\tilde{K}}^{- 1} - {\tilde{K}}^{- 1} \int k (x) p (x) d x \int k {(x)}^{T} p (x) d x {\tilde{K}}^{- 1} \\ = & \{\frac{1}{2 s} I + {\tilde{K}}^{- 1} C\} {\tilde{K}}^{- 1}, \end{matrix}

where

C

is defined by Equation (28).

Thus, the maximum entropy principle of the Gaussian distribution implies that the differential entropy

h (M_{p})

is upper-bounded by

h (M_{p}) \leq \frac{n}{2} log [(2 π e) {|\{\frac{1}{2 s} I + {\tilde{K}}^{- 1} C\} {\tilde{K}}^{- 1}|}^{\frac{1}{n}}] .

Combining this inequality with Equations (24) and (26), we have

R_{fea} (D_{s}) \leq \frac{1}{2} log |I + 2 s {\tilde{K}}^{- 1} C| .

Solving Equation (25) with respect to

2 s

and substituting it into the above expression, we obtain the upper bound in Equation (27). ☐

References

Schölkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
Aizerman, M.A.; Braverman, E.A.; Rozonoer, L. Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 1964, 25, 821–837. [Google Scholar]
Berger, T. Rate Distortion Theory: A Mathematical Basis for Data Compression; Prentice-Hall: Englewood Cliffs, NJ, USA, 1971. [Google Scholar]
Girolami, M. Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 2002, 13, 780–784. [Google Scholar] [CrossRef] [PubMed]
Filippone, M.; Camastra, F.; Masulli, F.; Rovetta, S. A survey of kernel and spectral methods for clustering. Pattern Recognit. 2008, 41, 176–190. [Google Scholar] [CrossRef]
Schölkopf, B.; Mika, S.; Burges, C.J.C.; Knirsch, P.; Müller, K.R.; Ratsch, G.; Smola, A.J. Input space versus feature space in kernel-based methods. IEEE Trans. Neural Netw. 1999, 10, 1000–1017. [Google Scholar] [CrossRef] [PubMed]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley Interscience: Hoboken, NJ, USA, 1991. [Google Scholar]
Gray, R.M. Entropy and Information Theory, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Steinwart, I.; Christmann, A. Support Vector Machines; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Linkov, Y.N. Evaluation of ϵ-entropy of random variables for small ϵ. Probl. Inf. Transm. 1965, 1, 18–26. [Google Scholar]
Linder, T.; Zamir, R. On the asymptotic tightness of the Shannon lower bound. IEEE Trans. Inf. Theory 1994, 40, 2026–2031. [Google Scholar] [CrossRef]
Koch, T. The Shannon lower bound is asymptotically tight. IEEE Trans. Inf. Theory 2016, 62, 6155–6161. [Google Scholar] [CrossRef]
Kawabata, T.; Dembo, A. The rate-distortion dimension of sets and measures. IEEE Trans. Inf. Theory 1994, 40, 1564–1572. [Google Scholar] [CrossRef]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning); The MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
Fukunaga, K.; Hostetler, L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 1975, 21, 32–40. [Google Scholar] [CrossRef]
Comaniciu, D.; Meer, P. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603–619. [Google Scholar] [CrossRef]
Watanabe, K. Rate-distortion analysis for kernel-based distortion measures. In Proceedings of the Eighth Workshop on Information Theoretic Methods in Science and Engineering, Copenhagen, Denmark, 24–26 June 2015. [Google Scholar]
Nene, S.A.; Nayar, S.K.; Murase, H. Columbia Object Image Library (COIL-20). Available online: http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php (accessed on 4 July 2017).

Figure 1. Rate-distortion bounds and quantizer performances for (a)

m = 2

and (b)

m = 10

[17].

Figure 1. Rate-distortion bounds and quantizer performances for (a)

m = 2

and (b)

m = 10

[17].

Figure 2. The ratios between the rate-distortion bounds and

- (log D) / 2

for (a)

m = 2

and (b)

m = 10

. The bounds are for the Laplacian kernel (

α = 1

) and the Gaussian kernel (

α = 2

).

Figure 2. The ratios between the rate-distortion bounds and

- (log D) / 2

for (a)

m = 2

and (b)

m = 10

. The bounds are for the Laplacian kernel (

α = 1

) and the Gaussian kernel (

α = 2

).

Figure 3. Upper bounds of the rate-distortion functions and quantizer performance for image data.

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Watanabe, K. Rate-Distortion Bounds for Kernel-Based Distortion Measures. Entropy 2017, 19, 336. https://doi.org/10.3390/e19070336

AMA Style

Watanabe K. Rate-Distortion Bounds for Kernel-Based Distortion Measures. Entropy. 2017; 19(7):336. https://doi.org/10.3390/e19070336

Chicago/Turabian Style

Watanabe, Kazuho. 2017. "Rate-Distortion Bounds for Kernel-Based Distortion Measures" Entropy 19, no. 7: 336. https://doi.org/10.3390/e19070336

APA Style

Watanabe, K. (2017). Rate-Distortion Bounds for Kernel-Based Distortion Measures. Entropy, 19(7), 336. https://doi.org/10.3390/e19070336

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rate-Distortion Bounds for Kernel-Based Distortion Measures^†

Abstract

1. Introduction

2. Rate-Distortion Function

3. Kernel-Based Distortion Measures

3.1. Reconstruction in Input Space

3.2. Reconstruction in Feature Space

3.3. $R_{inp} (D)$ and $R_{fea} (D)$

4. Rate-Distortion Bounds

4.1. Lower Bound to $R_{inp} (D)$

4.2. Upper Bound to $R_{inp} (D)$

4.3. Rate-Distortion Dimension

4.4. Upper Bound to $R_{fea} (D)$

5. Experimental Evaluation

5.1. Synthetic Data

5.2. Image Data

6. Conclusions

Acknowledgments

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix B. Proof of Theorem 5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Rate-Distortion Bounds for Kernel-Based Distortion Measures †

Abstract

1. Introduction

2. Rate-Distortion Function

3. Kernel-Based Distortion Measures

3.1. Reconstruction in Input Space

3.2. Reconstruction in Feature Space

3.3. R inp ( D ) and R fea ( D )

4. Rate-Distortion Bounds

4.1. Lower Bound to R inp ( D )

4.2. Upper Bound to R inp ( D )

4.3. Rate-Distortion Dimension

4.4. Upper Bound to R fea ( D )

5. Experimental Evaluation

5.1. Synthetic Data

5.2. Image Data

6. Conclusions

Acknowledgments

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix B. Proof of Theorem 5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Rate-Distortion Bounds for Kernel-Based Distortion Measures^†

3.3. $R_{inp} (D)$ and $R_{fea} (D)$

4.1. Lower Bound to $R_{inp} (D)$

4.2. Upper Bound to $R_{inp} (D)$

4.4. Upper Bound to $R_{fea} (D)$