Greedy Algorithms for Optimal Distribution Approximation

Geiger, Bernhard C.; Böcherer, Georg

doi:10.3390/e18070262

Open AccessArticle

Greedy Algorithms for Optimal Distribution Approximation

by

Bernhard C. Geiger

^*

and

Georg Böcherer

Institute for Communications Engineering, Technical University of Munich, Munich 80290, Germany

^*

Author to whom correspondence should be addressed.

Entropy 2016, 18(7), 262; https://doi.org/10.3390/e18070262

Submission received: 14 June 2016 / Revised: 1 July 2016 / Accepted: 11 July 2016 / Published: 18 July 2016

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

The approximation of a discrete probability distribution

t

by an M-type distribution

p

is considered. The approximation error is measured by the informational divergence

D (t ∥ p)

, which is an appropriate measure, e.g., in the context of data compression. Properties of the optimal approximation are derived and bounds on the approximation error are presented, which are asymptotically tight. A greedy algorithm is proposed that solves this M-type approximation problem optimally. Finally, it is shown that different instantiations of this algorithm minimize the informational divergence

D (p ∥ t)

or the variational distance

{∥ p - t ∥}_{1}

.

Keywords:

distribution approximation; finite precision; informational divergence; greedy algorithm

Graphical Abstract

1. Introduction

In this work, we consider finite precision representations of probabilistic models. Suppose the original model, or target distribution, has n non-zero mass points and is given by

t : = (t_{1}, \dots, t_{n})

. We wish to approximate it by a distribution

p : = (p_{1}, \dots, p_{n})

of which each entry is a rational number with a fixed denominator. In other words, for every i,

p_{i} = c_{i} / M

for some non-negative integer

c_{i} \leq M

. The distribution

p

is called an M-type distribution, and the positive integer

M \geq n

is the precision of the approximation. The problem is non-trivial, since computing the numerator

c_{i}

by rounding

M t_{i}

to the nearest integer in general fails to yield a distribution.

M-type approximations have many practical applications, e.g., in political apportionments, M seats in a parliament need to be distributed to n parties according to the result of some vote

t

. This problem led, e.g., to the development of multiplier methods [1]. In communications engineering, example applications are finite precision implementations of probabilistic data compression [2], distribution matching [3], and finite-precision implementations of Bayesian networks [4,5]. In all of these applications, the M-type approximation

p

should be close to the target distribution

t

in the sense of an appropriate error measure. Common choices for this approximation error are the variational distance and the informational divergences:

\begin{matrix} (1a) & {∥ p - t ∥}_{1} & : = \sum_{i = 1}^{n} | p_{i} - t_{i} | \\ (1b) & D (p ∥ t) & : = \sum_{i : p_{i} > 0} p_{i} log \frac{p_{i}}{t_{i}} \\ (1c) & D (t ∥ p) & : = \sum_{i : t_{i} > 0} t_{i} log \frac{t_{i}}{p_{i}} \end{matrix}

where log denotes the natural logarithm.

Variational distance and informational divergence Equation (1b) have been considered by Reznik [6] and Böcherer [7], respectively, who presented algorithms for optimal M-type approximation and developed bounds on the approximation error. In a recent manuscript [8], we extended the existing works on Equation (1a,b) to target distributions with infinite support (

n = \infty

) and refined the bounds from [6,7].

In this work, we focus on the approximation error Equation (1c). It is an appropriate cost function for data compression [9] (Theorem 5.4.3) and seems appropriate for the approximation of parameters in Bayesian networks (see Section 4). Nevertheless, to the best of the authors’ knowledge, the characterization of M-type approximations minimizing

D (t ∥ p)

has not received much attention in literature so far.

Our contributions are as follows. In Section 2, we present an efficient greedy algorithm to find M-type distributions minimizing Equation (1c). We then discuss in Section 3 the properties of the optimal M-type approximation and bound the approximation error Equation (1c). Our bound incorporates a reverse Pinsker inequality recently suggested in [10] (Theorem 7). The algorithm we present is an instance of a greedy algorithm similar to steepest ascent hill climbing [11] (Chapter 2.6). As a byproduct, we unify this work with [6,7,8] by showing that also the algorithms optimal w.r.t. variational distance Equation (1a) and informational divergence Equation (1b) are instances of the same general greedy algorithm, see Section 2.

2. Greedy Optimization

In this section, we define a class of problems that can be optimally solved by a greedy algorithm. Consider the following example:

Example 1.

Suppose there are n queues with jobs, and you have to select M jobs minimizing the total time spent. A greedy algorithm suggests to select successively the job with the shortest duration, among the jobs that are at the front of their queues. If the jobs in each queue are ordered by increasing duration, then this greedy algorithm is optimal.

We now make this precise: Let M be a positive integer, e.g., the number of jobs that have to be completed, and let

δ_{i} : N \to R

,

i = 1, \dots, n

, be a set of functions, e.g.,

δ_{i} (k)

is the duration of the k-th job in the i-th queue. Let furthermore

c_{0} : = (c_{1, 0}, \dots, c_{n, 0}) \in N_{0}^{n}

be a pre-allocation, representing a constraint that has to be fulfilled (e.g., in the i-th queue at least

c_{i, 0}

jobs have to be completed) or a chosen initialization. Then, the goal is to minimize

U (c) : = \sum_{i = 1}^{n} \sum_{k_{i} = c_{i, 0} + 1}^{c_{i}} δ_{i} (k_{i})

(2)

i.e., to find a final allocation

c : = (c_{1}, \dots, c_{n})

satisfying

{∥ c ∥}_{1} = M

and, for every i,

c_{i} \geq c_{i, 0}

. A greedy method to obtain such a final allocation is presented in Algorithm 1. We show in Appendix A.1. that this algorithm is optimal if the functions

δ_{i}

satisfy certain conditions:

Algorithm 1: Greedy Algorithm

Initialize

k_{i} = c_{i, 0}, i = 1, \dots, n .

repeat

M - {∥ c_{0} ∥}_{1}

times

Compute

δ_{i} (k_{i} + 1), i = 1, \dots, n .

Compute

j = \min {argmin}_{i} δ_{i} (k_{i} + 1) .

// (choose one minimal element) Update

k_{j} \leftarrow k_{j} + 1, .

end repeat

Return c = (k₁, ⋯, k_n).

Proposition 1.

If the functions

δ_{i} (k)

are non-decreasing in k, Algorithm 1 achieves a global minimum

U (c)

for a given pre-allocation

c_{0}

and a given M.

Remark 1.

The minimum of

U (c)

may not be unique.

Remark 2.

If a function

f_{i} : R \to R

is convex, the difference

δ_{i} (k) = f_{i} (k) - f_{i} (k - 1)

is non-decreasing in k. Hence, Algorithm 1 also minimizes

U (c) = \sum_{i = 1}^{n} f_{i} (c_{i}) .

(3)

Remark 3.

Note that the functions

δ_{i} (k)

need not be non-negative, i.e., in the view of Example 1, jobs may have negative duration. The functions

δ_{i} (k)

are non-negative, though, if

f_{i} : R \to R

in Remark 2 is convex and non-decreasing.

Remark 2 connects Algorithm 1 to steepest ascent hill climbing [11] (Chapter 2.6) with fixed step size and a constrained number of M steps.

We now show that instances of Algorithm 1 can find M-type approximations

p

minimizing each of the cost functions in Equation (1). Noting that

p_{i} = c_{i} / M

for some non-negative integer

c_{i}

, we can rewrite the cost functions as follows:

\begin{matrix} (4a) & {∥ p - t ∥}_{1} & = & \frac{1}{M} \sum_{i = 1}^{n} | c_{i} - M t_{i} | \\ (4b) & D (p ∥ t) & = & \frac{1}{M} (\sum_{i : c_{i} > 0} c_{i} log \frac{c_{i}}{t_{i}}) - log M \\ (4c) & D (t ∥ p) & = & log M - H (t) - \sum_{i : t_{i} > 0} t_{i} log c_{i} \end{matrix}

where

H (\cdot)

denotes entropy in nats.

Ignoring constant terms, these cost functions are all instances of Remark 2 for convex functions

f_{i} : R \to R

(see Table 1). Hence, the three different M-type approximation problems set up by Equation (1) can all be solved by instances of Algorithm 1, for a trivial pre-allocation

c_{0} = 0

and after taking M steps. The final allocation

c

simply defines the M-type approximation by

p_{i} = c_{i} / M

.

For variational distance optimal approximation, we showed in [8] (Lemma 3), that every optimal M-type approximation satisfies

p_{i} \geq ⌊ M t_{i} ⌋ / M

, hence one may speed up the algorithm by pre-allocating

c_{i, 0} = ⌊ M t_{i} ⌋

. We furthermore show in Lemma 1 below that the support of the optimal M-type approximation in terms of Equation (1c) equals the support of

t

(if

M \geq n

). Assuming that

t

is positive, one can pre-allocate the algorithm with

c_{i, 0} = 1

. We summarize these instantiations of Algorithm 1 in Table 1.

This list of instances of Algorithm 1 minimizing information-theoretic or probabilistic cost functions can be extended. For example, the

χ^{2}

-divergences

χ^{2} (t | | p)

and

χ^{2} (p | | t)

can also be minimized, since the functions inside the respective sums are convex. However, Rényi divergences of orders

α \neq 1

cannot be minimized by applying Algorithm 1.

3. $M$ -Type Approximation Minimizing $D (t ∥ p)$

As shown in the previous section, Algorithm 1 presents a minimizer of the problem

{min}_{p} D (t ∥ p)

if instantiated according to Table 1. Let us call this minimizer

t^{a}

. Recall that

t

is positive and that

M \geq n

. The support of

t^{a}

must contain the support of

t

, since otherwise

D (t ∥ t^{a}) = \infty

. Note further that the costs

δ_{i} (k)

are negative if

t_{i} > 0

and zero if

t_{i} = 0

; hence, if

t_{i} = 0

, the index i cannot be chosen by Algorithm 1, thus also

t_{i}^{a} = 0

. This proves:

Lemma 1.

If

M \geq n

, the supports of

t

and

t^{a}

coincide, i.e.,

t_{i} = 0 \Leftrightarrow t_{i}^{a} = 0

.

The assumption that

t

is positive and that

M \geq n

hence comes without loss of generality. In contrast, neither variational distance nor informational divergence Equation (1b) require

M \geq n

: As we show in [8], the M-type approximation problem remains interesting even if

M < n

.

Based on Lemma 1, the following example explains why the optimal M-type approximation does not necessarily result in a “small” approximation error:

Example 2.

Let

t = (1 - ε, \frac{ε}{n - 1}, \dots, \frac{ε}{n - 1})

and

M = n

, hence by Lemma 1,

t^{a} = \frac{1}{n} (1, 1, \dots, 1)

. It follows that

D (t ∥ t^{a}) = log n - H (t)

, which can be made arbitrarily close to

log n

by choosing a small positive ε.

In Table 1 we made use of [8] (Lemma 3), which says that every

p

minimizing the variational distance

{∥ p - t ∥}_{1}

satisfies

p_{i} \geq ⌊ M t_{i} ⌋ / M

, to speed up the corresponding instance of Algorithm 1 by proper pre-allocation. Initialization by rounding is not possible when minimizing

D (t ∥ p)

, as shown in the following two examples:

Example 3.

Let

t = (17 / 20, 3 / 40, 3 / 40)

and

M = 20

. The optimal M-type approximation is

p = (8 / 10, 1 / 10, 1 / 10)

, hence

p_{1} < ⌊ M t_{1} ⌋ / M

. Initialization via rounding off fails.

Example 4.

Let

t = (0.719, 0.145, 0.088, 0.048)

and

M = 50

. The optimal M-type approximation is

p = (0.74, 0.14, 0.08, 0.04)

, hence

p_{1} > ⌈ M t_{1} ⌉ / M

. Initialization via rounding up fails.

To show that informational divergence vanishes for

M \to \infty

, assume that

M > 1 / t_{i}

for all i. Since the variational distance optimal approximation

t^{vd}

satisfies

t_{i}^{vd} \geq ⌊ M t_{i} ⌋ / M

for every i,

t^{vd}

has the same support as

t

, which ensures that

D (t ∥ t^{vd}) < \infty

. By similar arguments as in the proof of [8] (Proposition 4), we obtain

D (t ∥ t^{a}) \leq D (t ∥ t^{vd}) \leq log (1 + \frac{n}{2 M}) \overset{M \to \infty}{⟶} 0 .

(5)

Note that this bound is universal, i.e., it prescribes the same convergence rate for every target distribution with n mass points.

We now develop an upper bound on

D (t ∥ t^{a})

that holds for every M. To this end, we first approximate

t

by a distribution

t^{*}

in

P_{M} : = {p : \forall i : p_{i} \geq 1 / {M, ∥ p ∥}_{1} = 1}

that minimizes

D (t ∥ t^{*})

. If

t^{*}

is unique, then it is called the reverse I-projection [12] (Section I.A) of

t

onto

P_{M}

. Since

t^{*} \in P_{M}

, its variational distance optimal approximation

t^{vd}

has the same support as

t

, which allows us to bound

D (t ∥ t^{a})

by

D (t ∥ t^{vd})

.

Lemma 2.

Let

t^{*} \in P_{M}

minimize

D (t ∥ t^{*})

. Then,

t_{i}^{*} : = \frac{t_{i}}{ν (M)} + {(\frac{1}{M} - \frac{t_{i}}{ν (M)})}^{+}

(6)

where

ν (M)

is such that

∥ t^{*} ∥_{1} = 1

, and where

{(x)}^{+} : = max {0, x}

.

Proof.

See Appendix A.2. ☐

Let

K : = {i : t_{i} < ν (M) / M}

,

k : = | K |

, and

T_{K} : = \sum_{i \in K} t_{i}

. The parameter

ν (M)

must scale the mass

(1 - T_{K})

such that it equals

(M - k) / M

, i.e., we have

\begin{matrix} ν (M) = \frac{1 - T_{K}}{1 - \frac{k}{M}} . \end{matrix}

(7)

If, for all i,

t_{i} > 1 / M

, then

t \in P_{M}

, hence

t^{*} = t

is feasible and

ν (M) = 1

. One can show that

ν (M)

decreases with M.

Proposition 2

(Approximation Bounds).

\begin{matrix} D (t ∥ t^{a}) & \leq log ν (M) + \frac{log (2)}{2} (1 - ν (M) (1 - \frac{n}{M})) \end{matrix}

(8)

Proof.

See Appendix A.3. ☐

The first term on the right-hand side of Equation (8) accounts for the error caused by first approximating

t

by

t^{*}

(in the sense of Lemma 2). The second term accounts for the additional error caused by the M-type approximation of

t^{*}

and incorporates the reverse Pinsker inequality [10] (Theorem 7). If

M > t_{i}

for every i, hence

t \in P_{M}

, then

ν (M) = 1

and the bound simplifies to

\begin{matrix} D (t ∥ t^{a}) & \leq log (2) \frac{n}{2 M} . \end{matrix}

(9)

For M sufficiently large, Equation (8) thus yields better results than Equation (5), which approximates to

n / (2 M)

. Moreover, for M sufficiently large, our bound Equation (8) is uniform, i.e., it prescribes the same convergence rate for every target distribution with n mass points. We illustrate the bounds for an example in Figure 1.

4. Applications and Outlook

Arithmetic coding uses a probabilistic model to compress a source sequence. Applying Algorithm 1 with cost Equation (1c) to the empirical distribution of the source sequence provides an M-type distribution as a probabilistic model. The parameter M can be choosen small for reduced complexity. Another application of Algorithm 1 can be found in [3], which considers the problem of generating length-M sequences according to a desired distribution. Since a length-M sequence has an M-type empirical distribution, the Reference [3] applies Algorithm 1 with cost Equation () to pre-calculate the M-type approximation of the desired distribution.

Algorithm 1 can also be used to calculate the M-type approximation of Markov models, i.e., approximating the transition matrix

T

of an n-state, irreducible Markov chain with invariant distribution vectors μ by a transition matrix

P

containing only M-type probabilities. Generalizing Equation (1c), the approximation error can be measured by the informational divergence rate [13]

\bar{D} (T ∥ P) : = \sum_{i, j = 1}^{n} μ_{i} T_{i j} log \frac{T_{i j}}{P_{i j}} = \sum_{i = 1}^{n} μ_{i} D (t_{i} ∥ p_{i}) .

(10)

The optimal M-type approximation is found by applying the instance of Algorithm 1 to each row separately, and Lemma 1 ensures that the transition graph of

P

equals that of

T

, i.e., the approximating Markov chain is irreducible. Future work shall extend this analysis to hidden Markov models and should investigate the performance of these algorithms in practical scenarios, e.g., speech processing with finite-precision arithmetic.

Another possible application is the approximation of Bayesian network parameters. The authors of [4] approximated the true parameters using a stationary multiplier method from [14]. Since rounding probabilities to zero led to bad classification performance, they replaced zeros in the approximating distribution afterwards by small values. This in turn led to the problem that probabilities that are in fact zero, were approximated by a non-zero probability. We believe that these problems can be removed by instantiating Algorithm 1 for cost Equation (1c). This automatically prevents approximating non-zero probabilities with zeros and vice-versa, see Lemma 1.

Finally, for approximating Bayesian network parameters, recent work suggests rounding log-probabilities, i.e., to approximate

log t_{i}

by

log p_{i} = - c_{i} / M

for a non-negative integer

c_{i}

[5]. Finding an optimal approximation that corresponds to a true distribution is equivalent to solving

\begin{matrix} min d (t, p) \\ s . t . ∥ e^{- c} ∥_{1 / M} = 1 \end{matrix}

where

d (\cdot, \cdot)

denotes any of the considered cost functions Equation (1). If

M = 1

and

d (t, p) = D (t ∥ p)

using the binary logarithm, the constraint translates to the requirement that

t

is approximated by a complete binary tree. Then, the optimal approximation is the Huffman code for

t

.

Acknowledgments

The work of Bernhard C. Geiger was partially funded by the Erwin Schrödinger Fellowship J 3765 of the Austrian Science Fund. The work of Georg Böcherer was partly supported by the German Ministry of Education and Research in the framework of an Alexander von Humboldt Professorship. This work was supported by the German Research Foundation (DFG) and the Technical University of Munich (TUM) in the framework of the Open Access Publishing Program.

Author Contributions

Bernhard C. Geiger and Georg Böcherer conceived this study, derived the results, and wrote the manuscript. Specifically, Bernhard C. Geiger Proposition 1 and Lemmas 3, 5 and 6, and Georg Böcherer proved Lemmas 2 and 4. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix. Proofs

Appendix A.1. Proof of Proposition 1

Since a pre-allocation only fixes a lower bound for

U (c)

, w.l.o.g. we assume that

c_{0} = 0

and thus

c \in N_{0}^{n}

with

{∥ c ∥}_{1} = M

. Consider the set

D : = {δ_{i} (k_{i}) : k_{i} \in N, i = 1, \dots, n}

and assume that the (not necessarily unique) set

D_{M}

consists of M smallest values in

D

, i.e.,

| D_{M} | = M

and

\forall d \in D_{M}, d^{'} \in D \ D_{M} : d \leq d^{'} .

(A1)

Clearly,

U (c)

cannot be smaller than the sum over all elements in

D_{M}

. Since the

δ_{i}

are non-decreasing, there exists at least one final allocation

c

that takes successively the first

c_{i}

values from each queue i, i.e.,

D_{M} = {δ_{1} (1), \dots, δ_{1} (c_{1}), \dots, δ_{n} (1), \dots, δ_{n} (c_{n})}

satisfies Equation (A1). This shows that the lower bound induced by Equation (A1) can actually be achieved.

We prove the optimality of Algorithm 1 by contradiction: Assume that Algorithm 1 finishes with a final allocation

\tilde{c}

such that

U (\tilde{c})

is strictly larger than the (unique) sum over all elements in (non-unique)

D_{M}

. Hence,

\tilde{c}

must exchange at least one of the elements in

D_{M}

for an element that is strictly larger. Thus, by the properties of the functions

δ_{i}

and Algorithm 1, there must be indices ℓ and m such that

{\tilde{c}}_{ℓ} > c_{ℓ}

,

{\tilde{c}}_{m} < c_{m}

, and

δ_{ℓ} ({\tilde{c}}_{ℓ}) \geq δ_{ℓ} (c_{ℓ} + 1) > δ_{m} (c_{m}) \geq δ_{m} ({\tilde{c}}_{m})

. At each iteration of the algorithm, the current allocation at index m satisfies

k_{m} \leq {\tilde{c}}_{m} < c_{m}

. Since

δ_{m} (c_{m}) < δ_{ℓ} (c_{ℓ} + 1)

,

δ_{ℓ} (c_{ℓ} + 1)

can never be a minimal element, and hence is not chosen by Algorithm 1. This contradicts the assumption that Algorithm 1 finishes with a

\tilde{c}

such that

U (\tilde{c})

is strictly larger than the sum of

D

’s M smallest values. ☐

Appendix A.2. Proof of Lemma 2

The problem finding a

t^{*} \in P_{M}

minimizing

D (t ∥ t^{*})

is equivalent to finding an optimal point of the problem:

\begin{matrix} (A2a) & \underset{p \in R_{> 0}^{n}}{minimize} & - \sum_{i = 1}^{n} t_{i} log p_{i} \\ (A2b) & subject to & \frac{1}{M} - p_{i} \leq 0, i = 1, 2, \dots, n \\ (A2c) & - 1 + \sum_{i = 1}^{n} p_{i} = 0 . \end{matrix}

The Lagrangian of the problem is

L (p, λ, ν) = - \sum_{i = 1}^{n} t_{i} log p_{i} + \sum_{i = 1}^{n} λ_{i} (\frac{1}{M} - p_{i}) + ν (- 1 + \sum_{i = 1}^{n} p_{i}) .

(A3)

By the Karush–Kuhn–Tucker (KKT) conditions [15] (Chapter 5.5.3), a feasible point

t^{*}

is optimal if, for every

i = 1, \dots, n

,

\begin{matrix} (A4a) & λ_{i} & \geq 0 \\ (A4b) & λ_{i} (\frac{1}{M} - t_{i}^{*}) & = 0 \\ (A4c) & \frac{\partial}{\partial p_{i}} {L (p, λ, ν) |}_{p = t^{*}} = - \frac{t_{i}}{t_{i}^{*}} - λ_{i} + ν & = 0 . \end{matrix}

By Equation (A2b), we have

t_{i}^{*} \geq 1 / M

. If

t_{i}^{*} > 1 / M

, then

λ_{i} = 0

by Equation (A4b) and

t_{i}^{*} = t_{i} / ν

by Equation (A4c). Thus

\begin{matrix} t_{i}^{*} = \frac{t_{i}}{ν} + {(\frac{1}{M} - \frac{t_{i}}{ν})}^{+} \end{matrix}

(A5)

where ν is such that

\sum_{i = 1}^{n} t_{i}^{*} = 1

. ☐

Appendix A.3. Proof of Proposition 2

Reverse I-projections admit a Pythagorean inequality [12] (Theorem 1). In other words, if

p

is a distribution,

p^{*}

its reverse I-projection onto a set

S

, and

q

any distribution in

S

, then

D (p ∥ q) \geq D (p ∥ p^{*}) + D (p^{*} ∥ q) .

(A6)

For the present scenario, we can show an even stronger result:

Lemma 3.

Let

t

be the target distribution, let

t^{*}

be as in Lemma 2, and let

t^{vd}

be the variational distance optimal M-type approximation of

t^{*}

. Then,

D (t ∥ t^{vd}) = D (t ∥ t^{*}) + ν D (t^{*} ∥ t^{vd}) .

(A7)

Proof.

\begin{matrix} (A8) & D (t ∥ t^{vd}) & = \sum_{i = 1}^{n} t_{i} log \frac{t_{i}}{t_{i}^{vd}} \\ (A9) & = \sum_{i = 1}^{n} t_{i} log \frac{t_{i} t_{i}^{*}}{t_{i}^{vd} t_{i}^{*}} \\ (A10) & = \sum_{i = 1}^{n} t_{i} log \frac{t_{i}}{t_{i}^{*}} + \sum_{i = 1}^{n} t_{i} log \frac{t_{i}^{*}}{t_{i}^{vd}} \\ (A11) & \overset{(a)}{=} D (t ∥ t^{*}) + ν \sum_{i \notin K} \frac{t_{i}}{ν} log \frac{t_{i}^{*}}{t_{i}^{vd}} \\ (A12) & \overset{(b)}{=} D (t ∥ t^{*}) + ν D (t^{*} ∥ t^{vd}) \end{matrix}

Here,

(a)

follows because for

i \in K

,

t_{i}^{*} = 1 / M

and thus, the M-type approximation minimizing the variational distance satisfies

t_{i}^{vd} = 1 / M

; furthermore,

(b)

is because for

i \notin K

,

t_{i}^{*} = t_{i} / ν

. ☐

We now bound the summands in Lemma 3.

Lemma 4.

In the setting of Lemma 3,

D (t^{*} ∥ t^{vd}) \leq log (2) {∥ t^{*} - t^{vd} ∥}_{1} .

(A13)

Proof.

We first employ a reverse Pinsker inequality from [10] (Theorem 7), stating that

D (t^{*} ∥ t^{vd}) \leq \frac{1}{2} \frac{r log r}{r - 1} {∥ t^{*} - t^{vd} ∥}_{1}

(A14)

where

r : = {sup}_{i : t_{i}^{*} > 0} \frac{t_{i}^{*}}{t_{i}^{vd}}

. Furthermore, since for variational distance optimal approximations we always have

| t_{i}^{*} - t_{i}^{vd} | < 1 / M

[8] (Lemma 3), we can bound

r < \frac{t_{i}^{vd} + \frac{1}{M}}{t_{i}^{vd}} \leq 2

(A15)

since

t_{i}^{vd} \geq ⌊ M t_{i}^{*} ⌋ / M \geq 1 / M

. Since the factor

\frac{r log r}{r - 1}

increases in r, the bound Equation (A13) follows by substituting r in Equation (A14) by 2. ☐

Lemma 5.

In the setting of Lemma 3,

D (t ∥ t^{*}) \leq log ν .

(A16)

Proof.

\begin{matrix} (A17) & D (t ∥ t^{*}) & = \sum_{i = 1}^{n} t_{i} log \frac{t_{i}}{t_{i}^{*}} \\ (A18) & = \sum_{i \notin K} t_{i} log \frac{ν t_{i}}{t_{i}} + \sum_{i \in K} t_{i} log M t_{i} \\ (A19) & \overset{(a)}{\leq} (1 - T_{K}) log ν + \sum_{i \in K} t_{i} log ν \\ (A20) & = log ν \end{matrix}

where

(a)

is because for

i \in K

,

M t_{i} \leq ν

. ☐

To bound

∥ t^{*} - t^{vd} ∥_{1}

, we present

Lemma 6.

Let

p^{*}

be a sub-probability distribution with

m \leq M

masses and total weight

1 - T

, and let

{p^{vd}}^{*}

be its variational distance optimal M-type approximation using

J \leq M

masses. Then,

∥ p^{*} - {p^{vd}}^{*} ∥_{1} \leq \frac{m}{2 M} + \frac{{(M - M T - J)}^{2}}{2 m M} .

(A21)

Note that for

J = M

we recover [8] (Lemma 4).

Proof.

Assume first that either

\forall i : p_{i}^{*} \geq {p_{i}^{vd}}^{*}

or

\forall i : p_{i}^{*} \leq {p_{i}^{vd}}^{*}

. Note that this is possible since

p^{*}

and

{p^{vd}}^{*}

are sub-probability distributions, summing to

1 - T

and

J / M

, respectively. Then,

∥ p^{*} - {p^{vd}}^{*} ∥_{1} = | 1 - T - J / M |

which satisfies this bound. This can be seen by rearranging Equation (A21) such that J only appears on the left-hand side; the maximizing J (not necessarily integer) then satisfies Equation (A21) with equality.

We thus remain to treat the case where after rounding off all indices,

1 \leq L \leq M - 1

masses remain and we have

\sum_{i = 1}^{m} p_{i}^{*} - \frac{⌊ M p_{i}^{*} ⌋}{M} = : \sum_{i = 1}^{m} e_{i} = 1 - T - \frac{J - L}{M} = : g (L) .

(A22)

The variational distance is minimized by distributing the L masses to L indices

i \in L

with the largest errors

e_{i}

, hence

\begin{matrix} (A23) & ∥ p^{*} - {p^{vd}}^{*} ∥_{1} & = \sum_{i \in L} (\frac{1}{M} - e_{i}) + \sum_{i \notin L} e_{i} \\ (A24) & \overset{(a)}{\leq} \frac{L}{M} - \frac{L}{n} g (L) + \frac{n - L}{n} g (L) \end{matrix}

where

(a)

follows because for

i \in L, j \notin L

,

e_{i} \geq e_{j}

. This is maximized for

L = \frac{n - (M - M T - J)}{2}

(not necessarily integer), which after inserting yields the upper bound. ☐

Proof of Bound in Proposition 2.

We start by bounding the informational divergence

D (t ∥ t^{a})

by the informational divergence between

t

and the variational distance optimal approximation

t^{vd}

of its reverse I-projection

t^{*}

onto

P_{M}

:

\begin{matrix} (A25) & D (t ∥ t^{a}) & \leq D (t ∥ t^{vd}) \\ (A26) & \overset{(a)}{=} D (t ∥ t^{*}) + ν D (t^{*} ∥ t^{vd}) \\ (A27) & \overset{(b)}{\leq} log ν + ν log (2) {∥ t^{*} - t^{vd} ∥}_{1} \\ (A28) & \overset{(c)}{\leq} log ν + ν log (2) \frac{n - k}{2 M} \\ (A29) & \overset{(d)}{\leq} log ν + ν log (2) \frac{n - M + \frac{M}{ν}}{2 M} \\ (A30) & = log ν + \frac{log (2)}{2} (1 - ν (1 - \frac{n}{M})) \end{matrix}

where

(a) is due to Lemma 3,
(b) is due to Lemmas 4 and 5,
(c) is due to Lemma 6 with $m = n - k$ , $1 - T = 1 - k / M$ , and $J = M - k$ , and
(d) follows by bounding k from below via Equation (7)

$k = \frac{M}{ν} (ν - 1 + T_{K}) \geq \frac{M}{ν} (ν - 1) = M - \frac{M}{ν} .$

(A31)

☐

References

Dorfleitner, G.; Klein, T. Rounding with multiplier methods: An efficient algorithm and applications in statistics. Stat. Pap. 1999, 40, 143–157. [Google Scholar] [CrossRef]
Rissanen, J.; Langdon, G.G. Arithmetic coding. IBM J. Res. Dev. 1979, 23, 149–162. [Google Scholar] [CrossRef]
Schulte, P.; Böcherer, G. Constant Composition Distribution Matching. IEEE Trans. Inf. Theory 2016, 62, 430–434. [Google Scholar] [CrossRef]
Drużdżel, M.J.; Oniśko, A. Are Bayesian Networks Sensitive to Precision of Their Parameters? In Proceedings of the International IIS’08 Conference, Intelligent Information Systems XVI, Zakopane, Poland, 16–18 June 2008; pp. 35–44.
Tschiatschek, S.; Pernkopf, F. On Bayesian Network Classifiers with Reduced Precision Parameters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 774–785. [Google Scholar] [CrossRef] [PubMed]
Reznik, Y. An Algorithm for Quantization of Discrete Probability Distributions. In Proceedings of the 2011 Data Compression Conference (DCC), Snowbird, UT, USA, 29–31 March 2011; pp. 333–342.
Böcherer, G. Optimal Non-Uniform Mapping for Probabilistic Shaping. In Proceedings of the 9th International ITG Conference on Systems, Communications and Coding (SCC), Munich, Germany, 21–24 January 2013; pp. 1–6.
Böcherer, G.; Geiger, B.C. Optimal Quantization for Distribution Synthesis. 2016; arXiv:1307.6843v4. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
Verdú, S. Total variation distance and the distribution of relative information. In Proceedings of the Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 9–14 February 2014; pp. 499–501.
Michalewicz, Z.; Fogel, D.B. How to Solve It: Modern Heuristics, 2nd ed.; Springer: Berlin, Germany, 2004. [Google Scholar]
Csiszár, I.; Frantis̆ek, M. Information Projections Revisited. IEEE Trans. Inf. Theory 2003, 49, 1474–1490. [Google Scholar] [CrossRef]
Rached, Z.; Alajaji, F.; Campbell, L.L. The Kullback–Leibler divergence rate between Markov sources. IEEE Trans. Inf. Theory 2004, 50, 917–921. [Google Scholar] [CrossRef]
Heinrich, L.; Pukelsheim, F.; Schwingenschlögl, U. On stationary multiplier methods for the rounding of probabilities and the limiting law of the Sainte-Laguë divergence. Stat. Decis. 2005, 23, 117–129. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]

Figure 1. Evaluating the bounds Equations (5) and (8) for

t = (0.48, 0.48, 0.02, 0.02)

. Note that Equation (5) is a valid bound only for

M \geq 50

, i.e., where the curve is dashed.

Figure 1. Evaluating the bounds Equations (5) and (8) for

t = (0.48, 0.48, 0.02, 0.02)

. Note that Equation (5) is a valid bound only for

M \geq 50

, i.e., where the curve is dashed.

Table 1. Instances of Algorithm 1 Optimizing Equation (1).

**Table 1.** Instances of Algorithm 1 Optimizing Equation (1).
Cost	$f_{i} (x)$	$δ_{i} (k)$	$c_{i, 0}$	References
${∥ p - t ∥}_{1}$	$\| x - M t_{i} \|$	$\| k - M t_{i} \| - \| k - 1 - M t_{i} \|$	$⌊ M t_{i} ⌋$	[6,8]
$D (p ∥ t)$	$x log (x / t_{i})$	$k log \frac{k}{k - 1} + log (k - 1) - log t_{i}$	0	[7,8]
$D (t ∥ p)$	$- t_{i} log x$	$t_{i} log ((k - 1) / k)$	$⌈ t_{i} ⌉$	This work

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Geiger, B.C.; Böcherer, G. Greedy Algorithms for Optimal Distribution Approximation. Entropy 2016, 18, 262. https://doi.org/10.3390/e18070262

AMA Style

Geiger BC, Böcherer G. Greedy Algorithms for Optimal Distribution Approximation. Entropy. 2016; 18(7):262. https://doi.org/10.3390/e18070262

Chicago/Turabian Style

Geiger, Bernhard C., and Georg Böcherer. 2016. "Greedy Algorithms for Optimal Distribution Approximation" Entropy 18, no. 7: 262. https://doi.org/10.3390/e18070262

APA Style

Geiger, B. C., & Böcherer, G. (2016). Greedy Algorithms for Optimal Distribution Approximation. Entropy, 18(7), 262. https://doi.org/10.3390/e18070262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Greedy Algorithms for Optimal Distribution Approximation

Abstract

1. Introduction

2. Greedy Optimization

3. $M$ -Type Approximation Minimizing $D (t ∥ p)$

4. Applications and Outlook

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix. Proofs

Appendix A.1. Proof of Proposition 1

Appendix A.2. Proof of Lemma 2

Appendix A.3. Proof of Proposition 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Greedy Algorithms for Optimal Distribution Approximation

Abstract

1. Introduction

2. Greedy Optimization

3. M -Type Approximation Minimizing D ( t ∥ p )

4. Applications and Outlook

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix. Proofs

Appendix A.1. Proof of Proposition 1

Appendix A.2. Proof of Lemma 2

Appendix A.3. Proof of Proposition 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. $M$ -Type Approximation Minimizing $D (t ∥ p)$