1. Introduction
The Kullback–Leibler divergence (KL), also called entropic divergence, is a widely used measure for comparing two discrete probability distributions [
1]. Such a divergence is derived from the notion of entropy, and it aims at evaluating the amount of information that is gained by switching from one distribution to another. The applications of the divergence span several scientific areas, for example, for testing random variables [
2,
3,
4], for selecting the right sample size [
5], for optimizing sampling in bioinformatics [
6] or for analyzing magnetic resonance images [
7]. However, the entropic divergence has two important properties that limit its applicability. It has also been applied as a cost function in predictive machine-learning approaches [
8] based on the well-established random-forest model, or in artificial neural networks [
9], for example, for clustering data points [
10] or for generative models [
11]. It can not be used as a metric because it is not symmetric, i.e.,
, where
P and
Q are two probability distributions. Moreover, its value is 0 if equal distributions are compared, but it is shown not to have an upper bound to its possible value. One of the reasons is that it results in an infinite divergence if the probability of a specific event is equal to 0 in
Q but is greater than 0 in
P. Although infinite divergence is discarded, an upper bound for the entropic divergence has not been established.
The search for bounded divergences is an important topic in information theory, and some attempts have been made in the past few years. For example, the main goal of the so-called Jensen–Shannon divergence (JS) [
12] is to provide a notion of symmetric divergence, but it is also shown to be upper-bounded by the value 1 if the base of the used logarithm is 2. It is a metric but its values are not uniformly distributed within the range
, as is empirically shown in this study. Kullback–Leibler and Jensen–Shannon measures are in the class of
f-divergences [
13], which aim at representing the divergence as an average of the odds ratioweighted by a function
f. Each divergence has a specific meaning and behavior, and the relation among different types of
f-divergence is a well-studied topic [
14]. The Hellinger distance [
15] is one of the most used measures among the
f-divergences, together with Kl and JS. It avoids infinite divergences by definition, and it is bounded between 0 and 1. More generally, the KL divergence is shown to be related to several other types of divergences [
16].
The present work introduces a new class of discrete probability distributions called quantized distributions. The name comes from the fact that the probabilities reported by such a class of distributions are formed by quanta of probability. The final aim of the present study is to show that, given a quantized distribution P, there exists another quantized distribution U that maximizes the entropic divergence from P. Thus, for each distribution P, an upper bound to the divergence form P can be obtained by constructing U. The assumptions are that infinite divergences must be avoidable and that the two distributions must be formed by distributing a given discrete quantity, namely, the same quantum must form them. This last property highlights an important previously unaccounted aspect of the KL divergence. Because such a bound can only be assessed under this condition, KL divergence should only be applied between probability distributions formed by the same quantum. These theoretical results allow the introduction of a notion of entropic divergence that is normalized in the range , independently from the base of the used logarithm. Such a measure is compared with the more commonly used notions of divergence and distance between distributions by showing that it behaves in a precise way. Furthermore, it is empirically shown that its values are better distributed in the range with respect to the compared measures. In conclusion, the novel aspects of this study are (i) the introduction of the concept of quantized distributions, (ii) the establishment of an upper bound for the KL divergence between quantized distributions, and (iii) the proposal of a normalized KL divergence.
2. Preliminaries
A
finite (thus discrete) multiplicity distribution is defined as a function
f, which distributes a given discrete quantity
M to a finite set
C of
distinct cells. Thus,
. This class of distributions is often represented via Ferrers diagrams [
17], in which the distributed quantity is a finite set of
M dots that are assigned to cells. A multiplicity distribution is commonly transformed into discrete probability (frequency) distributions by converting it to a distribution such that the sum of its outcomes equals 1. Thus, a finite discrete probability distribution
P is obtained by dividing the assigned quantities for the total quantity, namely
. In this context, the term
quantum signifies that the distribution is defined on a finite, discrete domain, and the assigned values are composed of quanta, which are discrete, unitary pieces of information.
Definition 1 (Quantized distribution). A quantized (probability) distribution (QD) is a finite discrete probability distribution that assigns a probability value to each of the n values of a variable. The probability values are positive, non-zero multiples of the fraction , called the quantum of the distribution, for a given . The value n is also called the cardinality of the distribution.
It has to be noticed that since quantized distributions are probability distributions, the sum of the assigned probabilities must equal 1. Furthermore, this defines a special type of probability distribution. In fact, in general, it is not required that a probability distribution is sourced by a discrete quantity
M distributed over a finite set of cells. Such a type of distribution is of great importance in the field of Computer Science, where probabilities are estimated by looking at frequencies calculated from discrete quantities, for example, for representing biological information [
18,
19,
20]. However, it can be easily shown that the class of quantized distribution covers all the possible discrete finite probability distributions. For distributions where assigned probabilities are rational numbers, rescaling is always possible. This is achieved by setting the quantum value as 1 divided by the common denominator of the assigned probabilities. The rest of the discrete finite probability distributions can be approximated by using an arbitrarily small epsilon for their discretization.
Remark 1. For each multiplicity distribution D, there exists a quantized distribution P, and vice versa.
In fact, given a multiplicity distribution, it can always be converted to a quantized distribution by dividing the assigned values by their sum. Vice versa, a quantized distribution can be represented as a function that assigns values that are an integer multiple (a multiplicity) of the quantum.
Because of the strict relation between frequency/probability and multiplicity distributions, from now on and without loss of generality, the assigned probability values, , will be interchanged with their multiplicity/integer-quantity counterpart, , depending on the purpose of the context in which they are recalled. Similarly, Ferret diagrams and their dot-based representation represent this type of distribution.
Remark 2. Two distributions P and Q, defined on the same domain C, are considered equal, thus not distinct if .
Proposition 1. The total number of distinct, thus not equal, quantized distributions that can be formed by arranging a quantity M in n distinct cells is .
Proof. Given a set
S of
x elements, the number of
y-combinations, namely the number of subsets of
y distinct elements of
S, if given by
. The number of
y-combinations with repetitions, namely the number of sequences of
y non-necessarily distinct elements of
S, is given by
[
21].
Quantized distributions require that at each cell, a minimum value of
must be assigned. Switching from quantized to multiplicity distributions implies that a quantity of
n elements, out of
M, does not participate in the arrangement process since a fixed minimum value of 1 is assigned to each cell. Thus, the quantity that is arranged equals
. Each dot must be assigned to a given cell, and no dot can remain unassigned. Thus, the arrangement process can be seen as an assignment of one specific cell to each of the
dots by allowing a cell to be assigned to multiple dots. Compared to classical combinatorial problems, we are not assigning dots to cells but cells to dots. Thus, this means counting the number of
-combinations with repetitions of a set of
n elements, which is given by
□
The present study aims to show that for each of these distributions, there exists another distribution that maximizes the value of the entropic divergence. The proof of it, which is given in the next section, requires that the elements of the domain must be ordered according to the values assigned to them.
Definition 2 (Ordered quantized distribution). Given a quantized distribution P, an ordered quantized distribution (OQD) is obtained by assigning an integer index i, with , to each domain element such that . is also referred to as .
It must be noted that Definition 2 is based on a monotonically decreasing order, but without loss of generality; a monotonically increasing order can also be used. Furthermore, in what follows, the greatest value of the distribution is considered to be placed in the leftmost position. Consequently, the lowest value is considered to be placed in the rightmost position of the distribution.
Remark 3. Two ordered distributions P and Q, defined on the same domain C, are equal, thus not distinct, if .
Multiple unordered distributions may produce the same ordered distribution leading them to belong to the same class of equivalence that is defined by such a shared ordered output. Formally, is defined as the complete set of QDs that can be formed by arranging a quantity of M into n cells. is defined as the complete set of OQDs that can be formed by arranging a quantity of M into n cells. Then, the function that transforms an unordered QD into an ordered QD, , can be shown to be a surjective function. Thus, each class of equivalence is represented by a given distribution , and it is formed by a subset of , referred to as , such that .
We are interested in counting the number of classes, which also equals the number of distinct ordered distributions.
Proposition 2. The total number of distinct ordered distributions that can be formed by arranging a quantity of M in n cells equals the number of partitions of the integer M for representing it as a sum of n integer addends.
Proof. Similarly to unordered distributions (see Proposition 1), ordered distributions assign a minimum quantity of 1 to each cell. The number of partitions of an integer
x to represent it as a sum of
y addends, the value of which can not be 0, can be obtained by the recursive formula
, with
if
and
[
22]. Thus, we can use the formula to evaluate the number of ordered distributions by setting
because it is the total arranged quantity, and
because we want to represent such an integer as a sum of exactly
n non-zero addends (namely, non-empty cells). □
The search for a maximum value of the KL divergence between two quantized distributions P and Q (presented in the next section) is based on the fact that the same quantum value must form both distributions. However, there are plenty of practical situations where this assumption is not verified, and the two distributions need to be transformed into two comparable distributions before calculating the divergence.
Proposition 3. Given two quantized distributions, P and Q, formed by two different quanta, and , respectively, they can always be transformed into two distributions formed by the same quantum.
Proof. Since and are two positive integer numbers, the least common multiple (lcm) between them can be used for re-scaling the two distributions such that the same quantum value will form them. The new distributions are formed by the same quantum, that is . The values of these distributions are always in the form , with x being a positive integer. Thus, the values of the new distributions can be re-scaled as . It is trivial to show that the new distributions maintain their status of quantized distribution. □
3. Upper Bound of the Entropic Divergence
Given two probability distributions, the entropic divergence, also called the Kullback–Leibler (KL) divergence from the authors who discovered it [
1], quantifies the information gained by switching from one distribution to another. For two probability distributions,
P and
Q, that are defined on the same domain
C, the divergence of
P from
Q is defined as:
The divergence is not symmetric, thus
, and its possible value ranges between 0 and
. In fact, the divergence is 0 if the two distributions are equal in their outcomes, namely
. Gibbs’ inequality [
23] demonstrates that it has no upper bound. However, such an affirmation has been shown by comparing two
general distributions and by stating that the entropic divergence is a difference between the two quantities
and
, which implies
and thus
Given two positive numbers M and n, the previous section establishes that the sets and are finite. Consequently, for any P within either of these sets, there exists a distribution U in or that maximizes . Here, we are interested in finding such a distribution U. It must be noted that P and U are quantized distributions formed by the same quantum. This assumption is crucial for obtaining an upper bound on the divergence from a given distribution P in practical situations.
The general concept of distribution, and thus of probability distribution, is independent of a given ordering of the elements in
C. In this perspective, ordered quantized distributions are used without loss of generality. The KL formula for ordered distributions can be written as:
It must be pointed out that the ordering does not affect the value of the KL divergence. This means that the distribution that maximizes the KL value from a given distribution P also maximizes the KL for all the unordered distributions within the same class of equivalence of P defined in the previous section. Thus, the goal is to define the shape of the distribution U, which maximizes the entropic divergence to P.
It is required that the compared distributions, P and U, must be defined on the set C and that for each element, the two distributions are non-zero valued, namely and for , in order to avoid infinite divergences. This constraint, together with the discretization of the quantity that is distributed to the cells, implies that at each cell, at least a quantity equal to 1 is assigned, that is, for every i. Thus, the quantity that must be arranged to construct the distribution U is .
The entropic divergence is a sum of terms in the form . If , then a negative contribution is given to the sum because of the logarithmic function, while positive contributions are given for . Thus, the aim is to reduce the number of positions with negative contributions. Each term is mediated by the factor. Thus, it is preferable to assign positive contributions to the greatest values. On the contrary, negative contributions should be assigned to the smallest values. This means that if P is monotonically decreasing in order (from left to right), then positive contributions should be on the left side of the distributions, and negative terms should be on the right side. Furthermore, the greater with respect to , the higher the value of the divergence. This translates to trying to increase the difference between the greatest values and their corresponding counterparts as much as possible. Of course, reducing the quantity assigned to the initial positions of U results in increasing the quantity assigned to the right-most positions of it.
All of these considerations lead to the intuition that the distribution that maximizes the entropic divergence is the one that minimizes the quantity assigned to positions from 1 to and that assigns the remaining amount to the last position. Since the minimum quantity is equal to 1, such a distribution assigns the remaining quantity to the last position n. In what follows, it is shown that if P is monotonically decreasing ordered, then such a distributional shape maximizes the entropic divergence independently of how the quantity is distributed in P. This fact also implies that such maximization is independent of the ordering of P. It is only necessary that the quantity is assigned to the position i, rather than n, where is minimal. However, the ordering helps prove the initial statement.
From here on, the maximizing distribution is always referred to as U, and any other competitor distribution is referred to as Q. The proof that the entropic divergence from U to P is greater than the divergence from any other distribution Q is split into two parts. Firstly, a special case is addressed, then the proof of the general case is given.
The special case is presented in
Figure 1. A total amount of
elements are arranged into
cells to compose the distributions. As introduced above, the
P distribution has a monotonically decreasing order, and the
U distributions assign a quantity of
to the last cell. The special case is represented by the
Q distribution, which assigns a quantity of 2 to the
-th position and a quantity of
to the last position. For all the distributions, for every cell, a minimal quantity of 1 is assigned. The goal is to show that:
From cell 1 to cell 3, the two divergences have an equal contribution; thus, they differ in terms of the last two terms. Thus, the inequality can be written as:
that is
that is
which is true.
A general proof of this special case, independently from the values of
M and
n, is given in
Appendix A.
Moving forward, the final goal is to show that U maximizes the divergence with respect to any possible distribution Q obtained by arranging the quantity in all the cells.
Proposition 4. Let P be an OQD obtained by distributing a quantity M to n cells. Let U be a QD, which assigns all the free quantity to the n-th cell and the minimum quantity of 1 to each cell. Let Q be a QD that assigns the free quantity in a way different from U, in addition to the minimum quantity of 1 for each cell. Then, , independent of how the quantity is arranged in Q.
Proof. The initial cells of U have a value equal to , and the last cell has a value equal to . Instead, for what concerns Q, a quantity equal to , for , is assigned to each cell, such that .
The following inequality must be verified:
that is
The left side of the inequality is composed of a series of terms
, each of which equals
, and the entire inequality can be written as
that is
that is
Since
P is ordered, for each position
i it happens that
, namely
. The inequality can be written as
The arguments of the logarithms are always greater than 1, thus the values of the logarithms are always positive. Moreover, the factors that multiply the logarithms are always positive because they are probabilities. The inequality can be written as
with
and
. Considering that the sum of logarithms is greater than the logarithm of the sum [
24], it is now trivial to show that the inequality is always satisfied. □
The previous proof is given for an ordered distribution P. However, the final inequality is independent of the ordering. In fact, it compares the quantity (that is, the one that makes U the distribution of interest) with the sum of the terms independent of their position and specific value. P is ordered, and U assigns by construction the additional quantity to the cell where P has the lowest assigned value.
The retrieving of an upper bound for the entropic divergence is here shown to be possible under two main conditions: (i) no zero values are assigned by the two distributions; (ii) the compared distributions are quantized distributions over the same quantum value . The first condition is often ensured in practical applications, where pseudo-counts are used to avoid infinite divergences. The second condition emerges from this study. It states that the entropic divergence acquires a more powerful meaning when applied to comparable distributions. The term comparable refers to sharing the same quantum value. This aspect should be taken into account in future developments of divergences.