Next Article in Journal
Formation of the Entrepreneurial Potential of Student Youth: A Factor of Work Experience
Next Article in Special Issue
An Improved Variable Kernel Density Estimator Based on L2 Regularization
Previous Article in Journal
Innovative Investment Models with Frequent Payments of Tax on Income and of Interest on Debt
Previous Article in Special Issue
Tail Conditional Expectations Based on Kumaraswamy Dispersion Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Alternative Dirichlet Priors for Estimating Entropy via a Power Sum Functional

by
Tanita Botha
1,*,†,
Johannes Ferreira
1,2,† and
Andriette Bekker
1,2,†
1
Department of Statistics, Faculty of Natural and Agricultural Sciences, University of Pretoria, Pretoria 0028 , South Africa
2
Centre of Excellence in Mathematical and Statistical Science, University of Witwatersrand, Johannesburg 2050, South Africa
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2021, 9(13), 1493; https://doi.org/10.3390/math9131493
Submission received: 29 May 2021 / Revised: 20 June 2021 / Accepted: 23 June 2021 / Published: 25 June 2021

Abstract

:
Entropy is a functional of probability and is a measurement of information contained in a system; however, the practical problem of estimating entropy in applied settings remains a challenging and relevant problem. The Dirichlet prior is a popular choice in the Bayesian framework for estimation of entropy when considering a multinomial likelihood. In this work, previously unconsidered Dirichlet type priors are introduced and studied. These priors include a class of Dirichlet generators as well as a noncentral Dirichlet construction, and in both cases includes the usual Dirichlet as a special case. These considerations allow for flexible behaviour and can account for negative and positive correlation. Resultant estimators for a particular functional, the power sum, under these priors and assuming squared error loss, are derived and represented in terms of the product moments of the posterior. This representation facilitates closed-form estimators for the Tsallis entropy, and thus expedite computations of this generalised Shannon form. Select cases of these proposed priors are considered to investigate the impact and effect on the estimation of Tsallis entropy subject to different parameter scenarios.

1. Introduction

Shannon entropy and related information measures are functionals of probability and a measurement of information contained in a system that arise in information theory, machine learning and text modelling, amongst others. Ref. [1] discussed quantifying the information carried by neural signals to estimating the dependency structure and inferring causal relations, uncertainty and dispersion in statistics being applied in fields such as molecular biology. Other interests include studies measuring complexity of dynamics in physics, to studies measuring diversity in ecology and genetics, fields of coding theory and cryptography [2], financial analysis and data compression [3]. Numerous inferential tasks rely on data-driven procedures to estimate these quantities. In these settings and utilising the estimated quantities, researchers are often confronted with data arising from an unknown discrete distribution, and seek to estimate its entropy. This motivates sustained research interest within entropy, coupled with the current data-driven and computing-rich era, for practitioners.
Entropy estimation remains an openly discussed challenge. Ref. [4] investigated how the maximum likelihood estimator (MLE) performed. This is also referred to as the plug-in principle in functional estimation, where a point estimate of the parameter is used to build an estimate for a functional of the parameter. The classical asymptotic theory of MLEs does not adequately address high-dimensional settings in this current data-driven era [4]. High-dimensional statistics arguably demand theoretical tools to address the needs of these high-dimensional settings. Ref. [5] investigated 18 different estimation measures and the suitability was determined experimentally based on the bias and the mean squared error. This work takes a Bayesian approach to entropy estimation, building upon work by [1,4,6,7].
Multivariate count data constrained to add up to a certain constant are commonly modelled using the multinomial distribution. This is widely used in modelling categorical data, of which features could be for example, words in the case of textual documents or visual words in the case of images. The Dirichlet distribution, closely related to the probabilistic behaviour of the multinomial distribution, is a conjugate prior for the multinomial distribution when a Bayes perspective is of interest. Ref. [8] highlights how the use of prior distributions in a Bayesian framework makes it possible to work with very limited data sets. Ref. [9] underscores the superior performance by using the hierarchical approach which introduces the construction of the statistical model. Some meaningful studies include [1,4,6,10]. Ref. [11] also showed how using different Dirichlet distributions for the bivariate case gives one the opportunity to include prior information and expert opinion to obtain more realistic results in certain situations. Ref. [4] also experimented with the estimation of entropy and this triggered more exploration with alternative priors. Experimentation on diverse data sets might necessitate parameter-rich priors; therefore, this study proposes these alternative Dirichlet priors to address this potential challenge.
The paper illustrates how a Bayesian approach is applied in a multinomial-Dirichlet family setup, which allows us to obtain a posterior distribution from where explicit expressions for the Tsallis entropy can be derived, by particularly focussing on the Product moment for the power sum functional, and assuming squared error loss. The first of two main contributions of this paper is the addition of flexible priors from a Dirichlet family, utilising them within an information-theoretical world, which also allows for positive correlation in addition to the usual negative correlation characteristic. The second shows that using elegant constructs of the complete product moments of the posteriors, gives one the comparative advantage of obtaining explicit estimators for entropy under these Dirichlet priors. Ref. [8] echos how the computation on moments accelerates the estimation of entropy.
The paper is outlined as follows. In Section 2, the essential components that are used in the paper are outlined. In Section 3, alternative Dirichlet priors will be introduced and studied, as candidates for the Bayesian analysis of entropy. In Section 4, analytical expressions for the entropy expressions under consideration will be derived and studied. Section 5 contains conclusions and final thoughts.

2. Essential Components

The countably discrete model under consideration in this paper is given by the well-motivated multinomial distribution. A discrete random variable X = ( X 1 , , X K ) follows the multinomial distribution of order K (i.e., with K distinct classes of interest) with parameters p = ( p 1 , p 2 , , p K ) and n > 0 if its probability mass function (pmf) is given by
f ( x | p ) = n ! i = 1 K x i ! ( n i = 1 K x i ) i = 1 K p i x i ( 1 i = 1 K p i ) n i = 1 K x i .
The Dirichlet distribution (of type 1, see [12]) of order K 2 and parameters Π = ( π 1 , π 2 , , π K + 1 ) for π i > 0 , i = 1 , , k + 1 , has a probability density function (pdf) with respect to the Lebesgue measure on the Euclidean space R K given by
h ( p ; Π ) = i = 1 k + 1 Γ ( π i ) Γ ( i = 1 k + 1 π i ) i = 1 K + 1 p i π i 1
on the K dimensional simplex, defined by   
p 1 , p 2 , , p K > 0 p 1 + p 2 + + p K < 1 p K + 1 = 1 p 1 p K ,
and where Γ ( · ) denotes the usual gamma function (the space and constraints of this K dimensional simplex is denoted by A ).
To derive a Bayesian engine, we need the likelihood function f ( x | p ) in addition to a suitable prior distribution h ( p ) . The fundamental relationship between the likelihood function and the prior distribution to form the posterior distribution f ( p | x ) is given by
f ( p | x ) = f ( x | p ) h ( p ) f ( x | p ) h ( p ) d p .
The most popular form of entropy is that of Shannon:
H ( P ) = i = 1 K + 1 p i ln p i .
Various generalised cases of this entropy exist, which relies on the power sum:
F α ( P ) = i = 1 K + 1 p i α
where α > 0 . The power sum functional occurs in various operational problems ([4]). Under the assumption of squared error loss within Bayes estimation, the estimates of both these quantities is given by their expected values:
E ( H ( P ) ) = E ( i = 1 K + 1 p i ln p i )
and
F ^ α ( P ) = E ( F α ( P ) ) = E ( i = 1 K + 1 p i α ) = i = 1 K + 1 E ( p i α ) .
Thus, it is of value to consider the expected value of p i α for all values of i.
Since there are cases, such as the non-extensive system like alignment processing, namely registration, which has complex behaviours associated with the phenomena of radar-imaging systems [13])which cannot be fully explained by Shannon entropy, other generalized forms were designed. The Tsallis entropy considered in this paper, which is a popular generalised entropy, tends to Shannon entropy as α tends to 1 [14] and is given by
T = i = 1 K + 1 p i α 1 1 α ; α 0 , α 1 .
The estimate of this generalisation can be written in terms of the estimate of the power sum:
E T = E j = 1 K + 1 p j α 1 1 α = F ^ α p 1 1 α .
Since the power sum is easier to estimate than the Shannon entropy, the power sum is used in our case. We consider the estimate as the expectation under the posterior distribution, thus under squared-error loss.

3. Alternative Dirichlet Priors

In this section, two previously unconsidered Dirichlet priors, namely the Dirichlet generator prior and the noncentral Dirichlet prior will be proposed. Positive correlation can be observed for special cases of the Dirichlet generator prior, which is a benefit of this generator form. These new contributions add to the field of generative models for count data and have not been previously considered for entropy.

3.1. Dirichlet Generator Prior

In this section, Dirichlet generator distributions are proposed as alternative candidates. From this form, numerous flexible candidates can be “generated”.
Definition 1.
Suppose p is Dirichlet-generator distributed. Then, its pdf is given by
h ( p 1 , , p K ; Π ) = C p 1 π 1 1 p 2 π 2 1 p K π K 1 ( 1 i = 1 K p i ) π K + 1 1 g ( θ i = 1 K p i )
with C a normalising constant such that
C 1 = A p 1 π 1 1 p 2 π 2 1 p K π K 1 ( 1 i = 1 K p i ) π K + 1 1 g ( θ i = 1 K p i ) d p .
The vector p A is thus a Dirichlet generator variate with parameters Π = ( π 1 , , π K + 1 ) , θ R , and whichever additional parameters g ( · ) imposed, which ensures that the pdf h ( · ) is non-negative. The following conditions also apply:
(1)
g ( · ) is a Borel-measurable function;
(2)
g ( · ) admits a Taylor series expansion;
(3)
g ( 0 ) = 1 .
The usual Dirichlet distribution with pdf (2) is thus a special case of (6) when θ = 0 .
For illustration of the implementation of the Dirichlet generator prior, we focus on
g ( θ i = 1 K p i ) = r F q ( a 1 , , a r ; b 1 , , b q ; θ i = 1 K p i ) = n = 0 ( a 1 ) n ( a r ) n ( b 1 ) n ( b q ) n ( θ i = 1 K p i ) n n !
where p F q ( · ) denotes the generalised hypergeometric function (see [15]) and ( a ) k = Γ ( a + k ) Γ ( a ) is the Pochhammer function.
The prior distribution (6) will then take on the following form with pdf
h ( p 1 , , p k ; Π ) = C * 1 Γ j = 1 k + 1 π j j = 1 k + 1 Γ ( π j ) p 1 π 1 1 , p k π k 1 1 j = 1 k p j π k + 1 1 × r F q a 1 , , a r , b 1 , , b q ; θ j = 1 k p j
where C * is equal to
r + 1 F q + 1 a 1 , , a r , j = 1 k π j ; b 1 , , b q , j = 1 k + 1 π j ; θ .
In this paper, three hypergeometric functions are considered ( 0 F 0 ; 0 F 1 and 1 F 1 ), since these are commonly considered functions representing exponential, binomial, and the confluent hypergeometric functions. For illustrative investigation, bivariate observations from the corresponding distributions were simulated using Algorithm 1 and the associated pdfs are overlaid and presented in Figure 1, Figure 2 and Figure 3. The data were simulated from (8) using the following steps of the Acceptance/Rejection method:
Algorithm 1 Acceptance/Rejection method
1
Define y i ( 0 , 1 ) of size n for i = 1 , 2 ;
2
Calculate the pdf of the Dirichlet (2) h ( y 1 , y 2 ) for y 1 + y 2 < 1 ;
3
Obtain m = m a x ( h ( y 1 , y 2 ) ) ;
4
Simulate p i U n i f ( 0 , 1 ) of size n for i = 1 , 2 ;
5
Calculate the pdf of the Dirichlet generator (8) h ( p 1 , p 2 ) for p 1 + p 2 < 1 ;
6
Simulate z U n i f ( 0 , 1 ) of size n;
7
If h ( p 1 , p 2 ) m > z , then keep ( p 1 , p 2 ) , else return to Step 4;
8
Repeat steps 4–7 k times.
Figure 1 and Figure 2 illustrate the three chosen hypergeometric functions for two choices of θ and for three different sets of Π s if K = 2 with a 1 = 4 and b 1 = 5 for 0 F 1 and 1 F 1 , respectively. This firstly illustrates the difference between the different hypergeometric candidates as well as the effect a change in π 1 has on these three functions (note that a symmetric observation would be made for π 2 ). The difference between Figure 1 and Figure 2 shows the effect that θ has on these different combinations with Figure 1 having a very small (almost negligible) θ , while Figure 2 increases the value of θ . An increase in π 1 results in a more highly dense concentration of the pdf for corresponding values of p 1 and p 2 . This is observed for all three considered hypergeometric candidates, as seen in Figure 1 and Figure 2, also for an increase in θ . For Figure 3, a single set of Π s were selected with θ = 0.1 to showcase the effect that the parameters a and b of the hypergeometric function have on the 1 F 0 and 1 F 1 functions. As a increases, an increased mass is observed closer to the restriction p 1 + p 2 < 1 while an increase in b results in a lower pdf volume.
Next, the posterior distribution is derived, assuming the Dirichlet generator prior (8) together with a multinomial likelihood (1).
Theorem 1.
Suppose the likelihood function is given by (1) and the prior distribution for p is given by (8). Then, the pdf of the posterior distribution is given by
f ( p | x ) p 1 π 1 + x 1 1 p 2 π 2 + x 2 1 p K π K + x K 1 ( 1 i = 1 K p i ) π K + 1 + x K 1 1 g ( θ i = 1 K p i )
which is identifiable as a Dirichlet generator distribution with parameters ( π 1 + x 1 , , π K + x K , π K + 1 + x K + 1 ) .
The complete product moment of the Dirichlet generator posterior (10) is of interest for the power sum (4), thus we are interested in E ( p 1 k 1 p 2 k 2 p k k k p k + 1 k k + 1 ) .
Theorem 2.
Suppose that p | x follows a Dirichlet generator posterior distribution with pdf given in (10). Then, the complete product moment is given by   
E ( p 1 k 1 p 2 k 2 p k k k p k + 1 k k + 1 ) = ( π 1 + x 1 ) k 1 ( π 2 + x 2 ) k 2 ( π k + 1 + x k + 1 ) k k + 1 ( π 1 + x 1 + + π k + 1 + x k + 1 ) k 1 + + k k + 1 × r + 1 F q + 1 ( a 1 , , a r , i = 1 k ( π i + x i + k i ) ; b 1 , , b q , i = 1 k + 1 ( π i + x i + k i ) ; θ ) r + 1 F q + 1 ( a 1 , , a r , i = 1 k ( π i + x i ) ; b 1 , , b q , i = 1 k + 1 ( π i + x i ) ; θ ) .
Special cases of the above expression include setting k k + 1 = 0 to obtain an expression for the usual product moment of the Dirichlet generator distribution under investigation in this paper.
Proof. 
See Appendix A for the proof.    □
The product moment can then be used to investigate the correlation for the examples as illustrated in this section. Figure 4 displays the correlation for a range of θ values using (8) and the special cases. It is important to note the positive correlation obtained by the introduction of g ( · ) = r F q ( · ) , which is a major benefit of using these alternative Dirichlet priors.

3.2. Noncentral Dirichlet Prior

In this section, a noncentral Dirichlet distribution will be constructed via the use of Poisson weights. Ref. [16] explored the use of a compounding method as a distributional building tool to obtain bivariate noncentral distributions and showed how this form of the distribution isolated the noncentrality parameter by retaining them in a Poisson probability form and hence introducing mathematical convenience. Ref. [17] extended on this work by introducing new bivariate gamma distributions emanating from a scale mixture of normal class.
Theorem 3.
Suppose p is Dirichlet distributed with pdf given by (2). Then, a noncentral Dirichlet distribution can be constructed in the following manner:
h ( p ; Π , Λ ) = j 1 = 1 j K + 1 = 1 exp ( λ 1 2 ) ( λ 1 2 ) j 1 j 1 ! exp ( λ K 2 ) ( λ K 2 ) j K j K ! exp ( λ K + 1 2 ) ( λ K + 1 2 ) j K + 1 j K + 1 ! h ( p ; Π | j 1 , , j K + 1 ) = j 1 = 1 j K + 1 = 1 exp ( λ 1 2 ) ( λ 1 2 ) j 1 j 1 ! exp ( λ K 2 ) ( λ K 2 ) j K j K ! exp ( λ K + 1 2 ) ( λ K + 1 2 ) j K + 1 j K + 1 ! × Γ ( π 1 + j 1 + + π K + j K + π K + 1 + j K + 1 ) Γ ( π 1 + j 1 ) Γ ( π K + j K ) Γ ( π K + 1 + j K + 1 ) p 1 π 1 + j 1 1 p K π K + j K 1 ( 1 i = 1 K p i ) π K + 1 + j K + 1 1
where h ( p | j 1 , , j K + 1 ) denotes the conditional (central) Dirichlet distribution (see (2)) with parameters Π * = ( π 1 + j 1 , , π K + 1 + j K + 1 ) , and Λ denotes the vector of noncentral parameters ( λ 1 , , λ K , λ K + 1 ) with λ i > 0 i . After simplification (12) reflects
h ( p ; Π , Λ ) = h ( p ; Π ) exp ( i = 1 K + 1 λ i 2 ) × ϕ ( π 1 + + π K + 1 ) j 1 + + j K + 1 ( π 1 ) j 1 ( π K + 1 ) j K + 1 j 1 ! j K + 1 ! λ 1 2 p 1 j 1 λ K 2 p K j K λ K + 1 2 ( 1 i = 1 K p i ) j K + 1
where h ( p ; Π ) denotes the (unconditional) Dirichlet distribution (see (2)) with parameter Π and where ϕ = j 1 = 1 j K + 1 = 1 .
Remark 1.
The pdf in equation (13) reflects a parametrization of the noncentral Dirichlet distribution of [12] and can be represented via the confluent hypergeometric function of several variables:
f ( p ; Λ ) = h ( p ; Π ) exp ( i = 1 K + 1 λ i 2 ) × Ψ 2 ( K + 1 ) ( i = 1 K + 1 π i ; π 1 , , π K + 1 ; λ 1 2 p 1 , , λ K 2 p K , λ K + 1 2 ( 1 i = 1 K p i ) )
where
Ψ 2 ( K + 1 ) ( i = 1 K + 1 π i ; π 1 , , π K + 1 ; λ 1 2 p 1 , , λ K 2 p K , λ K + 1 2 ( 1 i = 1 K p i ) ) = ϕ ( π 1 + + π K + 1 ) j 1 + + j K + 1 ( π 1 ) j 1 ( π K + 1 ) j K + 1 j 1 ! j K + 1 ! ( λ 1 2 p 1 ) j 1 ( λ K 2 p K ) j K ( λ K + 1 2 ( 1 i = 1 K p i ) ) j K + 1 .
In particular, when λ 1 = λ 2 = = λ K = λ K + 1 = 0 , see that
h ( p ; Π ) exp ( 0 ) Ψ 2 ( K + 1 ) ( i = 1 K + 1 π i ; π 1 , , π K + 1 ; 0 , , 0 , 0 ) = h ( p ; Π )
which illustrates that the model in (12) reduces to the usual (central) Dirichlet model in (2) when the noncentrality parameters are equal to 0. The model in (12) is thus the multivariate analogue of the doubly noncentral beta distribution (see [18]). In the case when Ψ 2 ( K ) ( i = 1 K π i ; π 1 , , π K ; λ 1 2 p 1 , , λ K 2 p K ) is considered in (12), this would represent the multivariate analogue of the singly noncentral beta distribution of [18].
Bivariate observations from the corresponding distributions were simulated using Algorithm 1 and the associated pdfs are overlaid and presented in in Figure 5 for different values of λ 1 and three combinations of Π s. These results showcase the effect that λ 1 has on these three different functions. Figure 5 clearly demonstrates the movement of the centroid of the contour plot.
Next, the posterior distribution is derived, assuming the noncentral Dirichlet prior (12) together with a multinomial likelihood (1).
Theorem 4.
Suppose the likelihood function is given by (1) and the prior distribution for p is given by (12). Then, the posterior distribution has pdf
f ( p | x ) ϕ ( π 1 + + π k + 1 ) j 1 + + j k + 1 ( π 1 ) j 1 ( π k + 1 ) j k + 1 j 1 ! j k + 1 ! λ 1 2 p 1 j 1 λ k + 1 2 p K + 1 j k + 1 × Γ j = 1 k + 1 π j + x j j = 1 K + 1 Γ ( π j + x j ) p 1 π 1 + x i 1 p k + 1 π k + 1 + x k + 1 1
which can be identified as a noncentral Dirichlet distribution with parameters ( π 1 + x 1 , , π K + x K , π K + 1 + x K + 1 ) and Λ.
Remark 2.
See that (14) can be represented using the confluent hypergeometric function from Remark 1 as
f ( p | x ; Λ ) = Ψ 2 ( K + 1 ) ( i = 1 K + 1 π i ; π 1 , , π K + 1 ; λ 1 2 p 1 , , λ K 2 p K , λ K + 1 2 ( 1 i = 1 K p i ) ) Ψ 2 ( K + 1 ) ( i = 1 K + 1 π i ; π 1 , , π K + 1 ; λ 1 2 , , λ K 2 , λ K + 1 2 ) j = 1 K + 1 Γ ( π j + x j + j j ) Γ j = 1 k + 1 π j + x j + j j × p 1 π 1 + x i 1 p k + 1 π k + 1 + x k + 1 1 .
The complete product moment of the noncentral Dirichlet posterior is of interest for the power sum, thus we are interested in E ( p 1 k 1 p 2 k 2 p k k k p k + 1 k k + 1 ) .
Theorem 5.
Suppose that p | x follows a noncentral Dirichlet distribution with pdf given in (14). Then, the complete product moment is given by
E ( p 1 k 1 p 2 k 2 p k k k p k + 1 k k + 1 ) = ϕ ( π 1 + + π K + 1 ) j 1 + + j K + 1 ( π 1 ) j 1 ( π K + 1 ) j K + 1 j 1 ! j K + 1 ! ( λ 1 2 ) j 1 ( λ K + 1 2 ) j K + 1 [ i = 1 K + 1 Γ ( π i + x i + j i + k i ) Γ ( i = 1 K + 1 ( π i + x i + j i + k i ) ] ϕ * ( π 1 + + π K + 1 ) j 1 + + j K + 1 ( π 1 ) j 1 ( π K + 1 ) j K + 1 j 1 ! j K + 1 ! ( λ 1 2 ) j 1 ( λ K + 1 2 ) j K + 1 [ i = 1 K + 1 Γ ( π i + x i + j i ) Γ ( i = 1 K + 1 ( π i + x i + j i ) ] .
Proof. 
See Appendix B for the proof.    □

4. Entropy Estimates

In this section, the Bayesian estimators (16) and (17) based on the posterior distributions (10) and (14) are derived for the power sum (3).

4.1. Dirichlet Generator Prior

Assuming the Dirichlet generator prior, the posterior distribution is given by (10). Using the complete product moments derived in (11), the Bayesian estimator for the power sum (3) can be derived by setting k i = α with i = 1 , , k + 1 and k i = 0 .
Theorem 6.
Using (11), the Bayesian estimator for the power sum entropy under the Dirichlet generator posterior (10) is given by:
F ^ α p = j * = 1 K + 1 Γ ( π j * + x j * + α ) j j * K + 1 Γ ( π j + x j ) Γ ( j = 1 K + 1 π j + x j ) Γ ( j = 1 K + 1 π j + x j + α ) j = 1 K + 1 ( π j + x j ) × r + 1 F q + 1 a 1 , , a r , α + j = 1 K π j + x j , b 1 , , b q , α + j = 1 K + 1 π j + x j ; θ r + 1 F q + 1 a 1 , , a r , j = 1 K π j + x j , b 1 , , b q , j = 1 K + 1 π j + x j ; θ = j * = 1 K + 1 Γ j = 1 K + 1 π j + x j j = 1 K + 1 Γ π j + x j 1 Γ π j * + x j * + α j j * K + 1 Γ π j + x j 1 Γ α + j = 1 K + 1 π j + x j r + 1 F q + 1 a 1 , , a r , α + j = 1 K π j + x j , b 1 , , b q , α + j = 1 K + 1 π j + x j ; θ r + 1 F q + 1 a 1 , , a r , j = 1 K π j + x j , b 1 , , b q , j = 1 K + 1 π j + x j ; θ 1 .
Using the estimated power sum entropy, we can calculate and investigate the behaviour of Tsallis entropy for various parameter scenarios as illustrated in Figure 6 for the bivariate case K = 2 .

4.2. Noncentral Dirichlet Prior

Assuming the noncentral Dirichlet prior and the posterior distribution is given by (14). Using the complete product moments derived in (15), the Bayesian estimator for the power sum entropy (3) can be derived by setting k i = α with i = 1 , , K + 1 and k i = 0 .
Theorem 7.
By using (15), the Bayesian estimator for the power sum (4) under the noncentral Dirichlet posterior (14) is given by:
F ^ α p = j * = 1 K + 1 ϕ π 1 + + π K + π K + 1 j 1 + . . + j K + 1 π 1 j 1 π K + 1 j K + 1 j 1 ! j K + 1 ! λ 1 2 j 1 λ K + 1 2 j K + 1 Γ π j * + x j * + j j * + α j j * K + 1 Γ π j + x j + j j Γ α + j = 1 K + 1 π j + x j + j j ϕ * π 1 + + π K + π K + 1 j 1 + + j K + 1 π 1 j 1 π K + 1 j K + 1 j 1 ! j K + 1 ! λ 1 2 j 1 λ K + 1 2 j K + 1 j = 1 K + 1 Γ π j + x j + j j Γ j = 1 K + 1 π j + x j + j j .
Using the estimated power sum entropy, we can calculate the entropies for different parameters of interests as illustrated in Figure 7 for the bivariate case.

4.3. Numerical Experiments of Entropy

The following steps illustrate the empirical behaviour of the Tsallis entropy under consideration for the alternative priors under consideration (Algorithm 2).
Algorithm 2 Numerical Experiments of Entropy
1
Simulate p 1 and p 2 from the posterior distribution given by (10) and (14) using the Accept/Rejection method as described earlier for n = 50 ;
2
Calculate p 3 = 1 p 1 p 2 ;
3
Calculate the p i α values for all the samples;
4
Determine i = 1 3 p i α for each calculation;
5
Calculate the median for the sample of quantities in Step 4 (Note that i = 1 3 p i α might not be symmetric, thus the median is used.)
6
The power sum (16) and (17) is used to calculate the Tsallis entropy (5);
7
Steps 1 to 5 are repeated for different parameters and plotted against the analytical entropy in order to illustrate the accuracy of the derived estimates.
Figure 8 and Figure 9 provides validation of the accuracy of the obtained theoretical expression and contribution for the Dirichlet generator 0 F 0 and noncentral Dirichlet cases with x 1 = 1 , x 2 = 2 and x 3 = 10 . From these two figures, it can be seen that the Dirichlet generator prior resulted in empirical results that closely match the theoretical results, while the noncentral Dirichlet shows slight deviations. It is observed that as π 1 increases, the Tsallis entropy increases (indicating more uncertainly), while as π 1 decreases the Tsallis entropy also decreases (indicating less uncertainty). When considering the location of the density, the changing of π 1 leads to densities which tend to the margin of p 2 or towards a specific point along the p 1 + p 2 = 1 line. This shows that as the concentration of the density moves toward a point along the p 1 + p 2 = 1 line, the uncertainty increases and will decrease as the concentration moves towards the small values of p 1 .

5. Conclusions

This study focussed on the power sum functional and its estimation as a key tool to model a generalised entropy form, namely Tsallis entropy, via a Bayesian approach. In particular, previous unconsidered Dirichlet priors have been proposed and studied, offering the practitioner more pliable options given experimental data. Specific choices of the proposed Dirichlet family allow for positive correlation in addition to the usual negative correlation characteristic. An example illustrated theoretical results accurately described empirical entropy. Future work could include further investigations into generalised functionals and their modeling in this information-theoretic environment.

Author Contributions

Conceptualization, J.F. and A.B.; methodology, J.F. and A.B.; software, T.B.; validation, T.B., J.F. and A.B.; formal analysis, T.B.; investigation, T.B. and J.F.; writing—original draft preparation, J.F.; writing—review and editing, T.B., J.F. and A.B.; visualization, T.B.; project administration, A.B.; funding acquisition, J.F. and A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work is based on the research supported in part by the National Research Foundation of South Africa (SARChI Research Chair- UID: 71199; and Grant ref. SRUG190308422768 nr. 120839) as well as the Research Development Programme at the University of Pretoria 296/2021. Opinions expressed and conclusions arrived at are those of the author and are not necessarily to be attributed to the NRF.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the anonymous reviewers for their insightful comments which led to the improvement of this paper. The support of the Department of Statistics at the University of Pretoria is acknowledged.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MLEMaximum likelihood estimation
pmfProbability mass function
pdfProbability density function

Appendix A. Proof of Complete Product Moments of Dirichlet Generator (11)

Proof. 
The definition of the complete product moment of a ( K + 1 ) variate variable Y with pdf f ( y ) is given by
E i = 1 k + 1 Y i x i = i = 1 k + 1 y i x i f ( y ) d y 1 d y k + 1
and since we know that the posterior distribution are given by Dirichlet generator distributions with parameters ( π 1 + x 1 ; ; π k + 1 + x k + 1 ) we can show that:
E ( p 1 k 1 p k + 1 k + 1 ) = A i = k 1 k k + 1 p i k i f ( p ) d p 1 d p k + 1 = C 1 i = 1 k + 1 Γ ( π i + x i + k i ) Γ ( i = 1 k + 1 π i + x i + k i ) Γ ( π 1 + x 1 + + p k + 1 + x k + 1 ) Γ ( π 1 + x 1 ) Γ ( π k + 1 + x k + 1 ) C * C * 1 p 1 π 1 + x 1 + k 1 p k + 1 π k + 1 + x k + 1 + k k + 1 Γ ( i = 1 k + 1 π i + x i + k i ) i = 1 k + 1 Γ ( π i + x i + k i ) × r F q a 1 , , a r , b 1 , , b q ; θ i = 1 k p i d p 1 d p k + 1
where C and C * corresponds to the normalising constants as in (7) and (9), respectively with parameters with parameters ( π 1 + x 1 ; ; π k + 1 + x k + 1 ) and ( π 1 + x 1 + k 1 ; ; π k + 1 + x k + 1 + k k + 1 ) . Since the integral becomes the Dirichlet generator pdf (6) with parameters ( π 1 + x 1 + k 1 , , π k + 1 + x k + 1 + k k + 1 ) this will become 1 and the complete product moment will simplify to
E ( p 1 k 1 p k + 1 k + 1 ) = i = 1 k + 1 ( π i + x i ) k i ( i = 1 k + 1 π i + x i ) k 1 + + k k + 1 × r + 1 F q + 1 ( a 1 , , a r , i = 1 k π i + x i + k i , b 1 , , b q , i = 1 k + 1 π i + x i + k i ; θ ) r + 1 F q + 1 ( a 1 , , a r , i = 1 k π i + x i , b 1 , , b q , i = 1 k + 1 π i + x i ; θ ) .

Appendix B. Proof of Complete Product Moments of Noncentral Dirichlet (15)

Proof. 
The definition of the complete product moment of a ( K + 1 ) variate variable Y with pdf f ( y ) is given by
E i = 1 k + 1 Y i x i = i = 1 k + 1 y i x i f ( y ) d y 1 d y k + 1
and since we know that the posterior distribution are given by a noncentral Dirichlet pdf (12) with parameters ( π 1 + x 1 ; ; π k + 1 + x k + 1 ) we can show that
E ( p 1 k 1 p k + 1 k + 1 ) = A i = k 1 k k + 1 p i k i f ( p ) d p 1 d p k + 1 = ϕ ( π 1 + + π k + 1 ) j 1 + + j k + 1 ( π 1 ) j 1 ( π k + 1 ) j k + 1 j 1 ! j k + 1 ! λ 1 2 j 1 λ k + 1 2 j k + 1 [ i = 1 K + 1 Γ ( π i + x i + j i + k i ) Γ ( i = 1 K + 1 ( π i + x i + j i + k i ) ] ϕ * ( π 1 + + π k + 1 ) j 1 + + j k + 1 ( π 1 ) j 1 ( π k + 1 ) j k + 1 j 1 ! j k + 1 ! λ 1 2 j 1 λ k + 1 2 j k + 1 j = 1 K + 1 Γ ( x j + x j + j j ) Γ j = 1 k + 1 π j + x j + j j × p 1 π 1 + x i + j j + k 1 1 p k + 1 π k + 1 + x k + 1 + j k + 1 + k k + 2 1 × Γ ( i = 1 k + 1 π i + x i + j i + k i ) i = 1 k + 1 Γ ( π i + x i + j i + k i ) d p 1 d p k + 1 .
Since the integral becomes the noncentral Dirichlet pdf with parameters ( π 1 + x 1 + k 1 + j 1 , , π k + 1 + x k + 1 + k k + 1 + j k + 1 ) this will become 1 and the complete product moment will simplify to
E ( p 1 k 1 p k + 1 k + 1 ) = ϕ ( π 1 + + π K + 1 ) j 1 + + j K + 1 ( π 1 ) j 1 ( π K + 1 ) j K + 1 j 1 ! j K + 1 ! ( λ 1 2 ) j 1 ( λ K + 1 2 ) j K + 1 [ i = 1 K + 1 Γ ( π i + x i + j i + k i ) Γ ( i = 1 K + 1 ( π i + x i + j i + k i ) ] ϕ * ( π 1 + + π K + 1 ) j 1 + + j K + 1 ( π 1 ) j 1 ( π K + 1 ) j K + 1 j 1 ! j K + 1 ! ( λ 1 2 ) j 1 ( λ K + 1 2 ) j K + 1 [ i = 1 K + 1 Γ ( π i + x i + j i ) Γ ( i = 1 K + 1 ( π i + x i + j i ) ] .

References

  1. Archer, E.; Park, I.M.; Pillow, J. Bayesian entropy estimation for countable discrete distributions. J. Mach. Learn. Res. 2014, 15, 2833–2868. [Google Scholar]
  2. Ilić, V.; Korbel, J.; Gupta, S.; Scarfone, A.M. An overview of generalized entropic forms. arXiv 2021, arXiv:2102.10071. [Google Scholar]
  3. Rashad, M.; Iqbal, Z.; Hanif, M. Characterizations and entropy measures of the Libby-Novick generalized beta distribution. Adv. Appl. Stat. 2020, 63, 235–259. [Google Scholar] [CrossRef]
  4. Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Maximum likelihood estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2017, 63, 6774–6798. [Google Scholar] [CrossRef] [Green Version]
  5. Contreras Rodríguez, L.; Madarro-Capó, E.J.; Legón-Pérez, C.M.; Rojas, O.; Sosa-Gómez, G. Selecting an Effective Entropy Estimator for Short Sequences of Bits and Bytes with Maximum Entropy. Entropy 2021, 23, 561. [Google Scholar] [CrossRef] [PubMed]
  6. Wolpert, D.H.; Wolf, D. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E Stat. Phys. Plasmas Fluids Relat. Interdiscip. Top. 1995, 52, 6841–6854. [Google Scholar] [CrossRef] [PubMed]
  7. Han, Y.; Jiao, J.; Weissman, T. Does Dirichlet prior smoothing solve the Shannon entropy estimation problem? In Proceedings of the IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 1367–1371. [Google Scholar]
  8. Little, D.J.; Toomey, J.P.; Kane, D.M. Efficient Bayesian estimation of permutation entropy with Dirichlet priors. arXiv 2021, arXiv:2104.08991. [Google Scholar]
  9. Zamzami, N.; Bouguila, N. Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models. Appl. Intell. 2019, 49, 3783–3800. [Google Scholar] [CrossRef]
  10. Holste, D.; Grosse, I.; Herzel, H. Bayes’ estimators of generalized entropies. J. Phys. A Math. Gen. 1998, 31, 2551. [Google Scholar] [CrossRef] [Green Version]
  11. Bodvin, L.J.S.; Bekker, A.; Roux, J.J. Shannon entropy as a measure of certainty in a Bayesian calibration framework with bivariate beta priors: Theory and methods. S. Afr. Stat. J. 2011, 45, 171–204. [Google Scholar]
  12. Sánchez, L.E.; Nagar, D.; Gupta, A. Properties of noncentral Dirichlet distributions. Comput. Math. Appl. 2006, 52, 1671–1682. [Google Scholar] [CrossRef] [Green Version]
  13. Kang, M.S.; Kim, K.T. Automatic SAR Image Registration via Tsallis entropy and Iterative Search Process. IEEE Sens. J. 2020, 20, 7711–7720. [Google Scholar] [CrossRef]
  14. Mathai, A.M.; Haubold, H.J. On generalized entropy measures and pathways. Phys. A Stat. Mech. Appl. 2007, 385, 493–500. [Google Scholar] [CrossRef] [Green Version]
  15. Gradshteyn, I.S.; Ryzhik, I.M. Table of Integrals, Series, and Products; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
  16. Ferreira, J.T.; Bekker, A.; Arashi, M. Bivariate noncentral distributions: An approach via the compounding method. S. Afr. Stat. J. 2016, 50, 103–122. [Google Scholar]
  17. Bekker, A.; Ferreira, J.T. Bivariate gamma type distributions for modeling wireless performance metrics. Stat. Optim. Inf. Comput. 2018, 6, 335–353. [Google Scholar] [CrossRef]
  18. Ongaro, A.; Orsi, C. Some results on non-central beta distributions. Statistica 2015, 75, 85–100. [Google Scholar]
Figure 1. Dirichlet generator priors (8) for θ = 0.1 with three different sets of Π (described above Figure 1), n = 50 .
Figure 1. Dirichlet generator priors (8) for θ = 0.1 with three different sets of Π (described above Figure 1), n = 50 .
Mathematics 09 01493 g001
Figure 2. Dirichlet generator priors (8) for θ = 0.9 with three different sets of Π , n = 50 .
Figure 2. Dirichlet generator priors (8) for θ = 0.9 with three different sets of Π , n = 50 .
Mathematics 09 01493 g002
Figure 3. Dirichlet generator priors (8) for θ = 0.1 with a single set Π , n = 50 .
Figure 3. Dirichlet generator priors (8) for θ = 0.1 with a single set Π , n = 50 .
Mathematics 09 01493 g003
Figure 4. Correlation for different p F q candidates. Set 1—blue ( π 1 = 0.5 ; π 2 = 0.5 ; π 3 = 5 ); Set 2—orange ( π 1 = 2 ; π 2 = 2 ; π 3 = 0.1 ); Set 3—purple ( π 1 = 5 ; π 2 = 5 ; π 3 = 10 ) with a = 100 and b = 2 .
Figure 4. Correlation for different p F q candidates. Set 1—blue ( π 1 = 0.5 ; π 2 = 0.5 ; π 3 = 5 ); Set 2—orange ( π 1 = 2 ; π 2 = 2 ; π 3 = 0.1 ); Set 3—purple ( π 1 = 5 ; π 2 = 5 ; π 3 = 10 ) with a = 100 and b = 2 .
Mathematics 09 01493 g004
Figure 5. Noncentral Dirichlet Priors (12) for different λ 1 s with λ 2 = 0.8 and λ 3 = 0.1 , n = 50 .
Figure 5. Noncentral Dirichlet Priors (12) for different λ 1 s with λ 2 = 0.8 and λ 3 = 0.1 , n = 50 .
Mathematics 09 01493 g005
Figure 6. Dirichlet generator entropy (16)—Varying θ : Set A—blue ( π 1 = 2 ; π 2 = 2 ; π 3 = 2 ) Set B—orange ( π 1 = 1 ; π 2 = 2 ; π 3 = 2 ) Set C—purple ( π 1 = 10 ; π 2 = 2 ; π 3 = 2 ).
Figure 6. Dirichlet generator entropy (16)—Varying θ : Set A—blue ( π 1 = 2 ; π 2 = 2 ; π 3 = 2 ) Set B—orange ( π 1 = 1 ; π 2 = 2 ; π 3 = 2 ) Set C—purple ( π 1 = 10 ; π 2 = 2 ; π 3 = 2 ).
Mathematics 09 01493 g006
Figure 7. Noncentral Dirichlet Entropy—Varying λ 1 : Set A—blue ( π 1 = 2 ; π 2 = 2 ; π 3 = 2 ) Set B—orange ( π 1 = 1 ; π 2 = 2 ; π 3 = 2 ) Set C—purple ( π 1 = 10 ; π 2 = 2 ; π 3 = 2 ).
Figure 7. Noncentral Dirichlet Entropy—Varying λ 1 : Set A—blue ( π 1 = 2 ; π 2 = 2 ; π 3 = 2 ) Set B—orange ( π 1 = 1 ; π 2 = 2 ; π 3 = 2 ) Set C—purple ( π 1 = 10 ; π 2 = 2 ; π 3 = 2 ).
Mathematics 09 01493 g007
Figure 8. Dirichlet generator 0 F 0 —empirical vs calculated Tsallis entropy: for θ = 0.5 and Set A—blue ( π 1 = 2 ; π 2 = 2 ; π 3 = 2 ) Set B—orange ( π 1 = 1 ; π 2 = 2 ; π 3 = 2 ) Set C—purple ( π 1 = 10 ; π 2 = 2 ; π 3 = 2 ).
Figure 8. Dirichlet generator 0 F 0 —empirical vs calculated Tsallis entropy: for θ = 0.5 and Set A—blue ( π 1 = 2 ; π 2 = 2 ; π 3 = 2 ) Set B—orange ( π 1 = 1 ; π 2 = 2 ; π 3 = 2 ) Set C—purple ( π 1 = 10 ; π 2 = 2 ; π 3 = 2 ).
Mathematics 09 01493 g008
Figure 9. Noncentral Dirichlet—empirical vs calculated Tsallis entropy: for λ 1 = 0.1 ; λ 2 = 0.8 and λ 3 = 0.1 with Set A—blue ( π 1 = 2 ; π 2 = 2 ; π 3 = 2 ) Set B—orange ( π 1 = 1 ; π 2 = 2 ; π 3 = 2 ) Set C—purple ( π 1 = 10 ; π 2 = 2 ; π 3 = 2 ).
Figure 9. Noncentral Dirichlet—empirical vs calculated Tsallis entropy: for λ 1 = 0.1 ; λ 2 = 0.8 and λ 3 = 0.1 with Set A—blue ( π 1 = 2 ; π 2 = 2 ; π 3 = 2 ) Set B—orange ( π 1 = 1 ; π 2 = 2 ; π 3 = 2 ) Set C—purple ( π 1 = 10 ; π 2 = 2 ; π 3 = 2 ).
Mathematics 09 01493 g009
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Botha, T.; Ferreira, J.; Bekker, A. Alternative Dirichlet Priors for Estimating Entropy via a Power Sum Functional. Mathematics 2021, 9, 1493. https://doi.org/10.3390/math9131493

AMA Style

Botha T, Ferreira J, Bekker A. Alternative Dirichlet Priors for Estimating Entropy via a Power Sum Functional. Mathematics. 2021; 9(13):1493. https://doi.org/10.3390/math9131493

Chicago/Turabian Style

Botha, Tanita, Johannes Ferreira, and Andriette Bekker. 2021. "Alternative Dirichlet Priors for Estimating Entropy via a Power Sum Functional" Mathematics 9, no. 13: 1493. https://doi.org/10.3390/math9131493

APA Style

Botha, T., Ferreira, J., & Bekker, A. (2021). Alternative Dirichlet Priors for Estimating Entropy via a Power Sum Functional. Mathematics, 9(13), 1493. https://doi.org/10.3390/math9131493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop