Next Article in Journal
Correction on Iliyasu, A.M. et al. Hybrid Quantum-Classical Protocol for Storage and Retrieval of Discrete-Valued Information. Entropy, 2014, 16, 3537-3551
Next Article in Special Issue
A Foundational Approach to Generalising the Maximum Entropy Inference Process to the Multi-Agent Context
Previous Article in Journal
Landscape Analysis of Geographical Names in Hubei Province, China
Previous Article in Special Issue
What You See Is What You Get
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning

Martin de Tours School of Management and Economics, Assumption University, MSME Building, 4th Floor, 88 Moo 8 Bang Na-Trad Km. 26 Bangsaothong, 10540 Samuthprakarn, Thailand
Entropy 2014, 16(12), 6338-6381; https://doi.org/10.3390/e16126338
Submission received: 19 October 2014 / Revised: 24 November 2014 / Accepted: 25 November 2014 / Published: 1 December 2014
(This article belongs to the Special Issue Maximum Entropy Applied to Inductive Logic and Reasoning)

Abstract

:
The aim of this paper is to develop a comprehensive study of the geometry involved in combining Bregman divergences with pooling operators over closed convex sets in a discrete probabilistic space. A particular connection we develop leads to an iterative procedure, which is similar to the alternating projection procedure by Csiszár and Tusnády. Although such iterative procedures are well studied over much more general spaces than the one we consider, only a few authors have investigated combining projections with pooling operators. We aspire to achieve here a comprehensive study of such a combination. Besides, pooling operators combining the opinions of several rational experts allows us to discuss possible applications in multi-expert reasoning.

Graphical Abstract

1. Background

1.1. Introduction

Information geometry has been studied as a powerful tool for tackling various problems. It has been applied in neuroscience [1], expert systems [2], logistic regression [3], clustering [4] and probabilistic merging [5]. In this paper, we aim to present a comprehensive study of information geometry over a discrete probabilistic space in order to provide some specialized tools for researchers working in the area of multi-expert reasoning.
In the context of this paper, the domain of information geometry is the Euclidean space ℝJ, for some fixed natural number J ≥ 2, where we measure a divergence from one point to another one. A divergence is, in general asymmetric, a notion of distance, and we will represent it here by an arrow. A divergence can represent a cost function having various constraints, so many engineering problems correspond to the minimization of a divergence.
For example, in the areas of neuroscience and expert systems, given evidence v and a training set of known instances W, we may search for an instance wW, which is “closest” to the evidence v, so as to represent it in the given training set W. An illustration is depicted in Figure 1.
A similar pattern of minimization appears also in the areas of clustering and regression. The aim of the former is to categorize several points into a given number of nodes in such a way that the sum of divergences from each point to its associated node is minimal. The aim of regression is to predict an unknown distribution of events based on the previously obtained statistical data by defining a function whose values minimize a sum of divergences to the data.
While several domains for divergences are considered in the literature, in the current presentation of information geometry, however, we will confine ourselves to the domain of positive discrete probability functions ⅅJ, where ⅅJ is the set of all w ∈ ℝJ restricted by j = 1 J w j = 1 and w1 > 0, …, wJ > 0. In our presentation, J ≥ 2 will be always fixed, but otherwise arbitrary.
Although in information geometry, it does not make sense to talk about beliefs, applications in multi-expert reasoning are often developed from that perspective. It is then argued that rational beliefs should obey the laws of probability, for example the Dutch book argument by Ramsey and de Finetti [6] is perhaps the most compelling argument. It is therefore of a particular interest to develop information geometry over a probabilistic space if we wish to eventually apply it to multi-expert reasoning.
In addition to our restriction to discrete probability functions, we will confine ourselves to a special type of divergence, called a Bregman divergence [7], which has recently attracted attention in machine learning and plays a major role in optimization; cf. [3]. A Bregman divergence over a discrete probabilistic space is defined by a given strictly convex function f: (0, 1)J → ℝ, which is differentiable over ⅅJ. For any v, w ∈ ⅅJ, the Bregman divergence generated by the function f is given by:
D f ( w v ) = f ( w ) f ( v ) ( w v ) f ( v ) ,
where f ( v ) is the gradient of f and · denotes the inner (dot) product of two vectors, i.e.,
( w v ) f ( v ) = j = 1 J ( w j v j ) f ( v ) v j .
We say that Df(wv) is a Bregman divergence from v ∈ ⅅJ to w ∈ ⅅJ. Figure 2 depicts a geometrical interpretation of a Bregman divergence.
By the first convexity condition applied to the (convex and differentiable) function f (see, e.g., [8]), Df(wv) ≥ 0 with equality holding only if w = v. This is the condition that makes Df(·‖·) a divergence as defined in information geometry. Note that, since a differentiable convex function is necessarily continuously differentiable (see [9]), Df(wv) is a continuous function. However, note that this is not sufficient to establish the differentiability of Df.
It is worth mentioning that the restriction w1 > 0,…, wJ > 0 for a probability function w that we have adopted here is important for the definition of a Bregman divergence. Some Bregman divergences do not have their generating function f differentiable over the whole space of probability functions. However, it is possible to define the notion of a Bregman divergence even if this condition is left out, but at the cost of some restrictions on f. We kindly refer the interested reader to [10] for further details. Nonetheless, the setting developed in [10] uses a rather complicated notation, which could prove to be impenetrable at first glance if it were adopted in the current paper.
In this paper, we study mainly Bregman divergences Df(·‖·), which are convex, i.e., for all λ ∈ [0, 1] and all w(1), w(2), v(1), v(2) ∈ ⅅJ:
λ D f ( w ( 1 ) v ( 1 ) ) + ( 1 λ ) D f ( w ( 2 ) v ( 2 ) ) D f ( λ w ( 1 ) + ( 1 λ ) w ( 2 ) λ v ( 1 ) + ( 1 λ ) v ( 2 ) ) .
Note that if D(·‖·) is a convex function, then D(·‖·) is a convex function also in each argument separately.
The following are examples of a convex Bregman divergence.
Example 1 (Squared Euclidean Distance). For any J 2 l e t f ( x ) = j = 1 J ( x j ) 2. Then, the divergence:
D f ( w v ) = j = 1 J ( w j v j ) 2
will be denoted by E2, and exceptionally, this divergence is symmetric.
Example 2 (Kullback–Leibler Divergence). For any J 2 , l e t f ( x ) = j = 1 J x j log x j, where log denotes the natural logarithm. (Note that in the information theory literature, this logarithm is often taken with base two. However, this does not affect the results of this paper in any way.) The well-known divergence:
D f ( w v ) = j = 1 J w j log w j v j
will be denoted by KL.
The convexity of the KL-divergence is easy to observe and is well known; see, e.g., [10].

1.2. Projections

For given v ∈ ⅅJ, a Bregman divergence Df(wv) is a strictly convex function in the first argument. This can be easily seen by considering D f ( w v ) = f ( w ) f ( v ) j = 1 J ( w j v j ) f ( v ) v j where v is constant. f(v) is therefore constant, as well, and the claim follows, since strict convexity of f is not affected by adding the linear term j = 1 J ( w j v j ) f ( v ) v j.
Owing to the observation above, if v ∈ ⅅJ is given and W ⊆ ⅅJ is a closed convex nonempty set, we can define the Df-projection of v into W. It is that unique point wW that minimizes Df(wv) subject only to wW. This property is crucial for the applicability of Bregman divergences. Note, however, that Df(·‖·) is not necessarily convex in its second argument; for a counterexample, consider the case f ( x ) = j = 1 4 ( x j ) 3.
Perhaps the most useful property that a Df-projection has is the extended Pythagorean property:
Theorem 1 (Extended Pythagorean Property). Let Df be a Bregman divergence. Let w be the Df-projection of v ∈ ⅅJ into a closed convex nonempty set W ⊆ ⅅJ. Let aW. Then:
D f ( a w ) + D f ( w v ) D f ( a v ) .
This property, in the case of the Kullback–Leibler divergence, was proven first by Csiszár in [11]. The proof of the generalized theorem above is given in [1,12], where the interested reader can find a comprehensive study of Bregman divergences within the context of differential geometry. We illustrate the theorem in Figure 3.
Notice that the squared Euclidean distance has a special role among all other Bregman divergences. It is symmetric, and it interprets the extended Pythagorean property “classically” as the relation of the sizes of the squares constructed on the sides of a triangle.
It is well-known that the Kullback–Leibler divergence is closely connected to the Shannon entropy defined for any w ∈ ⅅJ by:
H ( w ) = j = 1 J w j log w j ,
where log denotes the natural logarithm. The importance of the Shannon entropy is that it could be described as a measure of the level of disorder, which in the context of information theory, can be interpreted as a measure of informational content. The higher the entropy of w is, the less information is carried by w. In some contexts, one can then argue that given several seemingly equally probable choices of a probability function, one should choose the one that carries the least additional information [13]. Given a closed convex nonempty set W, the most entropic point in W will be denoted by ME(W).
Now, trying to find the most entropic point in a closed convex nonempty set W ⊆ ⅅJ is, in fact, equivalent to finding a special KL-projection (the KL-projection of the uniform probability function ( 1 J , , 1 J ) J) since:
arg min w W j = 1 J w j log w j 1 J = arg max w W j = 1 J w j log w j = ME ( W ) ,
where arg minx∈X f(x) denotes that unique argument xX, where f has its global minimum, whenever such a unique point exists. The expression arg max is defined accordingly.
Given the extensive justification of the Shannon entropy in various frameworks (see, e.g., [14,15]), it is perhaps not surprising that a common method of projecting in probabilistic expert systems is by means of the KL-projection; see [2,16]. In connection to the Shannon entropy, the KL-divergence is often referred to as the cross-entropy, and the projecting is called updating.
The above may perhaps be also an appealing reason to use projections in general to “represent” a given closed convex set of probability functions by a single point, in particular in expert reasoning. Moreover, recent use of projections by a Bregman divergence has become popular in other contexts; see, e.g., [4]. Remarkably, projections by a Bregman divergence also provide a unifying framework for a variety of techniques used in expert systems, such as logistic regression; see [3]. It is therefore of particular interest to investigate the geometry of Bregman divergences.

1.3. Pooling

In this subsection, we introduce probabilistic pooling, which is a method of aggregating several probability functions. Formally, a pooling operator Pool is defined for each n ≥ 1 as a mapping:
Pool : D J × × D J n D J .
Recall that J is a fixed natural number greater than or equal to two, which is otherwise arbitrary.
One possibility for choosing a pooling operator is to define one by means of a Bregman divergence. In particular, given a Bregman divergence D f , w ( 1 ) , , w ( n ) D J and a ∈ ⅅn, we can ask which point v ∈ ⅅJ has the least sum of Bregman divergences Df from w(1),…, w(n) weighted by a1,…, an, respectively. It turns out that the resulting probability function is unique, and in each coordinate, it is simply the weighted arithmetic mean of the corresponding coordinates of w(1),…, w(n) ∈ ⅅJ. In other words:
arg min v D J i = 1 n a i D f ( w ( i ) v ) = ( i = 1 n a i w 1 ( i ) , , i = 1 n a i w 1 ( i ) ) .
For a given family A = { a n : a n D n , n = 1 , 2 , } of weighting vectors, we define the pooling operator LinOp A by Equation (1) for every a A. Instead of the right-hand side of Equation (1), we will simply write LinOpa(w(1),…, w(n)) if a A. A special choice for A is the family N = { a n = ( 1 n , , 1 n ) : n = 1 , 2 , }, and the pooling operator LinOp N is well known in the literature as the LinOp-pooling operator.
The fact that Equation (1) actually holds can be observed by employing the following theorem, which is folklore in information theory.
Theorem 2 (Parallelogram Theorem). Let Df be a Bregman divergence, w(1),…, w(n), v ∈ ⅅJ and a ∈ ⅅn. Then:
i = 1 n a i D f ( w ( i ) v ) = i = 1 n a i D f ( w ( i ) LinOp a ( w ( 1 ) , , w ( n ) ) ) + + D f ( LinOp a ( w ( 1 ) , , w ( n ) ) v ) .
Proof. Let w = LinOpa(w(1),…, w(n)). The equality is easy to observe by:
i = 1 n a i [ f ( w ( i ) ) f ( v ) j = 1 J ( w j ( i ) v j ) f ( v ) v j ] = = i = 1 n a i [ f ( w ( i ) ) f ( w ) ( w ( i ) w ) f ( w ) ] + + [ f ( w ) f ( v ) j = 1 J ( w j v j ) f ( v ) v j ]
Since i = 1 n a i ( w ( i ) w ) f ( w ) = 0. □
Since Df(wv) = 0, only if w = v, and otherwise, it is positive, the unique minimum of the left-hand side of Equation (1) is at the point v = LinOpa(w(1),…, w(n)).
The situation above can be naturally interpreted in terms of random variables. Assume that X is a random variable taking values in {w(1),…, w(n)} ⊆ ⅅJ with the probability distribution a ∈ ⅅn, and we are given the problem of finding a random variable Y, such that the expected value:
E ( D f ( X Y ) )
is minimal. The unique answer to this question is then Y = E ( X ) = i = 1 n a i w ( i ). This underlines the reason why the LinOp A-pooling operator is so popular in the decision theory literature, where several experts, each with his own probability function w(i) representing his beliefs, seek to find a single probability function to represent their joint beliefs. The LinOp A-pooling operator simply yields the expected value as if expert’s beliefs were statistically obtained.
It is certainly interesting that the result above holds for any Bregman divergence, but as is shown in [17], Theorem 4, it is even more remarkable that Bregman divergences are the only divergences with such a property. However, we note that in order to establish this claim, a slightly more general setting was considered and that we have restricted the formulation of the original theorem to the only domain considered here (0, 1)J:
Theorem 3 (Banerjee, Guo, Wang). Let F: (0, 1)J × (0, 1)J → ℝ be a divergence. Assume that F ( x y ) , 2 F ( x y ) x i x j , 1 i , j J are all continuous. Let (Ω, P, ) be an arbitrary probability space, and let G be a sub-σ-algebra of . For all random variables X taking values in (0, 1)J, if:
arg min Y G F ( X Y ) = E ( X G )
then F (x‖y) = Df(x‖y) for some strictly convex and differentiable function f : (0, 1)J → ℝ.
While in the statistical sense, the LinOp A-pooling operator, where A is a family of weighting vectors, seems to be well placed, in the fields of multi-expert reasoning and probabilistic merging, the so-called LinOp A-pooling operator often appeals more. For every n ≥ 1 and every a A, it is defined by:
LogOp a ( w ( 1 ) , , w ( n ) ) = ( i = 1 n ( w 1 ( i ) ) a i j = 1 J i = 1 n ( w j ( i ) ) a i , , i = 1 n ( w J ( i ) ) a i j = 1 J i = 1 n ( w j ( i ) ) a i ) .
If w(1),…, w(n) are considered to be beliefs of n-experts, respectively, then the LinOp A-pooling operator appears to favor agreement over the expected value. For instance, consider the following example from utility theory. Say that Eleanor and George are looking for a film to watch and they have three options, A, B and C. Eleanor hates Movie A and under no circumstances would agree to watch it, while George absolutely loves it. Now, consider that the situation with respect to Film C is swapped: George hates it, while Eleanor would prefer to see it. They both consider Movie B uninteresting, but are willing to see it. The following probability functions could represent the preferences of Eleanor and George towards Movies A, B and C: (0, 0.1, 0.9) and (0.9, 0.1, 0), respectively. Moreover, we value the opinions of both of them equally, i.e., A = N. Now, while the LinOp N-pooling operator gives inconclusive (0.45, 0.1, 0.45) by the LinOp N-pooling operator (in the literature, this operator is simply known as the LogOp-pooling operator), we obtain (0, 1, 0). If we take the advice, then Eleanor and George should see the only film that is acceptable for both of them.
The example above illustrates why taking products rather than the arithmetic mean is popular when considering utilities. However, recently, the LinOp N-pooling operator attracted attention also in multi-expert probabilistic reasoning; a prominent example here is the social entropy process by Wilmers [18]. An intriguing idea that originates in the social entropy process is to swap the direction of the Kullback–Leibler projections and establish the corresponding conjugated KL-projection of w ∈ ⅅJ into V ⊆ ⅅJ as arg minv∈V KL(wv) (it is easy to check that KL(·‖·) is strictly convex in its second argument) and the conjugated parallelogram theorem [10]:
Theorem 4. Let w(1),…, w(n), v ∈ ⅅJ and a ∈ ⅅn. Then:
i = 1 n a i KL ( v w ( i ) ) = i = 1 n a i KL ( LogOp a ( w ( 1 ) , , w ( n ) ) w ( i ) ) + KL ( v LogOp a ( w ( 1 ) , , w ( n ) ) ) .
Proof. Let w = LogOpa(w(1),…, w(n)). First note that:
i = 1 n a i j = 1 J v j log v j w j ( i ) = j = 1 J v j log v j i = 1 n ( w j ( i ) ) a i .
Now:
j = 1 J v j log v j i = 1 n ( w j ( i ) ) a i = j = 1 J v j log v j w j ( j = 1 J v j ) log ( j = 1 J i = 1 n ( w j ( i ) ) a i ) = = ( j = 1 J v j ) log v j w j + i = 1 n a i j = 1 J w j log w j w j ( i ) ,
where we have used the fact that j = 1 J v j = 1. □
As a consequence, for given w(1),…, w(n) ∈ ⅅJ, we get:
arg min v D J i = 1 n a i KL ( v w ( i ) ) = LogOp a ( w ( 1 ) , w ( n ) ) .
Therefore, the LinOp A-pooling operator can be naturally interpreted in information geometry. The question now arises as to whether this can be done with some other Bregman divergences. We will investigate this later.
The reader perhaps wonders which are the main practical differences in using different pooling operators. The LinOp A-pooling operator, for example, satisfies the marginalization property, that is the values on the coordinates of the resulting probability function depend only on the corresponding coordinates of the probability functions that are pooled. The LinOp A-pooling operator does not have this property. On the other hand, the LinOp A-pooling operator, unlike the LinOp A-pooling operator, is externally Bayesian. That is the order in which we combine pooling and Bayesian updating is irrelevant. See [19] for more details.
We, however, do not seek any conclusive answer as to which pooling operator to use in any particular context. In this paper, we only aim to provide geometric tools that can be used in multi-expert reasoning. For elaborate work on pooling operators, we refer to the literature, e.g., [19] for a survey, [20] for a classical problem of the relationship between pooling and probabilistic independence or [18] for a modern account on LinOp N and LinOp N-pooling operators in probabilistic knowledge merging.

2. Projections and Pooling Combined

2.1. Averaging Projective Procedures

While the geometry of projections and the theory of pooling operators have been extensively studied in the literature (see the previous section), much less attention, however, was been devoted to the combination of them. A detailed study of this problem and a comprehensive analysis of the geometry involved is the main aim of this paper.
The central geometrical notion connecting projections and pooling in this paper is an averaging projective procedure F, which consists of a family of mappings F [ W 1 , , W n ] : D J D J, where sets W1,…, Wn ⊆ ⅅJ are closed convex and nonempty. A particular F is given by a family of strictly convex functions dv, v ∈ ⅅJ and a pooling operator Pool and is defined by the following two-stage process.
  • For an argument v ∈ ⅅJ, put w ( i ) = arg min w W i d v ( w ) , 1 i n.
  • Set F [ W 1 , W n ] ( v ) = Pool ( w ( 1 ) , , w ( n ) ).
For instance, the function dv(·) can be Df(·‖v) for some Bregman divergence Df and in such a particular case F [ W 1 , , W n ] : ( v ) first Df-projects the argument v into each of W1,…, Wn, and then, it “averages” the resulting probability functions by a pooling operator Pool. Hence, the name: an averaging projective procedure. An illustration of F is depicted in Figure 4.
Note that W1,…, Wn play dual roles in the definition above, which may perhaps appear clumsy. When they are fixed, F [ W 1 , , W n ] is a mapping ⅅJ → ⅅJ. However, the option to consider them also as variables will be the key to our following investigation and to the applicability of an averaging projective procedure in multi-expert reasoning, where W1,…, Wn will represent the respective knowledge of n experts. A straightforward interpretation is that the first stage simplifies sets to single probability functions, which then are being merged to a final social belief function of the college of experts.
With regard to previous research, the cases of dv(·) being KL(·‖v) and KL(v‖·) with Pool be taken to the LinOp A-pooling operator and the LogOp A-pooling operator, respectively, were introduced and investigated by Matúš in [21]. The idea of combining the projections by means of the squared Euclidean distance E2 with the LinOp A-pooling operator was first introduced by Predd et al. in [22].
Example 3. In the definition of an averaging projective procedure, take dv to be KL(·‖v) and Pool to be the LinOp N -pooling operator. Now, F is the mappingJ → ⅅJ for every n ≥ 1 and all closed convex nonempty sets W1,…, Wn ⊆ ⅅJ given by F [ W 1 , , W n ] : ( v ) above.
In particular, take J = 3 , n = 2 , W 1 = { ( x , 1 2 x , 1 2 ) , 1 10 x 2 5 } , W 2 = { ( x , 1 4 , 3 4 x ) , 1 10 x 13 20 }and v = ( 1 3 , 1 6 , 1 2 ). Then, the KL-projecion of v into W1 is actually v itself, since vW1 and the KL-projection of v intoW2 is ( 3 10 , 1 4 , 9 20 ). Therefore:
F [ W 1 , W 2 ] ( v ) = LinOp ( 1 2 , 1 2 ) ( ( 1 3 , 1 6 , 1 2 ) , ( 3 10 , 1 4 , 9 20 ) ) = ( 1 3 + 3 10 2 , 1 6 + 1 4 2 , 1 2 + 9 20 2 ) .

2.2. Obdurate Operators

In this section, we approach averaging projective procedures using the framework of probabilistic knowledge merging as defined in [5]. A probabilistic merging operator:
Δ : P ( D J ) × × P ( D J ) n P ( D J ) ,
is a mapping that maps a finite collection of closed convex nonempty subsets of ⅅJ, say W1,…, Wn, to a single closed convex nonempty subset of ⅅJ. In the area of multi-expert reasoning, we can perhaps interpret ∆(W1,…, Wn) as a representation of W1,…, Wn, which themselves individually represent knowledge bases of n experts.
A merging operator O is obdurate if, for every n ≥ 1 and any W1,…, Wn ⊆ ⅅJ, we have that O ( W 1 , , W n ) = { F [ W 1 , , W n ] ( v ) }, where v is some fixed argument and F is an averaging projective procedure. Note that this operator always produces a singleton. Obdurate processes thus first represent sets as single probability functions, and then, they pool them by a pooling operator.
Although this may sound like a fairly restrictive setting, many existing natural probabilistic merging operators are of this form. The prominent example is the merging operator of Kern-Isberner and Rödder (KIRP) [23]. In this particular case, v is the uniform probability function, dv(·) is KL(·‖v) and Pool is given by:
Pool ( w ( 1 ) , , w ( n ) ) = ( k = 1 n H ( w ( k ) ) i = 1 n H ( w ( n ) ) w 1 ( k ) , , k = 1 n H ( w ( k ) ) i = 1 n H ( w ( n ) ) w J ( k ) ) .
Recall that H(w(i)) is the Shannon entropy of w(i), which is, in fact, the most entropic point in Wi.
In [23], Kern-Isberner and Rödder argue that W1,…, Wn ⊆ ⅅJ can by considered as marginal probabilities in a subset U ⊆ ⅅJ+n, such that every probability function vU marginalizes to a ⅅJ -probability function belonging to one and only one set Wi. Since then, the point which KIRP produces is, in fact, the ⅅJ-marginal of the most entropic point in U, following the justification of the Shannon entropy, they conclude that such a point is a natural interpretation of W1,…, Wn by a single probability function. KIRP thus maps the uniform probability function to the ⅅJ-marginal of the most entropic point in U. To date, KIRP has received much attention in the area of probabilistic knowledge merging.
However, any obdurate merging operator seems to be challenged by its violation of the following principle.
(CP) Consistency Principle. Let ∆ be a probabilistic merging operator. Then, we say that ∆ satisfies the consistency principle if, for every n ≥ 1 and all W1,…, Wn ⊆ ⅅJ:
i = 1 n W i implies Δ ( W 1 , , W n ) i = 1 n W i .
(CP) can be interpreted as saying that if the knowledge bases of a set of experts are collectively consistent, then the merged knowledge base should not consist of anything else than what the experts agree on.
This principle often falls under the following philosophical criticism. One might imagine a situation where several experts consider a large set of probability functions as admissible, while one believes in a single probability function. Although this one is consistent with the beliefs of the rest of the group, one might argue that it is not justified to merge the knowledge of the whole group into that single probability function.
More rigorously, Williamson [24] introduces a particular interpretation of the epistemological status of an expert’s knowledge base, which he calls “granting”. He rejects (CP), as several experts may grant the same piece of knowledge for inconsistent reasons.
On the other hand, Adamčík and Wilmers in [5] assume that the way in which the knowledge was obtained is considered irrelevant, and each expert has incorporated all of his relevant knowledge into what he is declaring, contrary to Williamson’s granting. This is sometimes referred to as the principle of total evidence [25] or the Watts assumption [26]. They argue that, although overall knowledge of any human expert can never be fully formalized, as a formalization is always an abstraction from reality, the principle of total evidence needs to be imposed in order to avoid confusion in any discussion related to methods of representing the collective knowledge of experts. Otherwise, there would be an inexhaustible supply of invalid arguments produced by a philosophical opponent challenging one’s reasoning using implicit background information, which is not included in the formal representation of a knowledge base.
However, in this paper, we do not wish to probe further into this philosophical argument, and instead, we present the following rather surprising theorem, which appeared for the first time in [10].
Theorem 5. There is no obdurate merging operator O that satisfies the consistency principle (CP).
Proof. Suppose that J ≥ 3. Let d be the function to minimize from the definition of O, where, for simplicity, we suppress the constant superscript. Let v ∈ ⅅJ be the unique minimizer of d over some sufficiently large closed convex subset W of ⅅJ. Let w, uW be such that d(v) < d(w) < d(u) and w = λv + (1 − λ)u for some 0 < λ < 1 (in particular, w is a linear combination of v and u).
Let sW be such that d(v) < d(s) < d(w) and s is not a linear combination of v and u. Then, there is s′, such that s′ = λs + (1 − λ)w for some 0 < λ ≤ 1, and d is strictly increasing along the line from s′ to w. This is because d is strictly convex and d(s) < d(w). Note that if J = 2, then s would be always a linear combination of v and u. Moreover, for sufficiently large W ⊆ ⅅ3, we can always choose w, u, s and s′ in W as above.
Now, we show that d is also strictly increasing along the line from s′ to u. Assume this is not the case. Then, by the same argument as before, there is s″, such that d(s″) < d(s′). Due to the construction, the line from v to s″ intersects the line from s′ to w; let us denote the point of intersection as r. Since d is strictly increasing along the line from s′ to w, we have that d(r) > d(s′) > d(s″) > d(v). This, however, contradicts the convexity of d. The situation is depicted in Figure 5.
Now, assume that W1 = {λv + (1 − λ)w : λ ∈ [0, 1]}, W2 = {λs′ + (1 − λ)w : λ ∈ [0, 1]}, V1 = {λv + (1 − λ)u : λ ∈ [0, 1]} and V2 = {λs′ + (1 − λ)u : λ ∈ [0, 1]}. Since v minimizes d and along the lines from s′ to w and from s′ to u, the function d is strictly increasing, we have that:
O ( W 1 , W 2 ) = { Pool ( v , s ) } = O ( V 1 , V 2 ) ,
where Pool is a pooling operator used in the second stage of O. Suppose that O satisfies (CP). Then, O(W1, W2) = {w} and O(V1, V2) = {u}, which contradicts Equation (2).
The theorem above in some philosophical contexts can be used as an argument against the consistency principle, while from another perspective, it casts a shadow on the notion of an obdurate merging operator. This unfortunately includes the natural merging operator OSEP, or obdurate social entropy process, defined as follows. For every n ≥ 1 and all W1,…, Wn ⊆ ⅅJ:
OSEP ( W 1 , , W n ) = { LogOp N ( ME ( W 1 ) , , ME ( W n ) ) } .
Recall that ME(Wi) denotes the most entropic point in Wi or equivalently the KL-projection of the uniform probability function into Wi, and N is the family of weighting vectors ( 1 n , , 1 n ), one for every n ≥ 1. It is easy to observe that OSEP is really an obdurate merging operator.
In [10], it is proven that OSEP is (thus far, the only known) probabilistic merging operator satisfying a particular version of the independence principle, a principle that is an attempt to resurrect the notion of the independence preservation of pooling operators [20] in the context of probabilistic merging operators.
One may say that the reason behind an obdurate merging operator not satisfying (CP) is its “forgetting” nature. In the first stage, it transforms sets W1,…, Wn into w(1),…, w(n) individually without taking into account other sets, thus “forgetting” any existing connections, such as the consistency. However, instead of changing the definition of an averaging projective procedure so as to make it not “forgetting”, we will take a different viewpoint on the procedure itself in the following subsection.

2.3. Fixed Points

Our second approach to an averaging projective procedure F consists of considering the set of the fixed points of F. That is, for given n ≥ 1 and given closed convex nonempty sets W1,…, Wn ⊆ ⅅJ, we are interested in whether there are any points v ∈ ⅅJ, such that:
F [ W 1 , , W n ] ( v ) = v .
Following the convincing justification for combining Bregman projections with the LinOp A-pooling operator (see Section 1.3), for every convex Bregman divergence Df and a family of weighting vectors A, we consider here the averaging projective procedure F D f , A defined for every n ≥ 1 and all closed convex nonempty sets W1,…, Wn ⊆ ⅅJ by the following.
  • For an argument v ∈ ⅅJ, take w(i) the Df-projection of v into Wi for all 1 ≤ in.
  • Set F [ W 1 , , W n ] D f , A ( v ) = LinOp a ( w ( 1 ) , , w ( n ) ), where a A.
The restriction to convex Bregman divergences is needed for some later theorems and is adopted ad hoc. Therefore, unfortunately, we cannot provide any elaborate justification for it.
Given closed convex nonempty sets W1,…, Wn ⊆ ⅅJ, we will denote the set of all fixed points of F D f , A defined above by Θ a D f ( W 1 , W n ), where a A.
On the other hand, the conjugated parallelogram theorem (Theorem 4), suggesting the combination of the conjugated KL-projection with the LogOp-pooling operator, leads us to the consideration of those convex Bregman divergences, which are strictly convex also in the second argument. The squared Euclidean distance and the Kullback–Leibler divergence are instances of such divergences. A fairly general example is a Bregman divergence Df, such that f ( v ) = j = 1 J g ( v j ), where g is a strictly convex function (0, 1) ℝ, which is three times differentiable, and g″(vj) (wj − vj)g‴(vj) > 0 for all 1 ≤ j ≤ J and all w, v ∈ ⅅJ (this is easy to check by the Hessian matrix). Apart from the two divergences mentioned above, this condition is satisfied in particular if g(v) = vr, 2 ≥ r > 1. Note that the Bregman divergence generated by such a function g is also convex in both arguments.
Assuming strict convexity in the second argument of Df, we can define the conjugated Df-projection of v ∈ ⅅJ into a closed convex nonempty set W ⊆ ⅅJ as that unique wW that minimizes ⅅf(v‖w) subject only to wW. Moreover, since a sum of strictly convex functions is a strictly convex function, for any w(1),…, w(n) ∈ ⅅJ, there exists a unique minimizer of:
i = 1 n a i D f ( v w ( i ) )
which we denote Pool a D f ( w ( 1 ) , , w ( n ) ). Thus, for a family of weighting vectors A, we can define the Pool A D f-pooling operator. Note that Pool A KL = LogOp A , Pool A E 2 = LinOp A and that we do not need strict convexity in the second argument in these cases.
Theorem 6 (Conjugated Parallelogram Theorem). Let Df be Bregman divergence, w ( 1 ) , , w ( n ) , v J and a ∈ ⅅn. Then:
i = 1 n a i D f ( v w ( i ) ) = i = 1 n a i D f ( Pool a D f ( w ( 1 ) , , w ( n ) ) w ( i ) ) + + D f ( v Pool a D f ( w ( 1 ) , , w ( n ) ) ) .
Proof. Let w = Pool a D f ( w ( 1 ) , , w ( n ) ). We need to prove that:
i = 1 n a i [ f ( v ) f ( w ( i ) ) j = 1 J ( v j w j ( i ) ) f ( w ( i ) ) w j ( i ) ] = i = 1 n a i [ f ( w ) f ( w ( i ) ) j = 1 J ( w j w j ( i ) ) f ( w ( i ) ) w j ( i ) ] + + [ f ( v ) f ( w ) j = 1 J ( v j w j ) f ( w ) w j ] ,
or equivalently:
j = 1 J ( v j w j ) ( i = 1 n a i f ( w ( i ) ) w j ( i ) f ( w ) w j ) = 0.
Since w = arg min w D J i = 1 n a i D f ( w w ( i ) ), differentiation using the Lagrange multiplier method (since a differentiable convex function f is necessarily continuously differentiable (see [9]), the partial derivatives used above are all continuous and the Lagrange multiplier method is permissible) applied to the condition j = 1 J w j = 1 produces i = 1 n a i f ( w ( i ) ) w j ( i ) f ( w ) w j ( i ) = λ , 1 j J, where λ is a constant independent of j. Therefore, Equation (3) is equal to j = 1 J ( v j w j ) λ = 0, and the theorem follows.
The idea of defining a spectrum of pooling operators where the pooling operators LinOp and LogOp are special cases was developed previously in a similar manner, but in a slightly different framework of alpha-divergences; cf. [27].
Here, following [1,12], we will point out a geometrical relationship between pooling operators LinOp and PoolDf, which will be helpful in illustrating some results of this paper.
Recall that the generator of a Bregman divergence Df is a strictly convex function f : (0, 1)J → ℝ, which is differentiable over ⅅJ. Let w ∈ ⅅJ. We define w = ∇f(w). Since f is a strictly convex function, the mapping w f (w) is injective; thus, the coordinates of w form a coordinate system. There are two kinds of affine structures in ⅅJ. Df(wv) is convex in w with respect to the first structure and is convex in v with respect to the second structure.
Therefore, the proof above, in fact, gives [ v ] * = [ Pool a D f ( w ( 1 ) , , w ( n ) ) ] * = LinOp a ( [ w ( 1 ) ] * , , [ w ( n ) ] * ) + c, where c = ( λ , λ J -times ) is a normalizing vector induced by j = 1 J v j = 1.
The only other type of averaging projective procedure F ^ D f , A that we consider here will be generated by a convex differentiable Bregman divergence Df, which is strictly convex in its second argument, and a family of weight A and is defined for every n ≥ 1 and all closed convex nonempty sets W1,…, Wn ⊆ ⅅJ by the following.
  • For an argument v ∈ ⅅJ, take w(i) the conjugated Df-projection of v into Wi for all 1 ≤ i ≤ n.
  • Set F ^ [ W 1 , , W n ] D f , A ( v ) = Pool a D f ( w ( 1 ) , , w ( n ) ), where a A.
Given closed convex nonempty sets W1,…, Wn ⊆ ⅅJ, we will denote the set of all fixed points of F^Df, A defined above by Θ ^ a D f ( W 1 , W n ), where a A.
Note that we always require an additional assumption of Df being differentiable for this type of averaging projective procedure. This assumption is essential to the proofs of some results concerning this procedure. We note that both divergences KL and E2 are differentiable.
Given a family of weighting vectors A, our aim is to investigate Θ A D f = { Θ a D f : a A } and Θ ^ A D f = { Θ ^ a D f : a A } as operators acting on P ( D J ) × × P ( D J ). In particular, we ask the following questions. Given any closed convex nonempty sets W1,…, Wn ⊆ ⅅJ and a A:
  • Are Θ a D f = ( W 1 , , W n ) and Θ ^ a D f = ( W 1 , , W n ) always nonempty?
  • Are these sets always closed and convex?
If both answers are positive, then we can consider Θ A D f and Θ ^ A D f as probabilistic merging operators. In such a case, the following question makes sense.
  • As probabilistic merging operators, do they satisfy the consistency principle (CP)?
The fact that the answer to all three questions is “yes” is perhaps surprising, given that the much simpler obdurate merging operators do not satisfy (CP). We prove the above results in the following sequence of theorems, which conclude Section 2.
The following well-known lemma is a simple, but useful observation.
Lemma 1. Let Df be a Bregman divergence and a, v, w ∈ ⅅJ. Then:
D f ( a v ) D f ( a w ) D f ( w v ) = ( a w ) ( f ( w ) f ( v ) ) .
Theorem 7. Let Df be a convex Bregman divergence, W1,…, Wn ⊆ ⅅJ be closed convex nonempty sets and a ∈ ⅅn. Let v, w ∈ ⅅJ, u(1)W1,…, u(n)Wn and w(1)W1,…, w(n)Wn be such that v = LinOpa (u(1),…, u(n)), w = LinOpa(w(1),…, w(n)) and u(i) are the Df-projection of v into Wi, 1 ≤ i ≤ n. Then:
i = 1 n a i D f ( u ( i ) v ) i = 1 n a i D f ( w ( i ) w ) .
Proof. First of all, by the extended Pythagorean property, we have that:
D f ( w ( i ) v ) D f ( u ( i ) v ) D f ( w ( i ) u ( i ) ) 0.
By the parallelogram theorem:
i = 1 n a i D f ( w ( i ) v ) = i = 1 n a i D f ( w ( i ) w ) + D f ( w v ) .
Hence:
i = 1 n a i D f ( w ( i ) w ) i = 1 n a i D f ( u ( i ) v ) + D f ( w v ) i = 1 n a i D f ( w ( i ) u ( i ) ) 0.
Since we assume that Df(·‖·) is a convex function in both arguments by the Jensen inequality:
D f ( w v ) i = 1 n a i D f ( w ( i ) u ( i ) ) 0.
The Inequalities (4) and (5) give:
i = 1 n a i D f ( u ( i ) v ) i = 1 n a i D f ( w ( i ) w )
as required. □
Figure 6 depicts the situation in the proof above for n = 2. Arrows indicate corresponding divergences.
An interesting question related to conjugated Bregman projections arises as to whether a similar property to the Pythagorean property holds. It turns out that the corresponding property is the so-called four-point property, from to Csiszár and Tusnády. The following theorem in the case of the KL-divergence is a specific instance of a result in [28], Lemma 3, but the formulation using the term “conjugated KL-projection” first appeared in [21]. An illustration is depicted in Figure 7.
Theorem 8 (Four-Point Property). Let Df be a convex differentiable Bregman divergence, which is strictly convex in its second argument. Let V be a convex closed nonempty subset ofJ, and let v, u, w, s ∈ ⅅJ be such that v is the conjugated Df-projection of w into V and uV is arbitrary. Then:
D f ( s v ) D f ( s u ) + D f ( s w ) .
Proof. By Lemma 1, we have that:
D f ( s w ) = D f ( s v ) D f ( w v ) ( s w ) ( f ( w ) f ( v ) ) .
We can rewrite the above as:
D f ( s w ) D f ( s v ) + D f ( s u ) = = D f ( s u ) D f ( w v ) ( s w ) ( f ( w ) f ( v ) )
Since Df(·‖·) is a convex differentiable function, by applying the first convexity condition twice, we have that:
D f ( s u ) D f ( w v ) + + j = 1 J ( a j w j ) x j [ D f ( x v ) ] | x = w + j = 1 J ( u j v j ) x j [ D f ( w x ) ] | x = v
Expressions (6) and (7) give that:
D f ( s v ) D f ( s u ) + D f ( s w ) j = 1 J ( u j v j ) x j [ D f ( w x ) ] | x = w .
However, since v is the conjugated Df-projection of w into V, the gradient of Df(w‖·) at (w, v) in the direction to (w, u) must be greater than or equal to zero:
j = 1 J ( u j v j ) x j [ D f ( w x ) ] | x = v 0
and the theorem follows. □
The following result appeared for the first time in [10], but without considering the weighting.
Theorem 9 (Characterization Theorem for Θ a D f). Let Df be a convex Bregman divergence, a ∈ ⅅn and W1,…, Wn ⊆ ⅅJ be closed convex nonempty sets. Then:
Θ a D f ( W 1 , W n ) = { arg min v D J i = 1 n a i D f ( w ( i ) v ) : w ( i ) W i , 1 i n } ,
where the right hand-side denotes the set of all possible minimizers. That is the set of all probability functions v ∈ ⅅJ, which globally minimize i = 1 n a i D f ( w ( i ) v ), subject only to w(1)W1,…,w(n)Wn.
Proof. It is easy to see that, given closed convex nonempty sets W1,…, Wn ⊆ ⅅJ, we have that those w(1)W1,…, w(n)Wn, which together with v ∈ ⅅJ, globally minimize:
i = 1 n a i D f ( w ( i ) v ) ,
are also the Df-projections of v into W1,…, Wn respectively. This, together with Equation (1) (the equation preceding Theorem 2), gives:
Θ a D f ( W 1 , W n ) { arg min v D J i = 1 n a i D f ( w ( i ) v ) : w ( i ) W i , 1 i n } .
Now, assume that v Θ a D f ( W 1 , W n ) and
u { arg min v D J i = 1 n a i D f ( w ( i ) v ) : w ( i ) W i , 1 i n } .
Let us denote the Df-projections of v into W1,…, Wn by w(1) , w(n), respectively. Accordingly, let us denote the Df-projections of u into W1,…, Wn by r(1) , r(n), respectively. Suppose that i = 1 n a i D f ( w ( i ) v ) > i = 1 n a i D f ( r ( i ) u ), i.e.,
v { arg min v D J i = 1 n a i D f ( w ( i ) v ) : w ( i ) W i , 1 i n } .
This contradicts Theorem 7, and therefore:
Θ a D f ( W 1 , , W n ) { arg min v D J i = 1 n a i D f ( w ( i ) v ) : w ( i ) W i , 1 i n } .
Let us now deviate for a while from the goals of this subsection and stress the importance of the restriction to the positive discrete probability functions, which was detailed in Section 1.1. The problem with the KL-divergence is that the function f ( x ) = j = 1 J x j log x j is not differentiable if some xj = 0. Without the adopted restriction, the KL-divergence is therefore usually defined by:
KL ( w v ) = { j : v j 0 w j log w j v j , if v j = 0 implies w j = 0 for all 1 j J , + , otherwise .
If vj = 0 implies wj = 0 for all 1 ≤ j ≤ J, we say that v dominates w and write v w.
The first problem we would face with this definition is whether the notion of the KL-projection makes sense. For given v ∈ ⅅJ and closed convex nonempty set W ⊆ ⅅJ, the KL-projection of v into W makes sense only if there is at least one wW, such that v w.
However, even if adding this condition to all of the discussion concerning the KL-projection above (this is perfectly possible, as seen in [10]), Theorem 9 still could not hold, as the following example demonstrates.
Example 4. Let W 1 = { λ ( 0 , 0 , 1 6 , 5 6 ) + ( 1 λ ) ( 0 , 1 3 , 1 3 , 1 3 ) : λ [ 0 , 1 ] } and W 2 = { λ ( 0 , 0 , 1 3 , 2 3 ) + ( 1 λ ) ( 0 , 1 3 , 1 3 , 1 3 ) : λ [ 0 , 1 ] }. Assume that a = ( 1 2 , 1 2 ) It is easy to check that ( 0 , 0 , 1 4 , 3 4 ) and ( 0 , 1 3 , 1 3 , 1 3 ) are both fixed points, but the former does not belong to the set of global minimizers v of KL ( w v ) + KL ( w ( 2 ) v ) subject to w(1)W1 and w(2)W2. An illustration is depicted in Figure 8.
Moreover, some variant of the above example would show that the set Θ a KL ( W 1 , W 2 ) is not convex, which would wreck our aims; more details are given in [10].
On the other hand, neither of those Bregman divergences, which generate functions, are differentiable over the whole space of discrete probability functions (e.g., the squared Euclidean distance) and would encounter the difficulties of the KL-divergence. In particular, Theorem 9 formulated over the whole space of discrete probability functions (as opposed to only the positive ones) would still hold for such Bregman divergences.
Now, we shall go back and prove a theorem similar to Theorem 9 for the Θ ^ A D f-operator. In order to do that, we will need the following analogue of Theorem 7.
Theorem 10. Let Df be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W1,…, Wn ⊆ ⅅJ be closed convex nonempty sets and a ∈ ⅅn. Let v, w ∈ ⅅJ and u(1)W1,…, u(n)Wn and w(1)W1,…, w n)Wn be such that v = Pool a D f ( u ( 1 ) , , u ( n ) ), w = Pool a D f ( w ( 1 ) , , w ( n ) ) and u(i) are the conjugated Df-projection of v into Wi, 1 ≤ i ≤ n. Then:
i = 1 n a i D f ( v u ( i ) ) i = 1 n a i D f ( w w ( i ) ) .
Proof. By Theorem 6, we have that:
i = 1 n a i D f ( w u ( i ) ) = i = 1 n a i D f ( v u ( i ) ) + D f ( w v )
which by the four-point property (notice that we need the differentiability of Df to employ the four-point property) (Theorem 8) becomes:
i = 1 n a i D f ( w w ( i ) ) + D f ( w v ) i = 1 n a i D f ( v u ( i ) ) + D f ( w v )
and hence:
i = 1 n a i D f ( v u ( i ) ) i = 1 n a i D f ( w w ( i ) )
as required, see Figure 9. □
The theorem above is fairly similar to Theorem 7. Let us use the dual affine structure in ⅅJ defined after the proof of Theorem 6 to analyze this more closely. For W ⊂J, define W = {w; wW} and define the dual divergence D f * to the divergence Df by D f * ( v * w * ) = D f ( w v ). Since, by Theorem 6, we have that [ v * ] = [ Pool a D f ( w ( 1 ) , , w ( n ) ) ] * = LinOp a ( [ w ( 1 ) ] * , , [ w ( n ) ] * ) + c v, where c v = ( λ , , λ J -times ) is a normalizing vector induced by j = 1 J v j = 1, the theorem above can be rewritten as follows.
Let Df be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W1,…, WnJ be closed convex nonempty sets and a ∈ ⅅn. Let v, w ∈ ⅅJ, u(1)W1,…, un)Wn and w(1)W1.,…, w(n)∈ Wn be such that v = LinOp ([u(1)],…, [u(n)]*) + cv, w = LinOpa([w(1)], [w(n)]) + cw and [u(i)] are the D f *-projection of v into W i *, 1 ≤ in. Then:
i = 1 n a i D f * ( [ u ( i ) ] * v * ) i = 1 n a i D f * ( [ w ( i ) ] * w * ) .
This illustrates that if Df is a convex differentiable Bregman divergence that is strictly convex in its second argument, then Theorems 7 and 10 are dual with respect to .
Theorem 11 (Characterization Theorem for Θ ^ a D f. Let Df be a convex differentiable Bregman divergence, which is strictly convex in its second argument, and let W1,…, Wn ⊆ ⅅJ be closed convex nonempty sets and a ∈ ⅅn. Then:
Θ ^ a D f ( W 1 , , W n ) = { arg min v D J i = 1 n a i D f ( v w ( i ) ) : w ( i ) W i , 1 i n } ,
where the right hand-side denotes the set of all possible minimizers.
Proof. The proof is similar to the proof of Theorem 9. First, given closed convex nonempty sets W1,…, W ⊆ ⅅJ, we have that those w(1)W1,…, w(n)Wn, which together with v ∈ ⅅJ, globally minimize:
i = 1 n a i D f ( v w ( i ) ) ,
that are also the conjugated Df-projections of v into W1,…, Wn, respectively. This together with the definition of Pool a D f gives:
Θ ^ a D f ( W 1 , , W n ) { arg min v D J i = 1 n a i D f ( v w ( i ) ) : w ( i ) W i , 1 i n } .
Second, assume that v Θ ^ a D f ( W 1 , , W n ) and:
u { arg min v D J i = 1 n a i D f ( v w ( i ) ) : w ( i ) W i , 1 i n } .
Let us denote the conjugated Df-projections of v into W1,…, Wn by w(1), , w(n), respectively. Accordingly, let us denote the conjugated Df-projections of u into W1,…, Wn by r(1),, r(n), respectively. Suppose that:
i = 1 n a i D f ( u r ( i ) ) < i = 1 n a i D f ( v w ( i ) ) ,
i.e., v { arg min v D J i = 1 n a i D f ( v w ( i ) ) : w ( i ) W i , 1 i n }. This contradicts Theorem 10, and therefore:
Θ ^ a D f ( W 1 , , W n ) { arg min v D J i = 1 n a i D f ( v w ( i ) ) : w ( i ) W i , 1 i n } .
The following simple observation originally from [10] based on Equation (1) (alternatively on the parallelogram theorem) will be used in the proof of the forthcoming theorem.
Lemma 2. Let Df be a convex Bregman divergence and a ∈ ⅅn. Then, the following are equivalent:
  • The probability functions v, w(1),…, w(n) ∈ ⅅJ minimize the quantity:
    i = 1 n a i D f ( w v ( i ) )
    subject to w(1)W1,…, w(n)Wn.
  • The probability functions w(1),…, w(n) ∈ ⅅJ minimize the quantity:
    i = 1 n a i D f ( w ( i ) LinOp a ( w ( 1 ) , , w ( n ) )
    subject to w(1)W1,…, w(n)Wn and v = LinOpa(w(1),…, w(n)).
Theorem 12. Let Df be a convex Bregman divergence. Then, for all nonempty closed convex sets W1,…, Wn ⊆ ⅅJ and a ∈ ⅅn, the set { arg min v D J i = 1 n D f ( w ( i ) v ) : w ( i ) W i , 1 i n } is a nonempty closed convex region ofJ.
Proof. This proof is from [10]. Let v, s { arg min v D J i = 1 n D f ( w ( i ) v ) : w ( i ) W i , 1 i n }, as the set is clearly nonempty. For convexity, we need to show that λv + (1 − λ)s ∈ { arg min v D J i = 1 n D f ( w ( i ) v ) : w ( i ) W i , 1 i n } for any λ ∈ [0, 1].
Assume that w(1)W1,…, w(n)Wn are such that v = LinOpa(w(1),…, w(n)) and u(1)W1,…, u(n)Wn are such that s = LinOpa(u(1),…, u(n)). It is easy to observe that the convexity of Dfk·) implies convexity of: Df( ‖ ) implies convexity of:
g ( x ( 1 ) , , x ( n ) ) = i = 1 n a i D f ( x ( i ) LinOp a ( x ( 1 ) , , x ( n ) ) )
over the convex region specified by constraints x(i)Wi, 1 ≤ in. Moreover, the function g attains its minimum over this convex region at points (w(1),…, w(n)) and (u(1),…, u(n)). We need to show that g also attains its minimum at the point:
λ ( w ( 1 ) , , w ( n ) ) + ( 1 λ ) ( u ( 1 ) , , u ( n ) )
for any λ ∈ [0, 1]. Since g is convex by the Jensen inequality, we have that:
λ g ( w ( 1 ) , , w ( n ) ) + ( 1 λ ) g ( u ( 1 ) , , u ( n ) ) g ( λ ( w ( 1 ) , , w ( n ) ) + ( 1 λ ) ( u ( 1 ) , , u ( n ) ) ) .
Since g(w(1),…, w(n)) = g(u(1),…, u(n)), the inequality above can only hold with equality, and therefore, by Lemma 2,
λ v + ( 1 λ ) s { arg min v D J i = 1 n a i D f ( w ( i ) v ) : w ( i ) W i , 1 i n }
for any λ ∈ [0, 1].
Moreover, since convexity implies continuity, the minimization of a convex function over a closed convex region produces a closed convex set. Therefore, the fact that W1,…, Wn are all closed and convex implies that the set of n-tuples (w(1),…, w(n)), which are global minimizers of g over the region specified by w(i)Wi, 1 ≤ in, is closed. Additionally, since closed regions are preserved by projections in the Euclidean space, the set given by LinOpa(w(1),…, w(n)) is closed, as well.
The following observation immediately follows by the definition of Pool a D f.
Lemma 3. Let Df be a convex Bregman divergence and a ∈ ⅅn. Then, the following are equivalent:
  • The probability functions v, w(1),…, w(n) ∈ ⅅJ minimize the quantity:
    i = 1 n a i D f ( v w ( i ) )
    subject to w(1)W1,…, w(n)Wn.
  • The probability functions w(1),…, w(n) ∈ ⅅJ minimize the quantity:
    i = 1 n a i D f ( Pool a D f ( w ( 1 ) , , w ( n ) ) w ( i ) )
    subject to w(1)W1,…, w(n)Wn.and v = Pool a D f ( w ( 1 ) , , w ( n ) ) .
Theorem 13. Let Df be a convex Bregman divergence. Then, for all nonempty closed convex sets W1,…, Wn ⊆ ⅅJ and a ∈ ⅅn, the set { arg min v D J i = 1 n a i D f ( v w ( i ) ) : w ( i ) W i , 1 i n } a nonempty closed convex region ofJ.
Proof. Let v, s { arg min v D J i = 1 n a i D f ( v w ( i ) ) : w ( i ) W i , 1 i n }, as the set is clearly nonempty. For convexity, we need to show that λv + (1 − λ)s ∈ { arg min v D J i = 1 n a i D f ( v w ( i ) ) : w ( i ) W i , 1 i n } for any λ ∈ [0, 1].
Assume that w(1)W1,…, w(n)Wn are such that v = Pool a D f ( w ( 1 ) , , w ( n ) ) and u(1)W1,…, u(n) ∈ Wn are such that s = Pool a D f ( u ( 1 ) , , u ( n ) ). Now, for any λ ∈ [0, 1],
λ i = 1 n a i D f ( Pool a D f ( w ( 1 ) , , w ( n ) ) w ( i ) ) + ( 1 λ ) i = 1 n a i D f ( Pool a D f ( u ( 1 ) , , u ( n ) ) u ( i ) ) i = 1 n a i D f ( λ Pool a D f ( w ( 1 ) , , w ( n ) ) + ( 1 λ ) ( Pool a D f ( u ( 1 ) , , u ( n ) ) λ w ( i ) + ( 1 λ ) u ( i ) ) i = 1 n a i D f ( Pool a D f ( λ w ( 1 ) + ( 1 λ ) u ( 1 ) ) , , λ w ( n ) + ( 1 λ ) u ( n ) λ w ( i ) + ( 1 λ ) u ( i ) ) ,
where the first inequality follows by convexity of Df(·‖·) and the second by the definition of Pool a D f as the unique minimizer. However, the inequality above can only hold with equality and, by Lemma 3,
λ v + ( 1 λ ) s { arg min v D J i = 1 n a i D f ( w ( i ) v ) : w ( i ) W i , 1 i n }
for any λ ∈ [0, 1].
Moreover, since convexity implies continuity, the minimization of a convex function over a closed convex region produces a closed convex set. Therefore, the fact that W1,…, Wn are all closed and convex implies that the set of n-tuples (w(1),…, w(n)), which are global minimizers of i = 1 n a i D f ( Pool a D f ( w ( 1 ) , , w ( n ) ) w ( i ) ) over the region specified by w(i ∈ Wi, 1 ≤ in, is closed. Additionally, since closed regions are preserved by projections in the Euclidean space, the set given by Pool a D f ( w ( 1 ) , , w ( n ) ) is closed, as well. □
Finally, we can establish our initial claims:
Theorem 14. Let A be a family of weighting vectors. The operator Θ A D f, where Df is a convex Bregman divergence, and the operator Θ ^ A D f, where Df is a convex differentiable Bregman divergence, which is strictly convex in its second argument, are well defined probabilistic merging operators that satisfy (CP).
Proof. First, the fact that Θ A D f is well defined as a probabilistic merging operator follows Theorems 9 and 12. Accordingly, Θ ^ A D f is a well-defined probabilistic merging operator by Theorems 11 and 13.
Second, let a A (in particular a ∈ ⅅn) and W1,…, Wn ⊆ ⅅJ be closed, convex, nonempty and have a nonempty intersection. Clearly, every point in that intersection minimizes i = 1 n a i D f ( w ( i ) v ) and i = 1 n a i D f ( v w ( i ) ) subject to w(1)W1,…, w(n)Wn with both expressionsPattaining the zero value. Since Df(wv) = 0 only if w = v, those points in the intersection are the only points minimizing the above quantities. □
It turns out that, given closed convex nonempty sets W1,…, Wn ⊆ ⅅJ and weighting a, the sets of fixed points Θ a D f ( W 1 , , W n ) and Θ ^ a D f ( W 1 , , W n ) posses attractive properties, which make the operators Θ A D f and Θ ^ A D f suitable for probabilistic merging. The following example taken from [10] illustrates a possible philosophical justification for considering the set of all fixed points of a mapping consisting of a convex Bregman projection and a pooling operator.
Example 5. Assume that there are n experts, each with his own knowledge represented by closed convex nonempty sets W1,…, Wn ⊆ ⅅJ, respectively. Say that an independent chairman of the college has announced a probability function v to represent the agreement of the college of experts. Each expert then naturally updates his own knowledge by what seems to be the right probability function. In other words, the expert “i” projects v to Wi, obtaining the probability function w(i). Each expert subsequently accepts w(i) as his working hypothesis, but he does not discard his knowledge base Wi; he only takes into account other people’s opinions. Then, it is easy for the chairman to identify the average of the actual beliefs w(1),…, w(n) of the experts. If he found that this average v′ did not coincide with the originally announced probability function v, then he would naturally feel unhappy about such a choice, so he would be tempted to iterate the process in the hope that, eventually, he will find a fixed point.
It seems that, in a broad philosophical setting, such as in the example above, we ought to study any possible combination of Bregman projections with pooling operators. The question as to which other combination produces a well-defined probabilistic merging operator satisfying the consistency principle (CP) is open to investigation.

3. Convergence

3.1. Iterative Processes

In this section, we continue the investigation of the averaging projective procedures F D f , A and F ^ D f , A. Recall that, given a convex Bregman divergence Df and a family of weighting vectors A, F D f , A, was defined in the previous section for every n ≥ 1 and all closed convex nonempty sets W1,…, Wn ⊆ ⅅJ by the following.
  • For an argument v ∈ ⅅJ, take w(i) as the Df-projection of v into Wi for all 1 in.
  • Set F [ W 1 , , W n ] D f , A ( v ) = LinOp a ( w ( 1 ) , , w ( n ) ), where a A.
For Df, which is moreover differentiable and strictly convex in the second argument, F ^ D f , A was defined analogously by conjugated projections and the Pool A D f-pooling operator.
Our current aim is to find out what will happen if we iterate the application of averaging projective procedures F D f , A and F ^ D f , A. In particular:
  • Will the resulting sequences converge?
We shall find the answer in this subsection.
It is intriguing that we can abstractly define a “conjugated projection” with respect to a summation of a convex differentiable Bregman divergence Df. Let w(1),…, w(n)J and a ∈ ⅅn. Then, the “conjugated projection” of (w(1),…, w(n))into ⅅJ is defined by the global minimizer of i = 1 n a i D f ( w ( i ) v ), which, by Equation (1), is v = LinOpa(w(1),…, w(n)).
The claim that this behaves as a “conjugated projection” is supported by the following analogue of the four-point property illustrated in Figure 10.
Theorem 15. Let Df be a convex differentiable Bregman divergence. Let a ∈ ⅅn, w(1),…, w(n) ∈ ⅅJ and v = LinOp (w(1),…, w(n)). Let u(1),…, u(n) ∈ ⅅJ and u ∈ ⅅJ. Then:
i = 1 n a i D f ( u ( i ) v ) i = 1 n a i D f ( u ( i ) u ) + i = 1 n a i D f ( u ( i ) w ( i ) ) .
Proof. The proof is similar to the one of the actual four-point property (Theorem 8) only with a slightly different argument at the end: after obtaining:
i = 1 n a i D f ( u ( i ) v ) i = 1 n a i D f ( u ( i ) u ) + i = 1 n a i D f ( u ( i ) w ( i ) ) i = 1 n a i j = 1 J ( u j v j ) x j [ D f ( w ( i ) x ) ] | x = v
we proceed with:
i = 1 n a i j = 1 J ( u j v j ) x j [ D f ( w ( i ) x ) ] | x = v = = i = 1 n a i ( u j v j ) [ k = 1 J ( i = 1 n a i w k ( i ) v k ) f ( x ) x k x j | x = v ] = 0 ,
since i = 1 n a i w k ( i ) = v k for all 1 ≤ k ≤ J, and the theorem follows. □
Similarly, given w(1),…, w(n) ∈ ⅅJ, a ∈ ⅅn and a convex differentiable Bregman divergence Df, which is strictly convex in its second argument, we can consider Pool a D f ( w ( 1 ) , , w ( n ) ) the “projection” of (w(1),…, w(n)) into ⅅJ, since Theorem 6 resembles (a special case of) the extended Pythagorean property: for any u ∈ ⅅJ:
i = 1 n a i D f ( u Pool a D f ( w ( 1 ) , , w ( n ) ) ) + i = 1 n a i D f ( Pool a D f ( w ( 1 ) , , w ( n ) ) w ( i ) ) = = i = 1 n a i D f ( u w ( i ) ) .
The two observations above and the following lemma will be essential to the proofs of the two main theorems of this subsection.
Lemma 4. Let Df be a convex Bregman divergence. Assume that we are given a closed convex nonempty set W, v[i] ∈ ⅅL, i = 1, 2,… and w[i] ∈ ⅅJ, i = 1, 2,…, such that w[i] is the D -projection of v[i] into W for all i = 1, 2,…. Assume that { v [ i ] } i = 1 converges to v ∈ ⅅJ and { w [ i ] } i = 1 converges to w ∈ ⅅJ. Then, w is the Df-projection of v into W.
Proof. For a contradiction, assume that the Df-projection of v into W denoted by w ¯ is distinct from w. Then, by the extended Pythagorean property, D f ( w [ i ] v [ i ] ) + D f ( w ¯ || w [ i ] ) D f ( w ¯ || v [ i ] ). Since Df(·‖·) is continuous (see Section 1.1), we have that:
lim i D f ( w [ i ] v [ i ] ) = D f ( w v ) , lim i D f ( w ¯ w [ i ] ) = D f ( w ¯ w ) and: lim i D f ( w ¯ v [ i ] ) = D f ( w ¯ v ) .
Therefore: D f ( w v ) + D f ( w ¯ || w ) D f ( w ¯ || v ), which contradicts the assumption that w ¯ is the Df-projection of v into W. □
Finally, we are going to answer the question about whether the iteration of the averaging projective procedures F D f , A and F ^ D f , A converges; however, the result for F D f , A will be limited only to the case when Df is differentiable. Both results below should be attributed to a number of people. First, the results are applications of well-known alternative projections due to Csiszár and Tusnády; see [28], Theorem 3. In a particular case of the Kullback–Leibler divergence, the theorems were observed and proven by Matúš in [21]. Last, but not least, Eggermont and LaRiccia reformulated original alternative projections in terms of Bregman divergences in [29].
Theorem 16. Let Df be a convex differentiable Bregman divergence, A be a family of weighting vectors and a A be such that a ∈ ⅅn and W1,…, Wn ⊆ ⅅJ are closed, convex and nonempty. Then, for any v ∈ ⅅJ, the sequence:
{ v [ i ] } i = 0 ,
where v[0] = v and v [ i + 1 ] = F [ W 1 , , W n ] D f ( v [ i ] ) converge to some probability function in Θ a D f ( W 1 , , W n ). (Recall that Θ a D f ( W 1 , , W n ) is the set of the fixed points of F [ W 1 , , W n ] D f , A, i.e., all points v, such that F [ W 1 , , W n ] D f , A ( v ) = v.)
Proof. This proof is inspired by [21].
Denote the Df-projections of v[i] into W1,…, W by π1 v[i],…, πnv[i], respectively. Then, it is easy to observe that:
k = 1 n a k D f ( π k v [ i ] v [ i ] ) k = 1 n a k D f ( π k v [ i ] v [ i + 1 ] ) k = 1 n a k D f ( π k v [ i + 1 ] v [ i + 1 ] ) ,
for all i = 1, 2,…. Due to the monotonicity of this sequence, the limit lim i k = 1 n a k D f ( π k v [ i ] v [ i ] )exists. Thanks to the compactness of W1,…, Wn, the sequence { ( π 1 v [ i ] , , π n v [ i ] , v [ i ] ) } i = 1 has a convergent subsequence. Let us denote the limit of this subsequence (π1v,…, πnv, v). Due to Lemma 4, πkv is really the Df-projection of v into Wk for all 1 ≤ kn. Moreover
lim i k = 1 n a k D f ( π k v [ i ] v [ i ] ) = k = 1 n a k D f ( π k v v ) .
By Theorem 15:
k = 1 n a k D f ( π k v v [ i ] ) k = 1 n a k D f ( π k v v ) + k = 1 n a k D f ( π k v π k v [ i 1 ] ) .
This is because v [ i ] = LinOp a ( π 1 v [ i 1 ] , , π n v [ i 1 ] ). Moreover, by the extended Pythagorean property:
k = 1 n a k D f ( π k v [ i ] v [ i ] ) + k = 1 n a k D f ( π k v π k v [ i ] ) k = 1 n a k D f ( π k v v [ i ] ) .
An illustration of the situation is depicted in Figure 11.
Now, since:
lim i k = 1 n a k D f ( π k v [ i ] v [ i ] ) = k = 1 n a k D f ( π k v v ) .
and k = 1 n a k D f ( π k v [ i ] v [ i ] ) k = 1 n a k D f ( π k v v ) for all i = 1, 2,…, Equations (8) and (9) give that:
k = 1 n a k D f ( π k v π k v [ i ] ) k = 1 n a k D f ( π k v π k v [ i 1 ] )
for all i = 1, 2,…. We conclude that this is possible only if:
lim i k = 1 n a k D f ( π k v π k v [ i ] )
exists.
However, we already know that a subsequence of { ( π 1 v [ i ] , , π n v [ i ] , v [ i ] ) } i = 1 converges to (π1v,…, πnv); hence, a subsequence of the sequence { k = 1 n a k D f ( π k v π k v [ i ] ) } i = 1 decreases to zero, which by Equation (10), forces the whole sequence to converge to zero. Due to the fact that Df(x‖y) = 0, only if x = y and, by the continuity, we get:
lim i π k v [ i ] = π k v .
It follows that lim i v [ i ] exists and is equal to v. Moreover, v = lim i v [ i + 1 ] = v = lim i LinOp a ( π 1 v [ i ] , , π n v [ i ] ) = LinOp a ( π 1 v , , π n v ), and therefore, v is a fixed point of the mapping F [ W 1 , , W n ] D f , A; hence, v Θ a D f ( W 1 , , W n ).
The following analogue of Lemma 4 will be needed in the forthcoming theorem.
Lemma 5. Let Df be a convex differentiable Bregman divergence, which is strictly convex in its second argument. Assume that we are given a closed convex nonempty set W , v[i] ∈ ⅅL, i = 1, 2,… and w[i] ∈ ⅅJ, i = 1, 2,…, such that w[i] is the conjugated D -projection of v[i] into W for all i = 1, 2,…. Assume that { v [ i ] } i = 1 converges to v ∈ ⅅJ and { w [ i ] } i = 1 converges to w ∈ ⅅJ. Then, w is the conjugated Df-projection of v into W.
Proof. For a contradiction, assume that the conjugated Df-projection of v into W denoted by w ¯ is distinct from w. Then, by the four-point property, D f ( v [ i ] w [ i ] ) D f ( v [ i ] w ¯ ) + D f ( v [ i ] v [ i ] ). Since Df(·‖·) is continuous, we have that:
lim i D f ( v [ i ] w [ i ] ) = D f ( v w ) , lim i D f ( v [ i ] w ¯ ) = D f ( v w ¯ ) and lim i D f ( v [ i ] v ) = D f ( v v ) = 0.
Therefore: D f ( v w ) D f ( v w ¯ ), which contradicts the assumption that w ¯ is the conjugated Df-projection of v into W. □
Theorem 17. Let Df be a convex differentiable Bregman divergence, which is strictly convex in its second argument, A be a family of weighting vectors and a A be such that a ∈ ⅅn and W1,…, Wn ⊆ ⅅJ are closed, convex and nonempty. Then, for any v ∈ ⅅJ, the sequence:
{ v [ i ] } i = 0 ,
where v [ 0 ] = 0 a n d v [ i + 1 ] = F ^ [ W 1 , , W n ] D f , A ( v [ i ] ), converges to some probability function in Θ ^ a D f ( W 1 , , W n ). (Recall that Θ ^ a D f ( W 1 , , W n ) is the set of the fixed points of F ^ [ W 1 , , W n ] D f , A, i.e., all points v, such that F ^ [ W 1 , , W n ] D f , A ( v ) = v).
Proof. Denote the conjugated Df-projections of v[i] into W1,…, W by π 1 v [ i ] , , π n v [ i ], respectively. Then, it is easy to observe that:
k = 1 n a k D f ( v [ i ] π k v [ i ] ) k = 1 n a k D f ( v [ i + 1 ] π k v [ i ] ) k = 1 n a k D f ( v [ i + 1 ] π k v [ i + 1 ] ) ,
for all i = 1, 2,…. Due to the monotonicity of this sequence, the limit lim i k = 1 n a k D f ( v [ i ] π k v [ i ] ) exists. Thanks to the compactness of W1,…, Wn sequence { ( π 1 v [ i ] , , π n v [ i ] , v [ i ] ) } i = 1 has a convergent subsequence. Let us denote the limit of this subsequence (π1v,…, πnv, v). Due to Lemma 5, πkv is really the conjugated Df-projection of v into Wk for all 1 ≤ kn. Moreover:
lim i k = 1 n a k D f ( v [ i ] π k v [ i ] ) = k = 1 n a k D f ( v π k v ) .
By the four-point property:
k = 1 n a k D f ( v π k v [ i ] ) k = 1 n a k D f ( v π k v ) + D f ( v v [ i ] ) .
Moreover, by Theorem 6:
k = 1 n a k D f ( v [ i + 1 ] π k v [ i ] ) D f ( v v [ i + 1 ] ) = k = 1 n a k D f ( v π k v [ i ] ) .
That is because v [ i + 1 ] = Pool a D f ( π 1 v [ i ] , , π n v [ i ] ). An illustration of the situation is depicted in Figure 12.
Now, since:
lim i k = 1 n a k D f ( v [ i + 1 ] π k v [ i ] ) = k = 1 n a k D f ( v π k v )
and k = 1 n a k D f ( v [ i + 1 ] π k v [ i ] ) k = 1 n a k D f ( v π k v ) for all i = 1, 2,…, the expressions (11) and (12) give that:
D f ( v v [ i + 1 ] ) D f ( v v [ i ] )
for all i = 1, 2,…. We conclude that this is possible only if:
lim i D f ( v v [ i ] )
exists.
However, we already know that a subsequence of { v [ i ] } i = 1 converges to v; hence, a subsequence of the sequence { D f ( v v [ i ] ) } i = 1 decreases to zero, which by Equation (13), forces the whole sequence to converge to zero. Due to the fact that Df(x‖y) = 0 only if x = y and by the continuity, we get:.
lim i v [ i ] = v
and, subsequently, lim i π k v [ i ] = π k v , 1 k n (the subsequence of { π k v [ i ] } i = 1 has πkv as a limit, and { D f ( v [ i ] π k v [ i ] ) } i = 1 is monotonic).
Moreover, v = lim i v [ i + 1 ] = lim i Pool a D f ( π 1 v [ i ] , , π n v [ i ] ) = Pool a D f ( π 1 v , , π n v ) since Pool a D f is continuous ( k = 1 n a k D f ( ) is continuous and strictly convex in the first argument). Therefore, v is a fixed point of the mapping F ^ [ W 1 , , W n ] D f , A, and hence, v Θ ^ a D f ( W 1 , , W n ). □
The problem of characterizing the limits of Theorems 16 and 17 more precisely remains open. On the other hand, the theorems suggest a way to compute at least some points in Θ a D f ( W 1 , , W n ) and Θ ^ a D f ( W 1 , , W n ), although we have not investigated how fast the sequences converge. Moreover, also the question of how effective it is to compute Df-projections and conjugated Df-projections was left unanswered. This latter problem was nevertheless addressed in the literature, at least in the case of the KL-divergence and sets W1,…, Wn generated by finite collections of marginal probability functions. In such a case, the well-known iterative projective fitting procedure IPFP can be effectively employed [16].

3.2. Chairmen Theorems

In this section, for a convex differentiable Bregman divergence Df, which is strictly convex in its second argument, and a family of weighting vectors A, we investigate the susceptibility of Θ A D f and Θ ^ A D f-merging operators to a small bias by an arbitrary probability function in ⅅJ. The study of this problem first occurred in [18], where Wilmers argued that an independent adjudicator, whose only knowledge consists of what is related to him by the given college of experts, can rationally bias the agreement procedure by including himself as an additional expert, whose personal probability function is the uniform one (not arbitrary), in order to calculate a single social probability function and then find what would happen to this social probability function if his contribution happened to be infinitesimally small relative to that of the other experts. He showed that in the case of the Θ ^ N KL -merging operator, this point of agreement is characterized by the most entropic point in the region defined by Θ ^ N KL. A similar theorem for the Θ N KL-merging operator was proven in [10]. In what follows, we adapt these results to our general situation.
The following theorem will tell us that, in some particular case of W1,…, Wn ⊆ ⅅJ, we can always tell that the set Θ a D f ( W 1 , , W n ) is a singleton.
Theorem 18. Let W1,…, Wn ⊆ ⅅJ be closed, convex, nonempty and such that, for at least one i Wi is a singleton. Let Df be a convex Bregman divergence, which is strictly convex in its second argument and a ⊆ ⅅn. Then, Θ a D f ( W 1 , , W n ) is a singleton.
Proof. Without loss of generality, assume that W1 = {v}. For a contradiction, suppose that w, r Θ a D f ( W 1 , , W n ) and w ≠ r. Denote w(2),…, w(n) the Df-projections of w into W2,…, Wn, respectively, and r(2),…, r(n) the Df-projections of r into W2,…, Wn, respectively. By definition, w = LinOpa(v, w(2),…, w(n)) and r = LinOpa(v, r(2),…, r(n)).
Now, consider x = λw + (1 − λ)r for some λ ∈ (0, 1). By Theorems 9 and 12, we have that x Θ a D f ( W 1 , , W n ). Since Df(·‖·) is a convex function, by the Jensen inequality, we have that:
a 1 D f ( v x ) + i = 2 n a i D f ( λ w ( i ) + ( 1 λ ) r ( i ) x ) λ ( a 1 D f ( v x ) + i = 2 n a i D f ( w w ) ) + ( 1 λ ) ( a 1 D f ( v r ) + i = 2 n a i D f ( r ( i ) r ) ) .
However, since w , r , x Θ a D f ( W 1 , , W n ) and λw(i)+ (1 − λ)r(i)Wi, 1 ≤ in, the above is possible only with the equality.
On the other hand, since Df is strictly convex in its second argument, the following Jensen inequality is strict:
D f ( v x ) < λ D f ( v w ) + ( 1 λ ) D f ( v r ) .
Note that the border points λ = 0, 1 are excluded. Therefore, Equation (14) yields:
i = 2 n a k D f ( λ w [ i ] + ( 1 λ ) r ( i ) x ) > > λ ( i = 2 n a k D f ( w ( i ) w ) ) + ( 1 λ ) ( i = 2 n a k D f ( r ( i ) r ) ) .
However, this contradicts the Jensen inequality.
Theorem 19 (Chairman Theorem for Θ A D f). Let I ⊆ ⅅJ be a singleton consisting of an arbitrary probability function t ⊆ ⅅJ. Let W1,…, Wn ⊆ ⅅJ be closed, convex and nonempty, a A be such that a ∈ ⅅn and Df be a convex Bregman divergence, which is strictly convex in its second argument. For 1 > λ > 0, define (by the previous theorem, the following set is a singleton):
{ v [ λ ] } = Θ ( λ , a 1 λ a 1 , , a n λ a n ) D f ( I , W 1 , , W n ) .
Then, lim λ 0 v [ λ ] exists and equals
arg min v Θ a D f ( W 1 , , W n ) D f ( t v ) ,
i.e., it equals the conjugated Df-projection of the probability functiont into Θ a D f ( W 1 , , W n ).
Proof. This proof is inspired by [30], where a slightly stronger result is proven for the special case of Θ N D f. We note that Theorem 9 from Section 2.3 is implicitly used in what follows.
First, denote M a D f ( W 1 , , W n ) as the minimal value of:
i = 1 n a i D f ( w ( i ) v )
subject to w(i)Wi, 1 ≤ in and v ∈ ⅅJ. Furthermore, we denote Eλ as the minimal value of:
( 1 λ ) [ i = 1 n a i D f ( w ( i ) v ) M a D f ( W 1 , , W n ) ] + λ D f ( t v )
subject to w(i)Wi, 1 ≤ in and v ∈ ⅅJ. By the definition of M a D f ( W 1 , , W n ), we have that 0 ≤ Eλ for all 1 > λ > 0.
Note that for a fixed λ, if v ∈ ⅅJ globally minimizes Equation (15) subject to w(i)Wi, 1 ≤ in, then v Θ ( λ , a 1 λ a 1 , , a n λ a n ) D f ( I , W 1 , , W n ) (by Theorem 18, such a v is unique), and conversely, v Θ ( λ , a 1 λ a 1 , , a n λ a n ) D f ( I , W 1 , , W n ), then v minimizes Equation (15), subject to the above constraints.
Now, r = arg min v Θ a D f ( W 1 , , W n ) D f ( t v ). Since r Θ a D f ( W 1 , , W n ), it follows that for all 1 > λ > 0, we have that:
E λ D f ( t r ) .
Since ⅅJ ⊆ ℝJ is a compact space, there exists a sequence { λ m } m = 1 , 0 < λ m < 1 , lim m λ m = 0, such that { v [ λ m ] } m = 1 converges. Let w(i)[λm] be the Df-projection of v λm] into Wi for all 1 ≤ in and m = 1, 2,…. By Equation (16), the sequence:
{ i = 1 n a i D f ( w ( i ) [ λ m ] v [ λ m ] ) } m = 1
converges to M a D f ( W 1 , , W n ).
Note that we already know that lim m v [ λ m ] exists, and we denote it by v. However, we do not know whether the same is true for lim m w ( i ) [ λ m ], 1 ≤ in. On the other hand, since W1,…, Wn are compact, the considered sequences have convergent subsequences. Let us denote the corresponding limits w(1),…, w(n). Since Df(·‖·) is a continuous function in both variables, the value of i = 1 n a i D f ( w ( i ) v ) must be equal to M a D f ( W 1 , , W n ). However, this means that we have found a global minimizer (w(1),…, w(n), v) of i = 1 n a i D f ( w ( i ) v ) subject to w(i)Wi, 1 ≤ in, and v ∈ ⅅJ.
It follows that v = lim m v [ λ m ] Θ a D f ( W 1 , , W n ). By Equation (16):
0 ( 1 λ m ) [ i = 1 n a i D f ( w ( i ) [ λ m ] v [ λ m ] ) M a D f ( W 1 , , W n ) ] + λ m D f ( t v [ λ m ] ) λ m D f ( t r ) .
Hence, 0 λ m [ D f ( t r ) D f ( t v [ λ m ] ) ] for all m = 1, 2,…. However, by definition of r, this is possible only if r = v.
In fact, we have proven that for every sequence { λ m } m = 1 , such that limm→∞ λm = 0 and { v [ λ m ] } m = 1 is convergent, { v [ λ m ] } m = 1 must converge to r. Therefore, assume that there is a sequence { λ m } m = 1 , such that limm→∞ λm = 0, but {v λm]}m=1 is not convergent. Then, there is an open neighborhood of the point r outside of which there are an infinite number of the members of the sequence { v [ λ m ] } m = 1 Since ⅅJ is compact, this sequence must have a convergent subsequence with a limit distinct from r. That, however, contradicts our previous claim.
The theorem above is illustrated in Figure 13. Indeed, if Θ a D f ( W 1 , , W n ) is a singleton, then the limit in the theorem above is obvious. By Theorem 18, this happens in particular when at least one of W1,…, Wn is a singleton. However, it is not hard to observe an interesting case; consider that W1,…, Wn have a nonempty intersection, which is not a singleton. In this case, the limit above is, in fact, the conjugated Df-projection of the probability function t into that intersection. Such a conjugated projection depends on t. In particular, we can recover any point in the intersection by setting it to be the point t.
The following analogue of Theorem 18 has a fairly similar proof.
Theorem 20. Let W1,…, Wn ⊆ ⅅJ be closed, convex, nonempty and such that, for at least one i Wi is a singleton. Let Df be a convex Bregman divergence, which is strictly convex in its second argument and a ∈ ⅅn. Then, Θ ^ a D f ( W 1 , , W n ) is a singleton.
Theorem 21 (Chairman Theorem for Θ ^ A D f. Let I ⊆ ⅅJ be a singleton consisting of an arbitrary probability function t ∈ ⅅJ. Let W1,…, Wn ⊆ ⅅJ be closed, convex and nonempty, a A be such that a ∈ ⅅn and Df be a convex differentiable Bregman divergence, which is strictly convex in its second argument. For 1 > λ > 0, define:
{ v [ λ ] } = Θ ^ ( λ , a 1 λ a 1 , , a n λ a n ) D f ( I , W 1 , , W n ) .
Then, lim λ 0 v [ λ ] exists and equals:
arg min v Θ ^ a D f ( W 1 , , W n ) D f ( v t ) ,
i.e., it equals the Df-projection of the probability function t into Θ ^ a D f ( W 1 , , W n ).
The proof is analogous to the one of Theorem 19, so we omit it.

4. Applications

4.1. Relationship to Inference Processes

In this subsection, we will discuss some striking relationships between the chairmen theorems and the framework of inference processes [26]. Inference processes are methods of reasoning by which an expert may select a single probability function from a nonempty closed convex set of possible options. In our framework, it is simply a problem of choosing a single probability function in a closed convex nonempty set W ⊆ ⅅJ. This selection is, however, not arbitrary, and it is expected to satisfy some rational principles based on symmetry and consistency, as discussed in [15]. The maximum entropy (ME) inference process, which chooses the most entropic point in a given closed convex nonempty set, is uniquely justified by a list of such principles, as Paris and Vencovská showed [15].
As discussed in Section 1.2, the most entropic point in a closed convex nonempty set W ⊆ ⅅJ coincides with the KL-projection of the uniform probability function into W. This can be immediately applied to the chairman theorem for Θ ^ A KL, where A is a family of weighting vectors:
Let I ⊆ ⅅJ be a singleton consisting of the uniform probability function t ∈ ⅅJ. Let W1,…, Wn ⊆ ⅅJ be closed, convex and nonempty and a A be such that a ∈ ⅅn. For 1 > λ > 0, define:
{ v [ λ ] } = Θ ^ ( λ , a 1 λ a 1 , , a n λ a n ) KL ( I , W 1 , , W n ) .
Then:
lim v [ λ ] λ 0 = ME ( Θ ^ a KL ( I , W 1 , , W n ) ) .
For the family of weighting vectors:
N = { ( 1 n , , 1 n n ) : n = 1 , 2 , }
the operator that results by applying the ME-inference process to the operator Θ ^ N KL is, in fact, a probabilistic merging operator, which was introduced and studied by Wilmers in [18] under the name “social entropy process” or SEP, for short. In that paper, Wilmers argues that this merging operator is, to date, the most appealing with respect to symmetry and consistency; somehow, in the spirit of the original justification for the ME-inference process, although the problem of finding a complete justification is still open.
Whether SEP will turn out to be the most appealing probabilistic merging operator or not, by the same manner as above, we can define several probabilistic merging operators related to several other classical inference processes.
For example, the conjugated KL-projection of the uniform probability function into a closed convex nonempty set W ⊆ ⅅJ in fact generates the so-called CM-inference process (a limit version of the central mass process [26]). We write simply CM(W ) to denote the point of the projection, which is explicitly given by:
CM ( W ) = arg min w W j = 1 J log w j .
The chairman theorem for Θ N KL then suggests considering the probabilistic merging operator defined for every n ≥ 1 and all closed convex nonempty sets W1,…, Wn ⊆ ⅅJ by:
coSEP ( W 1 , , W n ) = { CM ( Θ a KL ( W 1 , , W n ) ) } ,
where a ∈ ⅅn and a N. We will call this operator the conjugated social entropy process coSEP.
What is really appealing about the operators SEP and coSEP is that there are singletons; we simply say that they satisfy the singleton principle (SP). Furthermore, the consistency principle (CP) is obviously satisfied by all of them. However, there is an interesting principle that can never be satisfied by a probabilistic merging operator that satisfies (CP) and is always a singleton: the disagreement principle introduced in [5].
(DP) Disagreement Principle. Let Δ be a probabilistic merging operator. Then, we say that Δ satisfies the disagreement principle if, for every n, m ≥ 1 and all W1,…, Wn ⊆ ⅅJ and V1,…, Vm ⊆ ⅅJ:
Δ ( W 1 , , W n ) Δ ( V 1 , , V m ) = ϕ
implies:
Δ ( W 1 , , W n , V 1 , , V m ) Δ ( W 1 , , W n ) = ϕ .
We cite [5] on the desirability of this principle: the principle (informally) says “… that a consistent group who disagrees with another group and then merges with them can be sure that they have influenced the opinions of the combined group.”
Theorem 22. There is no probabilistic merging operator that satisfies all (SP), (CP) and (DP).
Proof. Let Δ be a probabilistic merging operator. Assume that VW ⊆ ⅅJ and that V is a singleton. Suppose that Δ(W) ≠ V = Δ(V). Then, by (CP), Δ(W, V) = V, which contradicts (DP).
Theorem 23. The probabilistic merging operators Θ N D f and Θ ^ N D f, where Df is a convex Bregman divergence for the prior and is additionally differentiable and strictly convex in its second argument for the latter, satisfy (DP).
Proof. We prove the theorem only for Θ ^ N D f. The proof for Θ N D f is similar.
Let W1,…, Wn, V1,…, Vm ⊆ ⅅJ be closed convex and nonempty. For a contradiction, assume that v Θ ^ ( 1 n , , 1 n ) D f ( W 1 , , W n ), v Θ ^ ( 1 n + m , , 1 n + m ) D f ( W 1 , , W n , V 1 , , V m ) and, at the same time, v Θ ^ ( 1 n , , 1 n ) D f ( W 1 , , W n ).
Denote v(i) the conjugated Df-projection of v into Vi, 1 ≤ i ≤ m. Then, there is u ∈ ⅅJ, such that u = Pool ( 1 n , , 1 n ) D f ( V ( 1 ) , , V ( m ) ), i.e., i = 1 m 1 m D f ( v v ( i ) ) > i = 1 m 1 m D f ( u v ( i ) ). Since every Bregman divergence is strictly convex in its first argument, we have that:
λ [ i = 1 m D f ( ( 1 λ ) v + λ u v ( i ) ) ] | λ = 0 < 0.
Now, denote w(i) the conjugated Df-projection of v into Wi, 1 ≤ in. Since v = Pool ( 1 n + m , , 1 n + m ) D f ( w ( 1 ) , , w ( n ) , V ( 1 ) , , V ( m ) ) and v = Pool ( 1 n , , 1 n ) D f ( w ( 1 ) , , w ( m ) ) the strict convexity of divergences in their first argument gives also: Bregman
λ [ i = 1 m D f ( ( 1 λ ) v + λ u w ( i ) ) + i = 1 m D f ( ( 1 λ ) v + λ u v ( i ) ) ] | λ = 0 0
and:
λ [ i = 1 m D f ( ( 1 λ ) v + λ u w ( i ) ) ] | λ = 0 0.
However, this contradicts Equation (17). □
We can conclude that, before deciding which probabilistic merging operator to use, we need to establish which two of the three properties we want the operator to satisfy. In this paper, we have seen instances of all three options, as listed in Table 1.
Recall that KIRP is the operator due to Kern-Isberner and Röder and OSEP is the obdurate social entropy process; see Section 2.2 for more details. A proof that KIRP and OSEP satisfy (DP) can be easily obtained as a modification of the proof of Theorem 23, so we omit it.

4.2. Computability

In this subsection, we would like to propose a method corresponding to the classical method of projection, but in the multi-expert context. The possible use could be similar; if the knowledge of a college of experts could be characterized by a closed convex nonempty set of probability functions, then we would like to find such a probability function in that set that is “closest” to a given piece of information represented by another probability function. We only need to specify a way to represent the knowledge of the college by such a single set and pair it with an appropriate method of projection.
Throughout this subsection, assume that we are given closed convex nonempty sets of probability functions W1,…, Wn ⊆ ⅅJ with weighting a A, where ai is the weight of Wi and a probability function v ∈ ⅅJ to represent.
If the measure of “being closed” is quantified by a projection by means of a convex differentiable Bregman divergence Df, which is strictly convex in its second argument, our proposed method consists of the following. First, represent W1,…, Wn by a single, closed and convex set Θ ^ a D f ( W 1 , , W n ), and then, take the Df-projection of v into Θ ^ a D f ( W 1 , , W n ).
On the other hand, if the measure of “being closed” is quantified by a conjugated projections by means of a convex differentiable Bregman divergence Df, which is strictly convex in its second argument, we first represent W1,…, Wn by a single, closed convex set Θ a D f ( W 1 , , W n ) and then take the conjugated Df-projection of v into Θ a D f ( W 1 , , W n ).
The methods have two distinguishing features:
  • If all of the sets W1,…, Wn are singletons, the methods reduce to Pool A D f and LinOp A-pooling operators respectively.
  • If W1,…, Wn have a nonempty intersection V, they reduce to Df and conjugated Df-projections into V, respectively.
In this subsection, we shall investigate how effective it is to compute the results of those two methods. Notice that SEP and coSEP, defined in Section 4.1, are specific instances of those procedures, respectively, in which case, we are interested in KL-projections and conjugated KL-projections of the uniform probability function.
There are indeed some serious computational issues. The most essential is the following. A closed convex nonempty set W ⊆ ⅅJ is often given by a set of constraints on ⅅJ. How can we effectively verify that the resulting set W is nonempty? Unfortunately, it is not even possible to find a random Turing machine running in polynomial time that upon input given by a set of constraints on probability functions verifies the consistency of this set of constraints (given that the problems solvable in a randomized polynomial time cannot be solved in a polynomial time); see Theorem 10.7 of [26].
However, some computational problems closely related to projections have been extensively studied in the literature. As we have noted in Section 3.1, this includes procedures for finding a KL-projection to a closed convex set of probability functions. These show that in many particular practical implementations, the problem of intractability does not arise, e.g., as in the case when given closed convex nonempty sets are generated by marginal probability functions and where the IPFP-procedure can be applied to effectively find a KL-projection; see [16]. Therefore, we will assume that some effective procedures for Df-projections and conjugated Df-projections are given.
Under such an assumption, the iterative processes from Section 3.1 and the Chairmen theorems offer a way how to compute (although possibly ineffectively) the results of the two methods above. We shall start with the latter.
By Theorem 16, we know that the sequence:
{ v [ i ] } i = 0 ,
where v[0] = t is arbitrary in ⅅJ and v [ i + 1 ] = F [ W 1 , , W n ] D f , A ( v [ i ] ), converges to some probability function in Θ a D f ( W 1 , , W n ). Notice that Df is required to be differentiable in order to establish this conclusion.
Recall that by Theorem 18, Θ a D f ( W 1 , , W n ) is a singleton when at least one of W1,…, Wn is a singleton. Let I ∈ ⅅJ be such that I = {v}. For every 1 > λ > 0, we define the sequence { v [ λ ] [ i ] } i = 0 by { v [ λ ] [ 0 ] } = t (t can be arbitrary) and:
v [ λ ] [ i + 1 ] = F [ I , W 1 , , W n ] D f ( λ , a 1 λ a 1 , , a n λ a n ) ( v [ λ ] [ i ] ) .
By Theorem 16:
{ lim i v [ λ ] [ i ] } = Θ ( λ , a 1 λ a 1 , , a n λ a n ) D f [ I , W 1 , , W n ] .
By the chairman theorem for Θ A D f:
lim λ 0 lim i v [ λ ] [ i ] = arg min w Θ a D f ( W 1 , , W n ) D f ( v w )
i.e., equals the conjugated Df-projection of the probability function v into Θ a D f ( W 1 , , W n ).
Now, notice that if the limits in Equation (18) were interchangeable, then this would offer an answer to the question from Section 3.1 to closely characterize the limit limi→∞v[i] but with no claims to any theoretical results on the complexity of the computation). Unfortunately, the following simple example introduced in [10] shows that these limits are not interchangeable.
Example 6. Let J = 4 , W 1 = { ( x , 1 4 x , y 3 4 y ) , x [ 0.01 , 1 4 0.01 ] , y [ 0.01 , 3 4 0.01 ] } and W 2 = { ( x , y 1 4 x 3 4 y ) , x [ 0.01 , 1 4 0.01 ] , y [ 0.01 , 3 4 0.01 ] }. Assume that the weighting is N, Df = KL and the probability function v ∈ ⅅ4 to interpret is the uniform probability function. In other words, we are looking for coSEP(W1, W2).
Then, the members of the sequence { v [ i ] } i = 0 can be computed by two minimization problems: find x [ 0.01 , 1 4 0.01 ] and y [ 0.01 , 3 4 0.01 ] that minimize:
x log x v 1 [ i ] + ( 1 4 x ) log 1 4 x v 2 [ i ] + y log y v 3 [ i ] + ( 3 4 y ) log 3 4 y v 4 [ i ]
and another couple x ¯ [ 0.01 , 1 4 0.01 ] and y ¯ [ 0.01 , 3 4 0.01 ] that minimize:
x ¯ log x ¯ v 1 [ i ] + y ¯ log y ¯ v 2 [ i ] + ( 1 x ¯ 4 ) log 1 4 x ¯ v 3 [ i ] + ( 3 4 y ¯ ) log 3 4 y ¯ v 4 [ i ] .
Then, v 1 [ i + 1 ] = x + x ¯ 2 , v 2 [ i + 1 ] = 1 4 x + y ¯ 2 , v 3 [ i + 1 ] = 1 4 x ¯ + y 2 and v 4 [ i + 1 ] = 3 2 y ¯ y 2 After setting v [ 0 ] = ( 1 4 , 1 4 , 1 4 , 1 4 ) it turns out that in each iteration, x ¯ = x and y ¯ = y.
After performing the numerical computation for the first one hundred iterations, we obtain:
{ v [ 100 ] } ( 0.0488395 , 0.2011605 , 0.2011605 , 0.5488394 ) .
The rate of convergence for the first coordinate of probability functions is depicted in Figure 14 by the bottom red line.
However, since W1 and W2 are jointly consistent, we have that:
Θ ( 1 2 , 1 2 ) D f ( W 1 , W 2 ) = W 1 W 2 = { ( x , 1 4 x , 1 4 x , 1 2 + x ) , x [ 0.01 , 0.96 4 ] } .
We compute that CM(W1 ∩ W2) (the conjugated KL-projection of the uniform probability function) is approximately:
( 0.091506 , 0.15849 , 0.15849 , 0.5915 ) ,
which is obviously not equal to the limit of the sequence { v [ i ] } i = 0 .
It seems that the only viable way to use Equation (18) to estimate a result of the conjugated Df-projection into Θ a D f ( W 1 , , W n ) is to choose a sufficiently small λ, and for this λ, iterate the sequence { v [ λ ] [ i ] } i = 0 . However, the rate of convergence heavily depends on λ, and in fact, this often materializes in a negative way for a practical computation [10]:
Example 7. Consider the situation from Example 6. We compute numerically the first coordinate of initial members of the sequence { v [ λ ] [ i ] } i = 0 for several values of λ, and we compare them with the first coordinate of the sequence { v [ i ] } i = 0 . The algorithm we use is as follows. Note that due to the design of the sets, only one minimization problem is sufficient to solve in each iteration, as we have pointed out in the previous example.
v 1 : = 1 4 ; v 2 : = 1 4 ; v 3 : = 1 4 ; v 4 : = 1 4 ;
for i from 1 by 1 to 200 do
Minimize ( x log x v 1 + ( 1 4 x ) log 1 4 x v 2 + y log y v 3 + ( 3 4 y ) log 3 4 y v 4 , x = 0.01.. 0.96 4 , y = 0.001.. 2.96 4 ) ; v 1 : = 1 4 λ + x ( 1 2 1 2 λ ) + x ( 1 2 1 2 λ ) ; v 2 : = 1 4 λ + ( 1 4 x ) ( 1 2 1 2 λ ) + y ( 1 2 1 2 λ ) ; v 3 : = 1 4 λ + 1 4 x ) ( 1 2 1 2 λ ) + y ( 1 2 1 2 λ ) ; v 4 : = 1 4 λ + ( 3 4 y ) ( 1 2 1 2 λ ) + ( 3 4 y ) ( 1 2 1 2 λ ) ; end do;
The numerical result for λ = 1 21 , 1 41 , 1 61 is plotted in Figure 14. We can see that as λ decreases, the limit points of sequences are converging to the first coordinate of CM(W1 ∩ W2), which is denoted by the black dotted line. The red line denotes the first coordinate of the sequence { v [ i ] } i = 0 .
The numerical result for λ = 1 61 , 1 121 , 1 181 is plotted in Figure 15. We can conclude that, although the eventual precision rises as λ decreases, the rate of convergence is affected severely. Therefore, there is a significant trade-off between the precision and the number of iterations.
Notice that, as λ decreases, the blue lines point-wise converge to the red line. This convergence is, however, obviously not uniform.
Now, consider the prior method, which follows a fairly similar computation idea. By Theorem 17, we know that the sequence:
{ u [ i ] } i = 0
where u[0] = t is arbitrary in ⅅJ and u [ i + 1 ] = F ^ [ W 1 , , W n ] D f , A ( u [ i ] ), converges to some probability function in Θ ^ a D f ( W 1 , , W n ). This procedure can be, for instance, immediately used to compute SEP(W1,…, Wn) in a case when Θ ^ ( 1 n , , 1 n ) KL ( W 1 , , W n ) is a singleton. By Theorem 20, this happens when at least one of W1,…, Wn is a singleton.
One may perhaps expect that if u[0] is the uniform probability function, then {limi→∞u[i]} = SEP(W1,…, Wn). In the following example from [10], we will, however, see that this is not true in general. Note that we cannot use Example 6, since in that case, actually, {limi→∞u[i]} = SEP(W1,…, Wn).
Example 8. Let J = 8,
W 1 = { ( x 1 12 x 1 12 x , 2 6 + x , y , 1 6 y , 1 6 , 1 6 ) , x [ 0.01 , 0.88 12 ] , y [ 0.01 , 0.94 6 ] }
and:
W 2 = { ( x , 1 12 x , 1 12 x , 2 6 + x , 1 12 , 1 12 , y 2 6 y ) , x [ 0.01 , 0.88 12 ] , y [ 0.01 , 1.94 6 ] } .
W1 and W2 have a nonempty intersection; W 1 W 2 = { ( x , 1 12 x , 1 12 x , 2 6 + x , 1 12 , 1 12 , 1 6 , 1 6 ) , x [ 0.01 , 0.88 12 ] }, and we can compute that SEP(W1, W2) is the most entropic probability function from the set above with x equal to approximately 0.013888.
However, the sequence { u [ i ] } i = 0 is already constant after one iteration and equals CM(W1) = CM(W2) = CM(W1 ∩ W2), in which case, x ≈ 0.029231.
By the aid of the chairman theorem for Θ ^ a D f, we also suggest a way to approximate the Df-projection of v into Θ ^ a D f ( W 1 , , W n ), but we have no claims to any theoretical results on the complexity of computation. Let I = {v}. For every 1 > λ > 0, we define the sequence { u [ λ ] [ i ] } i = 0 by u [ λ ] [ 0 ] = t, which is arbitrary, and:
u [ λ ] [ i + 1 ] = F ^ [ I , W 1 , , W n ] D f , ( λ , a 1 λ a 1 , , a n λ a n ) ( u [ λ ] [ i ] ) .
By Theorem 17:
{ lim i u [ m ] [ i ] } = Θ ^ ( λ , a 1 λ a 1 , , a n λ a n ) D f ( I , W 1 , , W n ) .
By the chairman theorem for Θ ^ A D f:
lim λ 0 lim i u [ λ ] [ i ] = arg min w Θ ^ a D f ( W 1 , , W n ) D f ( w v )
i.e., equals the Df-projection of the probability function v into Θ ^ a D f ( W 1 , , W n ).
In particular, to approximate SEP(W1,…, Wn) using Equation (19), one needs to choose a sufficiently small λ and then iterate the sequence { u [ λ ] [ i ] } i = 0 , where u [ λ ] [ 0 ] = v is the uniform probability function, A = N and Df = KL. However, the question of how to determine such an λ and i in order to achieve a specific level of accuracy merits further investigation.
The special case of the problem above when W1,…, Wn have a nonempty intersection was extensively studied in the literature, and many scientific and engineering problems can be expressed as a problem of finding a point in such an intersection. Bregman in [7] showed the convergence of (what is now called) cyclic Bregman projections to a point in the intersection (the notion of a Bregman divergence is used only for the Euclidean space, but in [7], a more general topological vector space was considered). Many cyclic algorithms with appealing applications have been developed since then; see, e.g., [31,32].
Although the approach we propose offers the option of an empty intersection, it always leads to a meaningful point, and in particular, if the intersection is nonempty, it chooses a point inside the intersection; our study cannot be considered as an extension of the classical method of cyclic projections, which was developed over (possibly infinite) Banach spaces [33] in contrast to a limited discrete probabilistic space, which we are considering.
It is also worth mentioning that the method of cyclic projections, even in the case of an empty intersection, often provides more useful results than our method. An example is the noise reduction algorithm from [34].
One can perhaps conclude that the approach offered in this paper is at its best only another contribution to the problem of finding a point in a convex set by means of geometry, which, however, offers some interesting insights into the combination of Bregman projections with pooling operators.

Acknowledgments

The author is indebted to George Wilmers, whose support and wisdom allowed the creation of this paper. Thanks goes also to Alena Vencovská and František Matúš for sharing their ideas with me and to an anonymous reviewer for pointing out the connections to the dual affine structure in the probabilistic simplex.
The paper is an extension of some results that the author obtained as a Ph.D. student at the University of Manchester while supported by the (European Community’s) Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 238381.
Last, but not least, the author is grateful for the support received from the Assumption University in Thailand, without which the paper could not be finished.

Conflicts of Interest

The author declares no conflict of interest other than disclosed above in acknowledgments.

References

  1. Amari, S. Divergence, Optimization and Geometry. In Neural Information Processing: 16th International Conference; Leung, C., Lee, M., Chan, J.H., Eds.; Iconip: Bangkok, Thailand, 2009; pp. 185–193. [Google Scholar]
  2. Hájek, P.; Havránek, T.; Jiroušek, J. Uncertain Information Processing in Expert Systems; Raton, B., Arbor, A., Eds.; CRC Press: London, UK, 1992. [Google Scholar]
  3. Collins, M.; Schapire, R.E. Logistic Regression, AdaBoost and Bregman Distances. Mach. Learn. 2002, 48, 253–285. [Google Scholar]
  4. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  5. Adamčík, M.; Wilmers, G.M. 2015, in press.
  6. De Finetti, B. Sul Significato Soggettivo della Probabilitá. Fund. Math. 1931, 17, 298–329. [Google Scholar]
  7. Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar]
  8. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; pp. 1–716. [Google Scholar]
  9. Rockafeller, R.T. Convex Analysis. Princeton Landmarks in Mathematics; Princeton University Press: Princeton, NJ, USA, 1997; pp. 1–469. [Google Scholar]
  10. Adamčík, M. Collective Reasoning under Uncertainty and Inconsistency. Ph.D. Thesis, The University of Manchester, Manchester, UK, 2014; pp. 1–150. [Google Scholar]
  11. Csiszár, I. I-Divergence Geometry of Probability Distribution and Minimization Problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar]
  12. Amari, S.; Nagaoka, H. Methods of Information Geometry; AMS and Oxford University Press: New York, NY, USA, 2000; pp. 1–206. [Google Scholar]
  13. Jaynes, E.T. Where do we Stand on Maximum Entropy? In The Maximum Entropy Formalism; Levine, R.D., Tribus, M., Eds.; M.I.T. Press: Cambridge, MA, USA, 1979; pp. 15–118. [Google Scholar]
  14. Paris, J.B.; Vencovská, A. On the Applicability of Maximum Entropy to Inexact Reasoning. Int. J. Approx. Reason. 1989, 3, 1–34. [Google Scholar]
  15. Paris, J.B.; Vencovská, A. A Note on the Inevitability of Maximum Entropy. Int. J. Approx. Reason. 1990, 4, 183–224. [Google Scholar]
  16. Vomlel, J. Methods of Probabilistic Knowledge Integration. Ph.D. Thesis, Czech Technical University, Prague, Czech, 1999; pp. 1–123. [Google Scholar]
  17. Banerjee, A.; Guo, X.; Wang, H. On the optimality of conditional expectation as a Bregman predictor. IEEE Trans. Inf. Theory 2005, 51, 2664–2669. [Google Scholar]
  18. Wilmers, G.M. The Social Entropy Process: Axiomatising the Aggregation of Probabilistic Beliefs. In Probability, Uncertainty and Rationality; Hosni, H., Montagna, F., Eds.; CRM Series; Pisa, Italy, 2010; pp. 87–104. [Google Scholar]
  19. Genest, C.; Zidek, J.V. Combining probability distributions: A critique and an annotated bibliography. Stat. Sci. 1986, 1, 114–135. [Google Scholar]
  20. Genest, C.; Wagner, C.G. Further Evidence Against Independence Preservation in Expert Judgement Synthesis. Aequ. Math. 1986, 32, 74–86. [Google Scholar]
  21. Matúš, F. On iterated averages of I-projections. In Statistiek und Informatik; Universität Bielefeld: Bielefeld, Germany, 2007; pp. 1–12. [Google Scholar]
  22. Predd, J.B.; Osherson, D.N.; Kulkarni, S.R.; Poor, H.V. Aggregating Probabilistic Forecasts from Incoherent and Abstaining Experts. Decis. Anal. 2008, 5, 177–189. [Google Scholar]
  23. Kern-Isberner, G.; Rödder, W. Belief Revision and Information Fusion on Optimum Entropy. Int. J. Intel. Syst. 2004, 19, 837–857. [Google Scholar]
  24. Williamson, J. Deliberation, Judgement and the Nature of Evidence. Econ. Philos. 2014, in press. [Google Scholar]
  25. Carnap, R. On the application of inductive logic. Philos. Phenomenol. Res. 1947, 8, 133–148. [Google Scholar]
  26. Paris, J.B. The Uncertain Reasoner Companion; Cambridge University Press: Cambridge, UK, 1994; pp. 1–224. [Google Scholar]
  27. Amari, S. Integration of stochastic models by minimizing alpha-divergence. Neural Comput 2007, 19, 2780–2796. [Google Scholar]
  28. Csiszár, I.; Tusnády, G. Informational Geometry and Alternating Minimization Procedures. Stat. Decis. 1984, 1, 205–237. [Google Scholar]
  29. Eggermont, P.P.B.; LaRiccia, V.N. On EM-like algorithms for minimum distance estimation; Preprint 1998; University of Delaware: Delaware, NC, USA; pp. 1–29.
  30. Wilmers, G.M. Generalising the Maximum Entropy Inference Process to the Aggregation of Probabilistic Beliefs; Preprint 2011, Version 6; The University of Manchester: Manchester, UK; pp. 1–40.
  31. Bauschke, H.H. Projection Algorithms and Monotone Operators. Ph.D. Thesis, Simon Fraser University, Burnaby, BC, Canada, 1996; pp. 1–223. [Google Scholar]
  32. Censor, Y.; Zenios, S.A. Parallel Optimization: Theory, Algorithms, and Applications; Oxford University Press: New York, NY, USA, 1997; pp. 1–541. [Google Scholar]
  33. Bauschke, H.H.; Borwein, J.M.; Combettes, P.L. Bregman monotone optimization algorithms. SIAM J. Control Optim. 2003, 42, 596–636. [Google Scholar]
  34. Tofighi, M.; Kose, K.; Cetin, A.E. Denoising Using Projections Onto Convex Sets (POCS) Based Framework 2013, arXiv, 1309.0700.
Figure 1. An illustration of a divergence.
Figure 1. An illustration of a divergence.
Entropy 16 06338f1
Figure 2. A Bregman divergence.
Figure 2. A Bregman divergence.
Entropy 16 06338f2
Figure 3. The extended Pythagorean property.
Figure 3. The extended Pythagorean property.
Entropy 16 06338f3
Figure 4. An illustration of an averaging projective procedure F.
Figure 4. An illustration of an averaging projective procedure F.
Entropy 16 06338f4
Figure 5. The situation in the proof of Theorem 5.
Figure 5. The situation in the proof of Theorem 5.
Entropy 16 06338f5
Figure 6. The situation in the proof of Theorem 7 for n = 2.
Figure 6. The situation in the proof of Theorem 7 for n = 2.
Entropy 16 06338f6
Figure 7. The illustration of the four-point property.
Figure 7. The illustration of the four-point property.
Entropy 16 06338f7
Figure 8. The illustration of Example 4.
Figure 8. The illustration of Example 4.
Entropy 16 06338f8
Figure 9. The situation in the proof of Theorem 10 for n = 2.
Figure 9. The situation in the proof of Theorem 10 for n = 2.
Entropy 16 06338f9
Figure 10. The illustration of Theorem 15.
Figure 10. The illustration of Theorem 15.
Entropy 16 06338f10
Figure 11. The situation in the proof of Theorem 16.
Figure 11. The situation in the proof of Theorem 16.
Entropy 16 06338f11
Figure 12. The situation in the proof of Theorem 16.
Figure 12. The situation in the proof of Theorem 16.
Entropy 16 06338f12
Figure 13. The illustration of the chairman theorem for v Θ a D f ( W 1 , , W n ).* Note that the fact that v[λ]-s lie on the arrow does not have any meaning.
Figure 13. The illustration of the chairman theorem for v Θ a D f ( W 1 , , W n ).* Note that the fact that v[λ]-s lie on the arrow does not have any meaning.
Entropy 16 06338f13
Figure 14. The numerical computation for Example 7. Blue lines from the top are for { v [ i ] } i = 1 and { w [ i ] } i = 1 . This graph is taken from [10].
Figure 14. The numerical computation for Example 7. Blue lines from the top are for { v [ i ] } i = 1 and { w [ i ] } i = 1 . This graph is taken from [10].
Entropy 16 06338f14
Figure 15. The numerical computation for Example 7. Blue lines from the top are for w ¯ and D f ( v [ i ] w [ i ] ) D f ( v [ i ] w ¯ ) + D f ( v [ i ] v [ i ] ). This graph is taken from [10].
Figure 15. The numerical computation for Example 7. Blue lines from the top are for w ¯ and D f ( v [ i ] w [ i ] ) D f ( v [ i ] w ¯ ) + D f ( v [ i ] v [ i ] ). This graph is taken from [10].
Entropy 16 06338f15
Table 1. Examples for three saturated possibilities with respect to the consistency principle (CP), disagreement principle (DP) and singleton principle (SP). KIRP, Kern-Isberner and Rödder; OSEP, obdurate social entropy process; SEP, social entropy process; coSEP, conjugated social entropy process.
Table 1. Examples for three saturated possibilities with respect to the consistency principle (CP), disagreement principle (DP) and singleton principle (SP). KIRP, Kern-Isberner and Rödder; OSEP, obdurate social entropy process; SEP, social entropy process; coSEP, conjugated social entropy process.
PrinciplesProbabilistic Merging Operators
(DP), (CP) Θ N D f, Θ ^ N D f
(DP), (SP)KIRP, OSEP
(CP), (SP)SEP, coSEP

Share and Cite

MDPI and ACS Style

Adamčík, M. The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning. Entropy 2014, 16, 6338-6381. https://doi.org/10.3390/e16126338

AMA Style

Adamčík M. The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning. Entropy. 2014; 16(12):6338-6381. https://doi.org/10.3390/e16126338

Chicago/Turabian Style

Adamčík, Martin. 2014. "The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning" Entropy 16, no. 12: 6338-6381. https://doi.org/10.3390/e16126338

APA Style

Adamčík, M. (2014). The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning. Entropy, 16(12), 6338-6381. https://doi.org/10.3390/e16126338

Article Metrics

Back to TopTop