Next Article in Journal
An Energetic Analysis of the Phase Separation in Non-Ionic Surfactant Mixtures: The Role of the Headgroup Structure
Next Article in Special Issue
Effect of Conformational Entropy on the Nanomechanics of Microcantilever-Based Single-Stranded DNA Sensors
Previous Article in Journal
A Natural Gradient Algorithm for Stochastic Distribution Systems
Previous Article in Special Issue
Tsallis Wavelet Entropy and Its Application in Power Signal Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning Functions and Approximate Bayesian Computation Design: ABCD

1
Department of Applied Statistics, Johannes Kepler University, 4040 Linz, Austria
2
Department of Statistics, London School of Economics, Houghton Street, London WC2A 2AE, UK
*
Author to whom correspondence should be addressed.
Entropy 2014, 16(8), 4353-4374; https://doi.org/10.3390/e16084353
Submission received: 25 April 2014 / Revised: 18 July 2014 / Accepted: 28 July 2014 / Published: 4 August 2014
(This article belongs to the Special Issue Entropy in Experimental Design, Sensor Placement, Inquiry and Search)

Abstract

:
A general approach to Bayesian learning revisits some classical results, which study which functionals on a prior distribution are expected to increase, in a preposterior sense. The results are applied to information functionals of the Shannon type and to a class of functionals based on expected distance. A close connection is made between the latter and a metric embedding theory due to Schoenberg and others. For the Shannon type, there is a connection to majorization theory for distributions. A computational method is described to solve generalized optimal experimental design problems arising from the learning framework based on a version of the well-known approximate Bayesian computation (ABC) method for carrying out the Bayesian analysis based on Monte Carlo simulation. Some simple examples are given.

1. Introduction

A Bayesian approach to the optimal design of experiments uses some measure of preposterior utility, or information, to assess the efficacy of an experimental design or, more generally, the choice of sampling distribution. Various versions of this approach have been developed by Blackwell [1], and Torgerson [2] gives a clear account. Renyi [3], Lindley [4] and Goel and DeGroot [5] use information-theoretic approaches to measure the value of an experiment; see also the review paper by Ginebra [6]. Chaloner and Verdinelli [7] give a broad discussion of the Bayesian design of experiments, and Wynn and Sebastiani [8] also discuss the Bayes information-theoretic approach. There is wider interest in these issues in cognitive science and epistemology; see Chater and Oaksford [9].
When new data arrives, one can expect to improve the information about an unknown parameter θ. The key theorem, which is Theorem 2 here, gives conditions on informational functionals for this to be the case, and then, they will be called learning functionals. This class includes many special types of information, such as Shannon information, as special cases.
Section 2 gives the main theorems on learning functionals. We give our own simple proofs for completion, and the material can be considered as a compressed summary of what can be found in quite a scattered literature. We study two types of learning function, those of which we shall call the Shannon type and, in Section 3, those based on distances. For the latter, we shall make a new connection to the metric embedding theory contained in the work of Schoenberg with a link to Bernstein functions [10,11]. This yields a wide class of new learning functions. Following two, somewhat provocative, counter-examples and a short discussion of surprise in Section 4, we relate learning functions of the Shannon type to the theory of majorization in Section 5. Section 6 specializes learning functions on covariance matrices.
We shall use the classical Bayes formulation with θ as an unknown parameter with a prior density π(θ) on a parameter space Θ and a sampling density f(x|θ) on an appropriate sample space. We denote by fX,θ (x,θ) = f(x|θ)π(θ) the joint density of X and θand use fX(x) for the marginal density of X. The nature of expectations will be clear from the notation. To make the development straightforward, we shall look at the case of distributions with densities (with respect to Lebesgue measure) or, occasionally, discrete distributions with finite support. All necessary conditions for conditional densities, integration and differentiation will be implicitly assumed.
In Section 7, approximate Bayesian computation (ABC) is applied to problems in optimal experimental design (hence, ABCD). We believe that an understanding of modern optimal experimental design and its computational aspects needs to be grounded in some understanding of learning. At the same time, there is added value in taking a wide interpretation of optimal design as a choice, with constraints, of the sampling distribution f(x|θ). Thus, one may index f(x|θ) by a control variable z and write f(x|θ, z) or f(x(z)| θ). Certain aspects of the distribution may depend on z, others not. An experimental design can be taken as the choice of a set of z, at each of which we take one or more observations, giving a multivariate distribution. In areas, such as search theory and optimization, z may be a site at which one measures or observes with error. In spatial sampling, one may also use the term “site” for z. However, z could be a simple flag, which indicates one or another of somewhat unrelated experiments to estimate a common θ. In medicine, for example, one discusses different types of “intervention” for the same patient.

2. Information-Based Learning

The classical formulation proceeds as follows. Let U be a random variable with density fU(u). Let g(·) be a function on R+ → R and define a measure of information of the Shannon type for U with respect to g as
I g ( U ) = E U ( g ( f U ( U ) ) ) .
When g(u) = log(u), we have Shannon information. When g ( u ) = u γ - 1 γ (γ > −1), we have a version similar to Renyi information, which is sometimes called Tsallis information [12].
If X represents the future observation, we can measure the preposterior information of the experiment (query, etc.), which generates a realization of X, by the prior expectation of the posterior information, which we define as:
I g ( θ ; X ) = E X E θ X ( g ( π ( θ X ) ) ) = E X , θ ( g ( π ( θ X ) ) ) .
In the second term, the inner expectation is with respect to the posterior (conditional) distribution of θ given X, namely π(θ|X), and the outside expectation is with respect to the marginal distribution of X. In the last term, the expectation is with respect to the full joint distribution of X and θ. We wish to compare Ig(θ;X) with the prior information:
I g ( θ ) = E θ ( g ( π ( θ ) ) ) .

Theorem 1

For fixed g(u) and the standard Bayesian set-up, the pre-posterior quantity Ig(θ, X) and prior value, Ig(θ), satisfy:
I g ( θ ; X ) I g ( θ ) = E θ ( g ( π ( θ ) ) ) ,
for all joint distributions fX,θ (x, θ) if and only if h(u) = ug(u) is convex on R+.
We shall postpone the proof of Theorem 1 until after a more general result for functionals on densities:
φ : π ( θ ) R .

Theorem 2

For the standard Bayesian set-up and a functional φ (·),
φ ( π ( θ ) ) E X φ ( π ( θ X ) )
for all joint distributions fX,θ (x, θ) if and only if φ is convex as a functional:
φ ( ( 1 - α ) π 1 + α π 2 ) ( 1 - α ) φ ( π 1 ) + α φ ( π 2 ) ,
for 0 ≤ α ≤ 1 and all π1, π2.

Proof

Note that taking expectations with respect to the marginal distribution of X amounts to a convex mixing, not dependent on θ. Thus, using Jensen’s inequality:
E X ( φ ( π ( θ X ) ) ) φ ( E X ( π ( θ X ) ) ) = φ ( π ( θ ) ) .
The necessity comes from a special construction. We show that given a functional φ(·) and a triple {π1, π2, α}, such that:
φ ( ( 1 - α ) π 1 + α π 2 ) > ( 1 - α ) φ ( π 1 ) + α φ ( π 2 ) ,
we can find a pair {f(x, θ), π(θ)}, such that
φ ( π ( θ ) ) > E X φ ( π ( θ x ) ) .
Thus, let X be a Bernoulli random variable with marginal distribution (prob{X = 0}, prob{X = 1}) = (1 − α, α). Then, it is straightforward to choose a joint distribution of θand X, such that:
π ( θ X = 0 ) = π 1 ( θ ) , π ( θ X = 1 ) = π 2 ( θ ) ,
from which we obtain (1).

Proof

(of Theorem 1). We now show that Theorem 1 is a special case of Theorem 2.
Write πα(θ) = (1 − α)π1(θ) + απ2(θ). If h(u) = ug(u) is convex as a function of its argument u:
h ( π α ( θ ) ) d θ ( ( 1 - α ) h ( π 1 ( θ ) ) + α h ( π 2 ( θ ) ) d θ
= ( 1 - α ) h ( π 1 ( θ ) ) d θ + α h ( π 2 ( θ ) ) d θ ,
proving one direction.
The reverse is to show that if Ig is convex for all π, then h is convex. For this, again, we need a special construction. We carry this out on one dimension, the extension to more than one dimension being straightforward. For ease of exposition, we also make the necessary differentiability conditions. The second directional derivative of Ig(θ) in the space of distributions (which is convex) at π1 towards π2 is:
2 α 2 g ( π α ( θ ) ) π α ( θ ) d θ | α = 0 = ( π 1 - π 2 ) 2 ( g ( π 1 ) π 1 + 2 g ( π 1 ) ) d θ .
Let π1 represent a uniform distribution on [0, 1 z], for some z ≥ 0, and let π2 be a distribution with support contained in [0, 1 z]. Then, the above becomes:
0 1 z ( z - π 2 ( θ ) ) 2 ( g ( z ) z + 2 g ( z ) ) d θ = ( g ( z ) z + 2 g ( z ) ) 0 1 z ( z - π 2 ( θ ) ) 2 d θ .
Now, assume that h(z) = zg(z) is not convex at z; then h″(z) = g″ (z)z + 2g′ (z) < 0 and any choice of π2, which makes the integral on the right-hand side positive, shows that Ig(θ) is not convex at z. This completes the proof.
Theorem 2 has a considerable history of discovery and rediscovery and, in its full version, should probably be attributed to DeGroot [13]; see Ginebra [6]. The early results concentrated on functionals of the Shannon type, basically yielding Theorem 1. Note that the condition h(u) = ug(u) being convex on R+ is equivalent to g ( 1 u ) being convex, which is referred to as g(u) being “reciprocally convex” by Goldman and Shaked [14]; see also Fallis and Lyddell [15].

3. Distance-Based Information Functions

Shannon type information functionals take no account of metrics. Intuitively, if mass is moved around, the information stays the same. Let Z1,Z2 be independent copies from π(z), and let d(z1, z2) be a distance or metric. Define d-information as:
φ ( π ) = - E Z 1 , Z 2 ( d ( Z 1 , Z 2 ) 2 ) .
Now, with πα(z) = (1 − α)π1(z) + απ2(z),
φ ( π α ) = - d ( z 1 , z 2 ) 2 ( ( 1 - α ) π 1 ( z 1 ) + α π 2 ( z 1 ) ) ( ( 1 - α ) π 1 ( z 2 ) + α π 2 ( z 2 ) ) d z 1 d z 2 .
The condition for convexity, again using the second directional derivative with respect to α, is
- d ( z 1 , z 2 ) 2 ( π 1 ( z 1 ) - π 2 ( z 1 ) ) ( π 1 ( z 2 ) - π 2 ( z 2 ) ) d z 1 d z 2 0.
Noting that (π1(z1) − π2(z1)) = 0, (5) is a generalized version of the following condition:
- i j d ( z i , z j ) z i z j 0 ,     for all    z , z i = 0.
Condition (6), considered as a condition on a distance matrix dij = d(zi, zj), is called almost positive and is the necessary and sufficient condition for an abstract set of points P1, . . . , Pk, with interpoint distances {dij}, to be embedded in Euclidean space.

Theorem 3

If dij = dji, 1 ≤ i < jn, are 1 2 n ( n - 1 ) positive quantities, then a necessary and sufficient condition that the dij are the interpoint distances between points Pi, i = 1, . . . , n, in Rn is that the distance matrix D = −{dij} is an almost positive matrix.
This is a special case of metric embedding, sometimes called metric multi-dimensional scaling, in statistics; see, for example, Torgeson [16], Gower [17,18]. A more general result is:

Theorem 4

Let S be a separable metric with metric space with metric d(x, y), then S can be isometrically embedded in l2 if and only if A(x, y) =d(x, y) is an almost positive matrix.
It is a task to identify the functions B(d(x, y)2), such that, when d(x, y) is a Euclidean or Hilbert space metric, the space with the new metric can still be embedded into the Hilbert space. Schoenberg [10] gives the following major result that such B(·) comprise the Bernstein function defined as follows (see Theorem 12.14 in [11]):

Definition 1

A function B : (0,∞) ↦ R is a Bernstein function if it is C, f(λ) ≥ 0 for all λ > 0 and the derivatives f(n) satisfy (−1)n1f(n) ≥ 0 for all positive integers n and all λ > 0.
Note that this says that f′ is a completely monotone function.

Theorem 5

(Schoenberg) The following are equivalent:
(1)
B(||xy||2) (x, yH) is the square of a distance function, which isometrically embeds into Hilbert space H, i.e., there exists a φ : HH, such that:
B ( x - y 2 ) = φ ( x ) - φ ( y ) 2 .
(2)
B is a Bernstein function.
(3)
eB(t) is the Laplace transform of an infinitely divisible distribution, i.e.,
B ( t ) = - log  0 e - t u u d γ ( u ) ,
where γ is an infinitely divisible distribution.
(4)
B has the Lévy-Khintchine representation:
B ( t ) = B μ , b ( t ) = b t + 0 ( 1 - e - t u ) d μ ( u )
for some b0 and a measure μ, such that 0 ( 1 t ) d μ ( t ) < , with the condition that Bμ,b(t) > 0 for t > 0.
We now combine the above discussion with Schoenberg’s theorem.

Theorem 6

If B(·) is a Bernstein function with B(0) = 0 and d(z1, z2) is a Euclidean distance, then φ(π) =EZ1,Z2 (B(d(Z1,Z2)2)) is a learning function.
In the univariate case the negative of the variance of the distribution is a learning function since:
var ( Z ) = 1 2 E Z 1 , Z 2 ( Z 1 - Z 2 ) 2 .
When Z is multivariate, we again take independent copies Z1,Z2 of Z and use Euclidean distance, and we have that minus the trace of the covariance matrix of Z, Γ, is a learning function:
1 2 E Z 1 , Z 2 ( Z 1 - Z 2 2 ) = trace ( Γ ) .
Schilling et al. [11] (Chapter 15) list 138 Bernstein functions, each of which will lead to a learning functional of the distance type. We give a small selection of Bernstein functions B(λ), which then, applied with λ = d(z1, z2)2, give a learning function:
λ α , 0 < α < 1 , ( 1 + λ ) α - 1 , 0 < α < 1 , 1 - ( 1 + λ ) α - 1 , 0 < α < 1 , λ λ + α , α < 0.

4. Counterexamples

We show first that it is not true that information always increases. That is, it is not true that the posterior information is always more than the prior information:
I g ( θ ) E θ X ( g ( π ( θ X ) ) ) .
A simple discrete example runs as follows. I have lost my keys. With high prior probability, p, I think they are on my desk. Suppose I have a uniform prior over all k likely other locations. However, suppose when I look on the desk that my keys are not there. My posterior distribution is now uniform on the other locations. Under certain conditions on p and k, Shannon information has gone down. For fixed p, the condition is k > k* where:
k * = ( 1 - p ) 1 - 1 p p = e · ( 1 p - 1 2 + O ( p ) ) ,
by expanding pk* in a Taylor expansion. When p = 1 2, k* = 4 and pk*e, 1 when p → 0, 1. This example is captured by the somewhat self-doubting phrase “if my keys are not on my desk, I don’t know where they are”. Note, however, that something has improved: the support size is reduced from k + 1 to k.
There is a simple way of obtaining a large class of examples, namely to arrange that there are x-values for which the posterior distribution is approximately uniform. Then, because the uniform distribution typically has low information, for such x, we can have a decrease in information. Thus, we construct examples in which f(x|θ)π(θ) happens to be approximately constant for some x. This motivates the following example.
Let Θ × Entropy 16 04353f5= [0, 1]2 with joint distribution having support on [0, 1]2. Let π(θ) be the prior distribution and define a sampling distribution:
f ( x θ ) = a ( θ ) ( 1 - x ) + x π ( θ ) .
Note that we include the prior distribution into the sampling distribution as a constructive device, not as some strange new general principle. We have in mind, in giving this construction, that when x → 1, the first term should approach zero and the second term, after multiplying by π(θ), should approach unity. Solving for a(θ) by setting 0 1 f ( x θ ) d x = 1, we have a ( θ ) = 2 π ( θ ) - 1 π ( θ ) so that:
f ( x θ ) = ( 2 π ( θ ) - 1 ) ( 1 - x ) + x π ( θ ) .
The joint distribution is then:
f ( x θ ) π ( θ ) = ( 2 π ( θ ) - 1 ) ( 1 - x ) + x .
The marginal distribution of X is fX(x) = 1 on [0, 1], since the integral of (9) is unity, so that (9) is also the posterior distribution π(θ|x). Note that, in order for (9) to be a proper density, we require that π ( θ ) 1 2 for 0 ≤ θ ≤ 1.
The Shannon information of the prior is:
I 0 = 0 1 ( θ ) log  π ( θ ) d θ ,
and of the posterior is
I 1 = 0 1 ( ( 2 π ( θ ) - 1 ) ( 1 - x ) + x ) log ( ( 2 π ( θ ) - 1 ) ( 1 - x ) + x ) d θ .
When x = 1 2 , the integrands of I1 and I0 are equal and I0 = I1. When x = 1, the integrand of I1 is zero, as expected. Thus, for a non-uniform prior, we have less posterior information in a neighborhood of x = 1, as we aimed to achieve.
Specializing π ( θ ) = 1 2 + θ 0 [0, 1] gives:
I 0 = 9 8 log  3 - log  2 - 1 / 2 I 1 = 1 4 ( 1 - x ) ( ( 2 - x ) 2 log ( 2 - x ) - x 2 log ( x ) + 2 x - 2 )
Information I1 decreases from a maximum of log ( 2 ) - 1 2 at x = 0, through the value I0 at x = 1 2 , to the value zero at x = 1; see also Figure 1. Thus, I0 > I1 for 1 2 < x 1. Since the marginal distribution of X is uniform on [0, 1], we have the challenging fact that:
prob X { I 1 < I 0 } = 1 2 .
Namely, with prior probability equal to one half, there is less Shannon information in the posterior than the prior. The Renyi entropy exhibits the same phenomenon, but we omit the calculations. We might say that f(x|θ) is not a good choice sampling distribution to learn about θ.

4.1. Surprise and Ignorance

The conflict between prior beliefs and empirical data, demonstrated by these examples, lies at the heart of debates about inference and learning, that is to say epistemology. This has given rise to formal theories of surprise, which seek to take account of the conflict. Some Bayesian theories are closely related to the learning theory discussed here and measure surprise quantities, such as the difference:
S ( π , f ) = I g ( θ ) - E θ X g ( π ( θ X ) ) .
Since, under the conditions of Theorem 1, S is expected to be negative, a positive value is taken to measure surprise; see Itti and Baldi [19].
Taking a subjective view of these issues, we may stray into cognitive science, where there is evidence that the human brain may react in a more focused way than normal when there is surprise. This is related to wider computational models of learning: given the finite computational capacity of the brain, we need to use our sensing resources carefully in situations of risk or utility. One such body of work emanates from the so-called “cocktail party effect”: if the subject matter is of sufficient interest, such as the mention of one’s own name across a crowded room, then one’s attention is directed towards the conversation. Discussions about how the attention is first captured are closely related to surprise; see Haykin and Chen [20].

4.2. Minimal Information Prior Distributions

It is clear that if the prior distribution has minimal information (maximum entropy), then there is no surprise, because S, as defined above, is never positive. The use of such prior distributions has been advocated for many years and is incorporated into objective Bayesian analysis by some researchers. One key idea is to use Jeffrey’s prior distributions, that is those which are invariant under a suitable group (Haar measure); for a discussion, see Berger [21].
An unresolved issue is that the minimal information distribution depends on the learning function. A simple example is that for Shannon information, the minimal information distribution with support on [0, 1] is the uniform distribution, whereas the maximum variance distribution has mass 1 2 at each of {0, 1} and variance 1 4, which is achievable for the Beta(α, β) distribution as α, β → 1. The variance of the uniform distribution, on the other hand, is 1 12 < 1 4.
Consider the standard beta-binomial Bayesian set-up, where the sampling distribution is Bin(n, θ) and the (conjugate) prior is Beta(α, β). If x is the data, the posterior distribution is Beta(α + x, β + nx), and the posterior mean, which is the Bayes estimator with respect to quadratic loss, is θ ^ = α + x α + β + n. The minimal Shannon information is achieved for the uniform distribution when α, β → 1, in which case we have θ ^ = 1 + x 2 + n. However, if we take α, β → 0, giving, as mentioned, the minimal information with respect to the variance, we obtain in the limit the maximum likelihood estimator x n. The same feature arises with the Dirichlet-multinomial case, with the Dirichlet prior distribution: π ( θ 1 , , θ k ) = θ i α i - 1 Beta ( α 1 , , α k ). The minimal Shannon information is uniform when all αi = 1, but the minimal trace of the covariance matrix is for mass 1 k at each corner of the simplex ∑θi = 1.

5. The Role of Majorization

We concentrate here on Shannon-type learning functions. The analysis of the last section leads to the notion that for two distributions π1(θ) and π2(θ), the second is more peaked than the first if and only if:
Θ h ( π 1 ( θ ) ) d θ Θ h ( π 2 ( θ ) ) d θ     for all convex    h ( u ) = u g ( u )     on    R + .
The statement (10) defines a partial ordering between π1 and π2.
For Bayesian learning, we may hope that the ordering holds when π1 is the prior distribution and π2 is the posterior distribution. We have seen from the counterexamples that it does not hold in general, but, loosely speaking, always holds in expectation, by Theorem 1. However, it is natural to try to understand the partial ordering, and we shall now indicate that the ordering is equivalent to a well-known majorization ordering for distributions.
Consider two discrete distributions with n-vectors of probabilities\ π 1 = ( π 1 ( 1 ) , , π n ( 1 ) ) and π 2 = ( π 1 ( 2 ) , , π n ( 2 ) ), where i π i ( 1 ) = i π i ( 2 ) = 1. First, order the probabilities:
π ˜ 1 ( 1 ) π ˜ n ( 1 ) ,     π ˜ 1 ( 2 ) π ˜ n ( 2 ) .
Then, π2 is said to majorize π1, written π1π2, when:
i = 1 j π ˜ i ( 1 ) i = 1 j π ˜ i ( 2 )
for j = 1, . . . , n (equality for j = n). The standard reference is Marshall and Olkin [22], where one can find several equivalent conditions. Two of the best known are:
  • A1. there is a doubly stochastic matrix Pn×n, such that π1 = 2;
  • A2. i n h ( π i ( 1 ) ) i n h ( π i ( 2 ) ) for all continuous convex functions h(x).
Condition A2 shows that, in the discrete case, the partial ordering (10) is equivalent to the majorization of the raw probabilities.
We now extend this to the continuous case. This generalization, which we shall also call ≼, to save notation, has a long history, and the area is historically referred to as the theory of the “rearrangements of functions” to respect the terminology of Hardy et al. [23]. It is particularly well-suited to probability density functions, because ∫ π1(θ) = ∫π2(θ) = 1. The natural analogue of the ordered values in the discrete case is that every density π has a unique density π̃ , called a “decreasing rearrangement”, obtained by a reordering of the probability mass to be non-increasing, by direct analogy with the discrete case above. In the theory, π and π̃ are then referred to as being equimeasurable, in the sense that the supports are transformed in a measure-preserving way.
There are short sections on the topic in Marshall and Olkin [22] and in Müller and Stoyan [24]. A key paper in the development is Ryff [25]. The next paragraph is a brief summary.

Definition 2

Let π(z) be a probability density and define m(y) = μ{z : π(z) ≥ y}. Then:
π ˜ ( t ) = s u p { y : m ( y ) > t } , t > 0
is called the decreasing rearrangement of π(z).
The picture is that the probability mass (in infinitely small intervals) is moved, so that a given mass is to the left of any smaller mass. For example, for the triangular distribution:
π ( θ ) = { 4 θ , 0 θ < 1 2 4 ( 1 - θ ) , 1 2 θ 1
we have:
π ˜ ( θ ) = 2 ( 1 - θ ) , 0 θ 1.

Definition 3

We say that π2 majorizes π1, written π1 ≼ π2, if and only if, for the decreasing rearrangements,
0 c π ˜ 1 ( z ) d z 0 c π ˜ 2 ( z ) d z
for all c > 0.
Define a doubly stochastic kernel P(x, y) ≥ 0 on (0,∞), that is:
x P ( x , y ) = y P ( x , y ) = 1.
There is a list of key equivalent conditions to ≼, which are the continuous counterparts of the discrete majorization conditions. The first two generalize A1 and A2 above.
  • B1. π1(θ) = Θ P(θ, z)π2(z)dz for some non-negative doubly stochastic kernel P(x, y).
  • B2. Θ h(π1(z))dzΘ h(π2(z))dz for all continuous convex functions h.
  • B3. Θ (π1(z) − c)+dzΘ (π2(z) − c)+dz for all c > 0.
Condition B2 is the key, for it shows that in the univariate case, if we assume that h(u) = ug(u) is continuous and convex, (10) is equivalent to π1(θ) ≼ π2(θ). We also see that ≼ is equivalent to standard first order stochastic dominance of the decreasing rearrangements, since F ˜ ( θ ) = 0 θ π ˜ ( z ) d z is the cdf corresponding to π̃(θ). Condition B3 says that the probability mass under the density above a “slice” at height c is more for π2 than for π1.
We can summarize this discussion by the following.

Proposition 1

A functional is a learning functional of the Shannon type (under mild conditions) if and only if it is an order-preserving functional with respect to the majorization ordering on distributions.
The role of majorization has been noticed by DeGroot and Fienberg [26] in the related area of proper scoring rules.
The classic theory of rearrangements is for univariate distributions, whereas, as stated, we are interested in θ of arbitrary dimension. In the present paper, we will simply make the claim that the interpretation of our partial ordering in terms of decreasing rearrangements can indeed be extended to the multivariate case. Heuristically, this is done as follows. For a multivariate distribution, we may create a univariate rearrangement by considering a decreasing threshold and “squashing” all of the multivariate mass for which the density is above the threshold to a univariate mass adjacent to the origin. Since we are transforming multivariate volume to area, care is needed with Jacobians. We can then use the univariate development above. It is an instructive exercise to consider the univariate decreasing rearrangement of the multivariate normal distribution, but we omit the computations here.

6. Learning Based on Covariance Functions

If we restrict our functionals to those which are only functionals of covariance matrices, then we can prove wider results than just for the trace. Dawid and Sebastiani [27] (Section 4) refer to dispersion-coherent uncertainty functions and, where their results are close to ours, we differ only by assumptions.
We use the notation A ≥ 0 to mean that a symmetric matrix is non-negative definite.

Definition 4

For two n × n symmetric non-negative definite matrices A and B, the Loewner ordering AB holds when AB ≥ 0.

Definition 5

A function φ : A ↦ R on the class of non-negative definite matrices A is said to be Loewner increasing (also called matrix monotone) if ABφ(A) ≥ φ(B).

Theorem 7

A function φ is Loewner increasing and concave on the class of covariance matrices Γ (π) if and only ifφ is a learning function on the corresponding distributions.

Proof

Assume φ is Loewner increasing. To simplifying the notation, we call μ(π) and Γ(π) the mean vector and covariance matrix, respectively, of the random variable Z with distribution π. Now, consider a mixed density πα = (1 − α)π1 + απ2. Then, with obvious notation,
Γ ( π α ) = E α ( Z Z T ) - μ α μ α T = ( 1 - α ) Γ 1 + α Γ 2 + ( 1 - α ) μ 1 μ 1 T + α μ 2 μ 2 T - ( ( 1 - α ) μ 1 + α μ 2 ) ( ( 1 - α ) μ 1 + α μ 2 ) T = ( 1 - α ) Γ 1 + α Γ 2 + α ( 1 - α ) ( μ 1 - μ 2 ) ( μ 1 - μ 2 ) T ( 1 - α ) Γ 1 + α Γ 2 ,
for 0 ≤ α ≤ 1, since (μ1μ2)( μ1μ2)T is non-negative definite. Then, since φ is Loewner increasing and concave, φ(Γ(πα)) ≥ φ ((1 − α)Γ(π1) + αΓ(π2)) ≥ (1 − α)φ(Γ(π1)) + αφ(Γ(π2)), and by Theorem 2, −φ is a learning function.
We first prove the converse for matrices Γ and Γ̃ = Γ+zzT , for some vector z. Take two distributions with equal covariance functions, but with means satisfying μ1μ2 = 2z. Then,
Γ ( π 1 2 ) = Γ + 1 4 ( μ 1 - μ 2 ) ( μ 1 - μ 2 ) T = Γ + z z T = Γ ˜ .
Now assume −φ is a learning function. Then, by concavity,
φ ( Γ ˜ ) = φ ( π 1 2 ) 1 2 φ ( Γ ) + 1 2 φ ( Γ ) = φ ( Γ ) .
In general, we can write any Γ̃ ≥ Γ as Γ ˜ = Γ + i = 1 m z ( i ) z ( i ) T, for a sequence of vectors {z(i)}, i = 1, . . . ,m, and the result follows by induction from the last result.
Most criteria used in classical optimum design theory (in the linear regression setting) when applied to covariance matrices are Loewner increasing. If, in addition, we can claim concavity, then by Theorem 7, the negative of any such function is a learning function. We have seen in Section 3 that –trace(Γ) is a learning function, while −log det(Γ) corresponding to D-optimality is another example.
For the normal distribution, we can show that for two normal density functions, π1 and π2, with covariance Γ1 and Γ2, respectively, we have that for any Shannon-type learning function Ig(θ1) ≤ Ig(θ2) if and only if det(Γ1) ≥ det(Γ2). We should note that in many Bayesian set-ups, such as regression and Gaussian process prediction, we have a joint multivariate distribution between x and θ. Suppose that, with obvious notation, the joint covariance matrix is:
Γ θ , X = ( Γ θ γ θ , X γ θ , X T Γ X ) .
Then, the posterior distribution for θ has covariance Γ θ - γ θ , X Γ X - 1 γ θ , X T Γ θ. Thus, for any Loewner increasing φ, it holds that −φ(π(θ)) ≤ −EX(φ(π(θ|X))), by Theorem 7. However, as the conditional covariance matrix does not depend on X, we have learning in the strong sense; −φ(π(θ)) ≤ −φ(π(θ|X)). Classifying learning functions for θ and Γθ,X in the case where they are both unknown is not yet fully developed.

7. Approximate Bayesian Computation Designs

We now present a general method for performing optimum experimental design calculations, which, combined with the theory of learning outlined above, may provide a comprehensive approach. Recall that in our general setting, a decision about experimentation or observation is essentially a choice of the sampling distribution. In the statistical theory of the design of experiments, this choice typically means a choice of observation sites indexed by a control or independent variable z.
Indeed, we will have examples below in this category. However, the general formulation is that we want to maximize ψ over some restricted set of sampling distributions f(x|θ) ∈ . A choice of f we call generalized design. Below, we will have one non-standard example based on selective sampling. Note that we shall always assume that the prior distribution π(θ) is fixed, which is independent of the choice of f. Then, recalling our general information functional as φ(π), the design optimization problem is (for fixed π):
max f F ψ ( f ) = E X D φ ( π ( θ X D ) ) ,
where we stress the dependence of the random variable X on the design and, thereby, on the sampling distribution f, by adding the subscript D.
If the set of sampling distributions f is specified by the control variable z, that is the choice of the sampling distribution f(x|θ, z) amounts to selecting zZ, then the maximization problem is:
max z Z ψ ( f ) = E X D φ ( π ( θ X D , z ) ) .
In the examples that we consider below, the sampling distribution will be indexed by a control variable z.
An important distinction should be made between what we shall here call linear and non-linear criteria. By a more general utility problem being linear, we mean that there is a utility function U(θ, x), such that, when we seek to minimize, again over choice f,
E X D E θ X D U ( X D , θ ) = E X D , θ U ( X D , θ ) ,
where the last expectation is with respect to the joint distribution of XD and θ. In terms of integration, this only requires a single double integral. The non-linear case requires the evaluation of an “internal” integral for Eθ|XDU(XD, θ) and an external integral for EXD. It is important to note that Shannon-type functionals are special types of linear functionals where U(θ,XD) = g(π(θ|XD)). The distance-based functionals are non-linear in that they require a repeated single integral.
This distinction is important when other costs or utilities are included in addition to those coming from learning. Most obvious is a cost for the experiment. This could be fixed, so that no preposterior analysis is required, or it might be random in that it depends on the actual observation. For example one might add an additional utility U(XD) solely dependent of the outcome of the experiment: if it really does snow, then snow plows may need to be deployed. The overall (preposterior) expected value of the experiment might be:
E X D E θ X D U ( X D , θ ) + E X D U ( X D ) .
In this way, one can study the exploration-exploitation problem, often referred to in search and optimization.
We now give a procedure to compute ψ for a particular choice of sampling distribution f. We assume that f(x|θ) and π(θ) are known. If the functional φ is non-linear, we have to obtain the posterior distribution π(θ|XD) before evaluating φ. For simplicity, we use ABC rejection sampling (see Marjoram et al. [28]) to obtain an approximate sample from π(θ|XD) that allows us to estimate the functional φ(π(θ|XD)). In many cases, it is hard to find an analytical solution for π(θ|XD), especially if f(x|θ) is intractable. These are the cases where ABC methods are most useful. Furthermore, ABC rejection sampling has the advantage that it is easily possible to re-compute φ̂(π(θ|XD)) for different values of XD, which is an important feature, because we have to integrate over the marginal distribution of XD in order to obtain ψ(f) = EXDφ(π(θ|XD)).
For a given f, we find the estimate ψ̂ by integrating over φ̂(π(θ|XD)) with respect to the marginal distribution fX. We can achieve this using Monte Carlo integration:
ψ ( f ) ψ ^ = 1 G i = 1 G φ ^ ( π ( θ x D ( i ) ) )
for x D ( i ) ~ f X. The ABC procedure to obtain the estimate φ̂(π(θ|xD)) given xD is as follows.
(1)
Sample from π(θ) : {θ1, . . . , θH}.
(2)
For each θi, sample from f(x|θi) to obtain a sample: x ( i ) = ( x 1 ( i ) , , x n ( i ) ). This gives a sample from the joint distribution: fX,θ.
(3)
For each θi, compute a vector of summary statistics: T(x(i)) = (T1(x(i)), . . . , Tm(x(i))).
(4)
Split T-space into disjoint neighborhoods Entropy 16 04353f6.
(5)
Find the neighborhood Entropy 16 04353f6 for which T(xD) ∈ Entropy 16 04353f6 and collect the θi for which T(x(i)) ∈ Entropy 16 04353f6, forming an approximate posterior distribution π̃(θ|T), which if T is approximately sufficient, should be close to π(θ|xD). If T is sufficient, we have that π̃(θ|T) → π(θ|xD) as | Entropy 16 04353f6| → 0.
(6)
Approximate π(θ|xD) by π̃(θ|T).
(7)
Evaluate φ(π(θ|xD)) by integration (internal integration).
Steps 1–4 need to be conducted only once at the outset for each f; only Steps 5–7 have to be repeated for each xD ~ fX.
For the linear functional, explained above, we do not even need to compute the posterior distribution, π(θ|xD), if we are happy to use the naive approximation to the double integral:
ψ ( f ) 1 G i = 1 G U ( x i , θ i ) ,
where { x i , θ i } i = 1 N are independent draws from the joint distribution f(x, θ) = f(x|θ)π(θ).
The optimum ψ(f) for f may be found by employing any suitable optimization method. In this paper, we intend to focus on the computation of ψ̂(f). Therefore, in the illustrative examples below, we take a “crude” optimization approach, that is we estimate ψ(f) for a fixed set of possible choices for f and compare the estimates.
The basic technique of ABCD was introduced in Hainy et al. [29], but here, we present it fully embedded into statistical learning theory. Note that related different procedures utilizing MCMC chains were independently developed in Drovandi and Pettitt [30] and Hainy et al. [31].
We now present two examples that are meant to illustrate the applicability of ABCD to very general design problems using non-linear design criteria. Although these examples are rather simple and may also be solved by analytical or numerical methods, their generalizations become intractable using traditional methods.

7.1. Selective Sampling

When the background sampling distribution is f(x|θ), we may impose prior constraints of which data we accept to use. Such models in greater generality may occur when observation is cheap, but the use of observation is expensive, for example computationally. We can call this “selective sampling”, and we present a simple example.
Suppose in a one-dimensional problem that we are only allowed to accept observations from two slits of equal width at z1 and z2. Here, the model is equivalent (in the limit as the slit widths become small) to replacing f(x|θ) by the discrete distribution:
f ( x = i θ , z 1 , z 2 ) = f ( z i θ ) f ( z 1 θ ) + f ( z 2 θ ) , i = 1 , 2.
If we have a prior distribution π(θ) and f(x|z1, z2) = ∫ f(x|θ, z1, z2)π(θ) denotes the marginal distribution of x, the posterior distribution is given by:
π ( θ x = i , z 1 , z 2 ) = f ( x = i θ , z 1 , z 2 ) π ( θ ) f ( x = i z 1 , z 2 ) , i = 1 , 2.
To simplify even further, we take as a criterion:
φ ( π ( θ x , z 1 , z 2 ) ) = max θ π ( θ x , z 1 , z 2 ) .
The maximum is a limiting version of Tsallis entropy and is a learning functional.
Now consider a special case:
z θ ~ N ( θ , 1 ) , θ ~ U [ - 1 , 1 ] .
The preposterior:
ψ ( z 1 , z 2 ) = i = 1 2 φ ( π ( θ x = i , z 1 , z 2 ) ) f ( x = i z 1 , z 2 )
can be calculated explicitly. If z2z1 and zi ∈ [−a, a], then:
max z 1 , z 2 ψ ( z 1 , z 2 ) = ψ ( - a , a ) = 1 1 + exp ( - 2 a ) = { 1 2 a 0 1 a .
Next, we show how this example can be solved using ABCD. Due to the special structure of the sampling distribution in this example, we modified our ABC sampling strategy slightly.
(1)
For fixed z1 and z2, sample H numbers {θ(j), j = 1, . . . ,H} from the prior.
(2)
For each θ(j), repeat:
(a)
sample z(k) ~ π(z|θ(j)) until #{z(k) ∈ {Nε (z1),Nε (z2)}}= Kz, where Nε(z) = [zε/2, z + ε/2];
(b)
drop all z(k) ∉ {Nε(z1),Nε(z2)};
(c)
sample x(j) from discrete distribution with probabilities Pr ( x ( j ) = i ) = # { z ( k ) N ɛ ( z i ) } K z, i = 1, 2.
(3)
For i = 1, 2, select all θ(j) for which x(j) = i, compute kernel density estimate for these θ(j) and obtain maximum → φ̂(π̂(θ|x = i, z1, z2)).
(4)
ψ ^ ( z 1 , z 2 ) = i = 1 2 φ ^ ( π ^ ( θ x = i , z 1 , z 2 ) ) # { x ( j ) = i } H.
We performed our ABC sampling strategy for this example for a range of parameters for the slit neighborhood length ε (ε = 0.005, 0.01, 0.05), H (H = 100, 1,000, 10,000) and Kz (Kz = 50, 100, 200) in order to assess the effect of these parameters on the accuracy of the ABC estimates of the criterion ψ. The most notable effect was found for the ABC sample size H.
Figure 2 shows the estimated values of the criterion, ψ̂, for the special case where z2 = −z1 when a = 1.5. We set ε = 0.01, Kz = 100. The ABC sample size H is set to H = 100 (left), H = 1, 000 (center), and H = 10, 000 (right). The criterion was evaluated at the eight points (z1 = 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5). The theoretical criterion function ψ(z1) is plotted as a solid line.

7.2. Spatial Sampling for Prediction

This example is also a simple version of an important paradigm, namely optimal sampling of a spatial stochastic process for good prediction. Here, the stochastic process labeled X is indexed by a space variable z, and we write Xi = Xi(zi), i = 1, . . . , n to indicate sampling at sites (the design) Dn = {z1, . . . , zn}. We would typically take the design space, Z, to be a compact region.
We wish to compute the predictive distribution at a new site zn+1, namely xn+1(zn+1), given xD = x(Dn) = (x1(z1), . . . , xn(zn)). In the Gaussian case, the background parameter θ could be related to a fixed effect (drift) or the covariance function of the process, or both. In the analysis, xn+1 is regarded as an additional parameter, and we need its (marginal) conditional distribution.
The criterion of interest is the maximum variance of the (posterior) predictive distribution over the design space:
- φ ( x ( D n ) ) = max z n + 1 Z var ( X n + 1 ( z n + 1 ) x ( D n ) ) = max z n + 1 Z ( x n + 1 - μ x n + 1 ) 2 π ( x n + 1 x ( D n ) , z n + 1 ) d x n + 1 .
This functional is learnable, since it is is a maximum of a set of variances, each one of which is learnable.
Referring back to how the general design optimization problem that was stated in (11), the posterior predictive distribution of xn+1 may be interpreted as the posterior distribution in (11). The optimality criterion ψ is found by integrating φ with respect to X1, . . . ,Xn.
The strategy is to select a design Dn and then perform ABC at each test point zn+1. The learning functional φ(xD) is estimated by generating the sample I = { x D ( j ) , x n + 1 ( j ) } j = 1 H = { x 1 ( j ) , x 2 ( j ) , , x n ( j ) , x n + 1 ( j ) } j = 1 H at the sites {z1, z2, . . . , zn, zn+1} and calculating:
- φ ^ ( x D ) = max z n + 1 Z 1 J ɛ ( x D ) j J ɛ ( x D ) ( x n + 1 ( j ) - x ¯ n + 1 ) 2 ,
where J ɛ ( x D ) = { j { 1 , , H } : x D ( j ) N ɛ ( x D ) }, we have x D ( j ) N ɛ ( x D ) if x i ( j ) - x i ɛ i = 1 , , n, and x ¯ n + 1 = ( 1 / J ɛ ( x D ) ) j J ɛ ( x D ) x n + 1 ( j ).
In order to estimate ψ(Dn) = EXD(φ(XD)), we obtain a sample O = { x D ( i ) } i = 1 G from the marginal distribution of the random field at the design Dn and perform Monte Carlo integration:
ψ ^ ( D n ) = 1 G i = 1 G φ ^ ( x D ( i ) ) .
For each x D ( i ) O from the marginal sample, we use the sample I to compute φ ^ ( x D ( i ) ) in order to save computing time. We then vary the design using some optimization algorithm.
A simple example is adopted from Müller et al. [32]. The observations (x1(z1), x2(z2), x3(z3), x4(z4)) are assumed to be distributed according to a one-dimensional Gaussian random field with mean zero, a marginal variance of one and zi ∈ [0, 1]. We want to select an optimal design D3 = (z1, z2, z3), such that:
- ψ ( D 3 ) = E X 1 : 3 ( D 3 ) [ max z 4 [ 0 , 1 ] var  ( X 4 ( z 4 ) X 1 : 3 ( D 3 ) ) ]
is minimal.
We assume the Ornstein–Uhlenbeck process with correlation function ρ(|st|; θ) = eθ|st|. Two prior distributions for the parameter θ are considered. The first one is a point prior at θ = log(100), so that ρ(h) = ρ (h; log(100)) = 0.01h. This is the correlation function used by Müller et al. [32] in their study of empirical kriging optimal designs. The second prior distribution is an exponential prior for θ with scale parameter λ = 10 (i.e., θ ~ Exp(10)). The scale parameter λ was chosen, such that the average correlation functions of the point and exponential priors are similar. By that, we mean that the average of the mean correlation function for the exponential prior over all pairs of sites s and t, Es,t[Eθ{ρ(|st|; θ)|θ ~ Exp(λ)}] = Es,t[1/(1+λ|st|)], matches the average of the fixed correlation function ρ(|st|; log(100)) = 0.01|st| over all pairs of sites s and t, Es,t[0.01|st|]. The sites are assumed to be uniformly distributed over the coordinate space.
To be more specific, first, for each site s Entropy 16 04353f5, the average correlation to all other sites t Entropy 16 04353f5 is computed. Then, these average correlations are averaged over all sites s Entropy 16 04353f5. For the point prior, the average correlation is E s , t [ ρ ( s - t ; log ( 100 ) ) ] = 2 log ( 100 ) 2 ( log ( 100 ) - ( 1 - 1 100 ) ) = 0.3409, and for the exponential prior, the value is E s , t [ E θ { ρ ( s - t ; θ ) θ ~ Exp ( λ ) } ] = 2 λ 2 [ ( 1 + λ ) log ( 1 + λ ) - λ ]. If λ = 10, we have Es,t[Eθ{ρ(|st|; θ)|θ ~ Exp(10)}] = 0.3275.
Figure 3 depicts the distributions of the correlation function ρ(h; θ) = exp(−θh) under the two prior distributions. The solid line corresponds to the fixed correlation function ρ(h; θ = log(100)) = 0.01h. The dotted line and the two dashed lines represent the mean correlation function and the 0.025- and 0.975-quantile functions for ρ(h; θ) under the prior θ ~ Exp(10).
We estimated the criterion on a grid with spacing 0.05 for the design points z1 and z3 (z2 is fixed at z2 = 0.5). We set G = 1, 000, H = 5 · 106 and ε = 0.01 for each design point. The sample { x j ( z ) : z Z } j = 1 H is simulated at all points z of the grid prior to the actual ABC algorithm. In order to accelerate the computations, it is then reused for all possible designs D3 to estimate each φ ^ ( x D ( i ) ), i = 1, . . . ,G, in (12). The sample size H = 5 · 106 was deemed to provide a sufficiently exhaustive sample from the four-dimensional normal vector (x1(z1), x2(z2), x3(z3), x4(z4)) for any ziZ, so that the distortive effect of using the same sample for the computations of all φ ^ ( x D ( i ) ) is only of negligible concern for our purposes of ranking the designs.
Figure 4 (left) shows the map of estimated criterion values, − ψ̂ (D3), when the prior distribution of θ is the point prior at θ = log(100). It can be seen that the minimum of the criterion is attained at about (z1, z3) = (0.9, 0.1) or (z1, z3) = (0.1, 0.9), which is comparable to the the results obtained in Müller et al. [32] for empirical kriging optimal designs. Note that the diverging criterion values at the diagonal and at z1 = 0.5 and z3 = 0.5 are attributable to a specific feature of the ABC method used. At these designs, the actual dimension of the design is lower than three, so for a given ε, there are more elements in the neighborhood than for the other designs with three distinctive design points. Hence, a much larger fraction of the total sample, { x n + 1 ( j ) } j = 1 H
, is included in the ABC sample, { x n + 1 ( j ) : j J ɛ ( y D ) }. Therefore, the values of the criterion get closer to the marginal variance of one. In order to avoid this effect, the parameter ε would have to be adapted in these cases. Alternatively, one could use other variants of ABC rejection, where the fixed number of N elements of I = { x D ( j ) , x n + 1 ( j ) } j = 1 H with the smallest distance to the draw x D ( i ) O are constituting the ABC posterior sample, making it necessary to compute and sort out the distances for each x D ( i ) O.
Figure 4 (right) gives the estimated criterion values, −ψ̂(D3), when the prior of θ is θ ~ Exp(10). Due to the uncertainty of the prior parameter θ, the optimal design points for z1 and z3 slightly move to the edges, which is also in accordance with the findings of Müller et al. [32].

8. Conclusions

There are some fundamental results in Bayesian learning which provide important background to fields like the optimal design of experiments. Functionals of prior distributions which are learnable, via observation, in a wide sense, are convex. Shannon information is an example but there are many others and the paper points to some wide classes with connections to other fields. It combines the theory of learning with an effective method for the optimal design of experiments based on simulation: ABCD. It is suggested that the method should prove useful in non-standard situations, such as non-linear, non-Gaussian models and for complex problems where the sampling distribution is intractable but one can still draw samples from it, for given parameter values. A simple message is that the learning theory and simulation method applies to a generalized notion of an experiment as a choice of sampling distribution, under restrictions.

Acknowledgments

The research of the first author has been partially supported by the French Science Fund (ANR) and Austrian Science Fund (FWF) bilateral grant I-833-N18. The last author is grateful for the award of Exzellenzstipendium des Landes Oberösterreich by the governor of Upper-Austria, in 2012.

Author Contributions

The background sections were mainly authored by Henry P. Wynn; ABCD was jointly conceived by Werner G. Müller and Henry P. Wynn; All computations for the examples were performed by Markus Hainy. All authors have read and approved the final published manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Blackwell, D. Comparison of Experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 31 July–12 August 1950; University of California Press: Berkeley, CA, USA, 1951; pp. 93–102. [Google Scholar]
  2. Torgersen, E. Comparison of Statistical Experiments; Encyclopedia of Mathematics and its Applications 36; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
  3. Rényi, A. On Measures of Entropy and Information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
  4. Lindley, D.V. On a Measure of the Information Provided by an Experiment. Ann. Math. Stat 1956, 27, 986–1005. [Google Scholar]
  5. Goel, P.K.; DeGroot, M.H. Comparison of Experiments and Information Measures. Ann. Math. Stat 1979, 7, 1066–1077. [Google Scholar]
  6. Ginebra, J. On the measure of the information in a statistical experiment. Bayesian Anal 2007, 2, 167–211. [Google Scholar]
  7. Chaloner, K.; Verdinelli, I. Bayesian Experimental Design: A Review. Stat. Sci 1995, 10, 273–304. [Google Scholar]
  8. Sebastiani, P.; Wynn, H.P. Maximum entropy sampling and optimal Bayesian experimental design. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 2000, 62, 145–157. [Google Scholar]
  9. Chater, N. The Probability Heuristics Model of Syllogistic Reasoning. Cogn. Psychol 1999, 38, 191–258. [Google Scholar]
  10. Schoenberg, I.J. Metric Spaces Positive Definite Functions. Trans. Am. Math. Soc 1938, 44, 522–536. [Google Scholar]
  11. Schilling, R.L.; Song, R.; Vondracek, Z. Bernstein Functions: Theory and Applications; De Gruyter Studies in Mathematics 37; De Gruyter: Berlin, Germany, 2012. [Google Scholar]
  12. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys 1988, 52, 479–487. [Google Scholar]
  13. DeGroot, M.H. Optimal Statistical Decisions, WCL edition; Wiley-Interscience: Hoboken, NJ, USA, 2004. [Google Scholar]
  14. Goldman, A.I.; Shaked, M. Results on inquiry and truth possession. Stat. Probab. Lett 1991, 12, 415–420. [Google Scholar]
  15. Fallis, D.; Liddell, G. Further results on inquiry and truth possession. Stat. Probab. Lett 2002, 60, 169–182. [Google Scholar]
  16. Torgerson, W.S. Theory and Methods of Scaling; John Wiley and Sons, Inc: New York, NY, USA, 1958. [Google Scholar]
  17. Gower, J.C. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 1966, 53, 325–338. [Google Scholar]
  18. Gower, J.C. Euclidean distance geometry. Math. Sci 1982, 7, 1–14. [Google Scholar]
  19. Itti, L.; Baldi, P. Bayesian surprise attracts human attention. Vis. Res 2009, 49, 1295–1306. [Google Scholar]
  20. Haykin, S.; Chen, Z. The Cocktail Party Problem. Neural Comput 2005, 17, 1875–1902. [Google Scholar]
  21. Berger, J. The case for objective Bayesian analysis. Bayesian Anal 2006, 1, 385–402. [Google Scholar]
  22. Marshall, A.W.; Olkin, I.; Arnold, B.C. Inequalities: Theory of Majorization and Its Applications, 2nd ed; Springer Series in Statistics; Springer: Berlin, Germany, 2009. [Google Scholar]
  23. Hardy, G.H.; Littlewood, J.E.; Pólya, G. Inequalities, 2nd ed; Cambridge Mathematical Library; Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
  24. Müller, A.; Stoyan, D. Comparison Methods for Stochastic Models and Risks, 1st ed; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
  25. Ryff, J.V. Orbits of l1-functions under doubly stochastic transformations. Trans. Am. Math. Soc 1965, 117, 92–100. [Google Scholar]
  26. DeGroot, M.H.; Fienberg, S. Comparing probability forecasters: Basic binary concepts and multivariate extensions. In Bayesian Inference and Decision Techniques; Goel, P., Zellner, A., Eds.; North-Holland: Amsterdam, The Netherlands, 1986; pp. 247–264. [Google Scholar]
  27. Dawid, A.P.; Sebastiani, P. Coherent dispersion criteria for optimal experimental design. Ann. Stat 1999, 27, 65–81. [Google Scholar]
  28. Marjoram, P.; Molitor, J.; Plagnol, V.; Tavaré, S. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 2003, 100, 15324–15328. [Google Scholar]
  29. Hainy, M.; Müller, W.; Wynn, H. Approximate Bayesian Computation Design (ABCD), an Introduction. In mODa 10—Advances in Model-Oriented Design and Analysis; Ucinski, D., Atkinson, A.C., Patan, M., Eds.; Contributions to Statistics; Springer International Publishing: Heidelberg/Berlin, Germany, 2013; pp. 135–143. [Google Scholar]
  30. Drovandi, C.C.; Pettitt, A.N. Bayesian Experimental Design for Models with Intractable Likelihoods. Biom 2013, 69, 937–948. [Google Scholar] [Green Version]
  31. Hainy, M.; Müller, W.G.; Wagner, H. Likelihood-free Simulation-based Optimal Design; Technical Report; Johannes Kepler University: Linz, Austria, 2013. [Google Scholar]
  32. Müller, W.G.; Pronzato, L.; Waldl, H. Beyond space-filling: An illustrative case. Procedia Environ. Sci 2011, 7, 14–19. [Google Scholar]
Figure 1. Shannon information of the prior, I0, and of the posterior, I1, depending on x.
Figure 1. Shannon information of the prior, I0, and of the posterior, I1, depending on x.
Entropy 16 04353f1
Figure 2. Estimated values of the criterion ψ̂(z1) (points) and theoretical criterion function ψ(z1) (solid line) for ε = 0.01, Kz = 100, and H = 100 (a), H = 1, 000 (b), H = 10, 000 (c).
Figure 2. Estimated values of the criterion ψ̂(z1) (points) and theoretical criterion function ψ(z1) (solid line) for ε = 0.01, Kz = 100, and H = 100 (a), H = 1, 000 (b), H = 10, 000 (c).
Entropy 16 04353f2
Figure 3. Prior distributions of correlation function ρ(h; θ): correlation function ρ(h) = 0.01h under point prior θ = log(100) (solid line); mean correlation function (dotted line) and 0.025- and 0.975-quantile functions (dashed lines) for ρ(h; θ) under the prior θ ~ Exp(10).
Figure 3. Prior distributions of correlation function ρ(h; θ): correlation function ρ(h) = 0.01h under point prior θ = log(100) (solid line); mean correlation function (dotted line) and 0.025- and 0.975-quantile functions (dashed lines) for ρ(h; θ) under the prior θ ~ Exp(10).
Entropy 16 04353f3
Figure 4. Spatial prediction criterion map for the point prior at θ = log(100) (left) and for the exponential prior θ ~ Exp(10) (right).
Figure 4. Spatial prediction criterion map for the point prior at θ = log(100) (left) and for the exponential prior θ ~ Exp(10) (right).
Entropy 16 04353f4

Share and Cite

MDPI and ACS Style

Hainy, M.; Müller, W.G.; P. Wynn, H. Learning Functions and Approximate Bayesian Computation Design: ABCD. Entropy 2014, 16, 4353-4374. https://doi.org/10.3390/e16084353

AMA Style

Hainy M, Müller WG, P. Wynn H. Learning Functions and Approximate Bayesian Computation Design: ABCD. Entropy. 2014; 16(8):4353-4374. https://doi.org/10.3390/e16084353

Chicago/Turabian Style

Hainy, Markus, Werner G. Müller, and Henry P. Wynn. 2014. "Learning Functions and Approximate Bayesian Computation Design: ABCD" Entropy 16, no. 8: 4353-4374. https://doi.org/10.3390/e16084353

APA Style

Hainy, M., Müller, W. G., & P. Wynn, H. (2014). Learning Functions and Approximate Bayesian Computation Design: ABCD. Entropy, 16(8), 4353-4374. https://doi.org/10.3390/e16084353

Article Metrics

Back to TopTop