1. Introduction
In a statistical context, since the expression
the probability of an event A (usually denoted
) depends on the unknown parameter, it is really a misuse of language. Before performing the experiment, this expression can be assigned a natural meaning from a Bayesian perspective as the prior predictive probability of
A since it is the prior mean of the probabilities
. However, in accordance with Bayesian philosophy, once the experiment has been carried out and the value
has been observed, a more appropriate estimate of
is the posterior predictive probability given
of
A. The author has recently proved ([
1]) that not only is this the Bayes estimator of
but that the posterior predictive distribution (resp. the posterior predictive density) is the Bayes estimator of the sampling distribution
(resp. the density
) for the squared variation total (resp. the squared
) loss function in the Bayesian experiment corresponding to an
n-sized sample of the unknown distribution. It should be noted that the loss functions considered derive in a natural way from the commonly used squared error loss function when estimating a real function of the parameter.
The posterior predictive distribution is the cornerstone of Predictive Inference, which seeks to make inferences about a new unknown observation from a preceding random sample (see [
2,
3]). With that idea in mind, it has also been used in other areas such as model selection, testing for discordancy, goodness of fit, perturbation analysis, and classification (see additional fields of application in [
1,
2,
3,
4,
5]). Furthermore, in [
1], it has been presented as a solution for the Bayesian density estimation problem, giving several examples to illustrate the results and, in particular, to calculate a posterior predictive density. [
3] provide many other examples of determining the posterior predictive distribution. But in practice, explicit evaluation of the posterior predictive distribution may be cumbersome, and its simulation may become preferable. The aforementioned work of [
3] also constitutes a good reference for such simulation methods, and hence for the computation of the Bayes estimators of the density and the sampling distribution.
We would refer to the references cited in [
1] for other statistical uses of the posterior predictive distribution and some useful ways to calculate it.
In this communication, we shall explore the asymptotic behaviour of the posterior predictive density as the Bayes estimator of the density, showing its strong consistency and that the Bayes risk goes to 0 as n goes to ∞.
2. The Framework
Let
be a Bayesian experiment (where
Q denotes de prior distribution on the parameter space
), and consider the infinite product Bayesian experiment
corresponding to an infinite sample of the unknown distribution
. Let us write
for integer
n.
We suppose that
is a Markov kernel. Let
be the joint distribution of the parameter and the observations, i.e.,
As (i.e., the probability distribution of J with respect to ), is a version of the conditional distribution (regular conditional probability) . Analogously, is a version of the conditional distribution .
Let
, the prior predictive distribution in
(so that
is the prior mean of the probabilities
). Similarly, write
for the prior predictive distribution in
. So, the posterior distribution
given
satisfies
Denote by for the posterior distribution given .
Write
for the posterior predictive distribution given
defined for
as
So is nothing but the posterior mean given of the probabilities .
In the dominated case, we can assume without loss of generality that the dominating measure is a probability measure (because of (1) below). We write . The likelihood function is assumed to be -measurable.
We have that, for all
n and every event
,
which proves that
is a
-density of
that we recognize as the posterior predictive density on
given
.
In the same way,
is a
-density of
, the posterior predictive density on
given
.
In the following, we will assume the following additional regularity conditions:
- (i)
is a standard Borel space;
- (ii)
is a Borel subset of a Polish space and is its Borel -field;
- (iii)
is identifiable.
According to [
1], the posterior predictive distribution
(resp. the posterior predictive density
) is the Bayes estimator of the sampling distribution
(resp. the density
) for the squared variation total (resp. the squared
) loss function in the product experiment
. Analogously, the posterior predictive distribution
(resp. the posterior predictive density
) is the Bayes estimator of the sampling distribution
(resp. the density
) for the squared variation total (resp. the squared
) loss function in the product experiment
.
As a particular case of a well known result about the total variation distance between two probability measures and the
-distance between their densities, we have that
3. The Main Result
We ask whether the Bayes risk of the Bayes estimator
of the sampling distribution
goes to zero when
, i.e., whether
In terms of densities, the question is whether the Bayes risk of the Bayes estimator
of the density
goes to zero when
, i.e., whether
Let us consider the auxiliary Bayesian experiment
For , and , we will continue to write and , and now we write .
The new prior predictive distribution is
since
To compute the new posterior distributions, notice that
It follows that if
then
when
, we have that
is an increasing sequence of sub-
-fields of
such that
. According to the martingale convergence theorem of Lévy, if
Y is
-measurable and
-integrable then
converges
-a.e. and in
to
.
Let us consider the
-integrable function
Indeed, given
and
, we have that
which proves (2).
Analogously, it can be shown that
Hence, it follows from the aforementioned theorem of Lévy that
and
i.e.,
On the other hand, as a consequence of a known theorem of Doob (see Theorem 6.9 and Proposition 6.10 of [
4], pp. 129, 130, we have that, for every
,
for
Q-almost every
. Hence
for
Q-almost every
, i.e., given
there exists
such that
and,
,
So, for
, there exists
such that
and
From (4) and (6), it follows that ,
From this and (5), it follows that
i.e., the risk of the Bayes estimator of the density for the
loss function goes to 0 when
.
It follows from this and (1) that
i.e., the risk of the Bayes estimator of the sampling distribution
for the variation total loss function goes to 0 when
.
We ask whether these results remain true for the squared versions of the loss functions. The answer is affirmative because of the following general result: Let
be a sequence of r.r.v. on a probability space
such that
. If there exists
such that
, for all
n, then
because
In our case
,
and
So, we have proved the following result.
Theorem 1. Let be a Bayesian experiment dominated by a σ-finite measure μ. Let us assume that is a standard Borel space, and that Θ is a Borel subset of a Polish space and is its Borel σ-field. Assume also that the likelihood function is -measurable and the family is identifiable. Then:
- (a)
The posterior predictive density is the Bayes estimator of the density in the product experiment for the squared loss function. Moreover the risk function converges to 0 for both the loss function and the squared loss function.
- (b)
The posterior predictive distribution is the Bayes estimator of the sampling distribution in the product experiment for the squared variation total loss function. Moreover the risk function converges to 0 for both the variation total loss function and the squared variation total loss function.
- (c)
The posterior predictive density is a strongly consistent estimator of the density , i.e.,for Q-almost every .