Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get?

Tiňo, Peter

doi:10.3390/e15041202

Open AccessArticle

Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get?

by

Peter Tiňo

School of Computer Science, The University of Birmingham, Birmingham, B15 2TT, UK

Entropy 2013, 15(4), 1202-1220; https://doi.org/10.3390/e15041202

Submission received: 15 January 2013 / Revised: 21 March 2013 / Accepted: 25 March 2013 / Published: 8 April 2013

(This article belongs to the Special Issue Distance in Information and Statistical Physics Volume 2)

Download

Browse Figures

Versions Notes

Abstract

:

Studies of learning algorithms typically concentrate on situations where potentially ever growing training sample is available. Yet, there can be situations (e.g., detection of differentially expressed genes on unreplicated data or estimation of time delay in non-stationary gravitationally lensed photon streams) where only extremely small samples can be used in order to perform an inference. On unreplicated data, the inference has to be performed on the smallest sample possible—sample of size 1. We study whether anything useful can be learnt in such extreme situations by concentrating on a Bayesian approach that can account for possible prior information on expected counts. We perform a detailed information theoretic study of such Bayesian estimation and quantify the effect of Bayesian averaging on its first two moments. Finally, to analyze potential benefits of the Bayesian approach, we also consider Maximum Likelihood (ML) estimation as a baseline approach. We show both theoretically and empirically that the Bayesian model averaging can be potentially beneficial.

Keywords:

Poisson distribution; unreplicated data; Bayesian learning; expected Kullback–Leibler divergence

1. Introduction

Studies in (computational) learning theory mostly tend to concentrate on situations where potentially ever increasing number of training examples is available. While such results can lead to deep insights into the workings of learning algorithms, e.g., linking together characteristics of the data generating distributions, learning machines and sample sizes, there can be situations where, by very nature of the problem, only extremely small samples are available. In such situations it is of utmost importance to theoretically analyze exactly what and under what circumstances can be learnt. One example of such a scenario in count data is detection of differentially expressed genes, where even subtle changes in gene expression levels can be indicators of biologically crucial processes [1]. When replicas are costly to obtain one can attempt to use the limited data at one’s disposal to make the relevant inferences, as for example in the Audic and Claverie approach [2,3,4,5,6]. Another situation where available count data can be extremely sparse is estimation of time delay in non-stationary gravitationally lensed photon streams. When the scale of variability of the source is of order, say, of more than tens of days and observation gaps are not too long, one can resolve the time delay between lensed images of the same source by working directly with daily measurements of fluxes in the radio, optical or X-ray range [7,8,9,10]. However, when the variability scale is of the order of hours, one must turn to photon streams in the lensed images. One possibility of time delay detection in such cases is through comparing counts in relatively short and time-shifted moving time windows in the lensed photon streams.

In this paper we theoretically study what happens in the extreme situation of unreplicated data when the inference has to be performed on the smallest sample possible—sample of size 1. We consider a model-based Bayesian approach that averages over possible Poisson models with weighting determined by the posterior over the models, given the single observation. In fact, such a Bayesian approach has been considered in the bioinformatics literature under the assumption of flat improper prior over the Poisson rate parameter [2,3,4,5,6]. One can, of course, be excused for being highly sceptical about the relevance of such inferences, yet the methodology has apparently been used in a number of successful studies. In an attempt to build theoretical foundations behind such inference schemes, we proved a rather surprising result [11]: The expected Kullback–Leibler divergence from the true unknown Poisson distribution to its model learnt from a single realization never exceeds 1/2 bit.

Even though the field of bioinformatics is moving fast and better procedures for detection of differentially expressed genes have been introduced (e.g., not relying on the Poisson assumption, specifically taking into account potential dependencies among the genes, etc.), the primary focus of this study is different. Irrespective of the application domain, we theoretically investigate how reliably can a model for count data be build from a single count observation, under the assumption of a Poisson source. There are two issues that need careful consideration:

Equal a-priori weighting (flat prior) over possible (unknown) Poisson sources is unrealistic. Typical values of observed counts are usually bounded by the nature of the problem (e.g., gene magnification setting used in the experiments or time window on the photon streams). One may have a good initial (a-priori) guess as to what ranges of typical observed counts might be reasonably expected. In particular, we are interested in the low count regimes. In such cases, it is desirable to incorporate such prior knowledge into the inference mechanism. In this study, we do this in the Bayesian framework through prior distribution over the expected counts.
To understand potential benefits of the proposed learning/inference method (in our case Bayesian approach), it is important to compare it with a simple straightforward baseline (here maximum likelihood estimation). We contrast the expected Kullback–Leibler divergences from the true unknown Poisson distribution to its Bayesian and maximum likelihood estimates, inferred from a single realization.

The paper has the following organization. In Section 2 we introduce the maximum likelihood and Bayesian (with flat prior over mean rates) approaches to inferring predictive distribution over counts based on a single count observation. We also briefly review past work on information theoretic properties of the two approaches. Section 3 contains derivation of a more general Bayesian approach with gamma prior on the mean count parameter. In Section 4 we calculate the first two central moments of our generalized model. This enables us to better understand the influence of the prior on the inferred model and highlight the differences with the previous approach using the flat (improper) prior. In Section 5 we perform an information theoretic study of learning capabilities of the generalized model. Empirical investigations are presented in Section 6 and the main findings are discussed and summarized in Section 7.

2. Single Count Data—Bayesian and Maximum Likelihood Approaches

In this section we will briefly review the original Audic–Claverie [2] and maximum likelihood approaches outside the bioinformatics context.

2.1. Bayesian Averaging in the Audic–Claverie Approach

Let x be an observed count in an experiment. When repeating the experiment, possibly under different conditions, we observe a (possibly different) count y. The quantity of interest is the probability of observing y given that we already observed x, not knowing the identity of the generating Poisson source

P (X = x | λ) = e^{- λ} \frac{λ^{x}}{x!}

(1)

where

λ \geq 0

is the (unknown) parameter representing the mean count value.

Under the null hypothesis (not differentially expressed genes), both counts x and y come from the same underlying Poisson distribution

P (\cdot | λ)

. The key instrument in the Audic–Claverie approach is a distribution

P_{A C} (y | x)

over counts y informed by the observed count x, under the null hypothesis.

P_{A C} (y | x)

is obtained by Bayesian averaging (infinite mixture) of all possible Poisson distributions

P (y | λ^{'})

with mixing proportions equal to the posteriors

p (λ^{'} | x)

under the flat prior over λ. Formally, the probability of count y, given the observed count x from the same (unknown) Poisson distribution is:

\begin{matrix} P (y | x) & = & \int_{0}^{\infty} p (y, λ | x) d λ \\ = & \int_{0}^{\infty} P (y | λ, x) p (λ | x) d λ \\ = & \int_{0}^{\infty} P (y | λ) \frac{P (x | λ) p (λ)}{\int_{0}^{\infty} P (x | λ^{'}) p (λ^{'}) d λ^{'}} d λ \end{matrix}

(2)

Imposing the flat (improper) prior

p (λ)

over the Poisson parameter λ results in

P_{A C} (y | x) = \frac{1}{y!} \frac{\int_{0}^{\infty} e^{- 2 λ} λ^{x + y} d λ}{\int_{0}^{\infty} e^{- λ} λ^{x} d λ}

Since Gamma distribution parameterized by

a, b > 0

takes the form

G a m m a (λ | a, b) = \frac{1}{Γ (a)} b^{a} λ^{a - 1} e^{- b λ}

where

Γ (a) = \int_{0}^{\infty} u^{a - 1} e^{- u} d u

is the Gamma function, we have

P_{A C} (y | x) = \frac{1}{y! 2^{x + y + 1}} \frac{Γ (x + y + 1)}{Γ (x + 1)}

(3)

which, since x and y are integers (i.e.,

Γ (x) = (x - 1)!

), can be rewritten as

\begin{matrix} P_{A C} (y | x) & = & \frac{1}{2^{x + y + 1}} \frac{(x + y)!}{x! y!} \end{matrix}

(4)

\begin{matrix} = & \frac{1}{2^{x + y + 1}} (\binom{x + y}{x}) \end{matrix}

(5)

P_{A C} (\cdot | x)

can then be used, e.g., for principled inferences, construction of confidence intervals or statistical testing.

2.2. Information Theory of $P_{A C} (y | x)$

Consider a “true” underlying Poisson distribution

P (y | λ)

(1) over possible counts

y \geq 0

. We first use

P (\cdot | λ)

to generate a count x and then employ

P_{A C} (y | x)

(5) as a model distribution over y, given the already observed count x. We ask: If we repeated the process above, how different, in terms of Kullback–Leibler divergence, are on average the two distributions over y? One would naturally hope that

P_{A C} (y | x)

is sufficiently representative of the true unknown distribution

P (y | λ)

.

In [11] we proved that, given an underlying Poisson distribution

P (x | λ)

, if we repeatedly generated a “representative” count x from

P (x | λ)

, the average divergence

E (λ)

of

P_{A C} (y | x)

from the truth

P (y | λ)

would never exceed 1/2 bit.

Theorem 1 [11] Consider an underlying Poisson distribution

P (\cdot | λ)

parameterized by some

λ > 0

. Then

E (λ) = E_{P (x | λ)} [D_{K L} [P (y | λ) ∥ P_{A C} (y | x)]] = \frac{1}{2} log 2 + O (\frac{1}{λ})

where

D_{K L} [P (y | λ) ∥ P_{A C} (y | x)]

is the Kullback–Leibler divergence from

P (y | λ)

to

P_{A C} (y | x)

,

D_{K L} [P (y | λ) ∥ P_{A C} (y | x)] = \sum_{y = 0}^{\infty} P (y | λ) log \frac{P (y | λ)}{P_{A C} (y | x)}

The expected divergence (in bits) can be well-approximated (up to order

O (λ^{- 3})

) by [11]:

E (λ) \approx \frac{1}{2} - \frac{1}{12 λ} (1 - \frac{1}{2}) - \frac{1}{24 λ^{2}} (1 - \frac{1}{2^{2}})

(6)

2.3. $P_{A C} (y | x)$ vs. Maximum Likelihood

In this section we will briefly recall information theoretic analysis of the maximum likelihood estimate

P_{M L} (y | x)

in place of

P_{A C} (y | x)

[12]. First note that Poisson distribution

P (y | λ)

is only defined for positive λ. In the case of observing zero count

x = 0

, we cannot directly use the “maximum likelihood estimate”

P (y | 0)

. One option for dealing with zero observed counts is to allow for some form of model regularization, e.g., infer a Poisson model

P (y | ϵ)

, for some small

ϵ > 0

. In other words, if a count

x \geq 1

is observed, follow the standard maximum likelihood procedure and infer

P_{M L} (y | x) = P (y | x)

as the Poisson model; if a zero count is observed,

x = 0

, infer

P_{M L} (y | 0) = P (y | ϵ)

for some fixed

ϵ \in (0, 1]

. This is the route taken in [12] and adopted in this paper. Only a minimum amount of necessary regularization due to zero observed counts is employed in the otherwise straightforward ML approach.

Theorem 2 [12] Consider an underlying Poisson distribution

P (\cdot | λ)

parameterized by some

λ > 0

and a regularization constant

ϵ \in (0, 1]

. The expected divergence in bits

Υ (λ, ϵ)

between the true Poisson source and its (regularized) maximum likelihood estimate based on a single observation,

Υ (λ, ϵ) = E_{P (x | λ)} [D_{K L} [P (y | λ) ∥ P_{M L} (y | x)]]

is equal to

Υ (λ, ϵ) = λ ({log}_{2} λ - \sum_{x = 1}^{\infty} P (x | λ) {log}_{2} x) + e^{- λ} (ϵ - λ {log}_{2} ϵ)

(7)

Note that the expected divergence

Υ (λ, ϵ)

can get prohibitively large when regularizing with small

ϵ > 0

. As an illustration, in Figure 1 we show expected divergence

Υ (λ, ϵ = 1)

of the ML estimation (zero count regularized with

ϵ = 1

) for a range of mean parameter values λ of the underlying Poisson source (solid line). Also shown is the expected divergence

E (λ)

of

P_{A C} (y | x)

(dashed line). Except for very small Poisson source rates λ,

P_{A C} (y | x)

is clearly benefitting from the stabilizing effect of Bayesian averaging, given the extremely small sample size.

Figure 1. Expected divergence (in bits)

Υ (λ, ϵ = 1)

of the ML estimation (zero count regularized with

ϵ = 1

) (solid line). Also shown is the expected divergence

E (λ)

of

P_{A C} (y | x)

(dashed line).

Figure 1. Expected divergence (in bits)

Υ (λ, ϵ = 1)

of the ML estimation (zero count regularized with

ϵ = 1

) (solid line). Also shown is the expected divergence

E (λ)

of

P_{A C} (y | x)

(dashed line).

3. Generalized $P_{A C} (y | x)$ with Gamma Prior

In this section we will generalize

P_{A C} (y | x)

through the use of (conjugate) gamma prior

P (λ | α, β) = \frac{β^{α}}{Γ (α)} λ^{α - 1} e^{- β λ}

on the Poisson mean parameter λ. The positive parameters

α, β

determine the overall shape of the prior. Given a single observation x, the posterior

P (λ | x, α, β) = \frac{P (x | λ) P (λ | α, β)}{\int_{0}^{\infty} P (x | λ) P (λ | α, β) d λ}

is the gamma distribution with parameters

α + x

and

β + 1

,

P (λ | x, α, β) = \frac{{(β + 1)}^{α + x}}{Γ (α + x)} λ^{α + x - 1} e^{- (β + 1) λ}

The mean of

P (λ | x, α, β)

is equal to

(α + x) / (β + 1)

. A loose intuitive interpretation of the prior parameters

α, β

(assuming they are integers) is that prior to seeing the current data (in our case only one observation (count) x), we have seen β “observations”,

x_{1}^{'}, x_{2}^{'}, . . ., x_{β}^{'}

, with the total cumulative count

α = x_{1}^{'} + x_{2}^{'} + . . . + x_{β}^{'}

. Hence the mean parameter estimate would shift from x (ML estimation corresponding to

α, β \to 0

) to

(x_{1}^{'} + x_{2}^{'} + . . . + x_{β}^{'} + x) / (β + 1)

.

As in the case of

P_{A C} (y | x)

, having observed a count x, we build a predictive distribution over future counts y by integrating out the mean parameter λ with respect to the posterior

P (λ | x, α, β)

,

\begin{matrix} P_{G} (y | x, α, β) & = & \int_{0}^{\infty} P (y | λ) P (λ | x, α, β) d λ \\ = & \frac{{(β + 1)}^{α + x}}{Γ (α + x)} \frac{1}{y!} \int_{0}^{\infty} λ^{α + x + y - 1} e^{- (β + 2) λ} d λ \end{matrix}

(8)

From normalization of the gamma distribution we get

\int_{0}^{\infty} λ^{a - 1} e^{- b λ} d λ = \frac{Γ (a)}{b^{a}}

and so

\int_{0}^{\infty} λ^{α + x + y - 1} e^{- (β + 2) λ} d λ = \frac{Γ (x + y + α)}{{(β + 2)}^{x + y + α}}

leading to

P_{G} (y | x, α, β) = \frac{1}{y!} \frac{Γ (x + y + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{{(β + 2)}^{x + y + α}}

(9)

It can be easily verified that the original

P_{A C} (y | x)

is obtained as a special case of

P_{G} (y | x, α, β)

when

α = 1

and

β \to 0

. If Jeffrey’s prior were used instead of the flat prior in

P_{A C} (y | x)

, we would obtain

P_{G} (y | x, α, β)

with

α = 1 / 2

and

β \to 0

etc.

If α is an integer, we have

P_{G} (y | x, α, β) = {(\frac{1 + β}{2 + β})}^{x^{'} + 1} {(\frac{1}{2 + β})}^{y} (\binom{x^{'} + y}{y})

(10)

where

x^{'} = x + α - 1

is the observed count including prior observations. This expression generalizes

P_{A C} (y | x)

(5),

P_{A C} (y | x) = {(\frac{1}{2})}^{x + 1} {(\frac{1}{2})}^{y} (\binom{x + y}{y})

While

P_{G} (y | x, α, β)

(9) can be used with any appropriate setting of

α, β

(e.g., given a prior knowledge of the range of counts one may reasonably expect), in this contribution we concentrate on using the gamma prior to mitigate for the unrealistic equal weighting of all

λ > 0

in the flat prior behind

P_{A C} (y | x)

. Indeed, the observed counts are typically bounded by the nature of the problem and one can represent this through setting

α = 1

and varying

β > 0

in the gamma prior

P (λ | α, β)

underlying

P_{G} (y | x, α, β)

. Some examples of such priors are shown in Figure 2. Decreasing β leads to weaker emphasis on low λ, eventually recovering the flat (improper) prior for

β = 0

.

Figure 2. Gamma prior

P (λ | α = 1, β)

. Shown are the priors for three possible values of parameter β,

β \in {1, 0.1, 0.05}

.

Figure 2. Gamma prior

P (λ | α = 1, β)

. Shown are the priors for three possible values of parameter β,

β \in {1, 0.1, 0.05}

.

In Section 2.3 maximum likelihood estimation was regularized at zero count by imposing a non-zero “count” ϵ instead of the observed zero one. The generalized form of

P_{A C} (y | x)

,

P_{R} (y | x, β) = P_{G} (y | x, α = 1, β)

can be also viewed as an alternative “soft” form of regularization of the maximum likelihood approach at zero counts.

Parameter β in the Gamma prior

P (λ | α = 1, β) = β e^{- β λ}

can be set in a data driven manner, e.g., using the following strategy: Given the observed count x, we require that the area up to

x + 1

covered by the prior is equal to θ, for some threshold

θ \in (0, 1)

(e.g.,

θ = 1 / 4

). In other words,

F (x + 1 | β) = θ

, where

F (λ | β) = 1 - e^{- β λ}

is the cumulative distribution function of

P (λ | α = 1, β)

. This leads to

β (x) = - \frac{ln (1 - θ)}{x + 1}

(11)

For zero observed count

x = 0

,

β (0) = - ln (1 - θ)

and the prior gets more concentrated on smaller values of λ as likely candidates for the mean count of the underlying Poisson source. With increasing count values

x > 0

the parameter

β (x)

decreases to 0 and the prior gradually approaches the flat prior of

P_{A C} (y | x)

.

Finally, we contrast

P_{G} (y | x, α, β)

with the negative binomial distribution

P_{N B} (y | r, q) = \frac{1}{y!} \frac{Γ (r + y)}{Γ (r)} q^{r} {(1 - q)}^{y}

(12)

with parameters

r > 0

and

q \in [0, 1]

. One interpretation of the negative binomial distribution

P_{N B} (y | r, q)

is that it corresponds to a Gamma–Poisson mixture that one obtains by imposing a Gamma prior

P (λ | r, (1 - q) / q)

on the mean count parameter λ of the Poisson distribution

P (y | λ)

and integrating out λ. In our context it is natural to identify r and

(1 - q) / q

with hyperparameters α and β used in

P_{G} (y | x, α, β)

. It follows that

q = {(β + 1)}^{- 1}

. Hence, we rewrite Equation (12) as

P_{N B} (y | α, {(β + 1)}^{- 1}) = \frac{1}{y!} \frac{Γ (α + y)}{Γ (α)} \frac{β^{y}}{{(β + 1)}^{α + y}}

(13)

Direct comparison of (13) with (9) leads to an intuitive insight: The β prior measurements of total count α introduced by the gamma prior

P (λ | α, β)

are in the case of

P_{G} (y | x, α, β)

extended with a single observation x, resulting in

β + 1

observations of total count

α + x

. This can be represented by

P_{N B} (y | α + x, {(β + 2)}^{- 1}) = \frac{1}{y!} \frac{Γ (x + y + α)}{Γ (x + α)} \frac{{(β + 1)}^{y}}{{(β + 2)}^{x + y + α}}

(14)

It follows that

\frac{P_{G} (y | x, α, β)}{P_{N B} (y | α + x, {(β + 2)}^{- 1})} = {(β + 1)}^{x + α - y} .

Bayesian averaging in

P_{G} (y | x, α, β)

with respect to the posterior over λ, given a count x, differs from the corresponding negative binomial distribution

P_{N B} (y | α + x, {(β + 2)}^{- 1})

by the factor

{(β + 1)}^{x + α - y}

that depends on the difference between the prior+observed count

α + x

and y.

4. First and Second Moments of the Generalized $P_{A C} (y | x)$

In [11] we showed that

P_{A C} (y | x)

and the underlying Poisson distribution are quite similar in their nature: for any (integer) mean rate

λ \geq 1

, the Poisson distribution

P (\cdot | λ)

has two neighboring modes located at λ and

λ - 1

, with

P (λ | λ) = P (λ - 1 | λ)

. Analogously, given a count

x \geq 1

,

P_{A C} (\cdot | x)

has two neighboring modes, one located at x, the other at

x - 1

, with

P_{A C} (x | x) = P_{A C} (x - 1 | x)

. As in Poisson distribution, the values of

P_{A C} (y | x)

decrease as one moves away from the modes in both directions. In this section we derive the first two moments of the generalized

P_{A C} (y | x)

,

P_{G} (y | x, α, β)

. As a special case, we will show that as a result of Bayesian averaging, the variance of

P_{A C} (y | x)

is double that of the underlying (unobserved) Poisson distribution.

Theorem 3 Consider a non-negative integer x and the associated generalized model

P_{G} (y | x, α, β)

. Then,

E_{P_{G} (y | x, α, β)} [y] = \frac{x + α}{β + 1}, V a r [y] = \frac{β + 2}{β + 1} E_{P_{G} (y | x, α, β)} [y]

Proof: Let us evaluate

\begin{matrix} E_{P_{G} (y | x, α, β)} [y] & = & \sum_{y = 0}^{\infty} \frac{Γ (x + y + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{{(β + 2)}^{x + y + α}} \frac{1}{y!} y \\ = & \sum_{y = 1}^{\infty} \frac{Γ (x + y + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{{(β + 2)}^{x + y + α}} \frac{1}{(y - 1)!} \\ = & \sum_{y^{'} = 0}^{\infty} \frac{Γ (x + y^{'} + 1 + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{{(β + 2)}^{x + y^{'} + 1 + α}} \frac{1}{y^{'}!} \\ = & \sum_{y^{'} = 0}^{\infty} \frac{Γ (x + y^{'} + α) \cdot (x + y^{'} + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{(β + 2) \cdot {(β + 2)}^{x + y^{'} + α}} \frac{1}{y^{'}!} \end{matrix}

(15)

In the third equality we have used substitution

y^{'} = y - 1

and the last equality follows from

Γ (z + 1) = z \cdot Γ (z)

. By (15),

\begin{matrix} E_{P_{G} (y | x, α, β)} [y] & = & \sum_{y = 0}^{\infty} P_{G} (y | x, α, β) \frac{x + α + y}{β + 2} \end{matrix}

(16)

\begin{matrix} = & \frac{x + α}{β + 2} + \frac{1}{β + 2} E_{P_{G} (y | x, α, β)} [y] \end{matrix}

(17)

Solving (17) we obtain

E_{P_{G} (y | x, α, β)} [y] = \frac{x + α}{β + 1}

(18)

For the variance of

P_{G} (y | x, α, β)

we have

\begin{matrix} V a r_{P_{G} (y | x, α, β)} [y] & = & E_{P_{G} (y | x, α, β)} [y^{2}] - {(E_{P_{G} (y | x, α, β)} [y])}^{2} \end{matrix}

(19)

Now,

\begin{matrix} E_{P_{G} (y | x, α, β)} [y^{2}] & = & \sum_{y = 0}^{\infty} \frac{Γ (x + y + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{{(β + 2)}^{x + y + α}} \frac{1}{y!} y^{2} \\ = & \sum_{y = 1}^{\infty} \frac{Γ (x + y + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{{(β + 2)}^{x + y + α}} \frac{1}{y!} y^{2} \\ = & \sum_{y = 1}^{\infty} \frac{Γ (x + y + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{{(β + 2)}^{x + y + α}} \frac{1}{(y - 1)!} y \\ = & \sum_{y^{'} = 0}^{\infty} \frac{Γ (x + y^{'} + 1 + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{{(β + 2)}^{x + y^{'} + 1 + α}} \frac{1}{y^{'}!} (y^{'} + 1) \\ = & \sum_{y^{'} = 0}^{\infty} \frac{(x + y^{'} + α) Γ (x + y^{'} + α)}{Γ (x + α)} \frac{{(β + 1)}^{x + α}}{(β + 2) {(β + 2)}^{x + y^{'} + α}} \frac{1}{y^{'}!} (y^{'} + 1) \\ = & \frac{1}{β + 2} \sum_{y^{'} = 0}^{\infty} P_{G} (y^{'} | x, α, β) [(x + y^{'} + α) (y^{'} + 1)] \\ = & \frac{1}{β + 2} \sum_{y^{'} = 0}^{\infty} P_{G} (y^{'} | x, α, β) [x + y^{'} + α] \\ + & \frac{1}{β + 2} \sum_{y^{'} = 0}^{\infty} P_{G} (y^{'} | x, α, β) [y^{'} (x + α) + y^{' 2}] \end{matrix}

(20)

Using (16), (18) and (20), we obtain

\begin{matrix} E_{P_{G} (y | x, α, β)} [y^{2}] & = & E_{P_{G} (y | x, α, β)} [y] + \frac{x + α}{β + 2} E_{P_{G} (y | x, α, β)} [y] + \frac{1}{β + 2} E_{P_{G} (y | x, α, β)} [y^{2}] \\ = & \frac{x + α}{β + 1} (1 + \frac{x + α}{β + 2}) + \frac{1}{β + 2} E_{P_{G} (y | x, α, β)} [y^{2}] \end{matrix}

(21)

which can be solved as

E_{P_{G} (y | x, α, β)} [y^{2}] = \frac{(x + α) (x + α + β + 2)}{{(β + 1)}^{2}}

(22)

Plugging (22) into (19) we obtain

V a r_{P_{G} (y | x, α, β)} [y] = \frac{(x + α) (β + 2)}{{(β + 1)}^{2}} = \frac{β + 2}{β + 1} E_{P_{G} (y | x, α, β)} [y]

☐

Given an observation x, the maximum likelihood estimate of the underlying Poisson distribution is the Poisson distribution with mean x,

P (y | x) = e^{- x} \frac{x^{y}}{y!}

After observing x, the mean of the maximum likelihood and

P_{A C} (\cdot | x)

estimates is x and

x + 1

, respectively. Hence, Bayesian averaging in

P_{A C} (\cdot | x)

induced by the flat improper prior over the mean rate λ results in increased expected value

x + 1

of the next count from the same underlying source, given that the current count x. However, a much more marked consequence of using the flat prior can be seen in the variance of

P_{A C} (\cdot | x)

: while variance of the maximum likelihood is x, it is

2 (x + 1)

in

P_{A C} (\cdot | x)

.

Theorem 3 illustrates the role of more concentrated prior over λ on the generalized model. The mean expected count, after seeing x, is equal to the mean of the posterior

P (λ | x, α, β)

over λ, namely

(α + x) / (β + 1)

. As explained earlier, observed single count x with prior β counts of cumulative value α results in

β + 1

counts of cumulative value

α + x

. Hence the mean count per observation is

(α + x) / (β + 1)

. As with Poisson distribution, the variance of the generalized model is closely related to its mean and approaches the mean with increasing number of prior counts β.

As for the soft regularization

P_{R} (y | x, β) = P_{G} (y | x, α = 1, β)

, its mean is, as expected, biased towards values smaller than the observed count x, provided

β > 1 / x

. Increased values of β result in smaller variance of

P_{R} (y | x, β)

. But how do such prior parameter modifications manifest themselves in terms of accuracy of estimation of the underlying source? This question is investigated in the next section.

5. Expected Divergence of the Generalized $P_{A C} (y | x)$ from the True Underlying Poisson Distribution

Consider an underlying Poisson source

P (x | λ)

generating counts x. In this section we would like to quantify the average divergence

E_{G} (λ; β) = E_{P (x | λ)} [D_{K L} [P (y | λ) ∥ P_{R} (y | x, β)]]

(23)

of the corresponding generalized

P_{A C} (y | x)

,

P_{R} (y | x, β) = P_{G} (y | x, α = 1, β)

(“softly” regularized ML), from the truth

P (y | λ)

, if we repeatedly generated a “representative” count x from

P (x | λ)

. The same question was considered in the context of maximum likelihood estimation in Section 2.3. In particular, we are interested in specifying under what circumstances is the generalized form of

P_{A C} (y | x)

,

P_{R} (y | x, β) = P_{G} (y | x, α = 1, β)

, preferable to the original

P_{A C} (y | x) = P_{G} (y | x, α = 1, β \to 0)

and how it fares with the maximum likelihood estimation

P_{M L} (y | x)

of Section 2.3.

Theorem 4 Consider an underlying Poisson distribution

P (\cdot | λ)

parameterized by some

λ > 0

. Then for

β \geq 0

,

E_{G} (λ; β) = {log}_{2} (\frac{β + 2}{β + 1}) - \frac{1}{2} + λ [2 {log}_{2} (\frac{β + 2}{2}) - {log}_{2} (β + 1)] + O (λ^{- 1})

(24)

A higher order approximation (up to order

λ^{3}

) reads:

\begin{matrix} E_{G} (λ; β) & = & {log}_{2} (\frac{β + 2}{β + 1}) - \frac{1}{2} + λ [2 {log}_{2} (\frac{β + 2}{2}) - {log}_{2} (β + 1)] \\ - & \frac{1}{12 λ} (1 - \frac{1}{2}) - \frac{1}{24 λ^{2}} (1 - \frac{1}{2^{2}}) - \frac{19}{360 λ^{3}} (1 - \frac{1}{2^{3}}) + O (λ^{- 4}) \end{matrix}

(25)

Proof: Let us first express the divergence

D_{β} (λ, x) = D_{K L} [P (y | λ) ∥ P_{R} (y | x, β)]

. We have

D_{β} (λ, x) = - H [P (y | λ)] - E_{P (y | λ)} [log P_{R} (y | x, β)]

where

H [P (y | λ)] = - E_{P (y | λ)} [log P (y | λ)]

is the entropy of the source

P (y | λ)

and

\begin{matrix} E_{P (y | λ)} [log P_{R} (y | x, β)] & = & - log x! \\ - E_{P (y | λ)} [y] log (β + 2) - (x + 1) log (\frac{β + 2}{β + 1}) \\ - E_{P (y | λ)} [log y!] + E_{P (y | λ)} [log (x + y)!] \end{matrix}

Denoting (for integer

d \geq 0

)

E_{P (y | λ)} [log (y + d)!]

by

F (λ, d)

, we write

\begin{matrix} D_{β} (λ, x) & = & - H [P (y | λ)] + log x! \\ + λ log (β + 2) + (x + 1) log (\frac{β + 2}{β + 1}) \\ + F (λ, 0) - F (λ, x) \end{matrix}

We are now ready to calculate the expectation

E_{G} (λ; β) = E_{P (x | λ)} [D_{β} (λ, x)]

.

\begin{matrix} E_{G} (λ; β) & = & - H [P (y | λ)] + F (λ, 0) \\ + λ log (β + 2) + (λ + 1) log (\frac{β + 2}{β + 1}) \\ + F (λ, 0) - E_{P (x | λ)} [F (λ, x)] \end{matrix}

We have proved in [11] that

E_{P (x | λ)} [F (λ, x)] = F (2 λ, 0)

, and so

\begin{matrix} E_{G} (λ; β) & = & - H [P (y | λ)] + log (\frac{β + 2}{β + 1}) \\ + λ log (\frac{{(β + 2)}^{2}}{β + 1}) \\ + 2 F (λ, 0) - F (2 λ, 0) \end{matrix}

Since

\begin{matrix} - H [P (y | λ)] & = & E_{P (y | λ)} [log P (y | λ)] \\ = & - λ log e + E_{P (y | λ)} [y] log λ - E_{P (y | λ)} [log y!] \\ = & - λ log e + λ log λ - F (λ, 0) \end{matrix}

(26)

we have

\begin{matrix} E_{G} (λ; β) & = & log (\frac{β + 2}{β + 1}) \\ + λ [log λ + log (\frac{{(β + 2)}^{2}}{β + 1}) - log e] \\ + F (λ, 0) - F (2 λ, 0) \end{matrix}

(27)

Using entropy approximation (see [11]), one obtains

F (λ, 0) = λ (log λ - log e) + \frac{1}{2} log (2 π e λ) + O (λ^{- 1})

leading to (in log base 2)

F (λ, 0) - F (2 λ, 0) = - \frac{1}{2} + λ ({log}_{2} e - {log}_{2} λ - 2) + O (λ^{- 1})

Finally,

\begin{matrix} E_{G} (λ; β) & = & {log}_{2} (\frac{β + 2}{β + 1}) - \frac{1}{2} \\ + λ [{log}_{2} (\frac{{(β + 2)}^{2}}{β + 1}) - 2] + O (λ^{- 1}) \end{matrix}

which is equivalent to (24).

The higher order expression (25) is simply obtained by using higher order approximation to

F (λ, 0) - F (2 λ, 0)

.

☐

Note that for

β \to 0

we recover our original result [11] that the expected divergence

E (λ)

of the original

P_{A C} (y | x)

from the “truth”

P (y | λ)

is (up to terms of order

λ^{- 1})

never greater than 1/2 bit. The soft regularization in

P_{R} (y | x, β)

(using prior

P (λ | α = 1, β)

with

β > 0

) can result in larger expected divergence from the underlying source than is the case for

P_{A C} (y | x)

(using improper flat prior over λ). Moreover, (unlike in

P_{A C} (y | x)

) such a regularization causes linear divergence of

E_{G} (λ; β)

for large λ. The next theorem specifies for which underlying Poisson sources the soft regularization approach of

P_{R} (y | x, β)

is preferable to the original

P_{A C} (y | x)

.

Theorem 5 For Poisson sources with mean rates

λ < κ (β) = \frac{log (1 + \frac{β}{β + 2})}{log (1 + \frac{β^{2}}{4 (β + 1)})}

(28)

it holds

E (λ) > E_{G} (λ; β)

and hence

P_{R} (y | x, β)

is on average guaranteed to approximate (in the Kullback–Leibler divergence sense) the underlying source better than the original

P_{A C} (y | x)

.

Proof: It was shown in [11] that for the original

P_{A C} (y | x)

,

\begin{matrix} E (λ) & = & λ (log λ - log e + 2 log 2) + log 2 \\ + F (λ, 0) - F (2 λ, 0) . \end{matrix}

(29)

From (27) and (29) we have that the difference between the expected divergences of the original and generalized forms of

P_{A C} (y | x)

is

\begin{matrix} E (λ) - E_{G} (λ; β) & = & log 2 - log (\frac{β + 2}{β + 1}) \\ + λ [2 log 2 - log (\frac{{(β + 2)}^{2}}{β + 1})] \\ = & log \frac{2 (β + 1)}{β + 2} \\ + λ log \frac{4 (β + 1)}{{(β + 2)}^{2}} \end{matrix}

(30)

The result follows from solving for

E (λ) > E_{G} (λ; β)

.

The graph (in log-log scale) of

κ (β)

is shown in Figure 3. An alternative way of data-driven setting of parameter β is suggested by the fact that

κ (β)

is lower bounded by

β^{- 1}

. If the experimental setting is such that most counts are expected not to exceed some

x_{m a x}

, β can be set to

β = 1 / x_{m a x}

, so that

P_{R} (y | x, β)

is preferable to

P_{A C} (y | x)

.

In Figure 4 we present the expected divergences

E_{G} (λ; β)

(solid line) and

E (λ)

(dashed line) for

β = 0.2

(left) and

β = 0.01

(right). As expected, for underlying sources with small mean counts λ the advantage of using the regularized form

P_{R} (y | x, β)

(as opposed to the original

P_{A C} (y | x)

) is more pronounced. However, for larger λ there is a heavy price to be paid in terms of inaccurate modelling by

P_{R} (y | x, β)

.

Figure 3. Graph of

κ (β)

. For Poisson sources with mean rates

λ < κ (β)

,

E (λ) > E_{G} (λ; β)

and hence

P_{R} (y | x, β)

is on average guaranteed to approximate the underlying source better than the original

P_{A C} (y | x)

.

Figure 3. Graph of

κ (β)

. For Poisson sources with mean rates

λ < κ (β)

,

E (λ) > E_{G} (λ; β)

and hence

P_{R} (y | x, β)

is on average guaranteed to approximate the underlying source better than the original

P_{A C} (y | x)

.

Figure 4. Expected divergences

E_{G} (λ; β)

(solid line) and

E (λ)

(dashed line) for

β = 0.2

(left) and

β = 0.01

(right).

Figure 4. Expected divergences

E_{G} (λ; β)

(solid line) and

E (λ)

(dashed line) for

β = 0.2

(left) and

β = 0.01

(right).

6. Empirical Investigations

To investigate potential value of the more sophisticated Bayesian approach in the original and the generalized Audic–Claverie frameworks (Section 2.1 and Section 3, respectively) against the baseline of simple (regularized) maximum likelihood estimation (Section 2.3), we conducted a series of simple illustrative experiments. In the generalized Audic–Claverie framework developed in this study, we used the two schemes for setting the regularization parameter β suggested in Section 3 and Section 5. In the regularized maximum likelihood approach

P_{M L} (y | x)

we set

ϵ = 1

. From Figure 1, it appears that the biggest difference between the expected divergences from the true underlying Poisson source

P (x | λ)

to the original

P_{A C} (\cdot | x)

and the maximum likelihood estimate occurs for small mean rates λ roughly around

λ = 5

. We therefore run the experiments with

λ = 5

.

For illustration purposes, we follow the data generation mechanism used in [13] to compare methods for distinguishing between differential expression of genes associated with two treatment regimes. We stress that in no way we suggest that our experiments have strong relevance for bioinformatics, nor do we claim that the framework of [13] is the best test bed for assessing differential gene expression detection algorithms. We use the framework of [13] merely to illustrate whether the sophistication of the Bayesian approach (as opposed to simple (regularized) maximum likelihood) can bring benefits in a practical situation with low-count data.

Gene counts are simulated across the two treatment groups

T_{1}

and

T_{2}

. The tests are assessed by comparing false positive and true positive rates. In each experiment 10,000 gene pair counts

(x_{1, j}, x_{2, j})

,

j = 1, 2, . . ., 10, 000

, were produced, counts

x_{1, j}

and

x_{2, j}

associated with regimes

T_{1}

and

T_{2}

, respectively. As specified above, the sampling rate for

T_{1}

was fixed at

λ_{1} = 5

throughout the experiment. We varied the mean

{log}_{2}

fold change (LFC) between

T_{1}

and

T_{2}

from −2 to 2. Each gene pair count

(x_{1, j}, x_{2, j})

,

j = 1, 2, . . ., 10, 000

, was obtained through a generative process specified in [13] and described in detail in Appendix A.

Having generated the gene pair counts, we used methods considered in this study to make a decision for each

j = 1, 2, . . ., 10, 000

, whether the counts

x_{1, j}, x_{2, j}

originated from the same underlying source, i.e., whether when generating

y_{1, j}

and

y_{2, j}

, the mean rates in the two regimes

T_{1}

and

T_{2}

were identical (

L F C_{j} = 0

). Given the “test distribution”

Q (y | x)

and a confidence level

ϑ \in [0, 1]

, we guess that

x_{1, j}, x_{2, j}

originated from the same source if the

(1 - ϑ)

-quantile around the mean of

Q (y | x_{1, j})

contains

x_{2, j}

and vice-versa, i.e., if the

(1 - ϑ)

-quantile around the mean of

Q (y | x_{2, j})

contains

x_{1, j}

. In place of

Q (y | x)

we used

P_{A C} (y | x)

, its regularized form

P_{R} (y | x, β)

and the regularized maximum likelihood estimate

P_{M L} (y | x)

with

ϵ = 1

.

For a given confidence level

ϑ \in [0, 1]

and test statistic

Q (y | x)

we calculate the false positive rate (type I error rate) as the proportion of times a gene count pair

(x_{1, j}, x_{2, j})

was declared to have originated from two different underlying sources (differentially expressed gene) when in fact

L F C_{j}

was zero. The true positive rate (statistical power) was determined as the proportion of times a gene was correctly declared differentially expressed -

(x_{1, j}, x_{2, j})

declared to have originated come from two different underlying sources and

L F C_{j} \neq 0

.

Plot of false positive rate vs. true positive rate obtained for different values of ϑ constitutes a receiver operating characteristic (ROC) curve. If the ROC curve for one test distribution is always above another, this suggests its superiority in classifying genes as differentially expressed. Trivial classification of genes as differentially expressed using a completely random guess would yield the identity (diagonal) ROC curve. ROC curves for the maximum likelihood method (

ϵ = 1

, red dashed line) and the soft regularization model

P_{R} (y | x, β)

,

β = 1 / 50, 1 / 100

(solid lines) are plotted in Figure 5. Not surprisingly, the Bayesian approach (solid lines) outperforms the penalized maximum likelihood one (red dashed line). However, the original

P_{A C} (y | x)

(

β = 0

, black line) and the soft regularization model (color solid lines) achieve almost identical performances. In this challenging setting (single observations at low mean rate with additional noise), the scheme for setting the regularization parameter β suggested in Section 5 has little effect on the resulting classification performance. We also ran experiments to test the “dynamic” scheme for setting β introduced in Section 3, but no significant performance improvements were achieved.

Figure 5. ROC curves for test distributions

P_{A C} (y | x) = P_{R} (y | x, β \to 0)

(solid black line),

P_{R} (y | x, β = 1 / 100)

(solid blue line),

P_{R} (y | x, β = 1 / 50)

(solid green line) and

P_{M L} (y | x)

with

ϵ = 1

(dashed red line). Mean rate of the underlying Poisson source was fixed at

λ = 5

.

Figure 5. ROC curves for test distributions

P_{A C} (y | x) = P_{R} (y | x, β \to 0)

(solid black line),

P_{R} (y | x, β = 1 / 100)

(solid blue line),

P_{R} (y | x, β = 1 / 50)

(solid green line) and

P_{M L} (y | x)

with

ϵ = 1

(dashed red line). Mean rate of the underlying Poisson source was fixed at

λ = 5

.

Finally, we devised yet another scheme for determining the hyper-parameters α and β of the prior

P (λ | α, β)

from the data. In the spirit of type II maximum likelihood, we find the most likely values of

α, β

, given the observed counts

C = {x_{1}, x_{2}, . . ., x_{n}}

, using

P (C | α, β) = \prod_{i = 1}^{n} P (x_{i} | α, β),

where

P (x_{i} | α, β) = \int_{0}^{\infty} P (x_{i} | λ) p (λ | α, β) d λ

(31)

Using this method, we first optimize the prior hyperparameters on the observed data. The “optimized” prior

P (x_{i} | α_{*}, β_{*})

now reflects the possible ranges of mean counts λ one can expect given the data. We then repeated the experiments using the generalized model

P_{G} (y | x, α_{*}, β_{*})

derived from the optimized prior. In this way we can assess to what degree the relatively minor performance differences between the generalized and maximum likelihood models in Figure 5 are due to constraining α to

α = 1

(in

P_{R} (y | x, β)

), or due to inherent difficulty of learning from single counts. The resulting ROC analysis is shown in Figure 6. The data driven setting of hyperparameters

α, β

leads to slight improvement over

P_{A C} (y | x)

and

P_{R} (y | x, β)

.

Figure 6. ROC curves for test distributions

P_{A C} (y | x) = P_{R} (y | x, β \to 0)

(solid black line) and

P_{G} (y | x, α_{*}, β_{*})

(dashed red line). Mean rate of the underlying Poisson source was fixed at

λ = 5

.

Figure 6. ROC curves for test distributions

P_{A C} (y | x) = P_{R} (y | x, β \to 0)

(solid black line) and

P_{G} (y | x, α_{*}, β_{*})

(dashed red line). Mean rate of the underlying Poisson source was fixed at

λ = 5

.

7. Discussion and Conclusion

Studies of learning algorithms traditionally concentrate on situations where potentially ever increasing number of training examples is available. However, there are situations where only extremely small samples can be used in order to perform an inference. In this contribution we concentrated on extreme case of low count data governed by Poisson distribution, where only a single observation is available. We performed a rigorous theoretical investigation of the appropriateness of various model estimators, based on the single observation. We considered a Bayesian approach along the lines of [2], where the model built on the basis of a single observed count is no longer Poisson, even though we know that the generating source is Poisson (but do not know the mean rate).

We showed that the Bayesian approach is more optimal than the regularized maximum likelihood, in the sense that the expected Kullback–Leibler divergence from the source to the model is smaller for the Bayesian approach. Furthermore, we generalized the original model of [2] to account for possible prior information on expected expression counts. Detailed information theoretic study of learning capabilities of such a generalized model was conducted for the case of low count data. We also quantified the effect of Bayesian averaging on its first two moments.

We demonstrated both theoretically and empirically that the Bayesian model averaging on the generalized model can be potentially beneficial. For large λ, the expected divergence

Υ (λ, ϵ)

of the maximum likelihood estimator from the true Poisson source is dominated by the term

λ (log λ - \sum_{x = 1}^{\infty} P (x | λ) log x)

since

{lim}_{λ \to \infty} e^{- λ} (ϵ - λ log ϵ) = 0

. We empirically determined that for

λ \geq 10

,

Υ (λ, ϵ = 1)

expressed in bits is bounded by

0.7 < Υ (λ, ϵ = 1) < 0.8

. Hence, for mean Poisson rates

λ \geq 10

, the difference between the expected divergences of the Audic–Claverie and ML estimates from the true source is never less than 0.2 bits and never more than 0.3 bits. In other words,

0.2 < Υ (λ, ϵ = 1) - E (λ) < 0.3, λ \geq 10

Acknowledgements

This work was supported by a BBSRC grant (no. BB/H012508/1).

Appendix A

In the generative process of [13], each gene pair count

(x_{1, j}, x_{2, j})

,

j = 1, 2, . . ., 10, 000

, was obtained as follows:

The sampling rate $λ_{2, j}$ for the treatment group $T_{2}$ is obtained as

$\begin{matrix} λ_{2, j} & = & 2^{({log}_{2} λ_{1}) - L F C_{j}} \end{matrix}$

$\begin{matrix} L F C_{j} & \sim & U n i f o r m {- 2.0, - 1.5, - 1.0, . . ., 1.5, 2.0} \end{matrix}$
A pair of gene counts $(y_{1, j}, y_{2, j})$ is sampled with respect to $P o i s s o n (λ_{1})$ and $P o i s s o n (λ_{2, j})$ ,

$y_{1, j} \sim P o i s s o n (λ_{1}), y_{2, j} \sim P o i s s o n (λ_{2, j})$
Zero mean Gaussian noise is then added to each gene count (rounding to the nearest integer using the rounding operator $[\cdot]$ ):

$\begin{matrix} y_{i, j}^{'} & = & y_{i, j} + [η_{j}], i = 1, 2 \end{matrix}$

$\begin{matrix} η_{j} & \sim & N (0, σ_{j} = \frac{v_{j}}{ψ}) \end{matrix}$

$\begin{matrix} v_{j} & = & \frac{λ_{1} + λ_{2, j}}{2} \end{matrix}$

where $ψ = 10$ .
The batch and lane effects are simulated as follows. Batch effects are accounted for by adding Gaussian noise to each noisy count $y_{i, j}^{'}$ ,

$\begin{matrix} y_{i, j}^{''} & = & y_{i, j}^{'} + [η_{i, j}^{'}] \end{matrix}$

$\begin{matrix} η_{i, j}^{'} & \sim & N (0, \frac{y_{i, j}^{'}}{10}) \end{matrix}$

Lane effects are simulated by Poisson sampling from $y_{1, j}^{''}$ and $y_{2, j}^{''}$ at different rates varying between lanes,

$\begin{matrix} x_{i, j} & \sim & P o i s s o n (δ_{j} \cdot y_{i, j}^{''}) \end{matrix}$

$\begin{matrix} δ_{j} & \sim & U n i f o r m {0.65, 0.8, 0.95} \end{matrix}$

References

Varuzza, L.; Gruber, A.; de B. Pereira, C. Significance tests for comparing digital gene expression profiles. Nat. Preced. 2008. [Google Scholar] [CrossRef]
Audic, S.; Claverie, J. The significance of digital expression profiles. Genome Res. 1997, 7, 986–995. [Google Scholar] [PubMed]
Medina, C.; Rotter, B.; Horres, R.; Udupa, S.; Besser, B.; Bellarmino, L.; Baum, M.; Matsumura, H.; Terauchi, R.; Kahl, G.; et al. SuperSAGE: The drought stress-responsive transcriptome of chickpea roots. BMC Genomics 2008, 9, e553. [Google Scholar]
Kim, H.; Baek, K.; Lee, S.; Kim, J.; Lee, B.; Cho, H.; Kim, W.; Choi, D.; Hur, C. Pepper EST database: Comprehensivein silico tool for analyzing the chili pepper (Capsicum annuum) transcriptome. BMC Plant Biol. 2008, 8, e101. [Google Scholar] [CrossRef] [PubMed]
Cervigni, G.; Paniego, N.; Pessino, S.; Selva, J.; Diaz, M.; Spangenberg, G.; Echenique, V. Gene expression in diplosporous and sexual Eragrostis curvula genotypes with differing ploidy levels. BMC Plant Biol. 2008, 67, e11. [Google Scholar] [CrossRef] [PubMed]
Miles, J.; Blomberg, A.; Krisher, R.; Everts, R.; Sonstegard, T.; Tassell, C.V.; Zeulke, K. Comparative transcriptome analysis of in vivo and in vitro-produced porcine blastocysts by small amplified RNA-serial analysis of gene expression (SAR-SAGE). Mol. Reprod. Dev. 2008, 75, 976–988. [Google Scholar] [CrossRef] [PubMed]
Cuevas-Tello, J.C.; Tiňo, P.; Raychaudhury, S. How accurate are the time delay estimates in gravitational lensing? Astron. Astrophys. 2006, 454, 695–706. [Google Scholar] [CrossRef]
Cuevas-Tello, J.C.; Tiňo, P.; Raychaudhury, S.; Yao, X.; Harva, M. Uncovering delayed patterns in noisy and irregularly sampled time series: An astronomy application. Pattern Recognit. 2010, 43, 1165–1179. [Google Scholar] [CrossRef]
Pelt, J.; Hjorth, J.; Refsdal, S.; Schild, R.; Stabell, R. Estimation of multiple time delays in complex gravitational lens systems. Astron. Astrophys. 1998, 337, 681–684. [Google Scholar]
Press, W.; Rybicki, G.; Hewitt, J. The time delay of gravitational lens 0957+561, I. Methodology and analysis of optical photometric Data. Astrophys. J. 1992, 385, 404–415. [Google Scholar] [CrossRef]
Tiňo, P. Basic properties and information theory of audic-claverie statistic for analyzing cDNA arrays. BMC Bioinform. 2009, 10, e310. [Google Scholar] [CrossRef] [PubMed]
Tiňo, P. One-shot Learning of Poisson Distributions in cDNA Array Analysis. In Advances in Neural Networks, Proceedings of the 8th International Symposium on Neural Networks (ISNN 2011), Guilin, China, 29 May – 1 June, 2011; Liu, D., Zhang, H., Polycarpou, M., Alippi, C., He, H., Eds.; Lecture Notes in Computer Science (LNCS 6676). Springer-Verlag: Berlin, Heildelberg, Germany, 2011; pp. 37–46. [Google Scholar]
Auer, P.; Doerge, R. Statistical design and analysis of RNA sequencing data. Genetics 2010, 185, 405–416. [Google Scholar] [CrossRef] [PubMed]

© 2013 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Tiňo, P. Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get? Entropy 2013, 15, 1202-1220. https://doi.org/10.3390/e15041202

AMA Style

Tiňo P. Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get? Entropy. 2013; 15(4):1202-1220. https://doi.org/10.3390/e15041202

Chicago/Turabian Style

Tiňo, Peter. 2013. "Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get?" Entropy 15, no. 4: 1202-1220. https://doi.org/10.3390/e15041202

APA Style

Tiňo, P. (2013). Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get? Entropy, 15(4), 1202-1220. https://doi.org/10.3390/e15041202

Article Menu

Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get?

Abstract

1. Introduction

2. Single Count Data—Bayesian and Maximum Likelihood Approaches

2.1. Bayesian Averaging in the Audic–Claverie Approach

2.2. Information Theory of $P_{A C} (y | x)$

2.3. $P_{A C} (y | x)$ vs. Maximum Likelihood

3. Generalized $P_{A C} (y | x)$ with Gamma Prior

4. First and Second Moments of the Generalized $P_{A C} (y | x)$

5. Expected Divergence of the Generalized $P_{A C} (y | x)$ from the True Underlying Poisson Distribution

6. Empirical Investigations

7. Discussion and Conclusion

Acknowledgements

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Pushing for the Extreme: Estimation of Poisson Distribution from Low Count Unreplicated Data—How Close Can We Get?

Abstract

1. Introduction

2. Single Count Data—Bayesian and Maximum Likelihood Approaches

2.1. Bayesian Averaging in the Audic–Claverie Approach

2.2. Information Theory of P A C ( y | x )

2.3. P A C ( y | x ) vs. Maximum Likelihood

3. Generalized P A C ( y | x ) with Gamma Prior

4. First and Second Moments of the Generalized P A C ( y | x )

5. Expected Divergence of the Generalized P A C ( y | x ) from the True Underlying Poisson Distribution

6. Empirical Investigations

7. Discussion and Conclusion

Acknowledgements

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.2. Information Theory of $P_{A C} (y | x)$

2.3. $P_{A C} (y | x)$ vs. Maximum Likelihood

3. Generalized $P_{A C} (y | x)$ with Gamma Prior

4. First and Second Moments of the Generalized $P_{A C} (y | x)$

5. Expected Divergence of the Generalized $P_{A C} (y | x)$ from the True Underlying Poisson Distribution