Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model

Shi, Xiaoping; Wang, Xiang-Sheng; Wong, Augustine

doi:10.3390/math10234542

Open AccessEditor’s ChoiceArticle

Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model

by

Xiaoping Shi

¹

,

Xiang-Sheng Wang

²

and

Augustine Wong

^3,*

¹

Department of Computer Science, Mathematics, Physics and Statistics, University of British Columbia, Kelowna, BC V1V 1V7, Canada

²

Department of Mathematics, University of Louisiana at Lafayette, Lafayette, LA 70503, USA

³

Department of Mathematics and Statistics, York University, Toronto, ON M3J 1P3, Canada

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(23), 4542; https://doi.org/10.3390/math10234542

Submission received: 14 October 2022 / Revised: 18 November 2022 / Accepted: 25 November 2022 / Published: 1 December 2022

(This article belongs to the Special Issue Machine Learning and Statistical Methods to Prediction and Optimal Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, the Poisson lognormal mixed model has been frequently used in modeling count data because it can accommodate both the over-dispersion of the data and the existence of within-subject correlation. Since the likelihood function of this model is expressed in terms of an intractable integral, estimating the parameters and obtaining inference for the parameters are challenging problems. Some approximation procedures have been proposed in the literature; however, they are computationally intensive. Moreover, the existing studies of approximate parameter inference using the Gaussian variational approximation method are usually restricted to models with only one predictor. In this paper, we consider the Poisson lognormal mixed model with more than one predictor. By extending the Gaussian variational approximation method, we derive explicit forms for the estimators of the parameters and examine their properties, including the asymptotic distributions of the estimators of the parameters. Accurate inference for the parameters is also obtained. A real-life example demonstrates the applicability of the proposed method, and simulation studies illustrate the accuracy of the proposed method.

Keywords:

Gaussian variational approximation; Poisson lognormal mixed model; exponential family model; maximum likelihood estimation; asymptotic distribution; Kullback–Leibler divergence

MSC:

62-08; 62F12; 62E20; 62B10

1. Introduction

The variational method of approximating intractable computations has its roots in the calculus of variations and is reviewed in Stephenson [1]. Ormerod and Wang [2] gave an overview of how the variational approximations facilitate approximate inference for the parameters in complex statistical models, and provide a fast and deterministic alternative to the Monte Carlo method. Moreover, Teschendorff et al. [3] and Flandin and Penny [4] applied the variational method to model gene-expression data and fMRI data, respectively. Hall et al. [5] and Hall et al. [6] applied the variational method to approximate the likelihood function for a Poisson linear mixed model with only one predictor and derived the maximum likelihood estimators and their asymptotic distributions.

The main idea of the variational method of approximating the density of a statistic from a complex model is to find a density from a pre-determined family of distributions such that the approximated density and the target density have the smallest Kullback–Leibler divergence. Mathematically, let

q (x)

be the target density and

p (x)

be a specific density from a pre-determined family of distributions

G

. Then, the variational approximation of

q (x)

is

p^{*} (x) = {arg min}_{p (x) \in G} \int_{X} p (x) log \frac{p (x)}{q (x)} d x .

G

is the Gaussian family, the variational approximation is referred to as the Gaussian variational approximation (GVA).

Recent advances in GVA include links to Bayesian posterior inference in [7,8] and Bayesian Gaussian graphical model selection [9]; see also other applications to particle inference [10], GVA with a factorial covariance structure [11], and variational approximation in a Poisson-lognormal model [12].

In this paper, we consider the Poisson lognormal mixed model considered in [5] but with more than one predictor in the model. The GVA method is applied to obtain the maximum likelihood estimators of the parameters in the model. The advantages of the proposed method over the method discussed in [5] are:

The method discussed in [5] can only be applied to the Poisson lognormal mixed model with only one predictor, whereas the proposed method allows the model to have more than one predictor.
Although the explicit closed forms of the estimators are not available, the structure of the set of estimating equations allows us to easily establish the consistency property of the estimators and derive the limiting distributions of the estimators.

The rest of the paper is organized as follows. Section 2 examines a special member of the exponential family model. For this specific distribution, it is shown that the first three moments can be obtained directly from the normalizing constant. Although the explicit form of the normalizing constant is unavailable, it can be approximated by the GVA method. Section 3 reviews the Poisson lognormal mixed model and shows that the density of the model can be expressed in the form of the specific exponential model introduced in Section 2.

Thus, the likelihood function of the Poisson lognormal mixed model is explicitly available, and the maximum likelihood estimators and their limiting distributions are derived. A real-life example is presented in Section 4 to demonstrate the application of the proposed method. Moreover, results from simulation studies are also presented in Section 4 to illustrate the accuracy of the proposed method. Finally, we give our concluding remarks and discussion in Section 5.

2. A Special Member of the Exponential Family Model

Consider a special exponential family model with density

\begin{matrix} f (x; α, β, σ^{2}) = C^{- 1} (α, β, σ^{2}) exp \{α x - β e^{x} - \frac{x^{2}}{2 σ^{2}}\}, - \infty < x < \infty, \end{matrix}

(1)

where

α, β

and

σ

are non-negative real numbers, and

\begin{matrix} C (α, β, σ^{2}) = \int_{- \infty}^{\infty} exp \{α x - β e^{x} - \frac{x^{2}}{2 σ^{2}}\} d x \end{matrix}

(2)

is the normalizing constant. Then,

E (X) = \frac{1}{C (α, β, σ^{2})} \int_{- \infty}^{\infty} x exp \{α x - β e^{x} - \frac{x^{2}}{2 σ^{2}}\} d x = \frac{\partial log C (α, β, σ^{2})}{\partial α},

and

E (X^{2}) = \frac{1}{C (α, β, σ^{2})} \int_{- \infty}^{\infty} x^{2} exp \{α x - β e^{x} - \frac{x^{2}}{2 σ^{2}}\} d x = - \frac{\partial log C (α, β, σ^{2})}{\partial [{(2 σ^{2})}^{- 1}]} .

The variance of X can be obtained by

V a r (X) = E (X^{2}) - {[E (X)]}^{2}

. Following the same argument as above, it can be shown that

E (X^{3}) = - \frac{\partial^{2} log C (α, β, σ^{2})}{\partial α \partial [{(2 σ^{2})}^{- 1}]} + E (X) E (X^{2}) and E (e^{X}) = - \frac{\partial log C (α, β, σ^{2})}{\partial β} .

Note that, as

β \to 0

, the Model (1) converges to the normal model with mean

α σ^{2}

and variance

σ^{2}

. Furthermore, as

σ^{2} \to \infty

, the Model (1) converges to the log-gamma model with shape

α

and rate

β

.

Although Model (1) is not commonly discussed in the statistics literature, it occurs naturally in information theory. In particular, the concept of cross entropy in information theory, which is generally known as the Kullback–Leibler (KL) divergence in mathematics, is a measure of the closeness of two probability distributions P and Q. If both P and Q are continuous distributions, the cross entropy is defined as

KL (p, q) = \int_{- \infty}^{\infty} p (x) log \frac{p (x)}{q (x)} d x,

where

p (x)

and

q (x)

are the density functions of P and Q, respectively. Note that, by Jensen’s inequality,

KL (p, q)

is always non-negative.

The variational method is to find

g \in G

such that it is closest to Model (1) in terms of having the smallest KL divergence. When

G

is restricted to the family of normal distributions with mean

μ_{U}

and variance

σ_{U}^{2}

, we have

G = \{g (x; μ_{U}, σ_{U}^{2}) = \frac{1}{\sqrt{2 π σ^{2}}} exp \{- \frac{{(x - μ_{U})}^{2}}{2 σ_{U}^{2}}\}, - \infty < μ_{U} < \infty, σ_{U}^{2} > 0\},

where

\int_{- \infty}^{\infty} g (x; μ_{U}, σ_{U}^{2}) d x = 1 .

Mathematically, the variational method gives

\begin{matrix} min_{g (x) \in G} KL (g, f) \\ = & min_{(μ_{U}, σ_{U}^{2})} KL (g (x; μ_{U}, σ_{U}^{2}), f (x; α, β, σ^{2})) \\ = & min_{(μ_{U}, σ_{U}^{2})} \{\int_{- \infty}^{\infty} g (x; μ_{U}, σ_{U}^{2}) log \frac{g (x; μ_{U}, σ_{U}^{2})}{f (x; α, β, σ^{2})} d x\} \\ = & min_{(μ_{U}, σ_{U}^{2})} \{log C (α, β, σ^{2}) - \int_{- \infty}^{\infty} g (x; μ_{U}, σ_{U}^{2}) log \frac{f (x; α, β, σ^{2}) C (α, β, σ^{2})}{g (x; μ_{U}, σ_{U}^{2})} d x\} . \end{matrix}

(3)

Since the KL divergence is always non-negative, the variational method gives

\begin{matrix} arg max_{(μ_{U}, σ_{U}^{2})} E_{U} \{log \frac{f (U; α, β, σ^{2}) C (α, β, σ^{2})}{g (U; μ_{U}, σ_{U}^{2})}\} \\ = arg max_{(μ_{U}, σ_{U}^{2})} α μ_{U} - β e^{μ_{U} + σ_{U}^{2} / 2} - \frac{σ_{U}^{2} + μ_{U}^{2}}{2 σ^{2}} + \frac{1}{2} + \frac{1}{2} log (2 π σ_{U}^{2}), \end{matrix}

(4)

where

U \sim N (μ_{U}, σ_{U}^{2})

. Note that Equation (3) indicates that minimizing the KL divergence is the same as minimizing between

log C (α, β, σ^{2})

and all its lower bounds by Jensen’s inequality. Moreover, Equation (4) shows that (3) is equivalent to maximizing an expectation with respect to

μ_{U}

and

σ_{U}^{2}

.

Since the main idea of GVA is to approximate the log of an intractable integral by a function that is a maximum of all possible lower bounds derived from Jensen’s inequality, and all the possible lower bounds of the log of the normalizing constant of Model (1) can be determined by Jensen’s inequality, with

U \sim N (μ_{U}, σ_{U}^{2})

, we have

\begin{matrix} log C (α, β, σ^{2}) & = & log [\int_{- \infty}^{\infty} exp \{α u - β e^{u} - \frac{u^{2}}{2 σ^{2}}\} \frac{exp {- {(u - μ_{U})}^{2} / (2 σ_{U}^{2})} / \sqrt{2 π λ_{U}}}{exp {- {(u - μ_{U})}^{2} / (2 σ_{U}^{2})} / \sqrt{2 π σ_{U}^{2}}}] d u \\ = & log E_{U} [exp \{α U - β e^{U} - \frac{U^{2}}{2 σ^{2}} + \frac{{(U - μ_{U})}^{2}}{2 σ_{U}^{2}}\} \sqrt{2 π σ_{U}^{2}}] \\ \geq & E_{U} log [exp \{α U - β e^{U} - \frac{U^{2}}{2 σ^{2}} + \frac{{(U - μ_{U})}^{2}}{2 σ_{U}^{2}}\} \sqrt{2 π σ_{U}^{2}}] \\ = & α μ_{U} - β e^{μ_{U} + σ_{U}^{2} / 2} - \frac{σ_{U}^{2} + μ_{U}^{2}}{2 σ^{2}} + \frac{1}{2} + \frac{1}{2} log (2 π σ_{U}^{2}), \end{matrix}

where

- \infty < μ_{U} < \infty

and

0 < σ_{U}^{2} < \infty

. The Jensen gap between

log C (α, β, σ^{2})

and its lower bounds can be further narrowed by maximizing the lower bounds for all values of

μ_{U}

and

σ_{U}^{2}

. Finally, we have the following theorem:

Theorem 1.

The log of the normalizing constant

C (α, β, σ^{2})

approximated by GVA is

\begin{matrix} \underset{̲}{log C (α, β, σ^{2})} & = & max_{μ_{U}, σ_{U}^{2}} \{α μ_{U} - β e^{μ_{U} + σ_{U}^{2} / 2} - \frac{σ_{U}^{2} + μ_{U}^{2}}{2 σ^{2}} + \frac{1}{2} + \frac{1}{2} log (2 π σ_{U}^{2})\} \\ = & α {\tilde{μ}}_{U} - β e^{{\tilde{μ}}_{U} + {\tilde{σ}}_{U}^{2} / 2} - \frac{{\tilde{σ}}_{U}^{2} + {({\tilde{μ}}_{U})}^{2}}{2 σ^{2}} + \frac{1}{2} + \frac{1}{2} log (2 π {\tilde{σ}}_{U}^{2}) + O (α^{- 2}), \end{matrix}

(5)

where

{\tilde{μ}}_{U}

and

{\tilde{σ}}_{U}^{2}

satisfy

α - β e^{{\tilde{μ}}_{U} + {\tilde{σ}}_{U}^{2} / 2} - \frac{{\tilde{μ}}_{U}}{σ^{2}} = 0 and - β e^{{\tilde{μ}}_{U} + {\tilde{σ}}_{U}^{2} / 2} - \frac{1}{σ^{2}} + \frac{1}{{\tilde{σ}}_{U}^{2}} = 0 .

In particular, for large α and β with

s = log (α / β) = O (1)

,

\begin{matrix} \underset{̲}{log C (α, β, σ^{2})} & = α s - α - \frac{1}{2} log α + \frac{1}{2} log (2 π) - \frac{s^{2}}{2 σ^{2}} + \frac{s σ^{2} - σ^{2} + s^{2} + (σ^{4} / 6)}{2 α σ^{4}} + O (α^{- 2}) . \end{matrix}

(6)

It is important to note that the drawback of Equation (5) is that numerical approximation is needed to solve the nonlinear system of the estimated mean and variance. In contrast, Equation (6) has an explicit form with second-order accuracy. A comparison of these two equations is given in Figure 1. The inference for the Poisson lognormal mixed model will be provided in Section 3.

Proof of Theorem 1.

Let

\underset{̲}{ℓ} (μ_{U}, σ_{U}^{2}) = \{α μ_{U} - β e^{μ_{U} + σ_{U}^{2} / 2} - \frac{σ_{U}^{2} + μ_{U}^{2}}{2 σ^{2}} + \frac{1}{2} + \frac{1}{2} log (2 π σ_{U}^{2})\} .

If

{\tilde{μ}}_{U}

and

{\tilde{σ}}_{U}^{2}

maximize

\underset{̲}{ℓ} (μ_{U}, σ_{U}^{2})

, then we have

{\frac{\partial \underset{̲}{ℓ} (μ_{U}, σ_{U}^{2})}{\partial μ_{U}}|}_{({\tilde{μ}}_{U}, {\tilde{σ}}_{U}^{2})} = 0 and {\frac{\partial \underset{̲}{ℓ} (μ_{U}, σ_{U}^{2})}{\partial σ_{U}^{2}}|}_{({\tilde{μ}}_{U}, {\tilde{σ}}_{U}^{2})} = 0 .

Hence,

{\tilde{μ}}_{U}

and

{\tilde{σ}}_{U}^{2}

satisfy the equations

\begin{matrix} α - β e^{{\tilde{μ}}_{U} + {\tilde{σ}}_{U}^{2} / 2} - \frac{{\tilde{μ}}_{U}}{σ^{2}} & = 0, \end{matrix}

(7)

\begin{matrix} - β e^{{\tilde{μ}}_{U} + {\tilde{σ}}_{U}^{2} / 2} - \frac{1}{σ^{2}} + \frac{1}{{\tilde{σ}}_{U}^{2}} & = 0 . \end{matrix}

(8)

Therefore, (7) gives

e^{{\tilde{μ}}_{U} + {\tilde{σ}}_{U}^{2} / 2} = \frac{α}{β} - \frac{{\tilde{μ}}_{U}}{β σ^{2}}

and

{\tilde{μ}}_{U} = O (1)

. Furthermore,

\begin{matrix} {\tilde{μ}}_{U} = - {\tilde{σ}}_{U}^{2} / 2 + log (\frac{α}{β} - \frac{{\tilde{μ}}_{U}}{β σ^{2}}) = - {\tilde{σ}}_{U}^{2} / 2 + log (\frac{α}{β}) - (\frac{{\tilde{μ}}_{U}}{α σ^{2}}) - (\frac{{\tilde{μ}}_{U}^{2}}{2 α^{2} σ^{4}}) + O (α^{- 3}) . \end{matrix}

(9)

Substituting

e^{{\tilde{μ}}_{U} + {\tilde{σ}}_{U}^{2} / 2}

obtained from (7) into (8) and solving it for

{\tilde{σ}}_{U}^{2}

, we have

\begin{matrix} \frac{1}{{\tilde{σ}}_{U}^{2}} = α + \frac{1}{σ^{2}} - \frac{{\tilde{μ}}_{U}}{σ^{2}}, \end{matrix}

(10)

\begin{matrix} {\tilde{σ}}_{U}^{2} = \frac{1}{α} + \frac{{\tilde{μ}}_{U} - 1}{α^{2} σ^{2}} + O (α^{- 3}), \end{matrix}

(11)

\begin{matrix} log {\tilde{σ}}_{U}^{2} = - log α + \frac{{\tilde{μ}}_{U} - 1}{α σ^{2}} + O (α^{- 2}) . \end{matrix}

(12)

By substituting

{\tilde{σ}}_{U}^{2}

in (11) back into (9), we have

{\tilde{μ}}_{U} = log (\frac{α}{β}) - \frac{σ^{2} + 2 log (α / β)}{2 α σ^{2}} + O (α^{- 2}) .

(13)

Finally, replacing

{\tilde{μ}}_{U}

and

{\tilde{σ}}_{U}^{2}

in

\underset{̲}{log C (a, b, σ^{2})}

by (9) and (11), respectively, we have

\begin{matrix} \underset{̲}{log C (α, β, σ^{2})} & = - α {\tilde{σ}}_{U}^{2} / 2 + α log (\frac{α}{β}) - (\frac{{\tilde{μ}}_{U}}{σ^{2}}) - (\frac{{\tilde{μ}}_{U}^{2}}{2 α σ^{4}}) + O (α^{- 2}) \\ = - \frac{1}{2} - \frac{{\tilde{μ}}_{U} - 1}{2 α σ^{2}} + α log (\frac{α}{β}) - (\frac{{\tilde{μ}}_{U}}{σ^{2}}) - (\frac{{\tilde{μ}}_{U}^{2}}{2 α σ^{4}}) + \frac{1}{2} + \frac{1}{2} log (2 π {\tilde{σ}}_{U}^{2}) + O (α^{- 2}) . \end{matrix}

(14)

With

α

and

β

large and

s = log (α / β) = O (1)

, by substituting

{\tilde{μ}}_{U}

with (13) and

{\tilde{σ}}_{U}^{2}

with (12) into (14), we have

\begin{matrix} \underset{̲}{log C (α, β, σ^{2})} & = α s - α - \frac{1}{2} log α + \frac{1}{2} log (2 π) - \frac{s^{2}}{2 σ^{2}} + \frac{s σ^{2} - σ^{2} + s^{2} + (σ^{4} / 6)}{2 α σ^{4}} + O (α^{- 2}) . \end{matrix}

(15)

□

To demonstrate the difference in the numerical accuracy of the GVA, we consider the density

f (x; α, β, σ^{2})

as stated in (1) with

α = 5, β = 5,

and

σ = 0.5, 1, \dots, 10

. The exact

C (α, β, σ^{2})

is obtained by numerical integration. The plot of the result is given in Figure 1.

From Figure 1, we observe that the results of GVA by (5) are accurate for small

σ

, while GVA by (6) gives a good approximation for large

σ

. Overall, Equation (6) is a better explicit approximation compared with the numerical approximation by Equation (5).

3. GVA for Poisson Lognormal Mixed Model

In this section, we apply the results from Theorem 1 to obtain the estimates of the parameters in a Poisson lognormal mixed model. The estimates are simpler to obtain than the ones in Hall et al. [5,6].

Let

Y_{i t}

, which takes on non-negative integer values, be the number of occurrences of an event for subject i at time t, where

i = 1, \dots, m

and

t = 1, \dots, n

. Moreover, let

X_{i t} = (X_{i t}^{1}, X_{i t}^{2}, \dots, X_{i t}^{p})

where

X_{i t}^{j}

is the

j^{t h}

power of the covariate for subject i at time t,

X_{i t}

. Then, the conditional Poisson model is

\begin{matrix} Y_{i t} | X_{i t} \sim Poisson (λ_{i t}) \end{matrix}

(16)

where

λ_{i t} = exp (\sum_{j = 0}^{p} β_{j} X_{i t}^{j})

is the mean of the conditional Poisson model,

X_{i t}^{0} \equiv 1

, and

β_{0}, β_{1}, \dots, β_{p}

are the regression coefficients. An unobservable latent random variable

U_{i}

for the

i^{t h}

subject is introduced into Model (16) to capture the within-subject correlation. The resulting model, which is known as the Poisson generalized linear mixed model, can be written as

\begin{matrix} Y_{i t} | (X_{i t}, U_{i}) \sim Poisson (λ_{i t}) \end{matrix}

(17)

where the mean of the model is

λ_{i t} = exp (U_{i} + \sum_{j = 0}^{p} β_{j} X_{i t}^{j})

, and

U_{i}

and

X_{i t}

are assumed to be independently distributed. Furthermore, if

U_{1}, \dots, U_{m}

are assumed to be identical and independently distributed as

N (0, σ^{2}

), the model is referred to as the Poisson lognormal mixed model, though [5] referred to it as the Poisson linear mixed model. McCulloch et al. [13] gave a detailed discussion of the connections between the Poisson lognormal mixed model, and the model is studied in longitudinal data analysis.

It is challenging to obtain the estimators of the regression coefficients,

β_{0}, \dots, β_{p}

, and

σ^{2}

, because it involves the unobservable latent variables

U_{i}

. Rue et al. [14] suggested eliminating the effect of the unobservable latent variables by using the marginal log-likelihood function, which can be written as

\begin{matrix} ℓ (β, σ^{2}) & = \sum_{i = 1}^{m} \sum_{t = 1}^{n} {Y_{i t} (β_{0} + β_{1} X_{i t} + β_{2} X_{i t}^{2} + \dots + β_{p} X_{i t}^{p}) - log (Y_{i t}!)} - \frac{m}{2} log (2 π σ^{2}) \\ + \sum_{i = 1}^{m} log \int_{- \infty}^{+ \infty} exp (\sum_{t = 1}^{n} (Y_{i t} u - e^{β_{0} + β_{1} X_{i t} + β_{2} X_{i t}^{2} + \dots + β_{p} X_{i t}^{p}}) - \frac{u^{2}}{2 σ^{2}}) d u . \end{matrix}

(18)

The major challenge to developing likelihood-based inferences for the parameters is hindered by the last term in (18), which involves the summation of m integrals. Hall et al. [5] applied the GVA method to overcome such integration problems. They considered only the model with one covariate

(p = 1)

and the resulting estimating equations involved

2 m

nuisance parameters. Due to the complexity of their method, it is difficult, if not impossible, to extend to models with more than one covariate

(p > 1)

.

Using the notations introduced in Section 2, the marginal log-likelihood function (18) can be rewritten as

\begin{matrix} ℓ (β, σ^{2}) = \sum_{i = 1}^{m} \{log C (a_{i n}, b_{i n}, σ^{2}) + \sum_{t = 1}^{n} Y_{i t} \sum_{j = 0}^{p} β_{j} X_{i t}^{j} - \frac{log (σ^{2})}{2}\} + ℓ_{0} (Y), \end{matrix}

(19)

where

ℓ_{0} (Y) = - \sum_{i = 1}^{m} \sum_{t = 1}^{n} log (Y_{i t}!) - \frac{m}{2} log (2 π),

a_{i n} = \sum_{t = 1}^{n} Y_{i t} and b_{i n} = b_{i n} (β_{0}, \dots, β_{p}) = \sum_{t = 1}^{n} exp (\sum_{j = 0}^{p} β_{j} X_{i t}^{j}) .

From (6), we have

\sum_{i = 1}^{m} log C (a_{i n}, b_{i n}, σ^{2}) \approx \sum_{i = 1}^{m} \{- \frac{{log}^{2} (a_{i n} / b_{i n})}{2 σ^{2}} - a_{i n} log (b_{i n})\} + ℓ_{1} (Y),

where

ℓ_{1} (Y) = \sum_{i = 1}^{m} {a_{i n} log a_{i n} - a_{i n} - log (a_{i n}) / 2 + log (2 π) / 2}

. Therefore, the log-likelihood function (19) can be approximated by

ℓ (β, σ^{2}) \approx \sum_{i = 1}^{m} \{- \frac{{log}^{2} (a_{i n} / b_{i n})}{2 σ^{2}} - a_{i, n} log (b_{i n}) + \sum_{t = 1}^{n} Y_{i t} \sum_{j = 0}^{p} β_{j} X_{i t}^{j} - \frac{log (σ^{2})}{2}\} + ℓ_{0} (Y) + ℓ_{1} (Y) .

For

j = 1, \dots, p

, denote

b_{i n}^{(j)} = b_{i n}^{(j)} (β_{0}, \dots, β_{p}) = \frac{\partial b_{i n}}{\partial β_{j}} = \sum_{t = 1}^{n} X_{i t}^{j} exp (\sum_{k = 0}^{p} β_{k} X_{i t}^{k}),

and

b_{i n}^{(0)} = b_{i n}

. Hence, the partial derivatives of the log-likelihood function with respect to

β_{0}, \dots, β_{p}

and

σ^{2}

are

\begin{matrix} \frac{\partial ℓ (β, σ^{2})}{\partial β_{j}} & \approx \sum_{i = 1}^{m} \{(\sum_{t = 1}^{n} Y_{i t} X_{i t}^{j} - a_{i n} \frac{b_{i n}^{(j)}}{b_{i n}}) + log (\frac{a_{i n}}{b_{i n}}) \frac{b_{i n}^{(j)}}{σ^{2} b_{i n}}\}, j = 0, \dots, p, \end{matrix}

(20)

\begin{matrix} \frac{\partial ℓ (β, σ^{2})}{\partial σ^{2}} & \approx \sum_{i = 1}^{m} \{\frac{{log}^{2} (a_{i n} / b_{i n})}{2 σ^{4}} - \frac{1}{2 σ^{2}}\} . \end{matrix}

(21)

Note that

\frac{b_{i n}^{(j)}}{b_{i n}} = \{\begin{matrix} 1 & for j = 0 \\ \frac{\sum_{s = 1}^{n} X_{i s}^{j} exp (\sum_{k = 1}^{p} β_{k} X_{i s}^{k})}{\sum_{s = 1}^{n} exp (\sum_{k = 1}^{p} β_{k} X_{i s}^{k})} & otherwise \end{matrix}

and

log (b_{i n}) = β_{0} + log {\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k})}

. Since the maximum likelihood estimators

\hat{β}

and

{\hat{σ}}^{2}

must satisfy

{\frac{\partial ℓ (β, σ^{2})}{\partial β_{j}}|}_{(\hat{β}, {\hat{σ}}^{2})} = 0 and {\frac{\partial ℓ (β, σ^{2})}{\partial σ^{2}}|}_{(\hat{β}, {\hat{σ}}^{2})} = 0,

we have

\begin{matrix} {\hat{β}}_{0} & = \frac{1}{m} \sum_{i = 1}^{m} log \{\frac{\sum_{t = 1}^{n} Y_{i t}}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})}\}, \end{matrix}

(22)

\begin{matrix} \sum_{i = 1}^{m} \sum_{t = 1}^{n} Y_{i t} X_{i t}^{j} & = \sum_{i = 1}^{m} \{(\sum_{t = 1}^{n} Y_{i t}) \frac{\sum_{s = 1}^{n} X_{i s}^{j} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i s}^{k})}{\sum_{s = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i s}^{k})}\}, \end{matrix}

(23)

and

\begin{matrix} {\hat{σ}}^{2} = \frac{1}{m} \sum_{i = 1}^{m} {log}^{2} \{\frac{\sum_{t = 1}^{n} Y_{i t}}{\sum_{t = 1}^{n} exp (\sum_{k = 0}^{p} {\hat{β}}_{k} X_{i t}^{k})}\} . \end{matrix}

(24)

Although the maximum likelihood estimators of

{\hat{β}}_{k}

do not have a closed form solution, Equation (23) is well-structured enough that numerical procedure, such as the Newton–Raphson method or the EM algorithm can be applied to obtain the numerical solution.

To study the asymptotic properties of the maximum likelihood estimators, we need to make the following three assumptions.

Assumption 1.

Let

ξ_{m n} = {(ξ_{m n 1}, \dots, ξ_{m n p})}^{T} \to_{d} N (0, B)

as

m n \to \infty

, where

ξ_{m n j} = {(m n)}^{- 1 / 2} \sum_{i = 1}^{m} \sum_{t = 1}^{n} (Y_{i t} - λ_{i t}) (X_{i t}^{j} - r_{j}), j = 1, \dots, p,

X_{i t}

’s are independent and have the same distribution as X whose moment generating function (MGF),

M_{X} (s) = E {exp (s X)}

, is well-defined on the whole real line.

λ_{i t} = E (Y_{i t} | X_{i t}, U_{i})

,

r_{j}

and

B_{j_{1}, j_{2}}

(entry of

B

) are defined as

r_{j} = r_{j} (β_{1}, \dots, β_{p}) = E (Y_{i t} X_{i t}^{j}) / E (Y_{i t}) = ϕ_{j} / ϕ_{0},

where

\begin{matrix} ϕ_{0} & = & E {exp (\sum_{k = 1}^{p} β_{k} X^{k})} \\ ϕ_{j} & = & ϕ_{j} (β_{1}, \dots β_{p}) = E {X^{j} exp (\sum_{k = 1}^{p} β_{k} X^{k})} for j = 1, \dots, p \end{matrix}

and

B_{j_{1}, j_{2}} = cov {(Y_{i t} - λ_{i t}) (X_{i t}^{j_{1}} - r_{j_{1}}), (Y_{i t} - λ_{i t}) (X_{i t}^{j_{2}} - r_{j_{2}})}, 1 \leq j_{1}, j_{2} \leq p .

Assumption 2.

For

j = 1, \dots, p

,

\sum_{i = 1}^{m} \sum_{t = 1}^{n} (Y_{i t} - λ_{i t}) (r_{j} - b_{i n}^{(j)} / b_{i n}) = o_{p} (\sqrt{m n})

.

Assumption 3.

For

j = 1, \dots, p

,

{\hat{b}}_{i n}^{(j)} / {\hat{b}}_{i n} - b_{i n}^{(j)} / b_{i n} = \sum_{k = 1}^{p} δ_{j k} ({\hat{β}}_{k} - β_{k}) + o_{p} (\sum_{k = 1}^{p} | {\hat{β}}_{k} - β_{k} |)

where

δ_{j k} = r_{j + k} - r_{j} r_{k} = δ_{j k} (β_{1}, \dots, β_{k})

,

{\hat{b}}_{i n}^{(j)} = \partial b_{i n} / \partial β_{j} = b_{i n}^{(j)} (0, {\hat{β}}_{1}, \dots, {\hat{β}}_{p})

,

{\hat{b}}_{i n} = \sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k}) = b_{i n} (0, {\hat{β}}_{1}, \dots, {\hat{β}}_{p})

, and

{\hat{β}}_{1}, \dots, {\hat{β}}_{p}

are solutions obtained from (22).

We give a few remarks on each assumption.

Remark 1.

Assumption 1 is implied by the Central Limit Theorem. Thus, we only need to calculate the moments about the expectation and covariance.

\begin{matrix} E \{(Y_{i t} - λ_{i t}) (X_{i t}^{j} - r_{j})\} & = E [E \{(Y_{i t} - λ_{i t}) (X_{i t}^{j} - r_{j}) | X_{i t}, U_{i}\}] \\ = E [(X_{i t}^{j} - r_{j}) E \{(Y_{i t} - λ_{i t}) | X_{i t}, U_{i}\}] \\ = 0 . \end{matrix}

Similarly, from the definitions of

r_{j}

and

ϕ_{j}

, we have

B_{j_{1}, j_{2}} = B_{j_{1}, j_{2}} (β_{0}, \dots, β_{p}, σ^{2}) = exp (β_{0} + \frac{1}{2} σ^{2}) E {(X^{j_{1}} - r_{j_{1}}) (X^{j_{2}} - r_{j_{2}}) exp (\sum_{k = 1}^{p} β_{k} X^{k})} .

To be more specific, when

p = 1

,

B_{1, 1} = exp (β_{0} + \frac{1}{2} σ^{2}) (ϕ_{2} - ϕ_{1}^{2} / ϕ_{0}) .

When

p = 2

,

\begin{matrix} B_{1, 2} = exp (β_{0} + \frac{1}{2} σ^{2}) (ϕ_{3} - ϕ_{1} r_{2} - ϕ_{2} r_{1} + r_{1} r_{2}) = exp (β_{0} + \frac{1}{2} σ^{2}) (ϕ_{3} - ϕ_{1} ϕ_{2} / ϕ_{0}), \\ B_{2, 2} = exp (β_{0} + \frac{1}{2} σ^{2}) (ϕ_{4} - 2 ϕ_{2} r_{2} + ϕ_{0} r_{2}^{2}) = exp (β_{0} + \frac{1}{2} σ^{2}) (ϕ_{4} - ϕ_{2}^{2} / ϕ_{0}) . \end{matrix}

Remark 2.

In Assumption 2,

E {(Y_{i t} - λ_{i t}) (r_{j} - b_{i n}^{(j)} / b_{i n})} = 0

, and thus the variance is

\begin{matrix} \{\sum_{i = 1}^{m} \sum_{t = 1}^{n} (Y_{i t} - λ_{i t}) (r_{j} - b_{i n}^{(j)} / b_{i n})\} \\ = m \times \{\sum_{t = 1}^{n} (Y_{i t} - λ_{i t}) (r_{j} - b_{i n}^{(j)} / b_{i n})\} \\ = m \times E [{(r_{j} - b_{i n}^{(j)} / b_{i n})}^{2} V a r \{\sum_{t = 1}^{n} (Y_{i t} - λ_{i t})\}] \\ = m n \times E \{λ_{i t} {(r_{j} - b_{i n}^{(j)} / b_{i n})}^{2}\} . \end{matrix}

On account of Markov’s inequality, we need to show

E {λ_{i t} {(r_{j} - b_{i n}^{(j)} / b_{i n})}^{2}} = o (1) .

Note that

r_{j} = E (b_{i n}^{(j)}) / E (b_{i n})

. Derive the second-order Taylor expansion of the ratio

b_{i n}^{(j)} / b_{i n}

about the value

(E (b_{i n}^{(j)}), E (b_{i n}))

and use it to obtain

E {λ_{i t} {(r_{j} - b_{i n}^{(j)} / b_{i n})}^{2}} = O (n^{- 1})

.

Remark 3.

In Assumption 3, we obtain the

p^{t h}

-order Taylor expansion of

{\hat{b}}_{i n}^{(j)}

and

{\hat{b}}_{i n}

about the value

(β_{1}, \dots, β_{p})

, which yields

\begin{matrix} \frac{{\hat{b}}_{i n}^{(j)}}{{\hat{b}}_{i n}} - \frac{b_{i n}^{(j)}}{b_{i n}} & = \frac{\sum_{t = 1}^{n} X_{i t}^{j} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})} - \frac{\sum_{t = 1}^{n} X_{i t}^{j} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k})}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k})} \\ = \frac{\sum_{t = 1}^{n} X_{i t}^{j} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k}) + \sum_{s = 1}^{p} \sum_{t = 1}^{n} X_{i t}^{j + s} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k}) ({\hat{β}}_{s} - β_{s})}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k}) + \sum_{s = 1}^{p} \sum_{t = 1}^{n} X_{i t}^{s} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k}) ({\hat{β}}_{s} - β_{s})} \\ - \frac{\sum_{t = 1}^{n} X_{i t}^{j} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k})}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k})} + o_{p} (\sum_{k = 1}^{p} | {\hat{β}}_{k} - β_{k} |) \\ = \frac{\sum_{s = 1}^{p} \sum_{t = 1}^{n} X_{i t}^{j + s} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k}) ({\hat{β}}_{s} - β_{s})}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k})} \\ - \frac{\sum_{t = 1}^{n} X_{i t}^{j} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k}) \sum_{s = 1}^{p} \sum_{t = 1}^{n} X_{i t}^{s} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k}) ({\hat{β}}_{s} - β_{s})}{{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} β_{k} X_{i t}^{k})}^{2}} \\ + o_{p} (\sum_{k = 1}^{p} | {\hat{β}}_{k} - β_{k} |) \\ = \sum_{s = 1}^{p} (r_{j + s} - r_{j} r_{s}) ({\hat{β}}_{s} - β_{s}) + o_{p} (\sum_{k = 1}^{p} | {\hat{β}}_{k} - β_{k} |) . \end{matrix}

If

p = 1

,

δ_{11} = r_{2} - r_{1}^{2}

. Furthermore, we have

ρ_{11} = δ_{11}^{- 1} / E (Y_{i t})

, where

E (Y_{i t}) = exp (β_{0} + \frac{1}{2} σ^{2}) E {exp (\sum_{k = 1}^{p} β_{k} X^{k})}, and γ_{1} = ρ_{11}^{2} B_{11} = \frac{exp (- β_{0} - \frac{1}{2} σ^{2})}{r_{2} - r_{1}^{2}} .

Note that

γ_{1}

agrees with Equation (3.5) in Theorem 3.1 of Hall et al. [5].

If

p = 2

,

Λ = (\begin{matrix} δ_{11} & δ_{12} \\ δ_{21} & δ_{22} \end{matrix}) = (\begin{matrix} r_{2} - r_{1}^{2} & r_{3} - r_{1} r_{2} \\ r_{3} - r_{1} r_{2} & r_{4} - r_{2}^{2} \end{matrix}),

\frac{Λ^{- 1}}{E (Y_{i t})} = (\begin{matrix} ρ_{11} & ρ_{12} \\ ρ_{21} & ρ_{22} \end{matrix}) = \frac{exp (- β_{0} - \frac{1}{2} σ^{2}) ϕ_{0}^{- 1}}{(r_{2} - r_{1}^{2}) (r_{4} - r_{2}^{2}) - {(r_{3} - r_{1} r_{2})}^{2}} (\begin{matrix} r_{4} - r_{2}^{2} & r_{1} r_{2} - r_{3} \\ r_{1} r_{2} - r_{3} & r_{2} - r_{1}^{2} \end{matrix}) .

Furthermore, we have

γ_{1} = ρ_{11}^{2} B_{11} + 2 ρ_{11} ρ_{12} B_{12} + ρ_{12}^{2} B_{22},

γ_{2} = ρ_{21}^{2} B_{11} + 2 ρ_{21} ρ_{22} B_{21} + ρ_{22}^{2} B_{22} .

After tedious calculations, we obtain the simplified versions of

γ_{1}

and

γ_{2}

:

\begin{matrix} γ_{1} = \frac{exp (- β_{0} - \frac{1}{2} σ^{2}) ϕ_{0} (ϕ_{4} ϕ_{0} - ϕ_{2}^{2})}{(ϕ_{2} ϕ_{0} - ϕ_{1}^{2}) (ϕ_{4} ϕ_{0} - ϕ_{2}^{2}) - {(ϕ_{3} ϕ_{0} - ϕ_{1} ϕ_{2})}^{2}}, \\ γ_{2} = \frac{exp (- β_{0} - \frac{1}{2} σ^{2}) ϕ_{0} (ϕ_{2} ϕ_{0} - ϕ_{1}^{2})}{(ϕ_{2} ϕ_{0} - ϕ_{1}^{2}) (ϕ_{4} ϕ_{0} - ϕ_{2}^{2}) - {(ϕ_{3} ϕ_{0} - ϕ_{1} ϕ_{2})}^{2}} . \end{matrix}

Theorem 2.

With Assumptions 1, 2, and 3, for

j = 1, \dots, p

,

\sqrt{m n} ({\hat{β}}_{j} - β_{j}) = \sum_{k = 1}^{p} ρ_{j k} ξ_{m n k} + o_{p} (1),

where

{\hat{β}}_{j}

’s are the solutions obtained from (22),

ρ_{j k} = ρ_{j k} (β_{1}, \dots, β_{p})

is the jth row and kth column of

p \times p

matrix

Λ^{- 1} / E (Y_{i t})

with

Λ = (δ_{j k})

, and

E (Y_{i t}) = λ_{i t} = exp (U_{i} + \sum_{j = 0}^{p} β_{j} X_{i t}^{j})

. Thus, as

m, n \to \infty

,

{\hat{β}}_{j}

is a consistent estimator with asymptotic distribution of

{\hat{β}}_{j}

is

\sqrt{m n} ({\hat{β}}_{j} - β_{j}) \to_{d} N (0, γ_{j}) j = 1, \dots, p

where

γ_{j} = \sum_{k_{1} = 1}^{p} \sum_{k_{2} = 1}^{p} ρ_{j k_{1}} ρ_{j k_{2}} B_{k_{1}, k_{2}}

. Moreover,

{\hat{β}}_{0}

is a consistent estimator with asymptotic distribution

\sqrt{m} ({\hat{β}}_{0} - β_{0}) \to_{d} N (0, σ^{2}),

where

{\hat{β}}_{0}

is the solution of

β_{0}

provided in (23). Finally,

{\hat{σ}}^{2}

is also a consistent estimator with the asymptotic distribution is

\sqrt{m} ({\hat{σ}}^{2} - σ^{2}) \to_{d} N (0, 2 σ^{4}),

where

{\hat{σ}}^{2}

is the solution presented in (24).

Proof of Theorem 2.

From (23), we consider the following algebraic manipulation:

\begin{matrix} 0 & = \sum_{i = 1}^{m} \sum_{t = 1}^{n} (Y_{i t} - λ_{i t} + λ_{i t}) \{X_{i t}^{j} - r_{j} + r_{j} - \frac{b_{i n}^{(j)}}{b_{i n}} + \frac{b_{i n}^{(j)}}{b_{i n}} - \frac{{\hat{b}}_{i n}^{(j)}}{{\hat{b}}_{i n}}\} \\ = \sum_{i = 1}^{m} \sum_{t = 1}^{n} (Y_{i t} - λ_{i t}) (X_{i t}^{j} - r_{j}) + \sum_{i = 1}^{m} \sum_{t = 1}^{n} (Y_{i t} - λ_{i t}) \{r_{j} - \frac{b_{i n}^{(j)}}{b_{i n}}\} + \sum_{i = 1}^{m} \sum_{t = 1}^{n} Y_{i t} \{\frac{b_{i n}^{(j)}}{b_{i n}} - \frac{{\hat{b}}_{i n}^{(j)}}{{\hat{b}}_{i n}}\} \\ + \sum_{i = 1}^{m} \sum_{t = 1}^{n} λ_{i t} \{X_{i t}^{j} - \frac{b_{i n}^{(j)}}{b_{i n}}\}, \end{matrix}

where the first term is

\sqrt{m n} ξ_{m n j}

, the second term is

o_{p} (\sqrt{m n})

, and the last term is zero.

Consequently,

\begin{matrix} \sum_{i = 1}^{m} \sum_{t = 1}^{n} Y_{i t} \{\frac{{\hat{b}}_{i n}^{(j)}}{{\hat{b}}_{i n}} - \frac{b_{i n}^{(j)}}{b_{i n}}\} = \sqrt{m n} ξ_{m n j} + o_{p} (\sqrt{m n}) . \end{matrix}

As

{(m n)}^{- 1} \sum_{i = 1}^{m} \sum_{t = 1}^{n} Y_{i t} \to_{p} E (Y_{i t})

and

{(m n)}^{- 1} {\sqrt{m n} ξ_{m n j} + o_{p} (\sqrt{m n})} = o_{p} (1)

,

{\hat{b}}_{i n}^{(j)} / {\hat{b}}_{i n} - b_{i n}^{(j)} / b_{i n} \to_{p} 0

for each

j > 0

. By the continuity,

{\hat{β}}_{j} \to_{p} β_{j}

for each

j > 0

. This shows that the estimators are consistent. In addition,

\sqrt{m n} \sum_{k = 1}^{p} δ_{j k} ({\hat{β}}_{k} - β_{k}) = \frac{ξ_{m n j}}{E (Y_{i t})} + o_{p} (1) .

This gives the asymptotic result below by the definition of

ρ_{j k}

,

\sqrt{m n} ({\hat{β}}_{j} - β_{j}) = \sum_{k = 1}^{p} ρ_{j k} ξ_{m n k} + o_{p} (1) .

Therefore, by Central Limit Theorem, we have

\sqrt{m n} ({\hat{β}}_{j} - β_{j}) \to_{d} N (0, γ_{j})

where

γ_{j} = \sum_{k_{1} = 1}^{p} \sum_{k_{2} = 1}^{p} ρ_{j k_{1}} ρ_{j k_{2}} B_{k_{1}, k_{2}}

.

Moreover, we have

\frac{1}{m} \sum_{i = 1}^{n} (U_{i} + β_{0}) = β_{0} + m^{- 1 / 2} ζ_{m 1},

where

ζ_{m 1} \to_{d} N (0, σ^{2})

. From the estimator of

β_{0}

in (22),

{\hat{β}}_{0} = \frac{1}{m} \sum_{i = 1}^{m} log \{\frac{\sum_{t = 1}^{n} Y_{i t}}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})}\} + \frac{1}{m} \sum_{i = 1}^{n} (U_{i} + β_{0}) .

Since

\begin{matrix} log \{\frac{\sum_{t = 1}^{n} Y_{i t}}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})}\} - (U_{i} + β_{0}) \\ = log \{\frac{\sum_{t = 1}^{n} Y_{i t}}{exp (U_{i} + β_{0}) \sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})}\} = o_{p} (1), \end{matrix}

(25)

we have

\frac{1}{m} \sum_{i = 1}^{m} log \{\frac{\sum_{t = 1}^{n} Y_{i t}}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})}\} - \frac{1}{m} \sum_{i = 1}^{n} (U_{i} + β_{0}) = o_{p} (1) .

Therefore,

{\hat{β}}_{0} - β_{0} = o_{p} (1)

, and hence

{\hat{β}}_{0}

is a consistent estimator. Moreover, by the Central Limit Theorem, we have

\sqrt{m} ({\hat{β}}_{0} - β_{0}) \to_{d} N (0, σ^{2}) .

From the estimator of

σ^{2}

in (24), combing the consistence of

{\hat{β}}_{0}

and (25), we have

\begin{matrix} {\hat{σ}}^{2} & = \frac{1}{m} \sum_{i = 1}^{m} {[log \{\frac{\sum_{t = 1}^{n} Y_{i t}}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})}\} - {\hat{β}}_{0}]}^{2} \\ = \frac{1}{m} \sum_{i = 1}^{m} {[log \{\frac{\sum_{t = 1}^{n} Y_{i t}}{\sum_{t = 1}^{n} exp (\sum_{k = 1}^{p} {\hat{β}}_{k} X_{i t}^{k})}\} - (U_{i} + β_{0}) + (β_{0} - {\hat{β}}_{0}) + {(U_{i} + β_{0}) - β_{0}}]}^{2} \\ = \frac{1}{m} \sum_{i = 1}^{m} {\{(U_{i} + β_{0}) - β_{0}\}}^{2} + o_{p} (1) \\ = \frac{1}{m} \sum_{i = 1}^{m} U_{i}^{2} + o_{p} (1) \\ = σ^{2} + m^{- 1 / 2} ζ_{m 2} + o_{p} (1), \end{matrix}

where

ζ_{m 2} \to_{d} N (0, 2 σ^{4})

. Hence, we have

{\hat{σ}}^{2} - σ^{2} = o_{p} (1)

, and, thus,

{\hat{σ}}^{2}

is a consistent estimator. Again, by the Central Limit Theorem, we have

\sqrt{m} ({\hat{σ}}^{2} - σ^{2}) \to_{d} N (0, 2 σ^{4}) .

□

Remark 4.

It is natural to consider independent covariates in which the

λ_{i t}

can be modeled as

λ_{i t} = exp (U_{i} + \sum_{j = 0}^{p} β_{j} X_{i t}^{(j)})

, where

X_{i t}^{(0)} \equiv 1

and

X_{i t}^{(j_{1})}

and

X_{i t}^{(j_{2})}

are independent for

0 < j_{1} \neq j_{2} \leq p

. We still have

ξ_{m n j} = {(m n)}^{- 1 / 2} \sum_{i = 1}^{m} \sum_{t = 1}^{n} (Y_{i t} - λ_{i t}) (X_{i t}^{(j)} - r_{j}),

where

r_{j} = E (Y_{i t} X_{i t}^{(j)}) / E (Y_{i t}) = E {X_{i t}^{(j)} exp (β_{j} X_{i t}^{(j)})} / E {exp (β_{j} X_{i t}^{(j)})} = ϕ_{j, 1} / ϕ_{j, 0}

B_{j, j} = exp (β_{0} + \frac{1}{2} σ^{2}) {ϕ_{j, 2} - ϕ_{j, 1}^{2} ϕ_{j, 0}^{- 1}} \prod_{k \neq j} ϕ_{k, 0},

where

ϕ_{j, 2} = E {{(X_{i t}^{(j)})}^{2} exp (β_{j} X_{i t}^{(j)})}

, and

B_{j_{1}, j_{2}} = 0

if

j_{1} \neq j_{2}

. Furthermore,

δ_{j j} = ϕ_{j, 2} ϕ_{j, 0}^{- 1} - ϕ_{j, 1}^{2} ϕ_{j, 0}^{- 2}

and

δ_{j k} = 0

if

k \neq j

. Other modifications on

B_{j, j}

can be similarly done.

Remark 5.

According to Theorem 2, we can construct an approximate confidence interval (CI) for parameters based on the limiting distributions, and they take the form:

Parameter	$(1 - α) 100 %$ CI
$β_{0}$	${\hat{β}}_{0} \pm Φ (1 - \frac{1}{2} α) \sqrt{\frac{{\hat{σ}}^{2}}{m}}$
$β_{j} (j = 1, \dots, p)$	${\hat{β}}_{j} \pm Φ (1 - \frac{1}{2} α) \sqrt{\frac{{\hat{γ}}_{j}}{m n}}$
$σ^{2}$	${\hat{σ}}^{2} \pm Φ (1 - \frac{1}{2} α) {\hat{σ}}^{2} \sqrt{\frac{2}{m}}$

where Φ denotes the distribution function of

N (0, 1)

. Note that, when

p = 1

, these confidence intervals agree with those given in Hall, Pham, Wand, and Wang (2011).

4. Numerical Studies

The proposed method is applied to a study by the Philippine Statistics Authority to illustrate how the methodology works. The simulation studies are performed to demonstrate the accuracy of the proposed method.

4.1. Real Data Analysis

The Philippine Statistics Authority spearheads the nationwide Family and Expenditure Survey. One of the objectives of this survey is to identify the age of the head of the household that is likely to have the maximum number of family members in the household. This would allow the government of the Philippines to plan and set policies as well as psychological help for Filipinos to adjust to the possibility of loneliness in old age.

The results of the nationwide Family and Expenditure Survey were published in Philippine Statistics Authority (2015). For each of the 17 regions

(m = 17)

reported in [15], a sample of 1249 households

(n = 1249)

was randomly selected. Let

Y_{i t}

be the count of the number of family members in the household in the

t^{t h}

sample of the region i,

X_{i t}

is the age of the head of the

t^{t h}

household of region i, and

U_{i}

is the random effect in the

i^{t h}

region. The model considered is

\begin{matrix} Y_{i t} | (X_{i t}, U_{i}) \sim Poisson (λ_{i t}) i = 1, \dots, 17 and t = 1, \dots, 1249 \end{matrix}

(26)

where

λ_{i t} = exp (U_{i} + \sum_{j = 0}^{2} β_{j} X_{i t}^{j}) .

Figure 2 plots the log of the averaged counts of the number of family members in the household versus the age of the head of the household. At age 80 years or below, there is a clear maximum of the log of the average counts of the number of family members in the household. After age 80, this number increased because of the grand and great grand members joining the household. In this case, the method in Hall et al. [6] is not applicable because we have more than one predictor. Figure 2 also suggested that

X_{i t}^{2}

should be included in Model (26). Based on the proposed method, we have the estimates and confidence intervals of the parameters in Table 1.

Moreover, the age of the head of the household that maximized the model is 48.4 years, which is indicated by the red line in Figure 2.

4.2. Simulations

Simulation studies are used to demonstrate the accuracy of the proposed method. In all studies, we set

m = 10, 20, \dots, 100

and

n = 10 m

.

For the first study, we consider Model (16) with only one predictor

(p = 1)

and two sets of parameter values:

(β_{0}, β_{1}, σ^{2}) = (2.2, - 0.1, 0.16) and (1.2, 0.4, 0.3) .

Let

X_{i t}

be generated from

N (0, 1)

, where

i = 1, \dots, m

and

t = 1, \dots, n

. For each simulated sample, we obtain the 95% confidence interval for

β_{0}

and record if the true

β_{0}

is within the 95% confidence interval. Figure 3 shows the coverage proportion, where the nominal value is 0.95. Moreover, we have only one predictor, and thus the method in Hall et al. [6] can be applied as well. The results are also plotted in Figure 3 for comparison. We repeated the same procedure to obtain the coverage proportions for

β_{1}

and

σ^{2}

, which are reported in Figure 3, respectively. As shown in Figure 3, for variance

σ^{2} = 0.16

, both methods are comparable for larger variance

σ^{2} = 0.3

. However, the proposed method is slightly better for small m.

For the second study, we consider Model (16) with only two predictors

(p = 2)

and four sets of parameter values:

Set	$β_{0}$	$β_{1}$	$β_{2}$	$σ^{2}$
1	2.2	0.1	−0.1	0.16
2	1.2	0.4	−0.2	0.1
3	2.2	−0.1	0.1	0.16
4	1.2	0.4	0.1	0.1

Let

X_{i t}

be generated from

N (0, 1)

where

i = 1, \dots, m

and

t = 1, \dots, n

. The 95% coverage probabilities are plotted in Figure 4. In this case, the method in Hall et al. [6] is not applicable because

p = 2

. As shown in Figure 4, the convergence rates for estimates of

β_{1}

and

β_{2}

are faster than the ones for

β_{0}

and

σ^{2}

. That may explain the differences in coverage percentage among them.

Our last study is similar to the previous study, except the predictors are independent. In other words, we consider the model given in Remark 4 after Theorem 2. The same parameter settings as in the second study were used, and the two predictors were generated from an independent standard normal distribution. The 95% coverage probabilities are plotted in Figure 5. In this case, again the method in Hall et al. [6] is not applicable because

p = 2

. The results are similar to those obtained in our second study even for different generation processes of predictors as shown in Figure 5.

Other parameter settings were also considered, and similar results were obtained. Thus, they are not reported here. They are available from the authors upon request.

5. Conclusions and Discussion

We proposed a special exponential family model and studied the GVA of the normalized constant by connecting the KL divergence and Jensen’s inequality. A closed form of expansion of the normalized constant was given up to the second order, which simplified the calculation and theoretical justification in inference of the Poisson lognormal mixed model. Real data analysis justified the importance of extension to multiple predictors. Simulation studies showed that the coverage percentage was consistent.

The proposed method was implemented into R and is available from the authors upon request. The flowchart of the proposed algorithm by GVA is shown in Figure 6.

Note that the glmer function in R package and lme4 perform similar calculations; however, our proposed method has two advantages over glmer—namely,

glmer does not provide the standard deviation of the estimate of variance $σ^{2}$ , and hence asymptotic inference for $σ^{2}$ is not available.
glmer is slow. Figure 7 compares the required computational times for different m. The computational time for glmer increases dramatically as the sample size m or n increases. For example, for $m = 10, 000$ , our proposed method was about 56-times faster than glmer.

In summary, the Gaussian variational approximation method was developed for the Poisson lognormal mixed model with one or more predictors in this paper. The explicit forms of the estimators of the parameters and the corresponding limiting distributions were derived. Hence, inference for the parameters was obtained. A real-life example demonstrated the applicability of the proposed method, and simulation studies illustrated the accuracy of the proposed method. Although the same problem has been studied in Hall et al. [5] and Hall et al. [6], their studies were restricted to one-predictor models and are difficult to generalize to multiple-predictor models. The proposed method gives results that are comparable to their results but are applicable to multiple-predictor models.

Author Contributions

X.S., X.-S.W. and A.W. designed research; X.S., X.-S.W. and A.W. performed research; X.S. and A.W. analyzed data; X.S., X.-S.W. and A.W. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

Shi’s work was supported by NSERC Discovery Grant RGPIN 2022-03264, the Interior Universities Research Coalition and the BC Ministry of Health, and the University of British Columbia Okanagan (UBC-O) Vice Principal Research in collaboration with UBC-O Irving K. Barber Faculty of Science, Wong’s work was supported by NSERC Discovery Grant RGPIN 2017-05179.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data can be accessed from https://www.kaggle.com/grosvenpaul/family-income-and-expenditure, accessed on 13 October 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Stephenson, G. Mathematical Methods for Science Students; Pearson: Toronto, ON, Canada, 1973. [Google Scholar]
Ormerod, J.T.; Wand, M.P. Explaining variational approximation. Am. Stat. 2010, 64, 140–153. [Google Scholar] [CrossRef] [Green Version]
Teschendorff, A.E.; Wang, Y.; Barbosa-Morais, N.L.; Brenton, J.D.; Cal-Das, C. A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics 2005, 21, 3025–3033. [Google Scholar] [CrossRef] [PubMed]
Flandin, G.; Penny, W.D. Bayesian fMRI data analysis with sparse spatial basis function priors. NeuroImage 2007, 45, S173–S186. [Google Scholar]
Hall, P.; Ormerod, J.T.; Wand, M.P. Theory of Gaussian variational approximation for a Poisson linear mixed model. Stat. Sin. 2011, 21, 369–389. [Google Scholar]
Hall, P.; Pham, T.; Wand, M.P.; Wang, S.S.J. Asymptotic normality and valid inference for Gaussian variational approximation. Ann. Stat. 2011, 39, 2502–2532. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Blei, D.M. Frequentist consistency of variational Bayes. J. Am. Stat. Assoc. 2019, 114, 1147–1161. [Google Scholar] [CrossRef] [Green Version]
Yang, Y.; Pati, D.; Bhattacharya, A. α-variational inference with statistical guarantees. Ann. Stat. 2020, 48, 886–905. [Google Scholar] [CrossRef]
Dai, W.; Hu, T.; Jin, B.; Shi, X. Incorporating Grouping Information into Bayesian Gaussian Graphical Model Selection. Commun. Stat.–Theory Methods 2022, 1–18. [Google Scholar] [CrossRef]
Galy-Fajou, T.; Perrone, V.; Opper, M. Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. Entropy 2021, 23, 990. [Google Scholar] [CrossRef] [PubMed]
Ong, V.M.-H.; Nott, D.J.; Smith, M.S. Gaussian variational approximation with a factor covariance structure. J. Comput. Graph. Stat. 2018, 27, 465–478. [Google Scholar] [CrossRef] [Green Version]
Chiquet, J.; Mariadassou, M.; Robin, S. Variational inference for probabilistic Poisson PCA. Ann. Appl. Stat. 2018, 12, 2674–2698. [Google Scholar] [CrossRef] [Green Version]
McCulloch, C.E.; Searle, S.R.; Neuhaus, J.M. Generalized, Linear, and Mixed Models, 2nd ed.; Wiley: New York, NY, USA, 2008. [Google Scholar]
Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. R. Stat. Soc. Ser. 2009, 71, 319–392. [Google Scholar] [CrossRef]
Philippine Statistics Authority. Family Income and Expenditure Survey. 2005. Available online: https://www.kaggle.com/grosvenpaul/family-income-and-expenditure (accessed on 13 September 2022).

Figure 1. Approximation accuracy comparison. The exact C is the black solid line, the GVA as in (5) is the dashed red line, and the GVA as in (6) is a dotted purple line.

Figure 2. Plot of the log of the averaged counts for each age in years (dots) and fitted solid curve of the log of the true rate.

Figure 3. Comparison of the method in Hall et al. [6] (purple) and our method (red) for the 95% coverage percentage for

β_{0}

,

β_{1}

, and

σ^{2}

.

Figure 3. Comparison of the method in Hall et al. [6] (purple) and our method (red) for the 95% coverage percentage for

β_{0}

,

β_{1}

, and

σ^{2}

.

Figure 4. Our 95% coverage percentage for

β_{0}

,

β_{1}

,

β_{2}

, and

σ^{2}

.

Figure 4. Our 95% coverage percentage for

β_{0}

,

β_{1}

,

β_{2}

, and

σ^{2}

.

Figure 5. Our 95% coverage percentage for

β_{0}

,

β_{1}

,

β_{2}

, and

σ^{2}

.

Figure 5. Our 95% coverage percentage for

β_{0}

,

β_{1}

,

β_{2}

, and

σ^{2}

.

Figure 6. Flowchart of the proposed algorithm using GVA.

Figure 7. Comparison of the computation time for lme4 and our method for different m,

n = m / 10

, and

p = 2

.

Figure 7. Comparison of the computation time for lme4 and our method for different m,

n = m / 10

, and

p = 2

.

Table 1. Estimates and confidence intervals of the parameters.

Parameter	Estimates	95% Confidence Interval
$β_{0}$	0.55262	(0.52585, 0.57939)
$β_{1}$	0.04403	(0.04363, 0.04443)
$β_{2}$	$- 0.00045$	( $- 0.00046, - 0.00045$ )
$σ^{2}$	0.00317	(0.00104, 0.0053)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, X.; Wang, X.-S.; Wong, A. Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model. Mathematics 2022, 10, 4542. https://doi.org/10.3390/math10234542

AMA Style

Shi X, Wang X-S, Wong A. Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model. Mathematics. 2022; 10(23):4542. https://doi.org/10.3390/math10234542

Chicago/Turabian Style

Shi, Xiaoping, Xiang-Sheng Wang, and Augustine Wong. 2022. "Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model" Mathematics 10, no. 23: 4542. https://doi.org/10.3390/math10234542

APA Style

Shi, X., Wang, X. -S., & Wong, A. (2022). Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model. Mathematics, 10(23), 4542. https://doi.org/10.3390/math10234542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model

Abstract

1. Introduction

2. A Special Member of the Exponential Family Model

3. GVA for Poisson Lognormal Mixed Model

4. Numerical Studies

4.1. Real Data Analysis

4.2. Simulations

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI