1. Introduction
The human microbiome is a gathering of microbes that inhabit different sites of the human body, including bacteria, archaea, and fung. The research on low-price high-throughput sequencing analyses has substantially facilitated the characterization of the compositions and diversities of the human microbiome. It provides a chance to carry out large-scale population-based microbiome studies to identify novel risk factors and to improve the risk prediction of diseases. Over the past two decades, increased research has focused on studying the structure, function, and dynamics of microbiome data, referring to all of the taxa and their genes in a well-defined environment [
1]. Technical advancements and reduced costs in sequencing have led to large-scale studies such as the human microbiome project (HMP) and the American gut project (AGP), which characterize the microbiome of largely healthy individuals [
2,
3]. Most of the early research focused on microbial diversity and taxonomic classification. More recently, research has focused on differential abundance analysis and understanding how the host environment is associated with the microbiome [
4,
5,
6,
7]. The human microbiome is known to be associated with complex diseases, such as obesity, inflammatory bowel disease, and rheumatoid arthritis [
8,
9,
10].
Despite these advances, little is known about inter-microbial interactions. Much of the lack of information on microbial interactions comes from the fact that most standard statistical methods for correlation or network analysis cannot be directly applied to such high-dimensional, compositional, and sparse data. To analyze microbiome data, there are two methods in the literature for modeling the distributions of the relative abundance data of many taxa. One method is utilizing multivariate normal distributions to model correlation structures of taxa by means of the centred log ratio transformation (CLR) or alternative log ratio transformations (ALR), and the other is using Dirichlet multinomial distribution to model taxa counts [
11,
12]. Dirichlet distributions are not quite flexible enough to fit many real data sets in microbiome studies. Moreover, the lack of dependence structure analysis hinders further statistical inferences in microbiome studies [
13].
For modeling microbial dependence structures with multivariate normal distributions, two of the most common methods, SparCC and SPIEC-EASI, are used. SparCC uses the additive log-ratio transformations and assumes that the underlying unobserved counts are correlated through log-ratio [
14]. In contrast, SPIEC-EASI uses the CLR and assumes that the network of interactions is generated from a precision matrix [
15]. While the log-ratio transformations are common in compositional data analysis, they are not particularly fitting for data with excessive zeros [
16], as is the case with microbial sequencing data. The normality assumption of such data often does not hold (see, e.g.,
Figure 1a below), and they require the use of pseudo-counts, thus forcing the assumption that the true absolute abundance (AA) for every taxa is non-zero in each sample.
Figure 1 shows that there are large departures between the estimated contour plots of samples and the contour plots based on the estimated normal distribution with CLR or the estimated Dirichlet distribution. Furthermore, the corresponding results from such transformations are difficult to interpret and are sensitive to the choice of reference groups; as Feng et al. [
17] pointed out, the results of standard statistical tests performed on log-transformed data are often not relevant for the original, non-transformed data.
To model the high-dimensional and compositional data in microbiome studies, we propose two classes of multivariate distributions on a simplex. A new class of multivariate distributions on a simplex is composed of variable transformation of inverse gamma distributions since most datasets show that a single taxonomy can be characterized by an inverse Gamma distribution. Another new class of multivariate distributions on simplex is constructed by means of copula techniques. Copulas are functions that join univariate uniform distribution functions to form multivariate distributions and are powerful for characterizing the dependence structures of random variables [
18].
Furthermore, human microbiota can be influenced by environment changes and life-style changes. Microbiome data are compositional and sparse. A few taxa are abundant, most taxa are rare (present in <10% subjects), and there are zero counts for some taxa. This could be caused by sampling zero (low sequence depth) or structural zeros (absent taxa). To deal with sampling zeros, we could enlarge the sequencing depth. However, in our settings, we ignore sampling zeros and only deal with structural zeros by zero-inflated distributions.
The remainder of this paper is organized as follows.
Section 2 constructs the new class of multivariate distributions on a simplex by variable transformation of inverse gamma distributions. In
Section 3, using the copula techniques, we propose a new class of multivariate distributions on a simplex with a Gaussian copula and a Farlie–Gumbel–Morgenstern copula. The real data from the American gut project (AGP) is analyzed to demonstrate the effectiveness of the proposed new multivariate distributions in
Section 4, and some conclusions are given in
Section 5.
2. Multivariate Distributions on Simplex via Inverse Gamma Distributions
Suppose there are
microbial taxa, and denote
as the true relative abundance (RA) of microbial taxa. The RA has a compositional structure with
, and the vector
lies on the
k-dimensional simplex
,
To construct the distributions of
, the inverse Gamma distributions on
are employed since most datasets show that a single taxonomy can be characterized by an inverse Gamma distribution (IG). Let
be independent random variables on
and have the inverse Gamma distributions
with its density function
where
is the shape parameter and
are the scale parameters for
. For a given subject, we assume that the
k taxa have the same shape parameter
, but the scale parameters
vary. With the following transformation,
The RA of microbial taxa
has the multivariate distribution on the
k-dimensional simplex
with its density function as
for
and
,
for
. We denote
. The special case of this multivariate distribution when
was first introduced by Carlton [
19] and further discussed in Favaro et al. [
20].
For the density function (
4), the marginal density functions are
for
.
Figure 2 shows the estimated marginal density functions by (
5) are fitting well for the sample density functions of Faecalibacterium (OTU1) and Bacteroides (OTU2) from AGP.
In the studies of the human microbiome, it is more expedient to consider the average total positivity [
21] (and reverse regularity) of order two microbial taxa. Kendall’s correlation coefficient is a measure of the difference between the probabilities of concordance and discordance for two independent and identically distributed pairs of random variables. It is the difference between the probabilities of concordance and discordance for two independent and identically distributed pairs of random variables [
22,
23]. Let
with the density given in (
4). Because of the transformation (
3), the corresponding Kendall correlation coefficient of
and
is given by
where
is an independent copy of
and
is an independent copy of
for
,
. Since
then, the Kendall correlation coefficient of
and
is the same as that of two independent IG random variables
and
, and we have that
which is only dependent on the parameter
, for
,
.
2.1. Paramter Estimation
Let
be
i.i.d samples from
. The log-likelihood function of parameters
is given by
and we obtain the score functions
where
is the derivative function of
. The maximum likelihood estimation (MLE) of
can be obtained by an iterative algorithm.
2.2. Multivariate Goodness–of–Fit Testing
For given
n samples
in the
k-dimensional simplex
, we can obtain the estimated parameters
. A goodness-of-testing procedure is necessary to test the hypothesis that the samples
come from the distribution
. For one-dimensional or low-dimensional data, there are a wide variety of useful and powerful goodness-of-fit and two-sample testing procedures, such as Kolmogorov–Smirnov (K-S) test. However, for high-dimensional (
) data, those tests rapidly lose statistical power because all finite samples are sparse in high-dimensional settings owing to the “curse of dimensionality” [
24]. In this paper, a machine learning method for two-sample testing is employed, which was proposed by Friedman [
24] and further developed by Gretton et al. [
25]. For the given observed samples
, we can obtain a set of score values
, where
is the cdf of
. At same time, with the “Monte Carlo” samples
from the distribution
, we can obtain another set of score values
. Then, a univariate two-sample test, such as the K-S test, chi–squared test, or Mann–Whitney test, can be used to test the goodness of fit based on the scores. The testing procedure is as follows:
- Step 1.
Generating
n samples
from the distribution
for
. With the transform (
3), we can obtain
n samples
from the distribution
such that
,
,
, for
, and
.
- Step 2.
By denoting the cumulative distribution function (cdf) of , we obtain two samples and .
- Step 3.
Using the one-dimensional two-sample test, such as the K-S test, chi–squared test, or Mann–Whitney test, to test the hypothesis that the two samples and have the same distribution.
- Step 4.
Repeating N (for example, ) times of Steps 1–3, we cannot reject the null hypothesis if the percent of rejecting times of testing in Step 3 (i.e., the p-value < 0.05) is less than 5%.
Friedman showed that the power of this testing procedurecan be highly sensitive to the learning machine methods that are employed [
24].
2.3. Zero-Inflated Distribution with Zero Counts for Absent Taxa
The above multivariate distribution assumes that all taxa have positive proportions. However, for a given taxonomy
j, they may be absent in some subjects. To consider the zero-inflated structures, we introduce the zero-inflated inverse Gamma distributions [
26]. Let
be independent random variables on
and
, where
follows a Bernoulli distribution
with
and
, and
follows a inverse Gamma distribution
for
. Then,
has a zero-inflated inverse Gamma distribution
with the probability density
for
. With the transformation
The RA of microbial taxa
has the multivariate distribution on the
k-dimensional simplex
with its density function as
where
for
and
,
for
.
3. Multivariate Distributions on Simplex Using Copula Dependence Structures
Since the
microbial taxa
lie on the
k-dimensional simplex
defined in (
1),
is a
k-dimensional random vector. To characterize the dependence structure of the
k-dimensional random vector, copula techniques are usually employed [
18,
23]. Copulas are multivariate distributions with uniformly distributed marginals on
. Utilizing copulas, the density function of a multivariate distribution can be decomposed into the density weighting function and the product of marginal densities [
27]. For a
d-variate random vector
with joint cdf
and marginal strictly increasing cdf
(
), Sklars’s Theorem [
28] states that there a
d-variate copula cdf
on
exists such that
for all
and
, where
is the inverse function of
. Let
and
be the density function and the marginal density functions of
, respectively. The corresponding copula density and the density function of
have the following relationships:
The distributions of individual taxonomy are easily estimated, for example, Qiu et al. [
13] proposed two-parametric simplex distribution with its density function,
where
are shaping parameters for
. In this paper, we propose two families of copulas, Gaussian and Farlie–Gumbel–Morgenstern copulas, to characterize the dependence structures of microbial taxa such that the corresponding multivariate distributions are good-of-fit for a given microbial data.
3.1. Gaussian Copula
Many new multivariate distributions are constructed based on the dependence structures of multivariate normal distributions, which are called Gaussian copulas. Let
with mean vector
and the
covariance matrix
R,
Denote the cdf of
by
; the corresponding copula of
is given by
for
, where
is the cdf of standard normal distribution
.
Suppose that the marginal distributions of the
microbial taxa
are
the cdf of
for
, and
, the distribution function of
can be constructed with the copula (
16) to be
Fang et al. [
18] show that Kendall’s correlation coefficient
of
and
is given by
which depends only on
. Thus, the estimates of
can be obtained by the estimators of Kendall’s correlation coefficient
of
and
.
3.2. Farlie–Gumbel–Morgenstern (FGM) Copula
As an alternative to a multivariate normal distribution, the FGM copula has been widely studied and applied to statistical modeling in various research fields [
29,
30,
31]. Let
be a random vector that follows a
k-dimensional FGM copula with its cdf
,
where
and
are dependence parameters since
means that the random variables
are obviously independent. According to Johnson and Kotz [
30], the
parameters
have the following constraints:
for
. With the cdf (
19), we have the joint-density function of
given by
For the given marginal distributions
of the cdf of
for
, the distribution function of microbial taxa
can be constructed with the copula (
19) to be
where
for
, and
. The Kendall’s correlation coefficient
of
and
is given by
which depends only on
.
Thus, the estimates of
can be obtained by the estimators of Kendall’s correlation coefficient
of
and
, but all of the parameters
should satisfy the constraints in (
20). Shih et al. [
32,
33] provided the maximum likelihood estimates of
with the constrains in (
20).
3.3. Multivariate Distributions with Zero-Inflated Marginal Distributions
Let the
microbial taxa
lie on the
k-dimensional simplex
defined in (
1). When there are proportions with zero counts for absent taxa, the zero and the non-zero responses are assumed to occur from two different processes; then, we could use two-part models to characterize the marginal distributions. Define that
and
. Let
and
be the pdf and cdf of
, respectively, where
is a parameter (or a parametric vector). Assume that
with
and
. Then, the marginal random variable
has the probability density,
for
. To construct the multivariate distribution of
, we assume that
have the multinomial distribution with its probability mass function
and the joint cdf of
is given by
where
is the density function of a copula
. Denote
or
, where
and
R are the parameters in the given copula function. Thus, the joint probability function of
is given by
where
for
,
is the density function of
at
for
and
. Note that
and the marginal cdf
are also a copula.
3.4. Parameter Estimation and Goodness–of–Fit Testing
Let
be
i.i.d samples from the multivariate distribution with its pdf in (
27). To obtain the maximum likelihood estimates of parameters
, a two-stage approach is employed [
34]. The marginal distributions are first estimated based on partial maximum likelihood methods. Given these marginal fits, the estimated cumulative distribution functions are then obtained using the estimates of the parameters in copula. Denote
for
,
. The estimates of
’s can be obtained by
and the estimates of
’s maximize the log-likelihood functions.
Then, the estimates of the parameters in copula can be obtained by maximizing the likelihood function
When
is an FGM copula, Shih et al. [
33] provides the maximum likelihood estimates of
. If
is a Gaussian copula, the parameters
can be estimated by the corresponding estimates of Kendall’s correlation coefficient
based on (
23).
When we obtain the estimates of parameters
, the machine learning method given in
Section 2.1 can be used for goodness-of-fit testing of the proposed multivariate distributions. Shih et al. [
33] provides a procedure of generating samples from FGM copulas, and Fang et al. [
18] gives a procedure of generating samples from Gaussian copulas.
4. Real Data Analysis
Now we apply our proposed methods to the datasets from AGP, a self-selected, open-platform cohort [
2]. The cohort consists of individuals mostly from the United States, with some from the United Kingdom and Australia, who opted into the study by providing informed consent and paying a fee to offset the cost of processing and sequencing. The data, both 16S rRNA gene sequencing and self-reported meta-data, are publicly available in The European Bioinformatics Institute repository under the accession ERP012803.
We focus on the fecal microbiome samples and restrict our samples to individuals who reported not having inflammatory bowel disease or diabetes. We filtered data that showed that OTUs were removed if present in fewer than 37% of samples. Samples were removed if total sequencing depth fell below 10,800 sequence reads [
15]. After filtering, there were 127 OTUs and 289 samples remaining in our dataset.
To illustrate our methods, the analysis results of two OTUs are shown only in this section. As shown in
Figure 2, the sample density functions of Faecalibacterium (OTU1) and Bacteroides (OTU2) from AGP are fitted well by the marginal distributions of MSIG distribution, where the parameters
are estimated to be
. The parameters for the fitted Dirichlet distribution are estimated to be
,
.
Figure 3 shows the contour plots from the estimated Dirichlet distribution (DD) and MSIG, respectively. The p-values for the GOF test are 0.001 versus 0.567 between DD and MSIG, which show that the proposed MSIG is goodness-of-fitting the data. Furthermore, the Kendall’s correlation coefficient between OTU1 and OTU2 is estimated to be
based on the samples, but the corresponding Kendall’s
are calculated to be
with DD and
with MSIG, respectively.
Black lines in both panels are the estimated contour plots for samples. Red lines in the left panel are the contour plots of the estimated Dirichlet distribution. Blue lines in the right panel are the contour plots of the estimated MSIG. These figures show that MSIG is a goodness-of-fit for the data.
Some individual taxa can not be fitted by the marginal distribution of MSIG.
Figure 4a,b shows that taxa Clostridium (OTU3) and Escherichia (OTU4) are fitted by the two-parametric simplex distributions [
13], where the corresponding parameters are estimated to be
and
. The Kendall’s correlation coefficient between OTU3 and OTU4 is estimated to be
based on the samples. We use the Gaussian copula (GC) to fit the joint distribution of OTU3 and OTU4 with
. The contour plots in
Figure 4c show that the proposed multivariate distribution via Gaussian copula fitted the joint distribution of OTU3 and OTU4 well, and the
p-value for GOF test is 0.647. However,
Figure 1 (the right panel) shows that the contour plots of samples are a departure from that of the Dirichlet distribution, of which the corresponding parameters are estimated to be
; the Kendall’s
, and the
p-value for the GOF test is 0.001.
5. Discussions
The human microbiome has been recently shown to be associated with disease risks and has important implications in risk stratification and precision medicine. The existing methods for modeling microbiome dynamics focus on abundant taxa and ignore rare taxa. Modeling both will be crucial to provide information about microbial causal interactions and information for designing effective large-scale epidemiology. In this paper, we propose two classes of multivariate distributions to characterize the dependence structures among taxa in microbiome studies. Multivariate distributions on simplex via inverse Gamma distributions (MSIG) are useful in regression analysis due to the close-form of their density functions, but all of the Kendall’s correlation coefficients between two taxa are assumed to be the same. The copula techniques are very useful to characterize the dependence structures of random variables. We proposed two most popular copulas, Gaussian and Farlie–Gumbel–Morgenstern copulas, to construct the multivariate distributions of taxa. The data analysis shows that our methods perform well.
In this paper, we consider only the dependence structures of taxa at the fixed time points. Some new methodologies for longitudinal observations to study the dynamics of the microbiome community have to be developed, which will investigated in the future.