Next Article in Journal
On Coefficient Estimates for a Certain Class of Analytic Functions
Next Article in Special Issue
A Method for Augmenting Supersaturated Designs with Newly Added Factors
Previous Article in Journal
Handling Irregular Many-Objective Optimization Problems via Performing Local Searches on External Archives
Previous Article in Special Issue
Testing Multivariate Normality Based on F-Representative Points
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dependence Structure Analysis and Its Application in Human Microbiome

1
Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC 20057, USA
2
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD 20892, USA
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(1), 9; https://doi.org/10.3390/math11010009
Submission received: 29 November 2022 / Revised: 13 December 2022 / Accepted: 14 December 2022 / Published: 20 December 2022
(This article belongs to the Special Issue Distribution Theory and Application)

Abstract

:
The human microbiome has been recently shown to be associated with disease risks and has important implications in risk stratification and precision medicine. Due to abundant taxa in the human body, microbiome data are high-dimensional and compositional. Dirichlet distributions and their generalization are used to characterize the dependence structures of microbial data. Another existing method for fitting microbiome data employed Gaussian graphical model using the centered log-transformation (CLR). However, Dirichlet distributions are not able to infer networks or to estimate some extremely rare probabilities. On the other hand, it is hard to interpret the network analysis results using CLR. Furthermore, the data analysis showed that there is a lack of efficient multivariate distributions for fitting microbiome data, which results in inadequate statistical inferences. In this paper, we propose new multivariate distributions for modeling the dependence structures of the high dimensional and compositional microbiome data using inverse gamma distributions and copula techniques. The data analysis in the American gut project shows our proposed methods perform well.

1. Introduction

The human microbiome is a gathering of microbes that inhabit different sites of the human body, including bacteria, archaea, and fung. The research on low-price high-throughput sequencing analyses has substantially facilitated the characterization of the compositions and diversities of the human microbiome. It provides a chance to carry out large-scale population-based microbiome studies to identify novel risk factors and to improve the risk prediction of diseases. Over the past two decades, increased research has focused on studying the structure, function, and dynamics of microbiome data, referring to all of the taxa and their genes in a well-defined environment [1]. Technical advancements and reduced costs in sequencing have led to large-scale studies such as the human microbiome project (HMP) and the American gut project (AGP), which characterize the microbiome of largely healthy individuals [2,3]. Most of the early research focused on microbial diversity and taxonomic classification. More recently, research has focused on differential abundance analysis and understanding how the host environment is associated with the microbiome [4,5,6,7]. The human microbiome is known to be associated with complex diseases, such as obesity, inflammatory bowel disease, and rheumatoid arthritis [8,9,10].
Despite these advances, little is known about inter-microbial interactions. Much of the lack of information on microbial interactions comes from the fact that most standard statistical methods for correlation or network analysis cannot be directly applied to such high-dimensional, compositional, and sparse data. To analyze microbiome data, there are two methods in the literature for modeling the distributions of the relative abundance data of many taxa. One method is utilizing multivariate normal distributions to model correlation structures of taxa by means of the centred log ratio transformation (CLR) or alternative log ratio transformations (ALR), and the other is using Dirichlet multinomial distribution to model taxa counts [11,12]. Dirichlet distributions are not quite flexible enough to fit many real data sets in microbiome studies. Moreover, the lack of dependence structure analysis hinders further statistical inferences in microbiome studies [13].
For modeling microbial dependence structures with multivariate normal distributions, two of the most common methods, SparCC and SPIEC-EASI, are used. SparCC uses the additive log-ratio transformations and assumes that the underlying unobserved counts are correlated through log-ratio [14]. In contrast, SPIEC-EASI uses the CLR and assumes that the network of interactions is generated from a precision matrix [15]. While the log-ratio transformations are common in compositional data analysis, they are not particularly fitting for data with excessive zeros [16], as is the case with microbial sequencing data. The normality assumption of such data often does not hold (see, e.g., Figure 1a below), and they require the use of pseudo-counts, thus forcing the assumption that the true absolute abundance (AA) for every taxa is non-zero in each sample. Figure 1 shows that there are large departures between the estimated contour plots of samples and the contour plots based on the estimated normal distribution with CLR or the estimated Dirichlet distribution. Furthermore, the corresponding results from such transformations are difficult to interpret and are sensitive to the choice of reference groups; as Feng et al. [17] pointed out, the results of standard statistical tests performed on log-transformed data are often not relevant for the original, non-transformed data.
To model the high-dimensional and compositional data in microbiome studies, we propose two classes of multivariate distributions on a simplex. A new class of multivariate distributions on a simplex is composed of variable transformation of inverse gamma distributions since most datasets show that a single taxonomy can be characterized by an inverse Gamma distribution. Another new class of multivariate distributions on simplex is constructed by means of copula techniques. Copulas are functions that join univariate uniform distribution functions to form multivariate distributions and are powerful for characterizing the dependence structures of random variables [18].
Furthermore, human microbiota can be influenced by environment changes and life-style changes. Microbiome data are compositional and sparse. A few taxa are abundant, most taxa are rare (present in <10% subjects), and there are zero counts for some taxa. This could be caused by sampling zero (low sequence depth) or structural zeros (absent taxa). To deal with sampling zeros, we could enlarge the sequencing depth. However, in our settings, we ignore sampling zeros and only deal with structural zeros by zero-inflated distributions.
The remainder of this paper is organized as follows. Section 2 constructs the new class of multivariate distributions on a simplex by variable transformation of inverse gamma distributions. In Section 3, using the copula techniques, we propose a new class of multivariate distributions on a simplex with a Gaussian copula and a Farlie–Gumbel–Morgenstern copula. The real data from the American gut project (AGP) is analyzed to demonstrate the effectiveness of the proposed new multivariate distributions in Section 4, and some conclusions are given in Section 5.

2. Multivariate Distributions on Simplex via Inverse Gamma Distributions

Suppose there are k + 1 microbial taxa, and denote Y = ( Y 1 , , Y k + 1 ) T as the true relative abundance (RA) of microbial taxa. The RA has a compositional structure with j = 1 k + 1 Y j = 1 , and the vector Y lies on the k-dimensional simplex S k ,
S k : = ( y 1 , , y k ) T : y i > 0 , i = 1 , , k , i = 1 k y i 1
To construct the distributions of Y , the inverse Gamma distributions on ( 0 , ) are employed since most datasets show that a single taxonomy can be characterized by an inverse Gamma distribution (IG). Let X 1 , , X k + 1 be independent random variables on ( 0 , ) and have the inverse Gamma distributions X j I G ( α , β j ) with its density function
f ( x ; α , β j ) = β j α Γ ( α ) x α 1 e x p ( β j x ) , for x > 0 ,
where α > 0 is the shape parameter and β j > 0 are the scale parameters for j = 1 , , k + 1 . For a given subject, we assume that the k taxa have the same shape parameter α , but the scale parameters β j vary. With the following transformation,
Y j = X j j = 1 k + 1 X j , Y k + 1 = 1 j = 1 k Y j , for j = 1 , , k ,
The RA of microbial taxa Y has the multivariate distribution on the k-dimensional simplex S k with its density function as
f ( y ; α , β 1 , , β k + 1 ) = Γ ( ( k + 1 ) α ) Γ ( α ) k + 1 j = 1 k + 1 β j α j = 1 k + 1 y j α 1 j = 1 k + 1 β j y j ( k + 1 ) α ,
for y = ( y 1 , , y k + 1 ) T and 0 < y j < 1 , j = 1 k + 1 y j = 1 for j = 1 , , k + 1 . We denote Y = ( Y 1 , , Y k + 1 ) T M S I G ( α ; β 1 , , β k + 1 ) . The special case of this multivariate distribution when α = 1 2 was first introduced by Carlton [19] and further discussed in Favaro et al. [20].
For the density function (4), the marginal density functions are
f ( y j ; α ; β 1 , , β k + 1 ) = β j α Γ ( α ) ( 1 y j ) α + 3 y j α + 1 0 0 exp β j ( 1 y j ) y j ( i j x i ) i j β i x i ( i j x i ) α i j x i α + 1 i j d x i ,
for j = 1 , , k + 1 . Figure 2 shows the estimated marginal density functions by (5) are fitting well for the sample density functions of Faecalibacterium (OTU1) and Bacteroides (OTU2) from AGP.
In the studies of the human microbiome, it is more expedient to consider the average total positivity [21] (and reverse regularity) of order two microbial taxa. Kendall’s correlation coefficient is a measure of the difference between the probabilities of concordance and discordance for two independent and identically distributed pairs of random variables. It is the difference between the probabilities of concordance and discordance for two independent and identically distributed pairs of random variables [22,23]. Let Y = ( Y 1 , , Y k + 1 ) T M S I G ( α ; β 1 , , β k + 1 ) with the density given in (4). Because of the transformation (3), the corresponding Kendall correlation coefficient of Y i and Y j is given by
τ ( Y i , Y j ) = E ( s i g n ( Y i Y i ) s i g n ( Y j Y j ) ) = E ( s i g n ( X i X i ) s i g n ( X j X j ) )
where ( Y i , Y j ) is an independent copy of ( Y i , Y j ) and ( X i , X j ) is an independent copy of ( X i , X j ) for i j , i , j = 1 , , k + 1 . Since
P ( X i X i < 0 ) = 0 x β i α Γ ( α ) x α 1 e x p ( β i x ) β i α Γ ( α ) y α 1 e x p ( β i y ) d y d x ,
then, the Kendall correlation coefficient of Y i and Y j is the same as that of two independent IG random variables X i and X j , and we have that
τ ( Y i , Y j ) = P ( X i > X i ) P ( X j > X j ) + P ( X i < X i ) P ( X j < X j ) P ( X i > X i ) P ( X j < X j ) P ( X i < X i ) P ( X j > X j ) = 2 1 2 2 α Γ ( α ) l = 0 Γ ( 2 α + l ) 2 l Γ ( α + l + 1 ) 1 2 2 α + 1 Γ ( α ) l = 0 Γ ( 2 α + l ) 2 l Γ ( α + l + 1 ) ,
which is only dependent on the parameter α , for i j , i , j = 1 , , k + 1 .

2.1. Paramter Estimation

Let y ( 1 ) , , y ( n ) be i.i.d samples from Y M S I G ( α ; β 1 , , β k + 1 ) . The log-likelihood function of parameters Θ = ( α ; β 1 , , β k + 1 ) is given by
( Θ | y ( 1 ) , , y ( n ) ) n α j = 1 k + 1 log β j + n log ( Γ ( ( k + 1 ) α ) ) n ( k + 1 ) log ( Γ ( α ) ) i = 1 n ( α + 1 ) j = 1 k + 1 log y j ( i ) + ( k + 1 ) α log j = 1 k + 1 β j y j ( i ) ,
and we obtain the score functions
k β h = i = 1 n j h k + 1 β j y j ( i ) β h + j h k + 1 β j y j ( i ) i = 1 n 1 β h + j h k + 1 β j y j ( i ) 1 , h = 1 , , k + 1 ; n ( k + 1 ) Γ ( ( k + 1 ) α ) Γ ( ( k + 1 ) α ) Γ ( α ) Γ ( α ) = i = 1 n j = 1 k + 1 log y j ( i ) + log j = 1 k + 1 β j y j ( i ) ( k + 1 ) n j = 1 k + 1 log β j ,
where Γ ( · ) is the derivative function of Γ ( · ) . The maximum likelihood estimation (MLE) of Θ = ( α ; β 1 , , β k + 1 ) can be obtained by an iterative algorithm.

2.2. Multivariate Goodness–of–Fit Testing

For given n samples y ( 1 ) , , y ( n ) in the k-dimensional simplex S k , we can obtain the estimated parameters Θ ^ = ( α ^ ; β ^ 1 , , β ^ k + 1 ) . A goodness-of-testing procedure is necessary to test the hypothesis that the samples y ( i ) i = 1 n come from the distribution Y M S I G ( α ^ ; β ^ 1 , , β ^ k + 1 ) . For one-dimensional or low-dimensional data, there are a wide variety of useful and powerful goodness-of-fit and two-sample testing procedures, such as Kolmogorov–Smirnov (K-S) test. However, for high-dimensional ( k > 3 ) data, those tests rapidly lose statistical power because all finite samples are sparse in high-dimensional settings owing to the “curse of dimensionality” [24]. In this paper, a machine learning method for two-sample testing is employed, which was proposed by Friedman [24] and further developed by Gretton et al. [25]. For the given observed samples y ( i ) i = 1 n , we can obtain a set of score values v i = F ( y ( i ) ) i = 1 n , where F ( y ) is the cdf of M S I G ( α ^ ; β ^ 1 , , β ^ k + 1 ) . At same time, with the “Monte Carlo” samples z ( i ) i = 1 n from the distribution M S I G ( α ^ ; β ^ 1 , , β ^ k + 1 ) , we can obtain another set of score values w i = F ( z ( i ) ) i = 1 n . Then, a univariate two-sample test, such as the K-S test, chi–squared test, or Mann–Whitney test, can be used to test the goodness of fit based on the scores. The testing procedure is as follows:
Step 1.
Generating n samples x i j i = 1 n from the distribution I G ( α ^ , β ^ j ) for j = 1 , , k + 1 . With the transform (3), we can obtain n samples z ( i ) i = 1 n from the distribution M S I G ( α ^ ; β ^ 1 , , β ^ k + 1 ) such that z ( i ) = ( z i 1 , , z i , k + 1 ) T , z i j = x i j j = 1 k + 1 x i j , z i , k + 1 = 1 j = 1 k z i j , for j = 1 , , k , and i = 1 , , n .
Step 2.
By denoting F ( y ) the cumulative distribution function (cdf) of M S I G ( α ^ ; β ^ 1 , , β ^ k + 1 ) , we obtain two samples w i = F ( z ( i ) ) i = 1 n and v i = F ( y ( i ) ) i = 1 n .
Step 3.
Using the one-dimensional two-sample test, such as the K-S test, chi–squared test, or Mann–Whitney test, to test the hypothesis that the two samples w i i = 1 n and v i i = 1 n have the same distribution.
Step 4.
Repeating N (for example, N = 5000 ) times of Steps 1–3, we cannot reject the null hypothesis if the percent of rejecting times of testing in Step 3 (i.e., the p-value < 0.05) is less than 5%.
Friedman showed that the power of this testing procedurecan be highly sensitive to the learning machine methods that are employed [24].

2.3. Zero-Inflated Distribution with Zero Counts for Absent Taxa

The above multivariate distribution assumes that all taxa have positive proportions. However, for a given taxonomy j, they may be absent in some subjects. To consider the zero-inflated structures, we introduce the zero-inflated inverse Gamma distributions [26]. Let X 1 * , , X k + 1 * be independent random variables on [ 0 , ) and X j * = ( 1 Δ j ) + Δ j X j , where Δ j follows a Bernoulli distribution Δ j B e r n o u l l i ( π j ) with P ( Δ j = 1 ) = π j and P ( Δ j = 0 ) = 1 π j , and X j follows a inverse Gamma distribution X j I G ( α , β j ) for j = 1 , , k + 1 . Then, X j * has a zero-inflated inverse Gamma distribution X j * Z I I G ( π j , α , β j ) with the probability density
f ( x | π j , α , β j ) = π j δ ( 1 π ) β j α Γ ( α ) x α 1 e x p ( β j x ) 1 δ , δ = I { x = 0 } ,
for x [ 0 , ) . With the transformation
Y j = X j * j = 1 k + 1 X j * , Y k + 1 = 1 j = 1 k Y j , for j = 1 , , k ,
The RA of microbial taxa Y has the multivariate distribution on the k-dimensional simplex S k with its density function as
f ( y ; α , β 1 , , β k + 1 ) = j = 1 k + 1 π j δ j ( 1 π j ) 1 δ j Γ ( j = 1 k + 1 ( 1 δ j ) α ) Γ ( α ) j = 1 k + 1 ( 1 δ j ) j = 1 k + 1 β j ( 1 δ j ) α j = 1 k + 1 y j ( α 1 ) ( 1 δ j ) j = 1 k + 1 ( 1 δ j ) β j y j j = 1 k + 1 ( 1 δ j ) α ,
where δ j = I { y j = 0 } for y = ( y 1 , , y k + 1 ) T and 0 y j < 1 , j = 1 k + 1 y j = 1 for j = 1 , , k + 1 .

3. Multivariate Distributions on Simplex Using Copula Dependence Structures

Since the k + 1 microbial taxa Y = ( Y 1 , , Y k + 1 ) T lie on the k-dimensional simplex S k defined in (1), Y is a k-dimensional random vector. To characterize the dependence structure of the k-dimensional random vector, copula techniques are usually employed [18,23]. Copulas are multivariate distributions with uniformly distributed marginals on [ 0 , 1 ] . Utilizing copulas, the density function of a multivariate distribution can be decomposed into the density weighting function and the product of marginal densities [27]. For a d-variate random vector X = ( X 1 , , X d ) with joint cdf F ( · ) and marginal strictly increasing cdf F i ( · ) ( i = 1 , , d ), Sklars’s Theorem [28] states that there a d-variate copula cdf C ( · ) on [ 0 , 1 ] d exists such that
F ( x 1 , , x d ) = C ( F 1 ( x 1 ) , , F d ( x d ) ) , C ( u 1 , , u d ) = F ( F 1 1 ( u 1 ) , , F d 1 ( u d ) ) ,
for all x = ( x 1 , , x d ) T R d and u = ( u 1 , , u d ) T [ 0 , 1 ] d , where F 1 ( · ) is the inverse function of F ( · ) . Let f ( x 1 , , x d ) and f i ( x i ) ( i = 1 , , d ) be the density function and the marginal density functions of X = ( X 1 , , X d ) , respectively. The corresponding copula density and the density function of X have the following relationships:
c ( u 1 , , u d ) = d C ( u 1 , , u d ) u 1 , , u d = f ( F 1 1 ( u 1 ) , , F d 1 ( u d ) ) f 1 ( F 1 1 ( u 1 ) ) , , f d ( F d 1 ( u d ) ) , f ( x 1 , , x d ) = c ( F 1 ( x 1 ) , , F d ( x d ) ) j = 1 d f j ( x j ) .
The distributions of individual taxonomy are easily estimated, for example, Qiu et al. [13] proposed two-parametric simplex distribution with its density function,
f ( y ) = ( 2 π s ( y ( 1 y ) ) 3 ) 1 / 2 e x p ( ( y m ) 2 2 m ( 1 m ) 2 y ( 1 y ) s ) )
where ( s , m ) are shaping parameters for 0 < y < 1 . In this paper, we propose two families of copulas, Gaussian and Farlie–Gumbel–Morgenstern copulas, to characterize the dependence structures of microbial taxa such that the corresponding multivariate distributions are good-of-fit for a given microbial data.

3.1. Gaussian Copula

Many new multivariate distributions are constructed based on the dependence structures of multivariate normal distributions, which are called Gaussian copulas. Let Z = ( Z 1 , , Z k ) T N ( 0 , R ) with mean vector 0 and the k × k covariance matrix R,
R = ρ i j : ρ i i = 1 , 1 < ρ i j < 1 for i j , ρ i j = ρ j i ; i , j = 1 , , k .
Denote the cdf of Z by F Z ( z 1 , , z k ) ; the corresponding copula of Z is given by
C Z ( u 1 , , u k ) = F Z ( Φ 1 ( u 1 ) , , Φ 1 ( u k ) ) ,
for ( u 1 , , u k ) T ( 0 , 1 ) k , where Φ ( · ) is the cdf of standard normal distribution N ( 0 , 1 ) .
Suppose that the marginal distributions of the k + 1 microbial taxa Y = ( Y 1 , , Y k + 1 ) T are F j ( y j ) the cdf of Y j for j = 1 , , k , and Y k + 1 = 1 j = 1 k Y j , the distribution function of Y can be constructed with the copula (16) to be
F Y ( y 1 , , y k ; R ) = P Y 1 y 1 , , Y k y k = F Z ( Φ 1 ( F 1 ( y 1 ) ) , , Φ 1 ( F k ( y k ) ) ) .
Fang et al. [18] show that Kendall’s correlation coefficient τ ( Y i , Y j ) of Y i and Y j is given by
τ ( Y i , Y j ) = 2 π arcsin ( ρ i j ) , for i j , i , j = 1 , , k ,
which depends only on ρ i j . Thus, the estimates of ρ i j can be obtained by the estimators of Kendall’s correlation coefficient τ ( Y i , Y j ) of Y i and Y j .

3.2. Farlie–Gumbel–Morgenstern (FGM) Copula

As an alternative to a multivariate normal distribution, the FGM copula has been widely studied and applied to statistical modeling in various research fields [29,30,31]. Let U = ( U 1 , , U k ) T be a random vector that follows a k-dimensional FGM copula with its cdf C F G M ( u 1 , , u k ; Θ ) ,
C F G M ( u 1 , , u k ; Θ ) = P U 1 u 1 , , U k u k = j = 1 k u j 1 + i = 2 k 1 j 1 < < j i k θ j 1 j i ( 1 u j 1 ) ( 1 u j i ) ,
where ( u 1 , , u k ) T [ 0 , 1 ] k and Θ = { θ j 1 j i : 1 j 1 < < j i k , i = 2 , , k } are dependence parameters since Θ 0 means that the random variables U 1 , , U k are obviously independent. According to Johnson and Kotz [30], the 2 k k 1 parameters Θ have the following constraints:
1 + i = 2 k 1 j 1 < < j i k θ j 1 j i δ j 1 δ j i 0 , δ j = ± 1 ,
for j = 1 , , k . With the cdf (19), we have the joint-density function of U given by
c F G M ( u 1 , , u k ; Θ ) = k u 1 u k C F G M ( u 1 , , u k ; Θ ) = 1 + i = 2 k 1 j 1 < < j i k θ j 1 j i ( 1 2 u j 1 ) ( 1 2 u j i ) .
For the given marginal distributions F j ( y j ) of the cdf of Y j for j = 1 , , k , the distribution function of microbial taxa Y = ( Y 1 , , Y k + 1 ) T can be constructed with the copula (19) to be
F Y ( y 1 , , y k ; Θ ) = P Y 1 y 1 , , Y k y k = j = 1 k F j ( y j ) 1 + i = 2 k 1 j 1 < < j i k θ j 1 j i ( 1 F j 1 ( y j 1 ) ) ( 1 F j i ( y j i ) ) ,
where 0 < y j < 1 for j = 1 , , k , and y k + 1 = 1 j = 1 k y j . The Kendall’s correlation coefficient τ ( Y i , Y j ) of Y i and Y j is given by
τ ( Y i , Y j ) = 4 [ 0 , 1 ] 2 C ( u i , u j ) d C ( u i , u j ) 1 = θ i j 9 ( θ i j 8 ) , for i j , i , j = 1 , , k ,
which depends only on θ i j .
Thus, the estimates of θ i j can be obtained by the estimators of Kendall’s correlation coefficient τ ( Y i , Y j ) of Y i and Y j , but all of the parameters Θ = { θ j 1 j i : 1 j 1 < < j i k , i = 2 , , k } should satisfy the constraints in (20). Shih et al. [32,33] provided the maximum likelihood estimates of Θ with the constrains in (20).

3.3. Multivariate Distributions with Zero-Inflated Marginal Distributions

Let the k + 1 microbial taxa Y = ( Y 1 , , Y k + 1 ) T lie on the k-dimensional simplex S k defined in (1). When there are proportions with zero counts for absent taxa, the zero and the non-zero responses are assumed to occur from two different processes; then, we could use two-part models to characterize the marginal distributions. Define that Δ j = I { Y j = 0 } and Y j * = Y j | ( Y j > 0 ) . Let f j * ( y , ω j ) and F j * ( y , ω j ) be the pdf and cdf of Y j * , respectively, where ω j is a parameter (or a parametric vector). Assume that Δ j B e r n o u l l i ( π j ) with P ( Δ j = 1 ) = π j and P ( Δ j = 0 ) = 1 π j . Then, the marginal random variable Y j has the probability density,
f j ( y ) = π j δ j ( 1 π j ) f j * ( y , ω j ) 1 δ j , δ j = I { y = 0 } ,
for j = 1 , , k . To construct the multivariate distribution of Y , we assume that Δ 1 , , Δ k have the multinomial distribution with its probability mass function
g ( δ 1 , , δ k ; π 1 , , π k ) = P { Δ 1 = δ 1 , , Δ k = δ k } = j = 1 k π j δ j × j = 1 k ( 1 π j ) 1 δ j ,
and the joint cdf of ( Y 1 * , , Y k * ) is given by
f Y * ( y 1 , , y k ; ω 1 , , ω k ) = c ( F 1 * ( y 1 , ω 1 ) , , F k * ( y k , ω k ) ) j = 1 k f j * ( y j , ω j ) ,
where c ( u 1 , , u k ) is the density function of a copula C ( u 1 , , u k ) . Denote Υ = ( π 1 , , π k ; ω 1 , , ω k ; Θ ) or Υ = ( π 1 , , π k ; ω 1 , , ω k ; R ) , where Θ and R are the parameters in the given copula function. Thus, the joint probability function of Y is given by
f Y ( y 1 , , y k ; Υ ) = j = 1 k π j δ j + j = 1 k ( 1 π j ) 1 δ j j = 1 k f j * ( y j , ω j ) 1 δ j c ˜ [ F 1 * ( y 1 , ω 1 ) ] 1 δ 1 , , [ F k * ( y k , ω k ) ] 1 δ k ,
where δ j = I { y j = 0 } for j = 1 , , k , c ˜ [ F 1 * ( y 1 , ω 1 ) ] 1 δ 1 , , [ F k * ( y k , ω k ) ] 1 δ k is the density function of C ( u 1 1 δ 1 , , u k 1 δ k ) at u j = F j * ( y j , ω j ) for δ j = 0 and j = 1 , , k . Note that a 0 = 1 and the marginal cdf C ( u 1 1 δ 1 , , u k 1 δ k ) are also a copula.

3.4. Parameter Estimation and Goodness–of–Fit Testing

Let y ( 1 ) , , y ( n ) be i.i.d samples from the multivariate distribution with its pdf in (27). To obtain the maximum likelihood estimates of parameters Υ , a two-stage approach is employed [34]. The marginal distributions are first estimated based on partial maximum likelihood methods. Given these marginal fits, the estimated cumulative distribution functions are then obtained using the estimates of the parameters in copula. Denote δ i j = I { y j ( i ) = 0 } for j = 1 , , k , i = 1 , , n . The estimates of π ’s can be obtained by
π ^ j = i = 1 n δ i j / n , j = 1 , , k ,
and the estimates of ω ’s maximize the log-likelihood functions.
( ω j ) = i = 1 n ( 1 δ i j ) log ( f j * ( y j ( i ) , ω j ) ) , j = 1 , , k .
Then, the estimates of the parameters in copula can be obtained by maximizing the likelihood function
L i k ( Υ ) = i = 1 n c ˜ [ F 1 * ( y 1 ( i ) , ω ^ 1 ) ] 1 δ i 1 , , [ F k * ( y k ( i ) , ω ^ k ) ] 1 δ i k .
When C ( u 1 , , u k ) is an FGM copula, Shih et al. [33] provides the maximum likelihood estimates of Θ . If C ( u 1 , , u k ) is a Gaussian copula, the parameters R = ( ρ i j ) can be estimated by the corresponding estimates of Kendall’s correlation coefficient τ ( Y i * , Y j * ) based on (23).
When we obtain the estimates of parameters Υ , the machine learning method given in Section 2.1 can be used for goodness-of-fit testing of the proposed multivariate distributions. Shih et al. [33] provides a procedure of generating samples from FGM copulas, and Fang et al. [18] gives a procedure of generating samples from Gaussian copulas.

4. Real Data Analysis

Now we apply our proposed methods to the datasets from AGP, a self-selected, open-platform cohort [2]. The cohort consists of individuals mostly from the United States, with some from the United Kingdom and Australia, who opted into the study by providing informed consent and paying a fee to offset the cost of processing and sequencing. The data, both 16S rRNA gene sequencing and self-reported meta-data, are publicly available in The European Bioinformatics Institute repository under the accession ERP012803.
We focus on the fecal microbiome samples and restrict our samples to individuals who reported not having inflammatory bowel disease or diabetes. We filtered data that showed that OTUs were removed if present in fewer than 37% of samples. Samples were removed if total sequencing depth fell below 10,800 sequence reads [15]. After filtering, there were 127 OTUs and 289 samples remaining in our dataset.
To illustrate our methods, the analysis results of two OTUs are shown only in this section. As shown in Figure 2, the sample density functions of Faecalibacterium (OTU1) and Bacteroides (OTU2) from AGP are fitted well by the marginal distributions of MSIG distribution, where the parameters ( α , β 1 , β 2 ) are estimated to be ( α ^ , β ^ 1 , β ^ 2 ) ) = ( 0.351 , 0.089 , 0.036 ) . The parameters for the fitted Dirichlet distribution are estimated to be ( α ^ 1 , α ^ 2 ) = ( 0.0791 , 0.0562 ) . Figure 3 shows the contour plots from the estimated Dirichlet distribution (DD) and MSIG, respectively. The p-values for the GOF test are 0.001 versus 0.567 between DD and MSIG, which show that the proposed MSIG is goodness-of-fitting the data. Furthermore, the Kendall’s correlation coefficient between OTU1 and OTU2 is estimated to be τ = 0.061 based on the samples, but the corresponding Kendall’s τ are calculated to be 0.102 with DD and 0.057 with MSIG, respectively.
Black lines in both panels are the estimated contour plots for samples. Red lines in the left panel are the contour plots of the estimated Dirichlet distribution. Blue lines in the right panel are the contour plots of the estimated MSIG. These figures show that MSIG is a goodness-of-fit for the data.
Some individual taxa can not be fitted by the marginal distribution of MSIG. Figure 4a,b shows that taxa Clostridium (OTU3) and Escherichia (OTU4) are fitted by the two-parametric simplex distributions [13], where the corresponding parameters are estimated to be ( m ^ 1 , s ^ 1 ) = ( 0.249 , 5.801 ) and ( m ^ 2 , s ^ 2 ) = ( 0.125 , 8.610 ) . The Kendall’s correlation coefficient between OTU3 and OTU4 is estimated to be τ = 0.448 based on the samples. We use the Gaussian copula (GC) to fit the joint distribution of OTU3 and OTU4 with ρ ^ = 0.071 . The contour plots in Figure 4c show that the proposed multivariate distribution via Gaussian copula fitted the joint distribution of OTU3 and OTU4 well, and the p-value for GOF test is 0.647. However, Figure 1 (the right panel) shows that the contour plots of samples are a departure from that of the Dirichlet distribution, of which the corresponding parameters are estimated to be ( α ^ 1 , α ^ 2 ) = ( 0.444 , 0.556 ) ; the Kendall’s τ = 0.407 , and the p-value for the GOF test is 0.001.

5. Discussions

The human microbiome has been recently shown to be associated with disease risks and has important implications in risk stratification and precision medicine. The existing methods for modeling microbiome dynamics focus on abundant taxa and ignore rare taxa. Modeling both will be crucial to provide information about microbial causal interactions and information for designing effective large-scale epidemiology. In this paper, we propose two classes of multivariate distributions to characterize the dependence structures among taxa in microbiome studies. Multivariate distributions on simplex via inverse Gamma distributions (MSIG) are useful in regression analysis due to the close-form of their density functions, but all of the Kendall’s correlation coefficients between two taxa are assumed to be the same. The copula techniques are very useful to characterize the dependence structures of random variables. We proposed two most popular copulas, Gaussian and Farlie–Gumbel–Morgenstern copulas, to construct the multivariate distributions of taxa. The data analysis shows that our methods perform well.
In this paper, we consider only the dependence structures of taxa at the fixed time points. Some new methodologies for longitudinal observations to study the dynamics of the microbiome community have to be developed, which will investigated in the future.

Author Contributions

Conceptualization, H.-B.F. and P.A.; methodology, S.L.; validation, J.S.; formal analysis, S.L. and J.S.; writing—original draft preparation, S.L.; and writing—review and editing, H.-B.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Cancer Institute (NCI) grant P30CA051008 and the intramural Research Program of National Institute of Health.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lederberg, J.; Mccray, A.T. “Ome Sweet” Omics-a genealogical treasury of words. Scientist 2001, 15, 8. [Google Scholar]
  2. McDonald, D.; Vázquez-Baeza, Y.; Koslicki, D.; McClelland, J.; Reeve, N.; Xu, Z.; Gonzalez, A.; Knight, R. Striped UniFrac: Enabling microbiome analysis at unprecedented scale. Nat. Methods 2018, 15, 847–848. [Google Scholar] [CrossRef] [PubMed]
  3. Turnbaugh, P.J.; Ley, R.E.; Hamady, M.; Fraser-Liggett, C.M.; Knight, R.; Gordon, J.I. The human microbiome project. Nature 2007, 449, 804–810. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. McMurdie, P.J.; Holmes, S. Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 2014, 10, e1003531. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Paulson, J.N.; Stine, O.C.; Bravo, H.C.; Pop, M. Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 2013, 10, 1200–1202. [Google Scholar] [CrossRef] [Green Version]
  6. Peng, S.; Yin, J.; Liu, X.; Jia, B.; Chang, Z.; Lu, H.; Jiang, N.; Chen, Q. First insights into the microbial diversity in the omasum and reticulum of bovine using Illumina sequencing. J. Appl. Genet. 2015, 56, 393–401. [Google Scholar] [CrossRef] [Green Version]
  7. White, J.R.; Nagarajan, N.; Pop, M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol. 2009, 5, e1000352. [Google Scholar] [CrossRef]
  8. Greenblum, S.; Turnbaugh, P.J.; Borenstein, E. Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease. Proc. Natl. Acad. Sci. USA 2012, 109, 594–599. [Google Scholar] [CrossRef] [Green Version]
  9. Scher, J.U.; Abramson, S.B. The microbiome and rheumatoid arthritis. Nat. Rev. Rheumatol. 2011, 7, 569–578. [Google Scholar] [CrossRef]
  10. Taneja, V. Arthritis susceptibility and the gut microbiome. FEBS Lett. 2014, 588, 4244–4249. [Google Scholar] [CrossRef] [Green Version]
  11. Harrison, J.G.; Calder, W.J.; Shastry, V.; Buerkle, C.A. Dirichlet-multinomial modeling outperforms alternatives for analysis of microbiome and other ecological count data. Mol. Ecol. Resour. 2020, 20, 481–497. [Google Scholar] [CrossRef] [PubMed]
  12. Metwally, A.A.; Aldirawi, H.; Yang, J. A review on probabilistic models used in microbiome studies. Commun. Inf. Syst. 2018, 18, 173–191. [Google Scholar] [CrossRef]
  13. Qiu, Z.; Song, P.X.K.; Tan, M. Simplex mixed-effects models for longitudinal proportional data. Scand. J. Stat. 2008, 35, 577–596. [Google Scholar] [CrossRef]
  14. Friedman, J.; Alm, E.J. Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 2012, 8, e1002687. [Google Scholar] [CrossRef] [Green Version]
  15. Kurtz, Z.D.; Müller, C.L.; Miraldi, E.R.; Littman, D.R.; Blaser, M.J.; Bonneau, R.A. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 2015, 11, e1004226. [Google Scholar] [CrossRef] [Green Version]
  16. Aitchison, J. The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B 1982, 44, 139–160. [Google Scholar] [CrossRef]
  17. Feng, C.; Wang, H.; Lu, N.; Tu, X.M. Log transformation: Application and interpretation in biomedical research. Stat. Med. 2013, 32, 230–239. [Google Scholar] [CrossRef]
  18. Fang, H.B.; Fang, K.T.; Kotz, S. The meta-elliptical distributions with given marginals. J. Multivar. Anal. 2002, 82, 2017. [Google Scholar] [CrossRef] [Green Version]
  19. Carlton, M.A. A family of densities derived from the three-parameter Dirichlet process. J. Appl. Probab. 2002, 39, 764–774. [Google Scholar] [CrossRef]
  20. Favaro, S.; Hadjicharalambous, G.; Prünster, I. On a class of distributions on the simplex. J. Stat. Plan. Inference 2011, 141, 2987–3004. [Google Scholar] [CrossRef] [Green Version]
  21. Erdélyi, A.; Karlin, S. Total Positivity, Vol. I Stanford University Press; London: Oxford University Press, xi 576 pp. 166s. 6d. Proc. Edinb. Math. Soc. 1968, 17, 110. [Google Scholar] [CrossRef]
  22. Kruskal, W.H. Ordinal measures of association. J. Am. Stat. Assoc. 1958, 53, 814–861. [Google Scholar] [CrossRef]
  23. Nelsen, R.B. An Introduction to Copulas; Springer: Berlin, Germany, 2006. [Google Scholar]
  24. Friedman, J. On multivariate goodness-of-fit and two-sample testing. In Technical Report SLACPUB-10325; University of Stanford Statistics Department: Stanford, CA, USA, 2003. [Google Scholar]
  25. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  26. Liu, L.; Shih, Y.-C.T.; Strawderman, R.L.; Zhang, D.; Johnson, B.A.; Chai, H. Statistical analysis of zero-inflated nonnegative continuous data: A review. Stat. Sci. 2019, 34, 253–279. [Google Scholar] [CrossRef]
  27. DallÁglio, G.; Kotz, S.; Salinetti, G. Advance in Probability Distributions with Given Marginals; Kluwer Academic: Dordrecht, The Netherlands, 1991. [Google Scholar]
  28. Sklar, A. Fonctions de répartition á n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris 1959, 8, 229–231. [Google Scholar]
  29. Jaworski, P.; Durante, F.; Härdle, W.; Rychlik, T. Copula Theory and Its Applications; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  30. Johnson, N.L.; Kotz, S. On some generalized Farlie–Gumbel–Morgenstern distributions. Commun. Stat. 1975, 4, 415–427. [Google Scholar] [CrossRef]
  31. Navarro, J.; Durante, F. Copula-based representations for the reliability of the residual lifetimes of coherent systems with dependent components. J. Multivar. Anal. 2017, 158, 87–102. [Google Scholar] [CrossRef]
  32. Ota, S.; Kimura, M. Effective estimation algorithm for parameters of multivariate Farlie–Gumbel–Morgenstern copula. Jpn. J. Stat. Data Sci. 2021, 4, 1049–1078. [Google Scholar] [CrossRef]
  33. Shih, J.H.; Chang, Y.T.; Konno, Y.; Emura, T. Estimation of a common mean vector in bivariate meta-analysis under the FGM copula. Statistics 2019, 53, 673–695. [Google Scholar] [CrossRef]
  34. Joe, H.; Xu, J.J. The estimation method of inference functions for margins for multivariate models. In Technical Report 166; Department of Statistics, University of British Columbia: Vancouver, BC, Canada, 1996. [Google Scholar]
Figure 1. The contour plots of Faecalibacterium (OTU1) and Bacteroides (OTU2) from AGP in the upper panel and taxa Clostridium (OTU3) and Escherichia (OTU4) on the lower panel. In (a), AA is the absolute abundance that describes the OTU counts for taxa. The black lines are the estimated contour plots of samples, and the red lines are the contour plots of estimated distributions, which shows that there are large departures between the estimated contour plots of samples and the contour plots based on the estimated normal distribution with CLR (a) or the estimated Dirichlet distribution (b).
Figure 1. The contour plots of Faecalibacterium (OTU1) and Bacteroides (OTU2) from AGP in the upper panel and taxa Clostridium (OTU3) and Escherichia (OTU4) on the lower panel. In (a), AA is the absolute abundance that describes the OTU counts for taxa. The black lines are the estimated contour plots of samples, and the red lines are the contour plots of estimated distributions, which shows that there are large departures between the estimated contour plots of samples and the contour plots based on the estimated normal distribution with CLR (a) or the estimated Dirichlet distribution (b).
Mathematics 11 00009 g001
Figure 2. The estimated marginal density functions: the black line is estimated by the samples, the gray line by Dirichlet distributions, and the red line is by MSIG distribution.
Figure 2. The estimated marginal density functions: the black line is estimated by the samples, the gray line by Dirichlet distributions, and the red line is by MSIG distribution.
Mathematics 11 00009 g002
Figure 3. A comparison of contour plots between MSIG and Dirichlet distribution. The black lines are the estimated contour plots of samples. The red lines are the contour plots of the estimated Dirichlet distribution (a). The blue lines are the contour plots of the estimated MSIG (b).
Figure 3. A comparison of contour plots between MSIG and Dirichlet distribution. The black lines are the estimated contour plots of samples. The red lines are the contour plots of the estimated Dirichlet distribution (a). The blue lines are the contour plots of the estimated MSIG (b).
Mathematics 11 00009 g003
Figure 4. Multivariate distribution via Gaussian copula of taxa Clostridium and Escherichia. The upper panel (a,b) illustrates the estimated marginal density functions: the black line is estimated by the samples, the gray line by Dirichlet distributions, and the orange line is by Gaussian copulas. The lower panel (c) shows the contour plots of the estimated Gaussian copulas.
Figure 4. Multivariate distribution via Gaussian copula of taxa Clostridium and Escherichia. The upper panel (a,b) illustrates the estimated marginal density functions: the black line is estimated by the samples, the gray line by Dirichlet distributions, and the orange line is by Gaussian copulas. The lower panel (c) shows the contour plots of the estimated Gaussian copulas.
Mathematics 11 00009 g004
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, S.; Shi, J.; Albert, P.; Fang, H.-B. Dependence Structure Analysis and Its Application in Human Microbiome. Mathematics 2023, 11, 9. https://doi.org/10.3390/math11010009

AMA Style

Li S, Shi J, Albert P, Fang H-B. Dependence Structure Analysis and Its Application in Human Microbiome. Mathematics. 2023; 11(1):9. https://doi.org/10.3390/math11010009

Chicago/Turabian Style

Li, Shilan, Jianxin Shi, Paul Albert, and Hong-Bin Fang. 2023. "Dependence Structure Analysis and Its Application in Human Microbiome" Mathematics 11, no. 1: 9. https://doi.org/10.3390/math11010009

APA Style

Li, S., Shi, J., Albert, P., & Fang, H. -B. (2023). Dependence Structure Analysis and Its Application in Human Microbiome. Mathematics, 11(1), 9. https://doi.org/10.3390/math11010009

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop