1. Introduction
The family of skew-symmetric distributions has been increasingly recognized for its flexibility and efficacy in modeling real-world data by transforming symmetric probability density functions (PDFs) with specific generators. This family is defined by the following PDF:
where
f is a symmetric PDF centered at 0 and
G is the cumulative distribution function (CDF) of a continuous random variable that is symmetric around 0. The function
w is required to be odd and continuous, meaning
.
This framework was initially developed by Azzalini [
1,
2], who introduced the skew-normal distribution by setting
, the PDF of the standard normal distribution, and
. This construction allows the skew-symmetric distribution to encapsulate both symmetric and skewed data through the parameter
. Over time, various researchers (e.g., Gupta et al. [
3], Ma and Genton [
4], and Arellano-Valle et al. [
5]) have expanded this model to include different forms of
w, such as
and
, broadening its applicability within the skew-symmetric framework.
The general form (
1) encompasses a wide range of submodels, from the symmetric density
f (when
) to the highly skewed half-
f densities (as
). These models can capture varying degrees of skewness in data, making them valuable in many statistical applications. For example, Pewsey [
6] examined a subfamily where
for
. However, the model
,
, with location parameter
and scale parameter
, encounters significant challenges in maximum likelihood estimation (MLE) when
. Specifically, when
, the expected information matrix becomes singular, complicating the estimation process. Pewsey also noted that the observed information matrix fails to have an inverse for the CDF
G when
.
To address these challenges and further enhance the applicability of skew-symmetric distributions, we propose a novel
w function that not only avoids the singularity issues at
but also enables the modeling of bimodal data, which is crucial in many practical fields, including medicine. Specifically, we consider
for
and
for
. This function, equivalent to
, where
denotes the sign function, meets the necessary properties of
w and introduces bimodality into the model. Some models in the literature that do not present singularity issues in the information matrix are as follows: Bakouch et al. [
7] introduce a family of skewed distributions and explore the bimodal skew-normal distribution; Salinas et al. [
8] present a two-piece normal distribution for modeling biaxial fatigue data; and Khorsheed et al. [
9] propose a flexible form of three-parameter skew-normal distributions, enhancing flexibility for practical and industrial applications.
This proposed w function ensures that the model satisfies the regular conditions required for deriving the asymptotic distribution of the parameter vector. Importantly, the observed information matrix remains non-singular when , overcoming a significant limitation of previous models. The motivation for introducing this function is to address the non-singularity issue at and to incorporate bimodality, which enhances the model’s capability to accurately represent complex data distributions, particularly in fields such as medical research.
The primary objectives of the proposed bimodal skew-symmetric distributions are to provide a flexible framework for accurately modeling data with bimodal and skewed characteristics, which are common in various practical applications. The goals include extending the existing skew-normal distributions to accommodate bimodal features, enhancing the ability to model asymmetric data, and developing practical tools for parameter estimation and goodness-of-fit tests. Additionally, we aim to demonstrate the effectiveness of the proposed model through empirical analyses, such as on protein data from cancer cells, to illustrate its practical value and encourage its adoption in relevant fields.
These distributions are highly relevant in various fields where data exhibit bimodal and skewed characteristics. For example, in biology, they can model the distribution of protein expression levels in cancer cells, where distinct subpopulations of cells exhibit different expression patterns. In finance, these distributions can describe the returns of assets that have two predominant regimes, such as bullish and bearish market conditions, while also accounting for skewness due to market asymmetries. In environmental science, they are useful for modeling pollutant concentrations that show bimodal behavior due to varying sources and conditions. By providing a flexible framework that captures these complex data structures, the proposed distribution offers significant advantages for accurate modeling and inference in real-world applications.
This paper is structured as follows.
Section 2 defines the bimodal skew-symmetric distribution and examines its key probabilistic properties as well as certain inferential issues.
Section 3 introduces the bimodal skew-normal (BSN) distribution as a special case of the bimodal skew-symmetric family and discusses its properties and estimation. In
Section 4, we demonstrate the adaptability of this class of distributions by analyzing data on proteins in cancer cells. Finally,
Section 5 provides concluding remarks.
2. Bimodal Skew-Symmetric Family
In this paper, we investigate a family of distributions, called bimodal skew-symmetric distributions, that is generated by Equation (
1) using
, where
. We start by presenting a lemma that characterizes this class of distributions and then proceed to derive several important properties. These properties are relevant for understanding the behavior of this family of distributions and for developing inferential procedures for fitting the model to data.
Lemma 1. Let f be a symmetric PDF about 0 and let G be the CDF of a continuous random variable that is symmetric around 0. We define the functionwhich is a PDF for any value of . A random variable Z with a bimodal skew-symmetric distribution and a PDF given by (2) is denoted by . Proof. Let
. We aim to prove that
for all
. In fact,
□
2.1. Cumulative Distribution Function
The cumulative distribution function corresponding to the density in (
2) is given by
Proposition 1. Suppose as given in (3), then the following properties are obtained: - (i)
, where F is the CDF of f.
- (ii)
.
- (iii)
.
- (iv)
,
where is the indicator function.
Proof. - (i)
, where F is the CDF of f.
- (ii)
- (iii)
- (iv)
□
2.2. Properties
2.2.1. Basic Properties
The following properties are directly derived from Lemma 1.
Proposition 2. Using the previous notations, the following properties hold:
- (i)
.
- (ii)
.
- (iii)
.
- (iv)
.
- (v)
.
- (vi)
.
Proof. Property (i) of Proposition 2 shows that the f distribution belongs to the family of distributions. Properties (ii)–(iv) indicate the distributions of the variables , , and , respectively. Properties (v) and (vi) show the distributions that follow by considering the limiting values of . □
2.2.2. Bimodality Property
The bimodality property of the random variable
Z when it follows a BSf distribution with
is presented in Proposition 3. To prove this, we differentiate Equation (
2) with respect to
z and equate it to zero, which yields two different solutions,
and
. The first solution
corresponds to a negative modal point, and the second solution
corresponds to a positive modal point. Thus, the random variable
Z is a bimodal with two distinct modes at
and
. This property is useful in modeling real-life situations that exhibit two distinct peaks in their data distribution.
Proposition 3. Suppose , then the random variable Z is a bimodal for .
Proof. Differentiating Equation (
2) with respect to
z and equating to zero implies
where
and
. Therefore,
and
are different modal points. Therefore, the random variable
Z is a bimodal. □
2.3. Stochastic Representation of the Random Variable
Proposition 4. Suppose Z∼ with . Then Z can be represented as , where S and Y are dependent random variables with and .
Proof. Let S and Y be defined as in the statement of the proposition. Using the joint distribution of and the Jacobian method, the marginal distribution of Z is obtained as follows:
If
, then
and
. Therefore, we have
On the other hand, if
, then
and
,
Therefore, from (
4) and (
5), we obtain
.
□
This proof shows that a random variable Z that follows a BSf distribution with location parameter can be represented as a combination of two dependent random variables S and Y. The variable Y has a density function that is twice the absolute value of the density function of Z for positive values of Z and is zero for negative values of Z. The variable S takes the value of 1 with the probability given by the value of the cumulative distribution function of G evaluated at , and the value of with the complement of this probability.
This representation is useful because it provides a way to generate random samples from the BSf distribution using the joint distribution of S and Y. Additionally, it allows for the computation of various statistics and moments of the distribution using the properties of S and Y.
2.4. Calculation of Moments for the Distribution
The random variable Z can be represented as a combination of two dependent random variables S and Y, as shown in Proposition 4. In this section, we derive a formula for computing the r-th moment of a random variable X that follows the distribution, where and , with .
Proposition 5. The r-th moment of X is given bywhere is given byand is the random variable in the stochastic representation of Z as given in Proposition 4. Proof. By utilizing the stochastic representation provided in Proposition 4 and applying the properties of conditional expectation, we can derive the required expression.
The above leads to the conclusion that if k is even, then . On the other hand, if k is odd, then . To obtain , it is possible to apply the binomial theorem along with the basic properties of the expectation. □
The mean and variance of a random variable X with BSf distribution can be easily calculated using the following corollary:
Corollary 1. Suppose and . Then, the mean and variance of X are given bywhere and for . This result provides a straightforward way to compute the expected value and variance of a BSf-distributed random variable X, where , , , and G are parameters of the distribution. The integrals and can be numerically evaluated, making the calculation of and feasible in practice.
2.5. Observed Information Matrix for the Location–Scale BSf Distribution
Proposition 6 states that if is a random sample from a distribution with a continuous and differentiable symmetric univariate probability density function f and cumulative distribution function G, where , then the solution to the score equations is , , and , and the observed information matrix is non-singular when and and are continuous functions.
Proposition 6. Let be a realization of the random sample , where are independent and identically distributed random variables following a distribution. Assume that f and G are continuous and a differentiable symmetric univariate probability density function and cumulative distribution function, respectively, with .
- (i)
The solution to the score equations is , , and .
- (ii)
When , the observed information matrix is non-singular.
Proof. - (i)
Let
be the log-likelihood function. Assuming
exists, and denoting
, the first-order partial derivatives of the log-likelihood are as follows:
where
and
, which is defined to be
. Note that the log-likelihood function
depends on the parameter
. Therefore, the partial derivative
measures the sensitivity of the log-likelihood with respect to changes in
.
The score equations for the family
are given by
where
and
.
Solving these equations yields , , and for any solution. If , then the score equations require . In this case, we have and . Thus, , , and are a solution to the score equations of the family , regardless of the choice of G.
We observe that the condition
and
are a solution to the score equations only if we can select a density
f such that
. Therefore, we conclude that the estimators of the family
for
and
coincide with the class
studied by Pewsey [
6].
- (ii)
Assuming that
and
exist, we can obtain the second-order partial derivatives of the log-likelihood by defining
and
. With these definitions, the partial derivatives can be computed as follows:
From the score equations, we can see that
and
. Moreover, if there exists a solution to these equations such that
, then we have
,
,
, and
for any solution.
Note that many symmetric densities around zero are differentiable at this point, including popular ones such as the normal, logistic, and Student’s t densities. This means that for these distributions, we have . However, there are exceptions to this rule, such as the double exponential density, which is not differentiable at zero.
When we set and , we can calculate that and . We can then define standardized scores as , which have a mean of zero and a variance of one: and .
Using these standardized scores, we can express the first derivative of as . This gives us , , , and .
Conversely, we can find the second derivative of by using the formula . We can calculate that and . Additionally, we have .
The second-order partial derivatives for this solution are given by
where
is the mean absolute deviation and
is the kurtosis. This leads to the observed information matrix:
which is always non-singular, except when
g is not differentiable at the origin. This result ensures the regularity conditions necessary to obtain the asymptotic distribution of the MLE for
. It should be noted that this condition was not met with the distribution
studied by Azzalini and others. □
Remark 1. The functions discussed in Proposition 6 are crucial for addressing the singularity issue in statistical models. They are designed to ensure the non-singularity of the observed information matrix, which is essential for accurate parameter estimation and model performance.
The function introduced in this research paper, denoted as , is defined as for and for . This function is equivalent to , where denotes the sign function.
The significance of this function lies in its ability to introduce bimodality into the model while avoiding singularity issues at . By incorporating bimodality, the model can more accurately represent complex data distributions, particularly in fields like medical research.
The proposed w function ensures that the model satisfies the regular conditions required for deriving the asymptotic distribution of the parameter vector. It effectively overcomes the singularity issue at , a significant limitation in previous models.
These functions provide a robust solution to the singularity problem by maintaining the non-singularity of the observed information matrix except when g is not differentiable at the origin.
By resolving the singularity issue, these functions enhance the model’s reliability and accuracy in estimating parameters, making it a valuable tool for analyzing complex datasets, such as the protein data from cancer cells studied in this research paper.
Overall, these functions not only address the singularity problem but also contribute to the model’s capability to handle bimodal data effectively, showcasing their significance in statistical modeling and data analysis.
4. Practical Data Analysis
In this section, we illustrate the modeling capabilities of the BSN distribution by fitting it to 118 observations of the Homo sapiens PIG7 data in Çankaya [
13] using the MLE method. To ensure computational stability, we scaled the data by
before fitting the distributions. The data are left-skewed with the Pearson’s moment coefficient of skewness of
and appear to be bimodal, as seen in the empirical density plot in
Figure 3a. Some of the descriptive statistics of the data include a minimum value of
, a maximum value of
, a mean value of
, a variance value of
, a median value of
, a first quartile value of
, and a third quartile value of
. The resulting MLE of
is
, with a corresponding SE of
. To assess the goodness-of-fit of the BSN distribution to the empirical data, we employed the Kolmogorov–Smirnov (K-S) test, with a test statistic defined as
, where
is the
ith data value. For large sample size
n, the
p-value of the K-S test is given by
, where
is the estimated CDF of the theoretical distribution, see Kolmogorov [
14] and Smirnov [
15]. The K-S test measures the disparity between the empirical and estimated cumulative distribution functions (CDFs), with a smaller difference indicating a better fit. In general, if the
p-value of the K-S test is greater than
, we conclude that the model provides a good fit for the data. The fitted BSN distribution gives a K-S statistic of
with a
p-value of 0.1337 (>0.05). Therefore, based on this evidence and visual inspection through the plot of the CDFs in
Figure 3b, we conclude that the one-parameter BSN distribution provides a good fit for the data.
We compare the fit of the BSN distribution with that of four other distributions, namely, the normal distribution, double Lindley distribution [
16], Laplace distribution, and Student’s
t-distribution. The PDFs of these distributions are as follows:
- (i)
Normal distribution with PDF given by
where
is the standard normal PDF.
- (ii)
Double Lindley distribution with PDF given by
- (iii)
Laplace distribution with PDF given by
- (iv)
Student’s
t-distribution with PDF given by
where
is the beta function.
We used the K-S test to compare the goodness-of-fit of these distributions with that of the BSN distribution. The K-S test statistic measures the maximum distance between the empirical CDF of the data and the CDF of the fitted distribution, with a smaller test statistic indicating a better fit. The p-values of the K-S tests for each distribution were computed, and if the p-value was larger than , we concluded that the distribution provided a good fit for the data.
To ensure a fair comparison between the fits of the normal distribution, Laplace distribution, and BSN distribution, it is important to center the BSN distribution about the mean (), as both the normal and Laplace distributions are centered around the mean. To accomplish this, we introduce an additional parameter to the BSN distribution, resulting in a centered BSN distribution with PDF for . To determine the best-fitting model for the data, we use the information criteria listed below along with the K-S test:
- (i)
Akaike information criterion (AIC), given by AIC = .
- (ii)
Bayesian information criterion (BIC), given by BIC = .
- (iii)
corrected AIC (AICc), given by AICc = AIC + .
Here,
denotes the estimated log-likelihood value,
n represents the number of data points,
is the unknown parameter, and
k indicates the number of parameters in the model. A smaller value of the information criterion indicates a better fit.
Table 2 and
Table 3 present the results of the fitted distributions. Based on these tables, we can observe that the centered BSN distribution outperformed all other considered distributions, with the smallest K-S statistic, largest
p-value of K-S, and smallest values of AIC, BIC, and AICc. This is also evident from
Figure 3c, where we can see that the CDF of the estimated centered BSN distribution closely mimics the empirical CDF.
In
Table 4, descriptive statistics obtained from the empirical distribution and the estimated centered BSN distribution are compared. From the results, we can conclude that the fitted centered BSN distribution accurately captured the important features of the empirical distribution, as the first three moments and the standard deviation (std) of the estimated centered BSN distribution are similar to those of the empirical distribution. It is noteworthy that the direction of skewness is the same for both distributions. However, a slight difference in skewness values is observed, which may be due to rounding errors in the numerical integration of the
k-th-order moments. The
code used to compute the descriptive statistics is provided in
Appendix A.
5. Concluding Remarks
In this study, we introduced a new family of continuous distributions known as bimodal skew-symmetric distributions. The BSN distribution, which is essential to this family, is distinguished by the single parameter that causes its asymmetry. The statistical properties of this distribution have been thoroughly discussed, emphasizing its flexibility and applicability.
Utilizing the MLE method, we estimated this sole asymmetry parameter, demonstrating the practicality and effectiveness of the BSN model when applied to real-world data. The analysis highlights the BSN distribution’s capability to adeptly model data features, such as skewness and bimodality, which are often encountered in practical datasets but are challenging to address with more traditional models.
To enhance the utility of the BSN distribution and facilitate its comparison with more conventional distributions like the normal and Laplace distributions, both of which are two-parameter models centered about the mean, we plan to extend the BSN distribution by centering it about the mean in future applications. This adjustment will allow the BSN distribution to be directly comparable to these models, providing a fair basis for performance evaluation.
The results from this study are promising, showing that the two-parameter BSN distribution not only meets but exceeds the performance of the four considered competing distributions for the dataset in question. This superior performance underscores the potential of the BSN distribution as a robust and versatile tool in statistical modeling, particularly suitable for complex real-world data that exhibit asymmetry and bimodality.
This study contributes to an application of bimodal skew-symmetric distributions to the analysis of cancer cell protein data, addressing the inherent bimodality and asymmetry of such data. The proposed model enhances the flexibility and accuracy of statistical representations, leading to improved parameter estimation and robust analysis even in the presence of noise. By incorporating regularization techniques to prevent singularity issues and leveraging the model’s adaptability to capture complex biological variability, this research provides an effective tool for identifying subpopulations and characterizing protein profiles in cancer cells. These contributions not only advance the field of statistical modeling in bioinformatics but also have practical implications for biomarker discovery and proteomics analysis, paving the way for more precise and meaningful insights into cancer biology.
The implications of these findings are significant, suggesting that the BSN distribution can serve as an alternative to traditional models, offering enhanced flexibility and better fit for specific types of data. Future studies will focus on further developing this model, improving its statistical inference procedures, and extending its application to a broader range of datasets.