1. Introduction
The modeling and analysis of real-life data are essential to understand important features of random phenomena and to draw suitable conclusions as well. In particular, this requires the choice of statistical models based on probability distributions, whose adequateness against the observations will strongly influence the pertinence of the outputs. The analysis of recent data in applied sciences (environmental sciences, engineering, finance, etc.) has shown the limitations of the classical distributions, whose flexibility does not allow revealing some important details. To go further into these limitations, new distributions, often divided into specific families of distributions, have been created. A short list of the notorious families is the following: the skew-normal family (see [
1]), Marshall–Olkin-Gfamily (see [
2]), exponentiated-G family (see [
3]), beta-G family (see [
4]), order statistics-G family (see [
5]), sinh-arcsinh-G family (see [
6]), transmuted-G (see [
7]), gamma-G family (see [
8]), Kumaraswamy-G (see [
9]), Topp–Leone-G (see [
10]), and ratio-exponentiated-G (see [
11]). The global motivation behind them is to extend the modeling properties of a classical baseline distribution by adding one or more tuning parameters through the use of various flexible transformations (power, beta, gamma, ratio, etc.).
Among all the proposed families, the Mfamily of continuous distributions introduced by [
12] stands out from the others due to its original construction; it is defined by a cumulative distribution function (cdf) based on a ratio involving two baseline cdfs, with possibly different characteristics. More specifically, the corresponding cdf is defined by:
where
and
are two cdfs of continuous distributions with sets of parameters represented by
and
, respectively. These two baseline cdfs can be chosen independently of each other, without a particular condition. However, for practical purposes, in order to avoid the over-parametrization phenomenon, it is recommended not to have too many parameters involved; one can reduce
and
to a unique parameter, or take
and
as two different parameters, or
can be chosen as a subset of parameters of
, or vice versa. Clearly, the M family contains a plethora of ratio distributions and models, since a multitude of choices for
and
is possible. However, to the best of our knowledge, this versatile aspect has not been fully explored yet. Indeed, in the former work of [
12],
was presented as (
1), with the proof that it satisfied the properties of a valid cdf. Then, as a direct application, a new two-parameter lifetime distribution was defined by the cdf (
1) under the following simple configuration:
, and
was chosen as the cdf of the Weibull distribution, i.e.,
,
. Thus, the corresponding cdf is given by:
As a main application, it was proven that the related model had a better fit to the exponentiated exponential, Weibull, and gamma models, for the failure times of the air conditioning system data from [
13]. This nice result validated the entry of the M family on the short list. However, for the special configuration
, the M family loses its intrinsic originality for the following reasons: (i) it does not mix different features for the baseline cdfs; (ii) it is included in the well-known Marshall–Olkin family since we can express
as:
with
. That is, the general form of
is exploited at its minimum; the M family has not revealed all of its potential.
Based on the previous setting, the multiple contributions of the paper can be summarized as follows: (i) We introduce a simple and natural extension of the M family by the use of the power transform, called the EMfamily. (ii) We provide some mathematical results of this family, which are also new and applicable to the former M family. (iii) We consider and of different natures, i.e., exponential and inverse exponential, respectively, to create a new promising (three-parameter lifetime) distribution, which demonstrates a high modeling ability for data fitting; versatile shapes are observed for the main functions. (iv) We investigate the estimation of the model parameters by a top ranked method in terms of efficiency: the maximum likelihood method. (v) We apply this model to an actual dataset of COVID-19 cases observed in Pakistan during the year 2020. As a main result, for these data of particular interest, the proposed model possesses an excellent fitting behavior, better than that of 15 other top ranked models in the literature, attesting to the importance of these findings.
The remainder of the works is outlined as follows. In
Section 2, the EM family is introduced, and some of its mathematical results are proven. A special distribution of interest is presented in
Section 3, with discussions. The estimation of the related model parameters is studied in
Section 4. The application to a COVID-19 dataset is presented in
Section 5. The conclusion is given in
Section 6.
4. Parameter Estimation
Here, we derive the maximum likelihood estimates (MLEs) of the EMIEE model parameters, along with a simulation study to illustrate their practical interest. We recall that the MLEs have the following desirable properties. They are (i) efficient, (ii) consistent, (iii) asymptotically normal, and (iv) easy to handle in practice. For these, we refer the reader to [
23]. The mathematical basis of this method in the setting of the EMIEE distribution is given below.
Let
be
n independent realizations of a random variable following the EMIEE distribution with parameters
,
, and
. Then, the MLEs of
,
, and
are defined as the “argmax of the likelihood function with respect to
,
, and
”. Thus, by denoting them as
,
, and
, they are defined by:
where, based on (
8),
denotes the likelihood function given as:
To avoid the complicated product form of the likelihood function, one can also define the MLEs as
, where
denotes the log-likelihood function given by:
Thus,
,
, and
satisfy
,
, and
, whose extended forms are the following:
and:
From this last equation, one can express
according to
and
as:
(and plugging this expression into the first two equations, now depending on
and
only). Since the above equations are complicated to solve analytically, the MLEs have no tractable expression. However, they can be approached by standard optimization algorithms, such as the Newton–Raphson or quasi-Newton Broyden–Fletcher–Goldfarb–Shannon (BFGS) algorithms, with the use of a statistical software. By expressing the second partial derivatives of the log-likelihood function with respect to
,
, and
, we can determine the observed Fisher information matrix, allowing us to obtain the asymptotic variances, covariances, and standard errors (SEs) of the ML estimators for
,
, and
, among others.
Now, let us illustrate the practical aspect of the MLEs by a simulation study, with the use of the R software (see [
24]). The BFGS algorithm is considered. That is, we generated
N = 10,000 replications of samples
with
from a random variable following the EMIEE distribution defined with the six following sets of parameters, in turn:
(
),
(
),
(
),
(
),
(
), and
(
). Then, for each of these sets, we determined the average MLEs of the parameters defined as, for
,
where
denotes the MLE of
obtained at the
kth replication, and the corresponding empirical mean squared errors (MSEs) defined as, for
and
,
The results in
Table 5 and
Table 6 show the numerical efficiency of the maximum likelihood method for the EMIEE model. Indeed, we see that the MLEs were relatively close to the true values of the parameters, and globally, the MSEs decreased as
n increased. This “numerical convergence” illustrated the well-known theoretical convergence properties of the MLEs.
5. Application to a COVID-19 Dataset
Here, we propose a concrete application with an actual dataset to assess the interest in the EMIEE model. The considered data, called the COVID-19 dataset, is presented below.
COVID-19, which can be renamed as the “the flu of 2020”, is due to Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). Sadly, it spread quickly in the beginning of the year 2020, taking thousands of victims, obliging governments to take exceptional measures to protect their people. The update as of 29 May 2020 situation of this pandemic tragedy can be found in [
25,
26,
27]. Naturally, the overall comprehension of COVID-19 is a challenge for all scientists, but necessary for the sake of future generations. In this section, we modestly contribute to the subject by applying the EMIEE model to fit data of daily new COVID-19 confirmed cases in Pakistan from 21 March to 29 May 2020 (inclusive), showing that it is very efficient in this regard. We thus assumed that the new COVID-19 (confirmed) cases in Pakistan could be modeled by a continuous variable (since a discrete variable with a wide range of values could be considered as such) and provided a new statistical model that could be relevant for the following points:
- (i)
Provide a precise estimation for some measures of interest related to COVID-19 cases in Pakistan (mean of cases, probability to have a certain number of cases, and so on),
- (ii)
Compare the repartitions of the number of COVID-19 cases in Pakistan with those in other countries,
- (iii)
Propose an efficient strategy for fitting data on COVID-19 cases in other countries,
- (iv)
In a more challenging way, model the distribution of the number of cases for any pandemic with similar features and under a similar environment (with comparable populations, comparable climate, sanitary system, etc.).
The dataset was obtained from the following electronic address:
http://covid.gov.pk/stats/pakistan. It is given as follows: {112, 157, 89, 108, 102, 133, 170, 121, 99, 236, 178, 250, 161, 258, 172, 407, 577, 210, 243, 281, 186, 254, 336, 342, 269, 543, 488, 463, 514, 427, 796, 555, 742, 642, 785, 783, 605, 751, 806, 942, 990, 1297, 989, 1083, 1315, 1049, 1523, 1764, 1637, 1991, 1476, 1140, 2255, 1452, 1430, 1581, 1352, 1974, 1841, 1932, 2193, 2603, 1743, 2164, 1748, 1356, 1446, 2241, 2636, 2429} corresponding to the dates {21 March 2020, 22 March 2020, …, 29 May 2020}, respectively.
Aiming to identify the possible shapes of the unknown hrf behind these data, we plot the total time on test (TTT) plot in
Figure 3 (see [
28] for further details on the use of TTT plots in data analysis).
In
Figure 3, since the red line is convex, then concave, the unknown hrf probably presents a bathtub shape. Therefore, the EMIEE distribution is appropriate to fit the data.
Now, we aimed to compare the fitness of the EMIEE model with the one of 15 top ranked models in the literature: (i) the Weibull-exponential (WE) model by [
29], (ii) the Lomax-exponential (LE) model by [
30], (iii) the gamma-exponentiated exponential (GaE) model by [
31], (iv) the beta Weibull (BW) model by [
32], (v) the Kumaraswamy exponential (KE) model by [
33], (vi) the Burr X-exponential (BXE) model by [
34], (vii) the exponentiated exponential (EE) model by [
35], (viii) the CStransformation of exponential (CE) model by [
36], (ix) the standard exponential (E) model (see [
37], among others), (x) the alpha-power inverse Weibull (AIW) model by [
38], (xi) the Gompertz inverse exponential (GomIE) model by [
39], (xii) the Weibull-inverse exponential (WIE) model by [
40], (xiii) the inverse Weibull-inverse exponential (IWIE) model by [
41], (xiv) the inverse exponential (IE) model by [
20], and last, but not least, (xv) the “unexponentiated” version of the proposed EMIEE model, i.e., the MIEE model. We refer to the above references for the precise definitions of the related cdfs and pdfs, along with the Greek alphabet letters used for the parameters.
Then, the model parameters were estimated through the practice of the maximum likelihood method (with the BFGS algorithm). The R software was used in this regard. The calculations of the MLEs and SEs for all the model parameters are provided in
Table 7.
Among the information provided by
Table 7, the parameters of the EMIEE model, i.e.,
,
, and
, are estimated as:
Therefore, based on (
8), the corresponding estimated pdf is given by:
Thus, is an estimated function of the unobservable underlying pdf of the number of COVID-19 cases in Pakistan. By the use of this function, one can estimate the quantities of interest. Some basics of them are presented below. By denoting X the random variable modeling the daily COVID-19 confirmed cases in Pakistan during the epidemic, the probability that X belongs to a chosen interval, say , can be estimated by . For instance, the probability that the COVID-19 cases in Pakistan are less than a certain values c is given by . More generally, an estimation of the mean of a certain transformation of X, say , can be estimated by . For instance, the average number of COVID-19 cases in Pakistan can be approximated with precision by by taking , and so on.
As planned, a comparison of the models in terms of fitting was performed. We decided which was the best model by determining the values of the following statistical measures: minus complete log-likelihood function (
), Akaike information criterion (AIC), Bayesian information criterion (BIC), Cramer–von Mises (W) criterion, and Anderson–Darling (A) criterion. Furthermore, we considered the value of the Kolmogorov–Smirnov (KS) statistic and its
p-value. The best model was the one having the smallest
, AIC, BIC, W, A, and KS and the largest KS
p-value. For the considered data, the obtained values are shown in
Table 8.
From
Table 8, we see that the EMIEE model was the best among all the considered models, with the following numerical criteria:
, AIC
, BIC
, W
, A
, KS
, and KS
p-value
. One can notice that the EMIEE model outperformed the baseline E and IE models, and also, the MIEE model was derived from the former M family, validating the use of the exponentiated transform for fitting purposes.
Figure 4 shows the estimated pdf as described in (
10) over the histogram of the data.
Figure 5 presents the estimated cdf, i.e., based on (
7),
, over the empirical cdf of the data. The probability-probability (P-P) plot in
Figure 6 shows how closely the estimated and empirical cdfs agreed.
In all the graphics, we see that the red curves fit perfectly the black data objects, motivating the importance of the EMIEE model in the analysis of the COVID-19 dataset. We end this application by displaying the estimated hrf of the EMIEE model in
Figure 7.
We see that the estimated hrf has a bathtub shape, which was in coherence with what was interpreted in
Figure 3.