1. Introduction
There can be no doubt that the outbreak of the global pandemic caused by the SARS-CoV-2 virus has produced a strong shock worldwide, the repercussions of which are still far from being seen. Obviously, the main concern is to contain the spread as soon as possible, thus putting a stop to the interminable trickle of deaths left in its wake. The data provided in real time by Johns Hopkins University [
1] quantify the number of the infected, as of 22 July 2021, at around 192 million, while the deceased amount to more than 4.1 million. Fortunately, vaccination campaigns are beginning to bear fruit, although there is still a long way to go, especially in developing countries. To add to the problem, severe economic losses are being felt as a result of the pandemic. It is still too early to evaluate their full impact, although some studies are already looking into the matter (see for example [
2]).
Since the beginning of the pandemic, multiple efforts have been made to model its evolution and understand its behavior. Knowledge of applicable models can help, on the one hand, to make correct decisions to mitigate the spread of the virus and, on the other, to design actions against future health crises of the same nature. For this reason, the scientific community is making strenuous efforts to apply existing techniques and develop new ones capable of modeling and predicting the evolution of the epidemic.
In this regard, compartmental epidemiological models have been a main focus of attention. Among them, SIR models stand out, which are based on the assumption that the population can be classified into three independent compartmentalized groups (susceptible, infected, and recovered). The number and type of compartmentalized groups can be modified to better reflect the specific dynamics of the disease, giving way, for example, to SEIR models (susceptible, exposed, infected, and recovered). In both cases, the models represent how individuals can progress from one compartmentalized group to the next (see [
3] for more details about the compartmental models).
Although such models have been widely applied, the specifics of the present pandemic has led researchers to explore how certain modifications can be introduced in the models. For example, Gounane et al. [
4] developed a non-linear SIR model that includes the effect of social distancing imposed by governments around the world. Khan et al. [
5] introduced the so-called SQUIDER model in which new compartments are incorporated; specifically infected and undetected cases (U), undetected recovered cases (E), quarantined cases affected by social distancing (Q), and deaths from contagion (D). Furthermore, Ianni et al. [
6] introduced temporal dependence on the parameters of the classical SIR model, thus affecting the behavior of the basic reproductive number
over time. This is intended to take into account containment measures, such as lockdowns, social distancing, and limitations on commercial activities. The resulting models are called SIR-T and SIRD-T. While all the efforts mentioned above address compartmental models from a deterministic point of view, some research has looked into the development of stochastic models, including fractional versions of the same (see for instance [
7]).
Apart from these epidemiological models, several authors have considered other approaches. Among others we can cite the work of Maleki et al. [
8] in which they considered ARMA processes based on two-piece scale mixture normal distributions; Ünlu and Namb [
9] used machine learning techniques and addressed various epidemiological models from a Bayesian perspective. Special mention must be made of functional data techniques, which have been applied to several lines of research. For example, Ref. [
10] used principal component analysis to discover patterns of pandemic progression as well as to identify the countries that better represent these archetypes. Similarly, Acal et al. [
11] proposed a principal components multiple function-on-function regression model for the imputation of missing data in the COVID-19 hospitalized and intensive care curves of several Spanish regions.
The evolution dynamics of the current pandemic, as in previous ones, point at the importance of considering models based on growth curves, mainly those of the sigmoidal type. Some of them, such as the Richards curve, have already been used in previous situations of this nature (Hsi et al. [
12], Wang et al. [
13]). Concerning COVID-19 specifically, mention must be made of Català et al. [
14], where the authors employed the Gompertz curve; and of Li et al. [
15], a comparative study performed using the Richards, logistic, Von Bertalanffy, and Gompertz curves, among others. However, such curves present a fairly rigid behavior, so it is necessary to resort to more flexible models. In this sense, Tovissodé et al. [
16] have used the generic growth model introduced by Turner et al. [
17]. A fundamental aspect that must be taken into consideration is that practically all the studies in this line of research focus on modeling the first wave of the pandemic. This is due to these curves presenting a sigmoidal behavior with only one inflection point. Sadly, this does not allow for a more in-depth analysis of the evolution of the pandemic as a whole. Therefore, if more complex behaviors are to be modeled, it becomes necessary to consider multi-sigmoidal curves. Likewise, a large part of the studies have been carried out from a deterministic point of view, so they do not take into account possible random fluctuations as well as influences external to the system. The fact that growth curves have their origin, for the most part, in the solution of an ordinary differential equation naturally leads to the consideration of stochastic differential equations whose solutions are, under certain conditions, diffusion processes. Some works have recently been published which combine both aspects, that is, the multi-sigmoidal and the stochastic character of the model. Román-Román et al. [
18] considered a diffusion process associated with a plurisigmoidal Gompertz curve, while Di Crescenzo et al. [
19] proposed two growth and death processes and a diffusion process whose mean functions are logistic curves that present several inflection points.
In summary, the starting motivations of the present work can be summarized as:
The need for the broadest possible knowledge of the evolution dynamics of the pandemic, focusing on determining points of contagion peaks, inflections in its evolution, duration, and the succession of waves.
The need to consider models that reproduce the observed multi-sigmoidal behavior of the pandemic.
The need to consider stochastic models that allow the aspects mentioned above to be studied from a probabilistic perspective.
With these ideas in mind, the present paper is a proposal to employ diffusion processes, whose mean functions are multi-sigmoidal curves, in order to model the evolution of the pandemic through various waves by considering both the number of infected individuals and the number of deaths. Specifically, the Gompertz process introduced in [
18] is considered, as well as a version of the generalized Weibull-type process that was proposed by Barrera et al. [
20]. In both cases, the processes are obtained by including polynomial-type functions in the ordinary differential equations that give rise to the Gompertz and Weibull curves, respectively. Both processes can be viewed from a common perspective such as that provided by the inhomogeneous lognormal diffusion process. Stochastic diffusion processes provides advantages over other commonly-used models for the study of epidemiological phenomena, such as SIR models based on differential equations. The probabilistic nature of the model allows the researcher to evaluate the variability of the model estimates, considering, for example, confidence intervals for the estimates and predictions made. In addition, in the specific case of diffusion processes such as the ones considered in this paper, they can be used to study several temporary variables of notable interest. Specifically, in the present case the determination of the time instants in which the infection and death peaks are reached can be approached by calculating first-passage times, which provides a valuable probabilistic tool for its study and analysis.
Under these general considerations, this paper is structured as follows:
Section 2 is dedicated to introducing the multi-sigmoidal Gompertz and Weibull curves, while
Section 3 introduces the processes associated with each of the previous curves, based on the inhomogeneous lognormal diffusion process. These two sections cover both models in a unified way, which will help simplify the expressions to be derived later on.
Section 4 is dedicated to the estimation of the parameters of the models being considered, which is carried out by maximum likelihood. This section includes some strategies aimed at the selection of initial solutions for the numerical methods necessary for the estimation, and describes a strategy for the selection of the optimal polynomial in each case. Finally,
Section 5 uses both processes to model pandemic-related data originating from several Spanish regions during the first two waves of infection, and proposes a strategy for selecting the model that best fits the observed data. For the selected model, a study of the inflection times is considered, which determine the infection and deaths peaks in each wave. This analysis is carried out from the point estimation of said instants and from the study of the first-passage times of the process through the values where inflections occur. Finally, some conclusions are drawn.
2. Gompertz and Weibull Multi-Sigmoidal Curves
In this section, we introduce two multi-sigmoidal versions of the Gompertz and Weibull curves. Both are obtained through the inclusion of polynomial functions in the ordinary differential equations whose solutions are said curves. More in detail, let
be a polynomial of degree
, where
is a real parametric vector such that
. Let us consider
and define the multi-sigmoidal Gompertz function as
and the multi-sigmoidal Weibull functions as
where
and the parameters
are the asymptotic values of every curve, i.e., the carrying capacity of every model. Here we denote with
g and
w those expressions related to the Gompertz and Weibull models, respectively. Each of such functions can be viewed as the solution of the Malthusian-type linear differential equation
where the key function
is defined as follows:
Here, is the polynomial satisfying for all t (analogously for the Weibull case with ). From the definition of it follows that the growth periods of the curves are related with the roots of and . Indeed, at every root the derivative vanishes and the process stops growing. This is also directly related with the inflection points of the curves, which are discussed at the end of this section.
Other differential equations can be derived from (
3) by taking into account the original expressions (
1) and (
2) for Gompertz and Weibull multi-sigmoidal models, respectively. Indeed,
leads to solutions, provided the initial condition
for all
,
By taking limits when
t goes to infinity, such functions tend to values (the carrying capacity) which do not depend on the initial values. This is the main difference with respect to the model solution of Equation (
3). As a matter of fact, by considering
for
, Equation (
3) leads to curves
where
and
for
and
. Note that, provided such reformulation, the Weibull model now has one less parameter. From now on, all definitions depending on
will be considered for values
and
, instead of
and
.
Related with the above equations, note that the expression of
is simplified after the reformulation, being now
Furthermore, note that the limit value of such models depends on initial values
and
, that is,
The use of the latter models is justified in the cases of phenomena with multiple sample paths of observations, each starting at different points and having different limit values. Due to this more general character, in what follows we will consider the expressions of
given in (
4).
As mentioned above, inflection points are of great importance in studying the behavior of this type of curves. As such points make the second derivative of the curve vanish, expressions can be obtained by differentiating (
3) and taking into account expressions of
. Therefore, in the case of the multi-sigmoidal Gompertz function, condition
leads to equation
whereas, in the case of the multi-sigmoidal Weibull function, it follows that
The solutions of these equations are the inflection points of the curves. Many growth phenomena usually present at least one inflection, representing the moment when the growth rate changes. Multi-sigmoidal models may have multiple inflection points, and are therefore useful in modeling such types of growth phenomena.
In order to illustrate the behavior related with inflection points,
Figure 1 shows multi-sigmoidal Gompertz and Weibull curves for different values of the polynomial coefficients when
and
, as well as its first and second derivatives, all according to the values of
Table 1.
3. Gompertz and Weibull Multi-Sigmoidal Diffusion Processes
The aforementioned deterministic models are useful to describe growth phenomena in their most basic terms. However, in order to build a more precise model able to take into account other factors and uncertainties, their stochastic counterparts are recommended. Random fluctuations appear in every dynamical behavior, whether as a result of the nature of the phenomenon under study or as the uncertainty derived from observation and measurement instruments. They must therefore be included in any model attempting to describe the growth phenomena in any significant depth.
A traditional way to build stochastic models is including a random term (usually a white noise) in an ordinary differential equation. Nevertheless, such approach may lead to intractable models due to the potentially high complexity of the deterministic base model.
For this reason, it might be useful to consider other approaches, such as adding a random perturbation to a parametric function. In this case, following (
3), it is possible to introduce a white noise
with spectral density
, where
will be the diffusion coefficient of the stochastic model, in order to randomly perturb function
(from (
5)) and to obtain a function
. Please note that, for the sake of clarity and following the unified notation of the previous section, from now on we will omit the subscript in the expression for
.
After replacing
with
in (
3), such expression becomes a stochastic differential equation; concretely
where
is the standard Wiener process, independent of initial value
for
. Here,
is the initial time and
where
is the parametric space of the process.
Taking into account that
is continuous in
, Equation (
6) verifies the conditions for the existence and uniqueness of a solution. Said solution is a stochastic diffusion process, known as inhomogeneous lognormal diffusion process, taking values on
and characterized by infinitesimal moments
and
. In addition, a closed-form expression for the solution can be provided. In fact, by considering initial condition
, being
a random variable whose distribution must be either degenerate or lognormal (as further explained later in this section), we have
where function
is defined, for
, as
Note that parameter takes the form or when or , respectively, where and .
Function
depends on the integral of function
, defined in the previous section. Such integral can be computed easily for every multi-sigmoidal model, leading to
where parameter
depends on
in the sense that
and
.
Such results can be expressed in a more compact form as
where function
is defined as
The inhomogeneous lognormal process has been the subject of many studies, theoretical as well as applied. In particular, Román and Torres [
21] analyzed their versatility when modeling phenomena governed by growth curves such as those considered here (for example Richards [
22] and Hubbert [
23], among other curves). Gutiérrez et al. [
24] carried out a detailed analysis of that process from the perspective of SDE as well as from the point of view of Kolmogorov’s partial differential equations, including the study of the distribution of the process and characteristics of interest, such as the moment and percentile functions.
As regards the distribution of the process, if is either degenerate or lognormal, then the finite-dimensional distributions of the process are distributed according to a lognormal law. Indeed, random vector , where for all , has an n-dimensional lognormal distribution where is a vector of elements and is a matrix of elements , for . Here, and are the parameters of the initial distribution . Note that, if is degenerate at a point , i.e., a.s., then and in the previous expressions.
The transition probability distribution is particularly useful for inferential purposes. From the two-dimensional distributions
,
, the transitions of the process can be obtained, which are also lognormal. Concretely,
Once the distribution of the process has been established, different characteristics associated with it can be calculated, including the mean and conditioned mean functions, whose expressions are
and
In addition, taking into account (
4) and (
7) it can be verified that
thus the two mean functions are of the type considered in the previous section. This supports the use of the processes introduced to model phenomena whose behavior is of the multi-sigmoidal type. This provides good reason to study the inference of the processes, which is the subject of the next section.
4. Inference
In this section, the maximum likelihood estimation of the parameters of the two multi-sigmoidal processes is discussed, following the general methodology suggested by Román-Román et al. [
25].
Let us now consider a discrete sampling of d paths observed at times for and . For the sake of simplicity, we will consider for all i, i.e., all the paths are observed at the same time instants. Furthermore, let be the vector corresponding to the -th sample path (). Then, .
Let us assume a lognormal initial distribution, i.e.,
. Then, for a fixed value
, the log-likelihood function is
where
,
is the vector containing the parameters of the initial distribution and
where
(note that, in the case of equidistant times, this term becomes constant).
Provided the functional independence of parametric vectors
and
, the estimate of the former is
while the estimate of
is obtained from the solution of the system of equations (see [
25] for details)
where
The maximum-likelihood equations use the vector derivative of
, which is done by the following expressions, according to the compact form presented previously:
where, for
,
where
and
are the indicator function and the Kronecker delta, respectively. Note that, provided the degree
p of the polynomial, that is, the number of parameters of the curve, then
.
The system of Equation (
9) can not be solved explicitly, and it is therefore necessary to use numerical methods for which an initial solution is required. However, it is not possible to carry out a general study of the system of equations in order to check the conditions of convergence of the chosen numerical method, since the system is dependent on sample data which may therefore lead to unforeseeable behavior. One alternative would be using optimization procedures on the log-likelihood (
8), for which it is usually necessary to bound the parametric space. These two questions will be covered in
Section 4.2.
4.1. About
Before applying the inference procedure, it is necessary to discuss a modification involving the time interval where the processes are considered. Although such processes have been defined on a generic interval , two main reasons motivate the standardization of such interval to . The first one is the analytical complexity of the multi-sigmoidal lognormal model. When , many exponential functions turn to 1, making calculus more tractable than the general case . The second reason is the numerical precision of computational operations. When , fluctuations of the objective function (derived from the log-likelihood function) allow for a good performance of the numerical algorithms.
Such standardization does not, nevertheless, modify the nature of the process. Indeed, the two multi-sigmoidal lognormal diffusion processes being considered remain the same when the parametric space is transformed from
to
. In particular, infinitesimal moments fully characterizing the process are equal up to a term coming from the standardization procedure. A more formal explanation is shown in
Appendix A.
4.2. Initial Solutions
As mentioned earlier, the use of numerical methods to solve the system of Equation (
9), or of optimization methods in the case of maximizing (
8), requires having initial solutions in the first case or limiting the parametric space in the second.
In order to obtain initial solutions, we propose a simple linear regression. Taking
in agreement with the remark made in the previous subsection, from (
4) it follows, for the Gompertz case, that
where
, whereas for the Weibull case it has
which, taking into account relation
, becomes
Both Equations (
10) and (
11) can be viewed as linear regression models over
, and
, where
is the time vector and
is the vector of elements
for the Gompertz case, and
the vector of elements
for the Weibull case (
). Here,
are the final and
i-th observed value of the sample mean, respectively, to be modeled by Gompertz and Weibull processes. With all of this, initial values of polynomial parameters
and
follow from linear regression. The initial values for
and
are obtained from the relationships between these parameters and limit values
and
.
A similar procedure can be applied to obtain an initial value of . The equation for the regression comes from having a lognormal distribution. Indeed, it is known that, given a random sample from a lognormal distribution , the quotient between the arithmetic mean and the geometric one provides an estimation of . By applying this to the distribution of , we obtain, for each , an estimate of ; that is, , being and the geometric mean at time for Gompertz and Weibull cases. The initial value of is calculated by performing a simple linear regression of the values against . Note that if is a degenerate distribution, then . Otherwise, is previously estimated from the values of the sample paths at .
Regarding the maximization of the likelihood function, not all specialized software requires bounding the parametric space. However, for those that do, we suggest the following procedure:
Since high values of lead to sample paths with great variability around the mean of the process, we consider , so that the multi-sigmoidal behavior is advisable.
For the Gompertz case, we propose considering confidence intervals provided by the linear regression previously performed in order to find the initial solutions. It is advisable to use a high level of confidence, i.e., 0.999.
For the Weibull case, we consider the confidence intervals for parameters
, whereas for
, and since it is verified that
, we suggest taking interval
, where
4.3. Degree of Polynomial
The choice of the degree of the polynomial included in the infinitesimal mean of the multi-sigmoidal Gompertz and Weibull models must obviously be based on the data. Thus, we propose setting a high degree of polynomial in advance, q, and selecting:
For the multi-sigmoidal Gompertz model, the optimal polynomial regression model in for the data set being the class of polynomial regression models of degree less than or equal to q;
For the multi-sigmoidal Weibull model, the optimal polynomial regression model in for the data set being the class of polynomial regression models of degree less than or equal to q without the intercept term.
In order to do this, we will employ the usual Bayesian procedure for variable selection in normal regression models using intrinsic priors for model parameters and the uniform prior for models (see Moreno et al. [
26,
27,
28] for details). The optimal model will be the one having the highest posterior probability.
A slight adaptation of the Bayesian procedure for variable selection is required in our case:
For the multi-sigmoidal Gompertz model, by we denote the polynomial regression model of degree j, .
Given data set
, which comes from a model in
, the posterior probability of model
is given by
where
and
is the Bayes factor for comparing models
and
for the intrinsic priors, which depends on
, the ratio of the square sum of the residuals of models
and
.
For the multi-sigmoidal Weibull model, by we denote the polynomial regression model of degree j without the intercept term, .
Given data set
, which comes from a model in
, the posterior probability of model
becomes
where
and
is the Bayes factor for comparing models
and
for the intrinsic priors, which depends on
, the ratio of the square sum of the residuals of models
and
.
5. Application to the Description of the Evolution of the COVID-19 Pandemic
In this section, the stochastic models introduced earlier will be used to describe the evolution of COVID-19 in Spain during the first two waves of the pandemic. For the purposes of this analysis, which comprises two successive periods of infection, multi-sigmoidal models become particularly necessary.
5.1. About the Data
Since the beginning of the pandemic, the National Epidemiology Center (CNE), dependent on the Carlos III Health Institute, coordinates information related to the evolution of the disease in Spain. The CNE works at the service of public health, contributing to the control of diseases and risks in collaboration with the autonomous communities, the Ministry of Health, Consumer Affairs, and Social Welfare, and the rest of the national administrations with health-related attributions. Through the National Epidemiological Surveillance Network, this center collects the data obtained from the epidemiological survey that each Spanish autonomous community completes upon the identification of a COVID-19 case.
As is well known, keeping count of the number of positive cases posed many problems during the early stages of the epidemic because there was insufficient capacity for detection, mainly due to the shortage of diagnostic tests and the administrative chaos resulting from the collapse of the healthcare system. For this reason, the studies initially carried out were mostly based on data concerning hospitalizations and deaths, and such data were used to estimate the number of infected. This made it difficult to carry out a detailed analysis of the real evolution of the disease, and also to determine the occurrence of certain key moments such as the peak of infection. As time went by, the protocols to determine infected cases were improved, so that the data reported for the second and successive waves can be considered more reliable than those reported in the first.
The data considered are based on the number of infected and deceased individuals reported every 4 days by 15 Spanish regions between 8 March and 21 December 2020, a period that covers the first two waves of the pandemic as officially reported. Data have been extracted from
https://cnecovid.isciii.es/covid19/#documentacion-y-datos (last accessed on 24 July 2021). Each data series was modified by dividing by the value observed at the first time instant, so that the data processed represent, at each
t, how many times the value of confirmed cases and deaths multiplies the value initially observed.
Figure 2 shows the data. Note that the average value of the data (the solid black line) exhibits a multi-sigmoidal-type growth.
5.2. Fitting the Number of Infected Individuals
This section presents the application of the models introduced to adjust the data on the number of infected individuals. In the developed application we have addressed the following issues in sequence:
The choice of the degree of the polynomials included in the infinitesimal mean of the multi-sigmoidal Gompertz and Weibull models.
The maximum likelihood estimation of the parameters of the models.
The determination of the best model.
The study of the inflection time instants.
Results are presented next for each of these issues, according to the methodology described in previous sections. Likewise, and following the comments made in
Section 4.1, the time instants have been rescaled to the
interval.
5.2.1. The Choice of the Degree of Polynomials
This subsection follows the methodology presented in
Section 4.3. For the Gompertz model, and after choosing a maximum degree
,
Table 2 summarizes the posterior probabilities of the polynomial regression model of degree
j,
, given data set
coming from a model in
. The highest posterior probability is 0.8167759, corresponding to the polynomial regression model of degree 5.
In the same way, for
in the multi-sigmoidal Weibull model,
Table 3 summarizes the posterior probabilities of the polynomial regression model of degree
j without intercept term,
, given data set
coming from a model in
. Again, the optimal regression polynomial model is that of degree 5, which presents a posterior probability of
.
5.2.2. The Maximum Likelihood Estimation of the Parameters in Each Model
We address now the maximum likelihood estimation of the multi-sigmoidal Gompertz and Weibull models. In agreement with the conclusions drawn earlier, in both cases a polynomial of degree 5 will be considered in the infinitesimal mean of each process. Although in both cases an attempt has been made to solve the non-linear system of likelihood equations, arithmetic overflows in the calculus have prevented the likelihood equations for the multi-sigmoidal Gompertz process to be established. For this reason, in this case we have chosen to directly maximize the log-likelihood function.
The estimation process has provided the following results, where and are the elements of vectors and , respectively:
By using the spectral projected gradient method, implemented in the
spg R function of the
BB R-package [
29], the values of the parameters that maximize the logarithm of the likelihood for the multi-sigmoidal Gompertz model are:
,
,
,
,
,
, and
.
Applying the Newton method, implemented in the
nleqslv R function of the
nleqslv R-package [
30], in order to solve the non-linear system of likelihood equations for the multi-sigmoidal Weibull model, the following estimates of the model parameters have been obtained:
,
,
,
,
,
, and
.
5.2.3. Determination of the Best Model
Once the models have been estimated from the observed data, the question arises of determining which of the two models is the most appropriate to describe the global behavior of the phenomenon under study.
Figure 3 displays, for each model, the sample and estimated mean functions and the 95% confidence band for the mean function together with the sample paths.
In view of the plots in
Figure 3, it is not easy to decide on one model or another, although we must point out that, while the multi-sigmoidal Weibull model does not replicate the data trend very well at the beginning, the multi-sigmoidal Gompertz model does. However, the confidence band plots also indicate that both models have difficulties in fitting the data at the beginning.
A global measure of how well the sample mean is fit by the mean of the estimated model is the absolute relative error given by
This measure presents values 0.06537265 and 0.2657263, respectively, for the multi-sigmoidal Gompertz and Weibull models. In this sense, the first model appears to be more suitable.
Another relevant question is which of the two models provides a better estimate of the sample distribution at each instant of time. This can be done by determining the resistor-average distance [
31] between the sample and estimated distributions at each time instant.
The resistor-average distance is a symmetrized Kullback–Leibler distance defined as the harmonic sum (half the harmonic mean) of the component Kullback–Leibler distances, that is,
where
denotes the Kullback–Leibler distance between the sample distribution (
) and that of the estimated model (
).
For the models under consideration, the component Kullback–Leibler distances at time instant
are given by
and
where
, being
and
, respectively, the sample mean and sample geometric mean at
.
The values of resistor-average distances allow us to appreciate the time periods in which the estimated distribution moves away or approaches the sample distribution, and how close or far it is from said distribution. Furthermore, measures such as the mean or median of the resistor-average distances values allow us to globally assess the goodness of fit and select the best model.
For the estimated models,
Figure 4 shows the resistor-average distance between sample and estimated distributions as a function of time. An initial period of time is observed in which the Weibull model estimates the sample distributions much worse than the Gompertz model, and only at a later time does the Weibull model estimate the sample distributions somewhat better than the Gompertz model.
Table 4 collects some measures of interest to summarize the values of resistor-average distances. All such measures point at the multi-sigmoidal Gompertz model as the best choice.
5.2.4. The Study of Inflection Time Instants
A matter of great interest when describing the behavior of a set of growth-related data is to establish when the growing pattern changes, which is revealed by the inflection time instants. In the specific case that we are considering, these instants of time mark the moments in which infection peaks are reached in each wave.
Once we have determined and estimated an appropriate model to explain the evolution of the COVID-19 data, inflection times can be estimated using the inflection points of the estimated mean function of the selected model, .
By setting
, we have been able to estimate three inflection time instants.
Table 5 contains the estimated values of the inflections times,
, together with values
for
. We can conclude that changes in the average growth pattern occur on 2 April, 30 May, and 6 November.
Figure 5a depicts the second derivative of the estimated mean function. The vertical lines are located on the estimated inflection time instants.
Figure 5b locates the estimated inflection time instants (vertical lines) on the graph of the estimated mean function. The horizontal lines are placed on the values of the estimated mean function at those time instants. It should be noted that the second inflection moment obtained does not seem to correspond to a peak of infection proper. Indeed,
Figure 2 shows how between instants 0.2 and 0.4 (corresponding to the dates between 4 May and 30 June, approximately) the evolution of the pandemic slowed down considerably, in such a way that a plateau is observed on the graph. In that period of time, and despite fitting the observed average quite well, the estimated average shows a very slight decrease that motivates the appearance of this new inflection, without it being linked to a new wave of infection. Regarding the other two dates, they can be related to actions promoted by the Spanish government as well as the governments of the different autonomous communities. Indeed, on 14 March a state of alarm and nationwide lockdown and quarantine were imposed to control the spread of the virus, which meant that around 14 days later the peak of the first wave was reached. Similarly, around the third week of October an increase in the number of infected was observed. The reason for this must be found in the holidays that occurred in Spain on the occasion of the celebration of the National Day (12 October) and the mobility that this motivated among the population. For this reason, government actions took place, such as restricting attendance at university teaching, the closure of activities related to nightlife, and the limitation of opening hours for bars and restaurants. Again, around 14 days later, the peak of infection in this second wave was observed.
Furthermore, and regarding each of the inflection points, we can consider the first-passage-time variable, defined as the time required for data to reach the mean growth at the inflection point for the first time. By studying the distribution of this random variable we may determine, for example, the probability that the change in the growth pattern occurs in a certain period of time as well as the average or most frequent time instant in which the growth pattern changes.
The first-passage-time variable associated with each inflection time instant can be approximated by the first-passage-time variable defined as the time required for the estimated model to reach its mean growth at the estimated inflection point for the first time, that is, the time variable
The probability density function of the first-passage-time variable can be obtained as the solution of a Volterra integral equation of the second kind (see Gutiérrez et al. [
32,
33]). Nevertheless, and apart from some particular processes and boundaries, closed-form solutions for the integral equation are not available, as is the one we are considering in this application. For this reason, for these cases numerical procedures are needed. The most usual methods are based on numerical quadrature procedures, as the composite trapezoid method (see [
34,
35]).
Considering the comment made earlier about the second inflection time obtained, we will now focus on the other two, for which we will obtain the first-passage-time variables using the
fptdApprox R-package [
36,
37,
38].
Figure 6 contains the density functions of random variables
,
, (note that we have kept the same notation used in the point estimation of the inflection time points).
The approximation procedure of the density functions provided the following information about changes in the growth pattern of the COVID-19 data:
The first change occurs, with a probability of 0.999, in [0.055306, 0.185625], that is, between 23 March and 30 April.
The second relevant change (remember the above remark) happens in time interval [0.251004, 1], that is, between 19 May and 21 December, with a probability of 0.656171. It should be noted that in this case there is a part of the range of the first-passage-time variable that exceeds the temporal limits considered in the data (see
Figure 6b), which means that the complete probability mass is not confined to this interval. This is due to the fact that some regions reached the established level of contagion after 21 December.
Table 6 summarizes some of the main numerical characteristics of variables
,
, concretely the mean, variance, and mode, as well as some of the most relevant percentiles of the distributions. Note that in the case of variable
, neither the mean nor the variance have been calculated since the observation interval does not contain the complete probability mass. The percentiles falling into said interval are shown. From the values taken by these characteristics we can draw some conclusions, namely:
The first change in the growth pattern of the COVID-19 data happened, on average, at time instant 0.091806, that is, on 3 April, and more frequently at time instant 0.086358, that is, on 1 April. Of all regions, 50% reached the peak of infection before 2 April, while by 4 April they had already exceeded the peak by around 75%.
Regarding the second relevant change, it occurs more frequently at time instant 0.801248, that is, on 24 October, while 50% of regions peaked before 17 November.
5.3. Fitting the Number of Deaths
Next we show the results about the fitting of the number of deaths, for which the same methodology that has been considered with the number of infected has been followed. The study of the number of deaths has shown to be of great interest to epidemiologists. Indeed, as mentioned above, the collection of data on those infected during the first waves of the epidemic, especially during the first, has presented quite a few drawbacks meaning that the data collected does not show, most likely, the reality of the number of infected. This has given rise to an extensive literature in which procedures are shown to approximate the data of infected from that of deceased.
Since the methodology and procedures used have been described in detail in the previous application, below we summarize the main results obtained. Note that in this case it has not been necessary to rescale the time space associated with the process.
Regarding the degree of the polynomial, the selection method chooses degree 7 for both the Gompertz and Weibull models as can be deduced from
Table 7.
Once the degree of the polynomial has been selected, the estimation of the parameters of both models provide the following results:
for Gompertz model, , , and ,
for Weibull model, and ,
from which the mean functions are estimated as well as the 95% confidence bands, whose graphs can be seen in
Figure 7.
Regarding the selection of the optimal model, the results provided by the resistor-average distance (see
Figure 8 and
Table 8) indicate that both models fit the variable of interest quite well, although the Gompertz multi-sigmoidal model offers a better fit in the initial phase, which makes the measurements calculated from the resistor-average distance bias the decision towards said model.
Regarding the determination of the death peaks, the point estimation determines that these were reached on 31 March and 22 November, respectively. Finally, the probability density functions of the first-passage time of the process through the barriers defined by the value of the mean estimated in said time instants (
and
, respectively) have been approximated.
Figure 9 shows the graph of such density functions.
Finally, by applying the proposed methodology, we can deduce the following:
The first-passage-time variable through is quite symmetric. In fact, it can be seen how the mean coincides with both the mode and the median. This leads to establishing 31 March as the date on which the peak of deceased was reached with a high probability. Furthermore, the variable is quite concentrated, observing that on 1 April, 75% of the regions had already reached the peak, while 97.5% did so on 2 April.
Regarding the second peak, which more frequent date is 20 November, while 50% of regions peaked before 27 November. Furthermore, at the end of the observed period, 64.4% of the regions have reached the peak.
6. Conclusions
In real phenomena governed by growth curves there are situations in which the maximum level of growth is reached after successive stages, in each of which there is a slowdown followed by an exponential explosion. A clear example of this is provided by the evolution of pandemics, as is the case of the current one caused by the SARS-CoV-2 virus, where the appearance of successive waves of infection has been (and continues to be) observed. This motivates the use of sigmoidal curves with more than one inflection point.
In this work, two stochastic diffusion processes are presented, their main characteristic being that their means are multi-sigmoidal growth curves derived from classical curves such as the Gompertz and the Weibull. In both cases, these curves have been generated by introducing polynomial functions in their classic expressions.
Starting from a global formulation for both processes, this paper studied the problem of estimation using the method of maximum likelihood. Parameter estimates have been obtained by solving the system of likelihood equations as well as by direct maximization of the likelihood function. This entails analyzing how to obtain initial solutions in the first case and the delimitation of the parametric space in the second. Determining the degree of the polynomial before approaching the estimation of the parameters is a fundamental question in real-world applications. To do this, a Bayesian approach to the model selection problem has been considered based on the methodology derived from intrinsic prior distributions.
The stochastic models introduced here have been applied to data concerning the evolution of the COVID-19 epidemic in Spain during the first two officially recognized waves. For this, sample trajectories have been considered corresponding to the data of infected and deceased individuals reported by 15 Spanish regions between 8 March and 21 December 2020. The method of selecting the degree of the polynomial leads to the choice of degree 5 as optimal in the case of the number of infected, while for that of the deceased the selected degree is 7. After estimating the models, we proceeded to select the optimal one for the description of the evolution of the pandemic. Although the two diffusion processes considered are good models to describe the phenomenon under study, one of them was selected as the best choice. For this, two criteria were used: the absolute relative error between the observed and the estimated mean, and the average resistor distance, which measures the discrepancy between the sample and the estimated distributions. Both criteria favor the choice of the model based on the multi-sigmoidal Gompertz curve for both the number of infected and the number of deaths.
The study of the inflection times of the estimated mean function provides an estimate of the moments when the peak of the epidemic was reached. Our conclusions were drawn considering the total number of infected and death at the national level. By including in our analysis data from several regions we can account for the variability derived from their specific characteristics. Regarding the point estimation of the inflection instants, the results indicate that, for the number of infected, the peak of the first wave was reached around 2 April, while the second peak occurred around 6 November. In the case of the number of deaths, such values are around 31 March and 22 November, respectively.
Making a comparison with the information provided by other studies, we can see how the already refined statistics from the National Epidemiology Center (CNE) of Spain indicate that the nation-wide first-wave infection peak occurred on 20 March, 6 days after the national lockdown was implemented, and the cases of COVID-19 steadily decreased after then (see [
39] for details). Other studies place the peak on 26 March (Guirao, [
40]), while in the work of Mora et al. [
41] an interval is established that goes from 29 March to 3 April. Since the first week of July, Spain experienced an increasing trend in the cumulative incidence. The data provided by the CNE place the peak of the second wave around 27 October, two weeks after the National Day holidays (21 October), which led to a large increase in the mobility of the population. Regarding the number of deaths, the peak of the first wave is placed on 30 March, whereas the corresponding to the second wave is placed on 17 November.
Another perspective for the analysis of the tipping peaks is provided by studying the time required for data to reach, for the first time, the mean growth values at an inflection point. Our results indicate that 50% of regions reached the first peak of infection before 2 April, while by 4 April they had already exceeded the peak by around 75%. Regarding the second peak it occurred more frequently on 24 October, with 50% of regions reaching peak values before 17 November. In the case of the number of deaths, 50% of the regions reached the first peak before 31 March, while on 2 April it was 97.5% the regions that had reached it. On the other hand, 20 November is the most frequent date in which the regions reached the second peak, with 26 November being the date on which 50% of the regions had already reached it.
With the present work as a starting point, several lines of research can be suggested. Among them, the models may be modified by introducing exogenous variables that represent the therapeutic and non-therapeutic actions promoted by the authorities, and thereby explore their impact on the evolution of the pandemic. Other studies may look into how the number of death and/or hospitalized can be introduced as exogenous variables with the aim of improving the estimates of the real number of infected. Such research initiatives may be undertaken by introducing temporal variables in the infinitesimal moments of the diffusion processes being considered. Other research lines can be derived from the incorporation of this type of multisigmoidal curves to SIR-type models as well as considering diffusion processes with delays.
Finally, it is important to highlight the fact that the models presented in this work have allowed a good adjustment of both the number of infected and the number of deaths. This is of special interest given the vast literature that has been generated around the estimation of the number of infected from the number of deaths given the low quality of the data on infected during, especially, the first wave.