1.1. Study Background
The coronavirus epidemic began in Wuhan City (China) on 31 December 2019, and became a pandemic. The incidence of novel COVID-19 infections dramatically increased due to the absence of antiviral medications and vaccines, resulting in enormous economic losses, panic, and many deaths.
Using different statistical models to analyse epidemic data has emerged as a critical study field for predicting the number of COVID-19 deaths and infected individuals.
Statistical models represent the numerical data relevant to specific samples or groups. To assess trends in the data shown, these models frequently take the form of line graphs and scatterplots. While statistical models may display data in various scenarios, those dealing with COVID-19 are particularly valued at present since they provide numerical information about this pandemic, such as the number of cases and deaths brought on by COVID-19. These models have also proved very helpful in localizing cases to specific nations, regions, cities, and specific areas within cities, enabling the authorities in these locations to respond appropriately to the infection. Additionally, models have focused on various crucial traits among individuals who present with COVID-19, such as age, race, sex, and preexisting diseases. This enables researchers to determine which populations are most at risk of infection [
1].
Artificial intelligence (AI) techniques built on machine learning (ML) and mathematical models have been utilized to evaluate the epidemic’s progress throughout each country and identify any potential amplifying factors that might mitigate its effects [
2].
1.2. Literature Review
To examine the relationship between dependent and independent variables and determine the current rate of spread of COVID-19, [
3] sought to build on earlier research. This research statistically analysed the relationship of factors such as region, sex, birth year, infection date, and recovery or symptom relief date with the noted number of recovered and dead patients. The findings revealed that region, infection date, and sex were associated with the number of recovered and deceased patients, whereas birth year was only associated with the number of deceased patients. Furthermore, no deaths from COVID-19 were noted among recovered patients, whereas 11.3% of patients who died were confirmed to be COVID-19 positive after their deaths. In South Korea, the main factor associated with the number of infections was the number of patients infected by an unknown source, representing more than 33% of the total number of infected patients.
The association between the overall number of COVID-19 infections and recovered people in various countries were studied and analysed by [
4] using the chain-binomial variant of Bailey’s model. They also noted that most studies have investigated COVID-19 cases with different regression and time series models commonly used to assess the trend or growth of any illness.
The relationship between the transmission of viral infections and human migration was investigated by [
5]. They concluded that the intensity of pedestrian traffic in the research period impacted virus spread after 15–20 days on average.
A time series-based system to track epidemics is a system that [
6] aimed to create. Utilizing univariate time series models, the author showed the evolution of the reported incidents in the first stage. Additionally, he combined the models to offer more precise and reliable findings and analysed statistical probability distributions to create hypothetical futures. The “time series susceptible-infected-recovered” (tsiR) model was developed and used in the last stage, and its epidemiological ratio (R
0) was calculated to determine when the epidemic ended. The time series models comprised traditional exponential smoothing, ARIMA techniques, feed-forward artificial neural networks (ANNs), and multivariate adaptive regression splines (MARS) from the ML toolbox. The primary mean and Granger–Newbold and Bates–Granger techniques were included in the combinations. To assess the spread and containment of the epidemic, the tsiR model, as well as the R
0 ratio, was applied. The recommended method was used to monitor the COVID-19 outbreak in Greece.
Using Bailey’s model and secondary data, [
7] calculated the removal rate, or the percentage of deceased individuals in the infected population. Additionally, regression analysis was performed to demonstrate the linear association between this indicator and the frequencies of all infections. Finally, they discussed the connection between the model and decision-making.
By carefully analysing the cases reported in the country up to 22 April 2020, [
8] used exploratory data analysis to create a statistical model to help people understand COVID-19 in India. The study’s findings illustrated the daily and weekly effects of COVID-19 in India and drew comparisons between that nation, its neighbours, and other badly afflicted nations.
The impact of travel history and interaction with travellers on the dissemination of COVID-19 in Nigeria was evaluated by [
9] using the ordinary least squares (OLS) estimator. They created predictions by extracting data from the Nigeria Centre for Disease Control (NCDC) website from 31 March 2020 to 29 May 2020. The model evaluated the time before and after the Nigerian federal government imposed travel restrictions. Based on the diagnostic checks performed, the fitted model exhibited an excellent fit for the dataset with no validity violations. With travel history and contact with travellers observed to increase the likelihood of COVID-19 infection by 85 and 88%, respectively, the results demonstrated that the government made the right choice in enforcing travel restrictions. The authors concluded that the government must enforce this policy to contain the spread of COVID-19.
Using stochastic modelling, [
10] forecasted the prevalence of COVID-19 trends in East African countries, focusing on Somalia, Sudan, Djibouti, and Ethiopia. The study’s findings indicated that, under the average rate scenario, the number of COVID-19-positive individuals in Ethiopia would increase, ranging between 5846 and 56,610 within four months after 30 June 2020.
An autoregressive distributed lag model and limited cointegration tests were used by [
11] to evaluate the long-term equilibrium relationship between the cumulative number of new COVID-19 infections (X) and the cumulative number of deaths due to COVID-19 (Y). The stability of the calculated model was also assessed. The consistency of the model parameters was evaluated using the cumulative sum of the recursive residuals and squares tests.
The dynamic relationship between the number of cases and deaths was examined by [
12] using the vector error correction model (VECM), the Johnsen–Fisher cointegration test, and the Granger causality test. From 1 April 2020 to 26 December 2020, data on daily new COVID-19 cases and COVID-19-related deaths in India, Ukraine, Canada, and the USA were obtained from the website. Summary figures showed that the United States had the most significant COVID-19 cases, followed by India, Canada, and Ukraine. The USA also had the highest number of COVID-19-related deaths, followed by India, Ukraine, and Canada. Canada led all other countries regarding the death rate, followed by the USA, Ukraine, and India. The results of the Johnsen–Fisher cointegration test indicated that there was only one cointegration equation. The Granger causality test and the VECM demonstrated short- and long-term causal correlations between COVID-19 infection and mortality. The rate of adjustment was 9.9%.
1.4. Panel Data Model
These data include observations of events gathered over various time scales for the same group of people, entities, or units. Econometric panel data, in a nutshell, are multidimensional data collected over a certain period.
A simple regression model of panel data is defined as
where
represents the predicted residuals obtained from panel regression analysis, Y represents the dependent variable, X denotes the explanatory or independent variable and indicates the intercept and slope, respectively, t represents the tth period, i represents the ith cross-sectional unit, and X is considered to be non-stochastic as well as an error term to follow the classical assumptions, i.e.,
. In the present research paper, the number of cross-sections (districts) was 37 (i = 1, 2, 3, …, 37), and the number of time points was 1, 2, 3, …, 30.
Detailed discussions of panel data modelling can be found in [
13,
14,
15,
16,
17].
Panel data provide “more informative data, more variability, less collinearity among variables, more degrees of freedom and more efficiency” because they combine time series of cross-sectional observations [
14].