1. Introduction
The COVID-19 epidemic started in Hubei Province, China, around December 2019. Since then, the disease has spread to all continents and countries of the world, being categorized as a pandemic by World Health Organization on 11 March 2020.
In recent months, contributions have been made that analyze the evolution of the pandemic in different countries, implementing mathematical models to predict their evolution. Traditional predictive models for infectious diseases mainly include models for predicting differential equations and models for predicting time series based on statistics and random processes.
For example, in [
1] a methodology with the aim of estimating the actual number of people infected with COVID-19 in France is presented, since according to the authors, the number of screening tests carried out and the methodology do not directly calculate the actual number of cases and infection mortality rate (IFR). A mechanistic–statistical approach was developed that combines an epidemiological SIR model that describes these unobserved epidemiological dynamics, a probabilistic model that describes the data collection process and a method of statistical inference.
The logistic growth model, the generalized logistic growth model, the generalized growth model and the generalized Richards model were used to model the number of infected cases in the 29 provinces of China (and several countries), performing a detailed analysis on the heterogeneous situations by four phases of the outbreak in China [
2].
In [
3] the Kermack–McKendrick SEIR model (Susceptible, Exposed, Infectious and Recovered) is presented to analyze the effects of behavioral changes on the reduction in community transmission in Mexico. A variable contact rate over time is proposed and the consequences of disease spread in an affected population of non-essential activities is analyzed.
The behavior of the virus in Japan has also been analyzed [
4]. By 29 February 2020, in addition to the 619 confirmed cases (passengers and crew members) infected with COVID-19 in a cruise ship (near Tokyo), 215 locally transmitted cases had been also confirmed in Japan. To evaluate the effectiveness of reaction strategies based on avoiding large accumulations or crowded areas and to predict the spread of COVID-19 infections in Japan, in [
4] a stochastic transmission model produced by expanding the epidemiological model based on SIR (Susceptible-Infected-Removed) had been presented. The simulation results showed that the number of Infected and Removed patients will increase rapidly if there is no reduction in the time spent in crowded zone.
In [
5] using the Maximum-Hasting (MH) parameter estimation method and the SEIR model, the spread of COVID-19 and its prediction in South Africa, Egypt, Nigeria, Senegal, Kenya, and Algeria under three intervention scenarios (suppression, mitigation, mildness) is presented.
In addition to the most relevant epidemiological models used in the literature, models typically based on time series have also been used to analyze the behavior of the pandemic in different countries. The autoregressive integrated moving average (ARIMA) model is a mathematical model widely studied in the context of time series that has been successfully applied in the field of health (estimate the incidence and prevalence of influenza mortality, malaria incidence, hepatitis, and other infectious diseases) as well as in different fields in the past due to its simple structure, fast applicability and ability to explain the data set.
In [
6], ten Brazilian states are analyzed using the autoregressive integrated moving average (ARIMA), the cubistic regression (CUBIST), the random forest (RF), ridge regression (RIDGE), the support vector regression (SVR) and the stacking-ensemble learning in the task of time series forecasting of the number of patients infected with COVID-19 with one, three, and six-days ahead. A forecasting model based on ARIMA has also been presented in [
7] for Pakistan, presenting the high exponential growth in the number of confirmed cases, deaths and recoveries. In [
8], ARIMA time series models were applied to forecast the total confirmed cases of COVID-19 for the next ten days using the model ARIMA (0,2,1), ARIMA (1,2,0) and ARIMA (0,2,1) for Italy, Spain, and France, respectively.
Currently, the analysis of the evolution of COVID-19 in America is of great importance due to the impact of this epidemic on this continent. In this contribution we will focus on the United States. The first patient detected in the United States was a travel-associated case from Washington state on 19 January 2020. The preponderance of initial cases of infected patients with COVID-19 in the United States were correlated with travel to a “high-risk” country or close contacts of previously identified cases corresponding to the testing criteria adopted by the Centers for Disease Control and Prevention (CDC) (
https://www.cdc.gov/, accessed March 2020). From 1 to 31 March 2020, the number of reported COVID-19 cases in the United States rapidly increased from 30 to 188,172, the number of deaths rising from 1 to 5531, and the virus being detected in all the states. At the end of April, the number of infected reached 1,069,424 and the number of deceased stood at 62,996. At the time of writing this contribution (14 June 2020) the number of infected is more than 2 × 10
6 and more than 100.000 deaths, the United States being one of the countries that is suffering the most from COVID-19.
In a recent paper [
9], an attempt was made to estimate the actual number of infected people, even if they have not been counted. It was estimated that the true number of COVID-19 cases in the United States is likely in the tens of thousands, suggesting substantial undetected infections and spread within the country [
10].
This contribution presents a methodology to analyze the evolution patterns of COVID-19 in the states of United States (including Puerto Rico and the District of Columbia). A parametric similarity measure is presented, based on a robust distance measure between time series, the dynamic time warping distance (DTW), with which the number of infected and dead in each of the states can be compared simultaneously, even though the start of the epidemic originated on different dates in each zone (therefore, the time series that need to be compared have different lengths).
To the best of our knowledge, this contribution is the first study that tries to develop a hierarchical clustering time series algorithm in order to globally compare and classify the behavior of all the states of United State simultaneously in their evolution of infected and deceased patients suffering COVID-19. Carrying out this classification is very useful, since it will allow the establishment of similarities and patterns in the evolution of the pandemic among the states of the United States.
2. Material and Methods
A time series is a sequence of numerical (temporal) data points in successive order, which is naturally high dimensional and large in data size. There are two main operations that could be performed when working with time-series with its sequential data: (a) the analysis of a single time series; (b) the analysis of multiple time series simultaneously. This contribution is concentrated on the analysis of multiple time series for all the states of US suffering COVID-19, with the purpose of finding similarities between multiple time series by performing a clustering time-series methodology.
Clustering such complex objects is particularly advantageous because it may lead to the discovery of interesting patterns in time-series datasets, which contributes to a better understanding of the COVID-19 spread in different regions of the United States.
Clustering of time-series sequences has received noteworthy attention [
11,
12], not only as a formidable exploratory method and powerful tool for discovering patters, but also as a pre-processing step or subroutine for other tasks [
13].
In this section, the database used is presented first (
Section 2.1). Subsequently, a review of the most popular distance measures for time series is described (
Section 2.2) and a new parametric distance is proposed.
2.1. Data Set
The COVID-19 epidemic data set used in this contribution was collected from the Johns Hopkins University [
14]. In this platform, the number of confirmed, deaths and recovered cases until 14 June 2020 for different countries are presented. For the United States, two additional .csv files are provided, in with detail of administration and province/state is reported (including Puerto Rico and District of Columbia). In order to compare countries behavior, the time-series data are divided by state population.
2.2. Similarity/Distance Measure in Time Series
In a simplified way, the similarity of two simple time series with the same number of points (denoted by
m), and defined by
X = {
x1,
x2,….
xm} and
Y = {
y1,
y2,….
ym}, can be achieved by simply calculating the Minkowski (or Euclidean) distance (shortest path between two points) between points on both time series that happen at the same time. This distance is the measure of similarity, denoted as
d(
X,
Y), and it is a function that takes both times series (
X,
Y) as input and calculates their distance “
d”, defined as:
when
k = 2, the distance between two series is called Euclidean Distance. Using the Minkowski distance is a good metric to analyze the similarity of two time series, if these time series are synchronized (that is, all similar events in both time series occur at exactly the same time) and have the same length.
The evolution of time series in the different states of the United States present a different start date, both for the number of confirmed and death cases, and therefore its length is also different. Suppose as an analogy the time series of the sound of a mother’s voice when she speaks slowly to her child. If the mother says the same phrase quickly, the child will most likely recognize that she is still his mother. However, if the Euclidean distance between both series were used as a metric, these two time series would have a very low similarity and would not be considered fundamentally equal. This would lead to the conclusion that the two voices did not come from the same person. To solve this problem, the dynamic time warping distance (DTW) method is frequently used in the bibliography [
15].
DTW is a technique that can be considered as an extension of the Euclidean Distance between series [
16], that calculates an optimal match between two given time series with certain restriction, performing non-linearly in the series (by stretching or shrinking along its time-axis). This distortion (denoted as warping) between two time series is used to find corresponding regions and determine the similarity between them.
The DTW of two series X and Y, defined as
X = {
x1,
x2,….
xn} and
Y = {
y1,
y2,….
ym} is computed in the following way. An
n-by
-m matrix D is computed with the (
i.
j)th element, defining the local distance of two elements by:
The point-to-point alignment between series
X and
Y can be represented by a time warping path
W, defined as:
where
p is the length of the warping path
W, and
wx (
k) and
wy (
k) represent the indexes in time series
X and
Y, respectively. The warping path
indicates that the
wx (
k)th element in time series
X maps to the
wy (
k)th element in time series
Y. There are some constraints and rules for the construction of the warping path:
Every index from the first time series must be matched with one or more indices from the other time series (and vice versa)
The first (the same for the last index) index from the first time series must be matched (not only this match) with the first (last) index from the other time series. That is, the warping path should start at W(1) = (1,1) and end up at W (p) = (n,m).
The mapping of the indices from the first time series to indices from the other time serie must be monotonically increasing, and vice versa. The adjacent elements of path W, W(k) and W(k + 1) must be subject to wx (k + 1) − wx (k) ≥ 0 and wy (k + 1) − wy (k) ≥ 0.
The warping path should also have the property of continuity, mathematically expressed as adjacent elements of path W, W(k) and W(k + 1) must be subject to wx (k + 1) − wx (k) ≤1 and wy (k + 1) − wy (k) ≤ 1.
The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values. The DTW (minimal distance and optimal warping path) could be found using a dynamic programming algorithm:
where
is the minimal cumulative distance from (0, 0) to (
i,
j) in matrix
D. In the methodology proposed in this paper, for each of the states analyzed, both the time series of the number of infected and the time series of deaths will be simultaneously taken into account.
If each of these time series needs to be weighted differently, the following parametric metric,
is defined:
that measures the similarity in the evolution of the COVID time series for two states of the United States (
SA y
SB),
TSCA and
TSCB represent the time series of the number of infected,
TSDA and
TSDB represent the time series of the number of deaths for the states
SA and
SB, respectively. The parameter α (with 0 ≤
α ≤ 1) indicates the relative relevance given to the similarity measure, taking into account the time series of infected or deaths.
2.3. Clustering Method for Time Series
Clustering is a data mining technique in which similar data are divided into related or homogeneous groups, in an unsupervised way, that is, without a priori advanced knowledge of the data. For the problem presented in this contribution, working with time series of states of the United States suffering COVID-19, given a set of individual time series data, the objective is to group similar time series into the same cluster.
The problem of grouping time series data is formally defined by, given a dataset of
N time series data
Q = {
X1,
X2… XN}, finding in an unsupervised way a partition of
Q into
K cluster, denoted as
C={
C1,
C2,
… Ck}, taking into account that homogeneous or similar series are grouped together based on a certain similarity/distance measure. In this paper, the parametric metric
is used and there is no intersection between clusters, therefore:
The methods used in the area of time series clustering [
11,
17] are usually based in a conventional clustering algorithm by substituting standard distance measurements with a more suitable distance to compare time series (raw methods) or converting series into normal data and using directly classical algorithms (feature-based methods and models).
Among the most popular clustering algorithms, the hierarchical clustering and the k-means algorithm are widely used in time series clustering. In this contribution the hierarchical clustering is used, mainly due to its great visualization power and its simple and intuitive interpretation.
Hierarchical clustering creates a nested hierarchy of similar time series, according to a pair-wise distance matrix of the time series analyzed. The similarity measure used is therefore essential in this time-series clustering process.
The most widely used linkage criteria, such as single, average and complete linkage variants [
18], were analyzed. Hierarchical clustering can be converted into a partitional clustering, with
k cluster, by cutting the first
k links.
3. Results and Discussion
To evaluate the performance of the proposed method, several experiments are conducted in this section for three values of the parameter α in the distance metric
. The time series of the states of the United States was taken from John Hopkins database. Data range from 22 January 2020 to 14 June 2020, covering all the states (including Puerto Rico and District of Columbia). For the computation of the distance metric, a threshold
Imin was defined, defining the minimum number of infected people to start the time series, being for this study
Imin = 5 (the number of confirmed was greater than 5). Therefore, the length of the time series is different for each state, being on average 101 days (
Figure 1). The index of each of the states is presented in
Table 1.
The values of the parameter α analyzed will be {0, 0.5, 1}. This section of results begins with the value of α = 0.5, that is, the information of the confirmed patients time series with the same relevance as the time series of deaths for the final computation of the distance
. The distance matrix between the different states is presented in
Figure 2 (for a better visual representation, the distance matrix was multiplied by a constant and the states are ordered according to the cluster to which each one belongs).
It is important to highlight the existence of various clusters with only one state (corresponding with District of Columbia, New Jersey and Rhode Island). Cluster 7 (New Jersey, listed as 48 in
Table 1) links directly to cluster 8 ((49) Connecticut, (50) Massachusetts, (51) New York), which denote similar behavior between these states. For cluster 6 (District of Columbia, number 47), the linkage is performed for both cluster 3 ((32) Michigan, (33) Pennsylvania) and cluster 4 ((34) Delaware, (35) Illinois, (36) Louisiana, (37) Maryland). There are two large clusters (cluster 2 and cluster 5) that contain 29 and 9 states, respectively. Its linkage is performed through cluster 1, which contains only two states (Nebraska and South Dakota).
The similarities and distances between the different states and clusters obtained can be analyzed using the results presented in the hierarchical clustering (
Figure 3) and distance matrix (
Figure 2).
Figure 4 presents a geographical representation of all the states according to the cluster grouping obtained in
Table 1, using the similarity between the clusters obtained using the dendogram in
Figure 3.
The hierarchical clustering previously presented was obtained taking into account the time series of the number of confirmed and death cases simultaneously (α = 0.5).