1. Introduction
In time series analysis, in addition to trend, seasonality, periodicity and stationarity, another concept of great significance is the similarity between two or more time series. The similarity between two or more time series focuses on the study of similarities and common changes between the series. Different techniques have been proposed over the years for measuring similarities like simple mathematical measures (see, e.g., ([
1,
2]), data transformations (see, e.g., [
3]), algorithmic methods (see, e.g., [
4]) or measures of divergence (see. e.g., [
5,
6,
7,
8]). Finally, measures of dissimilarity have been thoroughly studied (see, e.g., [
9,
10,
11]).
Data series and, in particular, time series analysis involve, among others, pattern matching, anomaly identification and frequent pattern detection. All these tasks are directly associated with time series similarity techniques, some of which are mentioned above. We observe frequently that data series similarity is visualization-dependent. Indeed, it is quite common, for instance, that neuroscientists manually inspect the electroengephalogram (EEG) data of their patients, using visual analysis tools, so as to identify patterns of interest (see, e.g., [
12]). Surveillance systems also rely on visual tools to monitor incidence data and compare the disease behavior in various regions for the purpose of predicting or hopefully preventing an epidemic. Finally, physiologic time-series databases often require finding similar temporal patterns of physiological responses resembling those of a prototype case (see, e.g., [
13]). Detection of these complex physiological patterns not only enables demarcation of important clinical events but can also elucidate hidden dynamical structures that may be suggestive of disease processes. In all such cases, it is important to have similarity techniques, which in conjunction with visual analysis tools will enable analysts to complete their tasks quickly and accurately.
In this work, our goal is to present and discuss similarity techniques for ordered observations between time series and non-time dependent data. The coupling analysis can be obtained via a direct method ([
1]) which is a simple squared measure of distance between two series.
Since we are focusing on similarities between two series, the comparison can be achieved more effectively via indices, advanced indices and the associated (index) matrices. More specifically, as we can see below, the comparison concentrates on the extreme parts of the series and the index quantifies the degree of similarity between these specific parts of the series. In general, such indices referred in this work as
indices, compare the simultaneous time pairing of the
K maximum and/or minimum values (denoted by the parameter
M) between
N time series based on some statistical function denoted by
. A standard
index was formally defined in [
1]. Applications of such indices include the measuring of displacements, rotations, moments and forces for two types of floating wind turbines ([
14]) and epidemiological data ([
15]).
For details on the origin of these ideas, one may refer to a review paper by Makris et al. ([
16]).
In this work, we present the evolution of the idea of the similarity indices and propose advanced dimensionless indices and the associated index matrices which are both easily interpreted and provide more effective comparison of the series involved than the one achieved by the standard indices previously proposed. The rest of the paper has been organized as follows. In
Section 2, we present some preliminary definitions and review results about standard indices illustrated by an example. A generalization to the multivariate case is also presented. In
Section 3, we propose new advanced dimensionless indices and index matrices for efficient comparison of time series. A discussion on the related concept of cointegration is also included. In
Section 4, a dataset on influenza-like-illness (ILI) cases in Greece is examined for illustrating the usefullness and the importance of the proposed methodology. In
Section 5, we discuss the parameter
and present an application of the indices in economics and marketing defining a novel elasticity. In
Section 6, we study the effect of the parameter
M and in
Section 7, the indices are defined for non-time dependent data and a modified direct measure for this type of data is also discussed. Finally, in
Section 8, we discuss the parameter
N and its possible reduction in size. The paper concludes with some general comments and conclusions.
2. Preliminary Definitions
Most of the quantities defined in this work depend on three parameters which are denoted by M, K and N where
M takes two values and depending on whether we are dealing with the maximum or the minimum values of a series or a data set,
K represents the number of ordered observations used for the analysis, where n is the sample size and
N is the number of time series or data sets involved in the analysis with .
A class of dimensionless indices was recently defined by Makris and Vonta [
1] that depends on a basic statistical characteristic
like the mean (average), the variance, correlation, etc. of
K ordered (largest or smallest) observations of a series. More specifically, a basic characteristic of
K ordered observations is compared (through division) with a basic characteristic of all (total) observations involved in the analysis, with the latter not necessarily the same as the one used for the
K ordered observations. The indices are considered to be dimensionless since we divide the same type of quantities (the same statistical characteristic). Even if two different characteristics are used, the index will remain dimensionless as long as the characteristics are in the same unit of measurement (e.g., mean for the numerator and standard deviation for the denominator or vice versa).
For the definition, consider two time series i and j and let the time point at which the kth ordered observation of the series i has occurred. Having available K time points corresponding to the K largest (or smallest) ordered observations of a time series i, which plays the role of the basis for the index evaluation, we proceed and calculate the basic statistical characteristic of the K observations of the second series j conditionally on the time points that the baseline time series i displays its K largest (or smallest) values.
Definition 1 (see [
1])
. For two time series i and j, the index is defined bywhere the notation denotes the K time points where the series i displays its K smaller values and the K time points where the series i displays its K larger values. In addition, is the notation for all the observations of the time series j. Observe that the above index can be obtained using as a basis the series
j, resulting in the index
. Naturally the index is not symmetric since the time points of the
K ordered observations of one series do not necessarily occur at the exact same time points of the other series. Extending the idea of the above index one could also evaluate the index for the same series obtaining the indices
and
. The combination of the above indices can be represented in a
matrix which, in general, is denoted by
where
represents the statistical function (characteristic) to be used in the analysis. The matrix created below refers to the case of the
K smallest observations of the baseline series
i when the basic characteristic
is the standard deviation (
Std):
From the matrix above, observe that the diagonal elements refer to the baseline time series itself while the off-diagonal elements refer to one time series conditionally on the ordered time points of the other. Thus, the evaluation of the statistical characteristic (the standard deviation here) for each row k of the matrix is done conditionally on the time series k.
Remark 1. Note that in some instances the numerator will be always smaller or larger than the denominator. Indeed, if the mean (average) is used as the statistical characteristic then for the diagonal elements, the numerator will be always smaller than the denominator if the K smallest observations are used and larger if the K largest observations are used. In addition, in the case where the maximum values are studied, it is valid that the values of the indices that appear in the diagonal elements are most often larger than the off-diagonal elements in the same row and this is because the values in the same row are calculated conditionally on the time points where the diagonal elements are defined (see Example 1 below). In the case where the minimum values are examined, the elements in the diagonal most often take the minimum values in each row due to the same conditional argument.
The general form of the matrix for a general statistical characteristic
for
N time series given in [
1] is of dimension
and is presented as follows:
Remark 2. In the case where the maximum values are studied () and there are ties, for example when a value, say , is tied with other values within the range under investigation, e.g., , then we select among those time points the one that maximizes the index in the remaining time series because this way, we are being more conservative in terms of the similarity between the N series. Notice that the time point that provides the maximum index for each of the remaining time series might be different for each series. This remark will be become clear in the following example. Analogously, for we will be choosing each time the time point that minimizes the index in the remaining time series.
Example 1. For a better understanding of the previous Remarks, consider three time series and C consisting of 20 values (see Figure 1): and
For the time series
A, we observe that
and occurs at
while
and appears twice in the series at time points 4 and 11, namely
. Obviously, for
and for the characteristic
being the average (AV),
while for
,
. Observe now that for
,
if the 4th observation of the series
B is used and
if the 11th observation of the series
B is used. Similarly, we get
if the 4th observation of the series
C is used. Instead, if the 11th observation is used in the calculation then
. The results are summarized in
Table 1 for
.
Based on Remark 2, the first row of the matrix
is given as
Analogously, we will deal now with the case of the minimum values of the time series. For the time series
A, we observe that
and occurs at
while
and appears twice in the series at time points 9 and 17, namely
. Obviously, for
and for the characteristic
being the average (AV),
while for
,
. Observe now that for
,
if the 9th observation of the series
B is used and
if the 17th observation of the series
B is used. Similarly we get
if the 9th observation of the series
C is used. Instead, if the 17th observation is used in the calculation then
. The results are summarized in
Table 2 for
.
Based on Remark 2, the first row of the matrix
is given as
Multivariate Indices
For a better comparison between time series, a multivariate index could be used based not on a single but rather on multiple statistical characteristics. Thus,
could be increased in dimension in order to include more than one statistical functions and be a vector of higher dimension. For example, for the control of four statistical characteristics simultaneously (i.e., average
, standard deviation
, coefficient of variation
and covariance
, the index
can have dimension four creating a generalized matrix of dimension
defined in (
3) with
:
We should stress that each element of the matrix (
3) is itself a vector of dimension 4, which contains the values of the four indices that correspond to the four statistical measures, for the case under consideration.
Multivariate indices could become extremely useful in stochastic ordering and majorization both of which play a key role in many areas of statistics as for example, in reliability theory and engineering. For instance, in regard to majorization, an
n−dimensional vector
is said to be majorized by another
n−dimensional vector
, denoted by
if
where
and
refer to the ordered elements of the two vectors. We use the term
strict majorization if in the above we use strict inequality. Furthermore, if the inequality holds for all
js including the case
, we are dealing with
weak majorization. For details about majorization please see [
17].
Before closing the section and moving to the next one where some new advanced indices will be proposed, it is important to mention that the indices discussed in this work are applicable to both stationary and non-stationary processes. Although the series to be compared are expected to be of the same nature and as such the comparison is meaningful, in general, the use of stationary or non-stationary data and whether a differencing should be applied, is a challenging problem within the framework of similarity measures, that goes beyond the scope of the present work and is left as an open problem for a future project.
3. Advanced Dimensionless Indices
For better and efficient comparison between series, we propose in this section two new advanced index matrices that depend on the same three parameters
and
N as the ones in Definition 1. The first index matrix is obtained when each element of each row in matrix (
2) is divided by the diagonal element of the same row. The resulting new index matrix is denoted by
. As expected, this matrix has all its diagonal elements equal to one while all off-diagonal elements take non-zero values.
In a similar way, we propose the 2nd index matrix
where each element of each column in matrix (
2) is divided by the diagonal element of the same column. Both index matrices can be viewed as similarity measures between the series involved. The proposed index matrix measures are given below:
and
The above proposed advanced indices (index matrices) are dimensionless while their interpretation is much more clear than that of the standard indices presented in Definition 1 because they provide efficient pairwise comparison in pairs between series and could be considered as the percentage of similarity between two time series i and j. For instance, if for the matrix the value of the -element , this means that the second time series is 75% similar as compared with the first series, or in other words, the index of the second series is 75% the value of the corresponding index of the first series, when the calculations are conditional on the time points where the first time series achieves its K maximum or minimum values. The conclusion is the same if this index turns out to be equal to 1.75.
The comparison through the matrix measure is indirect and becomes more apparent when the values of one time series are multiples of the values of the other, because the element and therefore the statistical function based on the values of the time series j in the numerator at the time points of maximum (or minimum) values of the time series i is compared against the statistical function based on the values of the time series i in the denominator. That is, the closer the indices in value to the proportionality parameter (or its inverse depending on the case) between two time series, the more confident we are about their similarity in terms of the occurrence of their maximum (or equivalently minimum) values.
In the case of the matrix index , observe that the comparison is direct as the index and therefore the statistical function of the time series j in the numerator is based on the time points of maximum (or minimum) values of the time series i and compared against the statistical function based on itself in the denominator.
Thus, if for two time series 1 and 2 the series 2 has an index value
, then this means that the second time series presents its
K maximum values at the exact same time points as the first time series, and vice versa, if time series 1 has index
, then the first series presents its maximum values at the same time points as the second series. In fact, in general, we have:
Although this work is devoted to similarity, we cannot overlook the fact that it is, at the same time, interrelated to the concepts of causality and cointegration which are briefly mentioned below for the sake of completeness. The notion of causality is rather common in economic time series although causality issues could also be found in reliability or engineering. The difficulties of establishing a causal relationship between economic variables led Granger ([
18]) to develop the economic concept of causality known as
Granger Causality. On the other hand, the concept of cointegration was established later and discussed thoroughly in [
19] where the associated statistical inference including tests used to identify the long-term relationships between two or more series, was explored. The phenomenon is also quite common since, in general, economic theory forces certain pairs of series staying close to each other and moving alongside. For testing cointegration, one could use the Engle–Granger Augmented Dickey–Fuller test for cointegration (EG-ADF test, [
19]) based on the classical Dickey–Fuller test, if two series are involved or the Johansen test if more than two series are involved ([
20]). For further reading, the interested reader is referred to [
21,
22] and to the interesting review article by Hubrich et al. [
23].
In Microeconomics, another concept closely related to the above is the concept of elasticity which is a measure of the sensitivity of a variable to a change in another variable. For instance, although the prices of some goods are inelastic, this is not always the case. This issue, which is of great importance in marketing, is explored through an example in the following section where a special price elasticity index is presented and discussed.
In the sections that follow, the quantities that play a key role in the analysis of the proposed index matrices of this section, namely the function (or characteristic) and the parameters and N, will be thoroughly explored.
4. An Epidemiological Application
In this section, we examine time series on influenza-like-illness (ILI) rates for the purpose of identifying differences between (geographical) regions as well as differences between each region-rate and the country-rate. Usually, the purpose of such analyses is the identification of areas for which further monitoring and actions for reducing the spread of a disease are needed.
The data have been drawn from the Sentinel system of the Hellenic National Public Health Organization (EODY) for the period 2004–2014. From the data entered into the system by the physicians, the ILI rate is calculated weekly as the number of ILI cases per 1000 visits (1000*cases/visits), a rate that displays the spread of the disease to the community.
In this work, we will present and compare six time series which report the weekly ILI-rate for the time period of the 28th week of 2004 up to and including the 39th week of 2014 in four geographical regions in which Greece is divided. The time series which report the ILI-rate for the four regions will be denoted by ILI-1 to ILI-4. For the entire country, two series are available, the overall ILI-rate as a weighted average of ILI-1 through ILI-4 (based on appropriate weights depending on the population size per region), denoted by ILI and the ILI-total that reports the total number of cases per 1000 visits (1000*cases/visits). Through the matrix
given in (
2), we compare the similarity of the six time series based the
indices where the parameter
is the average,
,
and the parameter
K takes two values,
(
Table 3) and
(
Table 4).
The
element of the matrix in
Table 3 is equal to 8.6051 which means that the average of the 10 largest observations of the series ILI is about 8.6 times bigger than the average of all ILI observations. Observe that the average of the 10 observations of each of the series ILI-1, ILI-2 and ILI-3 (8.2869, 8.4281, 8.4036) evaluated conditionally on the time points where the country-rate displays its 10 largest observations, are very similar but the same is not true for the 4th country region where a much lower similarity value is observed (7.6249). This implies that the rate of the disease in that specific region (Aegean Sea islands and Crete) is not as high as in all other regions. The time series ILI-total displays an even smaller similarity (7.36) with the time series ILI, a fact that is probably attributed to the better definition of the ILI-rate as a weighted average of ILI1–ILI4.
From the results of
Table 4 (for
), we observe that all diagonal elements as well as almost all off-diagonal elements are reduced in size as compared to the elements of
Table 3. Observe that as
K increases, the diagonal elements approach 1.
The matrix
given in (4) for
being the average and
is presented in
Table 5. According to the definition of
, the diagonal elements are all equal to 1. Observe that the first three geographical regions display very similar behavior as compared with the country-rate (ILI) with values at least equal to 96%, while the fourth region is much less similar (88% similarity). In
Table 6 for
, we observe that the differences between all time series are alleviated and the indices approach 1. The fourth geographical region constitutes an exception since it reaches about 88% similarity as compared with the whole country and even less similarity as compared with the other three regions.
Epidemiological differences accurately identified among regional series are useful to health officials since they provide a useful tool for identifying, as early as possible, disease outbreaks in certain regions and is beneficial to the society in general, for early detection, prevention and spread of extreme, possibly harmful, events.
5. The Function μ
The measure which refers to the ratio of the statistical functions considered is an important factor in the analysis of the data through the indices.
In previous sections
was chosen to be a basic statistical function like the average, the standard deviation, the coefficient of variation, etc. In this section, we deal with the case where
is a differential (backshift) operator denoted by
d referring to the first differences of a time series. We denote by
the differential based on first differences between the
K maximum (or minimum) values of a time series. Thus,
is a
dimensional vector defined by
while
denotes the operator applied to all (total) values of the time series and is of dimension
. The corresponding index is defined below.
with an alternative version being defined as
where the denominator relies on any statistical function including any ordered observation. Indeed, in (
7) with
and
, the operator
d can be the classical first difference between the two largest observations for the series
i, namely
divided or weighted by either the mean of all observations (see (
8)) or the maximum of the two largest values (see (
9)) or the minimum among the two largest observations (see (
10)). Observe that the last one is nothing but the well-known
relative or percentage change. The relevant expressions are provided below:
and
In contrast to the above definitions, the index (
11) calculates the ratio of two indices. More specifically it calculates the ratio of the percentage change of the time series
i to the percentage change of the time series
j conditionally on the time points the time series
i displays its
K maximum (or minimum) values. A comment would be that the function
could be equal to 1 or any other constant. Its existence is to achieve weighting and to ensure the dimensionless property of the indices.
We define the new index as
Example 2. For the case of the index defined in (11), and for the case of three maximum values of the time series i, (i.e., ) a vector of dimension 2 can be derived in the numerator of the index and therefore the index is actually the 2-dimensional vectorwhile for and we have Example 3. (Price elasticity of demand). The index defined in (11) has many applications, mainly in economics and marketing. For example, if the time series i in (11) is the demand for a good, say A, measured in Q units and if the time series j reports the corresponding price values, say P, of the good, then the resulting index is a measure of the response of the maximum (or minimum) quantity demanded of a good, relative to the change in its price, with all other factors considered constant (see index (15) below). In other words, the index expresses the percentage change of the maximum (or minimum) quantity demanded of the good to the percentage change of its price, known as the price elasticity of the maximum (or minimum) demand denoted by , i.e., a novel form of the known price elasticity of demand ([24]). For the classical case with for two series, we havewhich can also be denoted bywhere stands for the ratio . Note that in the general K case we have It should be noted that if the price is replaced by income then the resulting index will be a novel elasticity of demand for income while if the price is substituted by demand for a complementary good (relative to good A) the resulting index will be a novel cross-elasticity of demand.
6. Parameter M and Cross-Correlation Indices
The values of the parameter M, one involving the minimum and one the maximum values of a data set, can be used in combination, connecting maximum and minimum values between time series simultaneously. This proposal, that is, to connect the occurrence of maximum and minimum values between time series, is inspired by variables that are complementary, such as the prices of two substitute goods (such as the matches and the lighter), where, as it is well-known when the price of a good rises the price of the substitute good goes down, with the result that at the time points one good receives its maximum prices, the other (namely the substitute good) receives its minimum prices.
For two time series
i and
j, when we are interested in comparing the information from the maximum and the minimum values simultaneously, we propose the definition below with the notation
for the new index. More specifically this definition entails two parts, one for the case
and one for the case
. In the first case, the
K values of the time series
j used to calculate the function
are the time points where the
K maximum values of the time series
j are presented, whereas for the case
the
K values of the time series
j used in order to calculate the function
are the time points where the
K minimum values of the time series
i are presented. In this way, a cross-coupling of the time points of occurrence of the maximum and minimum values of the time series
i and
j is established. Obviously, another definition arises when we replace in the above the maximum values with the minimum values and vice versa.
Based on the definition (
17), a matrix defined in (
18) can be created and denoted by
. Through this matrix and based on the values of each of
N time series in general, it can be seen whether a cross relationship of maximum and minimum values between them exists.
As an example, consider two time series 1 and 2. If the cell value (which is calculated based on the K maximum values of the time series 1) is the same or close to the value of the cell (which is calculated on the values of the time series 2 conditionally on the time points where the K minimum values of the time series 1 are presented), then this means that wherever the time series 1 presents its K maximum values, the time series 2 will present its K minimum values. Alternatively, if the value of the cell is very close to the value of the cell , then this means that wherever the time series 2 presents its maximum values, the time series 1 presents its minimum values. One of the aforementioned results does not imply the second and vice versa, but when both are valid then there will be a time cross correlation between the two time series in terms of their K maximum and minimum values.
Caution is required for the possibility of spurious correlation. Since the cross-correlation mentioned earlier may be due to the presence of a hidden, confounding factor, certain measures should be taken for investigating such a possibility. It should be noted that spurious correlation is not uncommon and it is surfaced not only in economics and finance but also in behavioral sciences.
8. Parameter N
The parameter N stands for the number of data sets (which may be time dependent or not) that are analyzed, thus creating in each case an index matrix of dimension .
In case many data sets are to be analyzed, namely the value of N is large and therefore the Matrix is difficult to deal with, there is a need to discard some data sets that are not important to be present in the analysis (e.g., when the values of the indicators are very small). We introduce therefore an additional parameter and we have a new notation which displays the dependence of N on . Our purpose for doing that is to discard from the final analysis those data sets that are not significant based on a criterion.
The criterion for keeping a data set is its closeness to another data set which can be measured by a typical distance, e.g., the Euclidean distance or the absolute value based on the indices
. The selection of the data sets that remain in the matrix is performed in two steps. In the first step, the
smallest absolute differences for
are kept in each row with the result being that the initial
matrix becomes an
matrix. Let us, for simplicity, denote these differences by
for each row
. In the second stage, the absolute differences on each row are summed up as
Finally, the
smallest of those sums are selected and the corresponding rows are kept into the matrix (so the
matrix reduces to the final
matrix) and gives rise to a matrix denoted by
. To be more explicit, the resulting matrix contains the rows with row sum
9. Conclusions
In this work, a method of data analysis was presented based on the indices , which are calculated as a ratio of two statistical functions. We have studied the parameters involved in the indices and how these parameters affect the indices. More specifically, we examined the parameter (function) with various examples and an application to economics and marketing on price elasticity of demands. We have also studied the parameter M and how we can use the maximum and minimum values simultaneously to perform a cross-correlation of times. Furthermore, we examined the reduction of the size of the parameter N by discussing how some data sets that are not significant (important) to be present in the analysis, can be discarded (e.g., when the values of their indices are very big), namely, we reduce the size of the parameter N and thus the number of the data sets kept in the analysis. Finally, the indices for data independent of time (namely random variables) are discussed through explanatory examples.
The proposed indices can be used as powerful statistical tools in similarity matching problems involving pattern matching, anomaly identification and/or frequent pattern detection. The applicability of the proposed methodology goes beyond neuroscience or physiology and epidemiology all of which have been mentioned in this work. Indeed, such techniques can play a vital role in various scientific fields like financial mathematics, economics, management, geosciences, stylometry or music retreaval. Such examples include among many others, the identification of companies with similar growth patterns, products with similar selling patterns and seismic waves not similar in spotting geological irregularities. Finally, music retreaval and plagiarism in literature and music will greatly benefit from the implementation of the proposed methodology.