1. Introduction
The problem of time series forecasting is to estimate the future values of a process from a sequence of its observations. This task is important because it has many practical applications. Examples include predicting future stock prices and air temperature forecasting. Nowadays, there are many different approaches to solving this problem. Classical statistical models such as exponential smoothing and the autoregressive integrated moving average (ARIMA) model are very popular, highly accurate, and relatively easy to use. A detailed description of these methods can be found in [
1,
2]. Neural networks [
3,
4,
5] are also widely used, especially on large datasets. However, there is no best method for all situations, and the development of new forecasting techniques remains relevant.
This work is based on an information-theoretic approach to time series forecasting. As is it was shown in [
6], the problems of data compression and prediction are closely related and an asymptotically optimal method for predicting stationary stochastic processes can be based on a universal code (see also [
7]). In [
8] it was shown how any lossless data compression algorithm can be used to predict finite-alphabet and real-valued time series.
In this paper, we do not focus on asymptotic properties of algorithms and consider using this approach in practical situations. The main contributions of the work are summarized as follows. First, we describe how to use arbitrary data compressors for time series forecasting. Such compressors are a promising tool for forecasting because many of them are implementations of universal codes with numerous modifications to increase the level of compression in real-world situations. It is important to note that modern data compression techniques are based on several different approaches to universal coding: the Burrows-Wheeler Transform (BWT) [
9], the Prediction by Partial Matching (PPM) [
10] algorithm, the Lempel-Ziv family of algorithms [
11,
12], among others. Some algorithms [
13,
14,
15] search for a compact context-free grammar that unambiguously represents the sequence for compression, one can see this approach as a kind of artificial intelligence. In previous works, only some theoretical possibilities of using universal codes for prediction without any practical applications were described [
8], or the use of one universal code for prediction was presented [
7].
Secondly, we propose an adaptive approach to time series forecasting, which is useful in situations when we do not know in advance which data compressor is optimal for a given time series.
Thirdly, we describe how the proposed techniques can be using to predict multivariate data. To test the proposed techniques, we make forecasts for real-world data such as sunspot numbers and some social indicators of Novosibirsk region, Russia.
It should be noted that our approach can be complemented with well-known time series transformation and adjustment techniques. For example, in this work, we use differencing and smoothing in the computational experiments.
The rest of the paper is structured as follows. In the next section, we describe the mathematical foundations of using arbitrary data compression techniques for forecasting finite-alphabet time series. After that, we describe generalizations of the model to real-valued and multivariate cases. Then we give some examples of the practical use of the presented techniques. Next, we propose an adaptive algorithm that can significantly reduce computation time when multiple data compressors are used. Further, we use the proposed algorithm to predict sunspot numbers. At the end of the paper, we describe the limitations of the proposed approach and make some conclusions.
2. Data Compression and Prediction
2.1. Predicting Finite-Alphabet Time Series
Time series with finite alphabets are most convenient for forecasting using data compression algorithms. Suppose we have a sequence
,
, where
A is a finite set (an alphabet), and we want to give a prediction for
,
. Denote as
the set of all sequences of lengths
n over
A and
. A uniquely decodable data compression method (or code)
is a set of mappings
,
, such that for any sequence of words
,
,
, the sequence
can be uniquely decoded as
. The compressed size (in bits) of sequence
we denote as
. We can get a probability distribution on
using
by
A code
is called universal if for any stationary and ergodic measure
with probability 1, and
where
is the entropy rate [
16] of
,
is the expected value of
f. In [
8] it was shown that if
is a stationary and ergodic stochastic process and
is a universal code, (
1) in certain sense is a nonparametric estimate of the unknown probability measure
. More precisely, the following theorem was proved.
Theorem 1. If is a stationary ergodic measure and φ is a universal code, then the following equalities hold:
- 1.
with probability 1,
- 2.
,
- 3.
.
We can use (
1) to estimate the conditional probability that
for some
as
As we can see from (
2), the conditional probability of
depends on how well
Y can be compressed after
X (relative to any other
).
Suppose that A is a finite set of integers, . We can give a point forecast as (i.e., compute the mean over the marginal distribution for step j).
Example 1. Consider the following sequence: Suppose we want to make a forecast for its next two values using gzip [17] as φ. Assume that the alphabet A of the underlying process is . We need to compress every sequence of the form , where . The compression results along with the calculated conditional probabilities are presented in Table 1. For instance, and Suppose we have a finite set of data compression algorithms
. If we do not know which
is the best predictor for a given series
X, we can mix the conditional probability distributions yielded by each compressor from
F by
where
,
.
Note that (
3) works in such a way that the highest probability gets
such that
for some
, i.e., (
3) selects the best compressor
“automatically”.
2.2. Predicting Real-Valued Time Series
Many time series that can be found in practice are sequences of real numbers. To predict such a series using data compression algorithms, we need to convert it to a sequence of integers (with loss of information, obviously). This process of conversion is known as quantization. Consider a sequence , where . Denote its minimal and maximal elements as m and M respectively: , . Probably the simplest way of conversion is to split into a finite number n of disjoint numbered intervals of equal length and replace each with its corresponding interval number: if , then replace with j. Later, we can perform the inverse conversion, replacing the indices, for example, with the medians of the corresponding intervals.
Now, we consider the question how to select the number of intervals
n. On the one hand, if this value is too small, some important regularities may be missing in the converted series. On the other hand, if
n is too large, a data compressor may not be able to capture the regularities due to noise in the data. One possible solution of this problem is to employ an approach similar to that used in (
3). Let
n be some positive power of 2:
is an integer greater than zero. We can mix the probability distributions, obtained using partitions into
intervals,
, with some weights. The partition yielded the smallest code length will have the greatest impact on the final result. Denote the number of interval that contains
in partition into
intervals as
. Then the mixed probability distribution can be defined as
where
is a set of interval numbers,
—non-negative weights,
.
2.3. Predicting Multivariate Data
In some practical situations, it is required to predict several related series. In such cases, we can try to utilize the connection between them to improve the accuracy of our forecasts. For example, air temperature and barometric pressure are related, and it might make sense to predict them together. Many univariate time series forecasting techniques have generalizations to the multivariate (or vector) case [
18,
19,
20,
21].
The approach based on data compression algorithms can be used in a multivariate setting too. Suppose we have a vector time series , where , , . Let’s find the minimal and maximal values for each coordinate: , , . As in the previous section, we split each interval into a finite number of intervals n and thus obtain d-cubes. Then we need to number these cubes and replace each point with the number of the cube it falls into. As a result, we get an integer sequence that can be predicted using the previously described methods.
2.4. Experiments
In this section, we use the proposed method to predict real-world data. Along with the point forecasts (which are the means of the corresponding distributions), we provide confidence intervals for them. To estimate these intervals, we used the following strategy. Suppose that we want to make an h-step ahead prediction for time series . Let s be . For each series , where , we made an h-step forecast and calculated the residuals for each step: , . Then we computed the standard deviations of the residuals as , where . The confidence interval for was calculated as .
Example 2. and we want to make a 2-step forecast. As was explained previously, we make 2-step ahead predictions for the seriesand Suppose that our forecast for is and our forecast for is . Then the standard deviations of the residuals for steps 1 and 2 areand If our forecast for X is , our confidence interval for the first step is , for the second step is .
To get all the predictions presented below, we used the following set of data compression algorithms, combined using (
3):
zlib (
https://zlib.net/)—a data compression library that implements the DEFLATE algorithm;
We mixed these compressors with the same weights, that is, in (
3)
.
Several simple techniques of data transformation also were used. First, to remove trends in data we took the first difference: . Secondly, we used smoothing: .
Let us move on to forecasting some indicators of Novosibirsk region. In the first example, we predict the annual number of unemployed persons in the region. In
Figure 1a this series along with our 4-step forecast with confidence intervals is presented. Since this series is a real-valued one, we used partitions of its range of values into 2, 4 and 8 intervals in (
4). According to our forecast, in the next four years the indicator will be higher than the current level.
In the second example, we predict the average annual monetary income of the population in the region. The results are shown in
Figure 1b. The forecast shows that in the next 4 years the trend will continue and by 2023 the indicator will be near 35,000 rubles per year.
3. Adaptive Method of Forecasting
As noted in
Section 2.1, multiple data compression algorithms can be combined into one method of forecasting (almost) without loss of accuracy using (
3). The only issue is computation time—we need to compress all sequences with every data compression algorithm that we include in our combination. In this section, we consider a way to significantly reduce computation time while maintaining near-optimal accuracy. This can be achieved using the so-called time-universal codes.
3.1. Time-Universal Codes
Suppose we want to compress a sequence , belongs to some finite alphabet A. Also suppose we have a finite set of data compression algorithms and we want to compress X with that yields the smallest code length. The obvious way to find is to compress X with each and then to choose the best one. After that, we need to store s along with in order to be able to decompress the original data. Binary representation of any integer q from 0 to requires no more than bits, therefore our final code word would be with length bits. This approach works well but requires additional time proportional to k. The goal is to compress X with the best compressor using relatively small additional time.
Time-universal codes were proposed in [
25] and allow making the portion of extra time asymptotically as small as desired. Let
be the time
spends on encoding a single letter of
X,
. Thus, an upper bound for time needed to encode
X using any
is
. Denote as
the amount of extra time we have to select a close to optimal data compressor from
F for
X, where
is a positive constant. The total time of selection and compression becomes no more than
. Time-adaptive and time-universal codes are defined as follows.
Definition 1. Any method of data compression that encodes a sequence , , by a binary word of the length for some and the time of encoding is not greater than is called as time-adaptive code and denoted as .
Definition 2. If for a time-adaptive code the following equation is valid this code is called time-universal.
In [
25] was proposed the following simple algorithm and proved that it yields a time universal code:
Calculate ;
Find such that , ;
Compress using and make the code word .
In this work, we implemented this algorithm and used it to predict real-world time series.
3.2. Experiments
In this section, we consider sunspot numbers forecasting. Sunspots are temporary spots on the Sun with reduced surface temperature that appear darker than the surrounding areas. The sunspot number time series is provided by the WDC-SILSO (World Data Center—Sunspot Index and Long-term Solar Observations) [
26]. The daily, monthly, yearly and 13-month smoothed monthly total sunspot numbers can be found on the SILSO site
http://www.sidc.be/silso/datafiles. Here we used the monthly mean total sunspot number series.
We accessed the SILSO site on 20 June 2020, at that time the series had 3257 observations. The entire series is shown in
Figure 2. It has an obvious cyclical component and it’s known that the approximate cycle lengths is 11 years [
27].
The Space Weather Services (SWS) of the Australian Bureau of Meteorology issues forecasts for the WDC-SILSO sunspot numbers and maintains an archive of this forecasts (
http://listserver.ips.gov.au/pipermail/ips-ssn-predictions/), so we can compare our forecasts with them. In July 2015, SILSO adjusted the original data and SWS moved to the new version in the forecasts in February 2016. We started making predictions from that date, thus after each element with the number greater than 3205 (which corresponds to January 2016), before adding it to the end of the series, we made a 4-step forecast.
In order to select a (close to) optimal compressor for this series, we tried to use from
to
of its values with step
. For instance, when
of the values were used,
was 8 × 0.1 = 0.8 since we had 8 algorithms. In
Section 2.4, the maximal number of intervals we considered when quantizing was 8; in this section, we increased this value to 16. In everything else, we used the same methodology as in
Section 2.4.
For brevity, let us denote the combination of all 8 algorithms, obtained using (
3), as joint method.Code lengths for different compressors, obtained by compressing the entire series (
of values), are presented in
Table 2. We can see that by code length zstd is the best compressor for this series and hence it will have the greatest contribution to the result of the joint method. For
trough
of the series values the best compressor by code lengths is ppmd, then for 50–100% zstd becomes the best compressor. The code length for ppmd and zstd are shown in
Table 3.
To assess the accuracy of our forecasts, we calculated the mean absolute error (MAE), defined as the mean of the absolute values of the residuals, for each step 1–4. The MAEs of ppmd’s and zstd’s forecasts are presented in
Table 4. We can see that ppmd was a bit more accurate than zstd. We tried to add more values for prediction (from 2800 to 3205), and in that case zstd was slightly more accurate for steps 1–3 (the mean absolute errors are presented in
Table 5). For step 4, ppmd was more accurate.
We ran the program 5 times using all 8 algorithms and 5 more times using the adaptive algorithm (50% of the values were used to select the best compressor). The average time required to compute one 4-step forecast was 9945.46 s. in the former case and 564.63 s. in the latter. Thus, we reduced the computation time by more than 17 times without loss of accuracy.
4. Model Limitations
The presented model has some limitations. First, it is suitable for prediction of data with no trend pattern. But, as was shown in
Section 2.4, to overcome this limitation some additional techniques such as differencing or time series decomposition [
28] can be used. Secondly, the computational complexity of the model is high: if we make forecast for
h steps ahead,
sequences have to be compressed. This means that it cannot be directly applied in long-term forecasting or in a setting when it is required to compute predictions very quickly. While sacrificing accuracy, we can still make long-term forecasts using the following approach. Suppose we have a series
and we want to predict its next
h values. For simplicity, assume that
t and
h are even numbers, but the generalization is obvious. We can split
X into two separate time series
and
. Then we predict
using
and
using
. This allows us reduce the number of sequences to compress from
to
. It is clear that we can go further and split
X into more than two series. Another obvious way to speed up computations is to choose small
n when quantizing.
5. Discussion
In our opinion, the results of computations show that the proposed method has good accuracy. The adaptive approach to forecasting can significantly reduce computation time without loss in effectiveness if multiple compressors are used.
It is important to note that some time series can contain complex regularities. For example, the dynamics of financial time series can be influenced by the participants of the corresponding processes, and this can lead to the emergence of subtle patterns in the data. Another example is the time series arising in the study of space objects. Such series can contain various nonstationarities like attenuation and non-periodic changes. Thus, in our opinion, forecasting methods that allow finding unusual patterns may be of interest to many researchers. Patterns of the indicated types that occur in text data can be found by modern archivers. Therefore, we believe that the use of data compressors implicitly allows the use of various techniques of finding patterns, including artificial intelligence methods beyond neural networks.
In further research, a more elaborate strategy for optimal algorithms selection in the adaptive method can be developed. For example, one can use multidimensional optimization techniques for this purpose.
6. Conclusions
It turns out that from a mathematical point of view, data compression and prediction can be thought of together, allowing ideas from one area to be used in another. From a practical point of view, many of the data compression techniques implemented in software can be used in time series forecasting. Efficient data compression algorithms to be developed in the future can be combined with existing ones using the described approach.
Author Contributions
Conceptualization, B.R.; methodology, B.R.; software, K.C.; validation, B.R., K.C.; writing—review and editing, K.C., B.R.; Both authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by RFBR, project numbers 19-37-90009, 19-47-540001.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
References
- Hyndman, R.; Koehler, A.B.; Ord, J.K.; Snyder, R.D. Forecasting with Exponential Smoothing: The State Space Approach; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Shumway, R.H.; Stoffer, D.S. Time Series Analysis and Its Applications: With R Examples; Springer International Publishing: Cham, Switzerland, 2017. [Google Scholar]
- Tang, Z.; De Almeida, C.; Fishwick, P.A. Time series forecasting using neural networks vs. Box-Jenkins methodology. Simulation 1991, 57, 303–310. [Google Scholar] [CrossRef]
- Zhang, G.P.; Kline, D.M. Quarterly time-series forecasting with neural networks. IEEE Trans. Neural Netw. 2007, 18, 1800–1814. [Google Scholar] [CrossRef]
- Hewamalage, H.; Bergmeir, C.; Bandara, K. Recurrent neural networks for time series forecasting: Current status and future directions. Int. J. Forecast. 2020, 37, 388–427. [Google Scholar] [CrossRef]
- Ryabko, B.Y. Prediction of random sequences and universal coding. Probl. Inf. Transm. 1988, 24, 87–96. [Google Scholar]
- Ryabko, B.; Astola, J.; Malyutov, M. Compression-Based Methods of Statistical Analysis and Prediction of Time Series; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
- Ryabko, B. Compression-Based Methods for Nonparametric Prediction and Estimation of Some Characteristics of Time Series. IEEE Trans. Inf. Theory 2009, 55, 4309–4315. [Google Scholar] [CrossRef]
- Burrows, M.; Wheeler, D.J. A block-sorting lossless data compression algorithm. In Tech. Rept. 124; Digital SRC: Palo Alto, CA, USA, 1994. [Google Scholar]
- Cleary, J.; Witten, I. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 1984, 32, 396–402. [Google Scholar] [CrossRef] [Green Version]
- Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef] [Green Version]
- Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef] [Green Version]
- Kieffer, J.C.; Yang, E.H. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory 2000, 46, 737–754. [Google Scholar] [CrossRef]
- Nevill-Manning, C.G.; Witten, I.H. Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artif. Intell. Res. 1997, 7, 67–82. [Google Scholar] [CrossRef] [Green Version]
- Sakamoto, H.; Maruyama, S.; Kida, T.; Shimozono, S. A space-saving approximation algorithm for grammar-based compression. IEICE Trans. Inf. Syst. 2009, 92, 158–165. [Google Scholar] [CrossRef] [Green Version]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
- GNU Project. Gzip. Free Software Foundation. Available online: https://www.gnu.org/software/gzip/ (accessed on 20 June 2020).
- Tsay, R.S. Multivariate Time Series Analysis: With R and Financial Applications; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- De Silva, A.; Hyndman, R.J.; Snyder, R. The vector innovations structural time series framework: A simple approach to multivariate forecasting. Stat. Model. 2010, 10, 353–374. [Google Scholar] [CrossRef] [Green Version]
- Shih, S.Y.; Sun, F.K.; Lee, H.Y. Temporal pattern attention for multivariate time series forecasting. Mach. Learn. 2019, 108, 1421–1441. [Google Scholar] [CrossRef] [Green Version]
- Wang, K.; Li, K.; Zhou, L.; Hu, Y.; Cheng, Z.; Liu, J.; Chen, C. Multiple convolutional neural networks for multivariate time series prediction. Neurocomputing 2019, 360, 107–119. [Google Scholar] [CrossRef]
- Maruyama, S.; Sakamoto, H.; Takeda, M. An online algorithm for lightweight grammar-based compression. Algorithms 2012, 5, 214–235. [Google Scholar] [CrossRef]
- Bille, P.; Gørtz, I.L.; Prezza, N. Space-efficient re-pair compression. In Proceedings of the 2017 Data Compression Conference (DCC), Snowbird, UT, USA, 4–7 April 2017; pp. 171–180. [Google Scholar]
- Chirikhin, K.S.; Ryabko, B.Y. Application of artificial intelligence and data compression methods to time series forecasting. In Proceedings of the Applied Methods of Statistical Analysis, Statistical Computation and Simulation-AMSA’2019, Novosibirsk, Russia, 18–20 September 2019; pp. 553–560. [Google Scholar]
- Ryabko, B. Time-Universal Data Compression. Algorithms 2019, 12, 116. [Google Scholar] [CrossRef] [Green Version]
- SILSO, World Data Center. Sunspot Number and Long-Term Solar Observations. Royal Observatory of Belgium, On-Line Sunspot Number Catalogue. 1749–2020. Available online: http://www.sidc.be/SILSO/ (accessed on 20 June 2020).
- Hathaway, D.H. The solar cycle. Living Rev. Sol. Phys. 2015, 12, 4. [Google Scholar] [CrossRef] [PubMed]
- Cleveland, R.B.; Cleveland, W.S.; McRae, J.E.; Terpenning, I.J. STL: A seasonal-trend decomposition procedure based on loess. J. Off. Stat. 1990, 6, 3–33. [Google Scholar]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).