1. Introduction
The time series considered in this paper has a multi-scale structure: the
coarse level and the
fine level. We have observations
, where each
itself is a time series of length
L. The whole time series is arranged as
The subscripts
are called coarse-level indices, while the subscripts
are called fine-level indices. Throughout this paper, we took the electricity load dataset as a concrete example, as illustrated in
Figure 1. The electricity load dataset consists of
consecutive daily records, and in each day, there are
samples recorded every quarter-hour. In this example, the coarse-level indices denote “day”, while the fine-level indices correspond to the time resolution of 15 min. The aim was to forecast both short-term and long-term electricity loads based on historical records. There may be partial observations
with
, so the entire observed time series has the form:
The task is to predict future response , where are positive integers.
The coarse level and fine level provide different structural information about the data generation process. In the coarse level, each
can be regarded as a time series, and there is a certain cluster structure [
1,
2] underlying these time series
: we can divide
into groups such that the time series within each group share a similar evolving trend. Back to the electricity load dataset, such groups correspond to different electricity consumption patterns. We use
to denote the cluster label of
. In the fine level, observations
can be regarded as a realization of a stochastic process, and the properties of the stochastic process are determined by the cluster label
.
The mixture of Gaussian processes functional regression (mix-GPFR) model [
1,
3] is powerful for analyzing functional data or batch data, and it is applicable to the multi-scale time series forecasting task. Mix-GPFR assumes there are
K Gaussian process functional regression (GPFR) [
4] components, and associated with each
, there is a latent variable
indicating which
is generated by which GPFR component. Since GPFR is good at capturing temporal dependency, this model successfully utilizes the structure information in the fine level. However, the temporal information in the coarse level is totally ignored since mix-GPFR assumes
are i.i.d.
In this work, we propose to model the temporal dependency in the coarse level by the hidden Markov model, which characterizes the switching dynamics of by the transition probability matrix. We refer to the proposed model as HM-GPFR. Mix-GPFR is able to effectively predict when . To predict the responses , we must determine the cluster label based on observations ; otherwise, we do not know which is governed by which evolving pattern. If there is no observation at day (i.e., ), then mix-GPFR fails to identify the stochastic process that generates . For the same reason, mix-GPFR is not suitable for long-term forecasting (). On the other hand, HM-GPFR is able to infer for any based on the transition probabilities of the hidden Markov model even for . Therefore, HM-GPFR makes use of coarse-level temporal information and solves the cold start problem in mix-GPFR. Besides, when a new day’s records have been fully observed, one needs to re-train a mix-GPFR model to utilize , while HM-GPFR can adjust the parameters incrementally without retraining the model.
2. Related Works
The Gaussian process [
5] is a powerful non-parametric Bayesian model. In [
6,
7,
8], The GP was applied for time series forecasting. Shi et al.proposed the GPFR model to process batch data [
4]. To effectively model multi-modal data, the mixture structure was further introduced to GPFR, and the mix-GPFR model was proposed [
1,
3]. In [
2,
9,
10], GP-related methods for electricity load prediction were evaluated thoroughly. However, in these works, daily records were treated as i.i.d. samples, and the temporal information in the coarse level was ignored.
Multi-scale time series were proposed in [
11,
12], and further developments in this direction have been achieved in recent years. The time series considered in this work is different from the multi-scale time series since, at the coarse level, there is no aggregated observation from the samples at the fine level. In this paper, we mainly emphasize the multi-scale structure of the time series.
6. Experimental Results
6.1. Experiment Settings
In this section, we used the electricity load dataset issued by the State Grid of China for a city in northwest China. The dataset records electricity loads every 15 min; thus, there are 96 records per day. Using the electricity load records of 2010 for training, we predicted the subsequent S-step electricity loads in a time series prediction fashion, where . This setting allowed both short-term and long-term predictions to be evaluated. For a more comprehensive and accurate assessment of the performance, we rolled the time series by 100 rounds. Based on the electricity loads of 2010, the r-th round also puts the first records of 2011 into the training set. In each round, we predicted the subsequent S-step electricity loads. In the r-th round, suppose the ground-truths are and the predictions are ; we used the mean absolute percentage errors (MAPEs) to evaluate the prediction results. Specifically, . For the overall evaluation, we report the average of 100 MAPEs to obtain . Since the algorithms are influenced by randomness, we repeated the algorithms for 10 runs and report the average results.
We implemented HM-GPFR and BHM-GPFR in MATLAB and compared them with other time series forecasting methods. Specifically:
Classical time series forecasting methods: auto-regressive (AR), moving average (MA), auto-regressive moving average (ARMA), auto-regressive integrated moving average (ARIMA), seasonal auto-regressive moving average (SARMA).
Machine learning methods: long short-term memory (LSTM), feedforward neural network (FNN), support vector regression (SVR), enhanced Gaussian process mixture model (EGPM).
GPFR-related methods: the mixture of Gaussian process functional regressions (mix-GPFR), the mixture of Gaussian processes with nonparametric mean functions (mix-GPNM), Dirichlet-process-based mixture of Gaussian process functional regressions (DPM-GPFR).
For AR, MA, ARMA, ARIMA, and SARMA, we set the model order
L in
. For SARMA, the seasonal length was set to be 96 since there are 96 records per day, which implicitly assumes that the overall time series exhibits periodicity in days. LSTM, NN, SVR, and EGPM transform the time series prediction problem into a regression problem, i.e., they use the latest
L observations to predict the output at the next point and then use the regression method to train and predict. In the experiment, we set
L in
. The neural network in the FNN has two hidden layers with 10 and 5 neurons, respectively. The kernel function in SVR is the Gaussian kernel, whose scale parameters were adaptively selected by cross-validation. The number of components for EGPM was set in
. In addition, we used the recursive method [
20] for multi-step prediction. For the comparison algorithms implemented using the MATLAB toolbox (including AR, MA, ARMA, ARIMA, ARIMA, SARMA, LSTM, FNN, and SVR), the settings of the hyper-parameters were adaptively tuned by the toolboxes. For other comparison algorithms (EGPM, mix-GPFR, mix-GPNM), the parameters were set according to the original papers. For mix-GPFR, mix-GPNM, and DPM-GPFR, we first converted the time series data into curve datasets and then used these methods to make the predictions. The number of components
K in mix-GPFR and mix-GPNM was set to 5, and the number of B-spline basis functions
D in mix-GPFR and DPM-GPFR was set to 30. The setting of
K and
D will be further discussed in
Section 6.6.
6.2. Performance Evaluation and Model Explanation
The prediction results of various methods on the electricity load dataset are shown in
Table 1. From the table, we can see that the prediction accuracy of the classical time series forecast methods decreased significantly as we increased the prediction step. Among them, SARMA outperformed AR, MA, ARMA, and ARIMA, because SARMA takes the periodicity of data into consideration and can fit the data more effectively. The results of machine learning methods LSTM, NN, SVR, and EGPM also had similar phenomena, that is, when
S was small, the prediction accuracy was high, and when
S was large, the prediction accuracy was low. This observation indicates that these methods are not suitable for long-term prediction. In addition, machine learning methods are also sensitive to the settings of the parameters. For example, the results of FNN and SVR were better when
, which was close to SARMA, while the long-term prediction accuracy of EGPM decreased significantly when
L was relatively large. It is challenging to appropriately set the hyper-parameters in practice. When making a long-term prediction, classical time series prediction methods and machine learning methods need to recursively predict the subsequent values based on the estimated values, which will cause the accumulation and amplification of errors. On the other hand, GPFR-related methods first make predictions according to the mean function, then finely correct these predictions based on observed data. The mean function part can better describe the evolution law of the data, which enables the use of historical information and structural information in the data more effectively. Mix-GPFR, mix-GPNM, and DPM-GPFR obtained similar results in long-term prediction compared with SARMA and could even achieve the best results in short-term prediction. This observation demonstrates the effectiveness of GPFR-related methods. However, these methods cannot deal with long-term prediction tasks well due to the “cold start” problem. Overall, the performances of the proposed HM-GPFR and BHM-GPFR were more comprehensive. For medium-term and short-term prediction, the results of HM-GPFR and BHM-GPFR were slightly worse than those of mix-GPFR, mix-GPNM, and DPM-GPFR, but they still enjoyed significant advantages compared with the other comparison methods. In terms of long-term forecasting, HM-GPFR and BHM-GPFR outperformed mix-GPFR, mix-GPNM, and DPM-GPFR, which shows that considering the multi-scale temporal structure between daily electricity load time series can effectively improve the accuracy of long-term forecasting. In addition, BHM-GPFR was generally better than HM-GPFR, which shows that giving prior distributions to the parameters and learning in a fully Bayesian way can further increase the robustness of the model and improve the prediction accuracy.
HM-GPFR and BHM-GPFR have strong interpretability. Specifically, the estimated values of the hidden variables obtained after training
divided the daily electricity load records into
K categories according to the evolution law. Each evolution pattern can be represented by the mean function of the GPFR component, and these evolution patterns transfer to each other with certain probabilities. The transfer law is characterized by the transfer probability matrix in the model. In
Figure 3, we visualize the evolution patterns and transfer laws learned by HM-GPFR and BHM-GPFR. We call the evolution law corresponding to the mean function represented by the orange curve (at the top of the figure) Mode 1, and we call the five evolution modes as Mode 1 to Mode 5, respectively, in clockwise order. Combined with the practical application background, some meaningful laws can be found according to the results of the learned models. Examples are as follows:
The electricity load of Mode 1 was the lowest. Besides, Mode 1 was relatively stable: when the system was in this evolution pattern, then it would stay in this state in the next step with a probability of about
. In the case of state transition, the probability of transferring to the mode with the second-lowest load (Mode 2 in
Figure 3a and Mode 3 in
Figure 3b) was high, while the probability of transferring to the mode with the highest load (Mode 5 in
Figure 3a and Mode 2 and Mode 5 in
Figure 3b) was relatively low;
The evolution laws of Mode 2 and Mode 5 in
Figure 3b are very similar, but the probabilities of transferring to other modes are different. From the perspective of electricity load alone, both of them can be regarded as the mode with the highest load. When the system was in the mode with the highest load (Mode 5 in
Figure 3a and Mode 2 and Mode 5 in
Figure 3b), the probability of remaining in this state in the next step was the same as that of transferring to the mode with the lowest (Mode 1);
When the system was in the mode with the second-highest load (Mode 3 in
Figure 3a and Mode 4 in
Figure 3b), the probability of remaining in this state in the next step was low, while the probabilities of transferring to the modes with the lowest load and the highest load were high.
These laws are helpful for us to understand the algorithm, have a certain guiding significance for production practice, and can also be further analyzed in combination with expert knowledge.
The case of
in
Table 1 is the most common in practical applications, that is a one-step-ahead rolling forecast. As discussed in sec:HMMGPFR, when making a rolling prediction, HM-GPFR and BHM-GPFR can dynamically adjust the model incrementally after collecting new data without retraining the model. The results of the one-step-ahead rolling prediction of HM-GPFR and BHM-GPFR on the electricity load dataset are shown in
Figure 4. It can be seen that the predicted values of HM-GPFR and BHM-GPFR were very close to the ground-truths, indicating that they are effective for rolling prediction. In the figure, the color of each point is the weighted average of the colors corresponding to each mode in
Figure 3 according to the weight
. Note that there are color changes in some electricity load curves in
Figure 4a,b. Taking the time series in
Figure 4a in the range of about 1100–1200 as an example, when there are few observation data on that day, HM-GPFR believes that the electricity load evolution pattern of that day is more likely to belong to Mode 3. With the gradual increase of observation data, the model tends to think that the electricity load evolution pattern of that day belongs to Mode 5 and then tends to Mode 3 again. This shows that HM-GPFR and BHM-GPFR can adjust the value of
in a timely manner according to the latest information during the rolling prediction.
6.3. Clustering Structure
The estimated values of latent variable
also indicate the evolution mode corresponding to the data of the
i-th day.
Figure 5 visualizes some training data with different colors indicating different evolution modes, so we can intuitively see the multi-scale structure in the electricity load time series. According to the learned transition probability, we can obtain the stationary distribution of the Markov chain
, which is
in HM-GPFR and
in BHM-GPFR. The proportion of each mode in
Figure 5 is roughly consistent with the stationary distribution.
6.4. Ablation Study
In this section, we mainly compared HM-GPFR, BHM-GPFR with mix-GPFR, mix-GPNM, and DPM-GPFR to explore the impact of introducing coarse-grained temporal structure on the prediction performance. The MAPEs reported in
Table 2 are averaged with respect to
, while in this section, we paid special attention to the case of
. In this case, the observed data are the electricity load records in 2010, and there are no partial observations on January 1, 2011 (i.e.,
in Equation (
2)). Therefore, mix-GPFR, mix-GPNM, and DPM-GPFR will encounter the cold start problem.
Table 2 reports the MAPE of these methods at different prediction steps when
. It can be seen from the table that the prediction accuracy of HM-GPFR and BHM-GPFR was higher than that of mix-GPFR, mix-GPNM, and DPM-GPFR at almost every step, which shows that coarse-grained temporal information is helpful to improve the prediction performance, and the use of the Markov chain to model the transfer law of electricity load evolution patterns can make effective use of coarse-grained temporal information.
Figure 6 further shows the results of the multi-step prediction of these methods on the electricity load dataset. Here is also the case of “cold start” (
), and we predicted the electricity loads in the next 10 days (960 time points in total). It can be seen from the figure that these methods can effectively utilize the periodic structure in the time series; the prediction results showed periodicity, but the prediction results of HM-GPFR and BHM-GPFR were slightly different from the other methods. Due to the problem of the “cold start”, the predictions of mix-GPFR, mix-GPNM, and DPM-GPFR for each day were the same, i.e.,
, while HM-GPFR and BHM-GPFR use coarse-grained temporal information when making predictions and then adjust the predicted values of each day. Based on the predicted values of the other methods, it can be seen from the figure that the predicted values of HM-GPFR and BHM-GPFR on the first day were higher, and with the increase in step size, the predicted values will tend to the weighted average value of the mean function of each GPFR component.
6.5. Multi-Step Prediction under Cold Start Setting
In order to more clearly see the role of the Markov chain structure of hidden variables in the cold start setting, in
Figure 7 and
Figure 8, we show the predicted values of HM-GPFR and BHM-GFPR for electricity load in the next five days
and the distributions of the latent variables
conditioned on
. It can be seen from the figure that HM-GPFR and BHM-GPFR have different predictions for each day’s electricity load, which will be adjusted according to the transition probability of the evolution law. For example, in
Figure 7, when
, the power load on that day is low, and the predicted value of HM-GPFR on the
-th day is also low. When
, the electricity load on that day is higher, and the predicted value of HM-GPFR on the
-th day is also higher.
Figure 8 has a similar phenomenon. In addition, it can be seen that, with the increase of
,
quickly converged to the stable distribution of the Markov chain, and the predicted value
also tended to be the weighted average of the mean function in each GPFR component. In conclusion, these phenomena demonstrated that HM-GPFR and BHM-GPFR can effectively use the coarse-grained temporal structure to adjust the prediction of each day.
6.6. Sensitivity of Hyper-Parameters
There are two main hyper-parameters in HM-GPFR and BHM-GPFR: the number of B-spline basis functions
D and the number of GPFR components
K. Here, we mainly focused on the selection of
K. We varied
K in
, trained HM-GPFR and BHM-GPFR, respectively, and report the results in
Table 3. For HM-GPFR, its prediction performance tended to deteriorate with the increase of
K. In short-term prediction, the MAPE increased significantly, while the MAPE changed less in long-term prediction. With the increase of
K, the number of parameters in the model also increased, and the model tended to suffer from overfitting. For BHM-GPFR, with the increase of
K, its long-term prediction performance decreased significantly, while the medium-term and short-term prediction results did not change much. This showed that BHM-GPFR can prevent overfitting to a certain extent after introducing the prior distributions to the parameters. In addition, we also note that, when
, the difference between the results corresponding to different
K was not significant, which is a more reasonable choice. From the perspective of applications, we set
in the experiment, which can take both the expression ability and interpretability of the model into consideration.
7. Conclusions and Discussions
In this paper, we proposed the concept of multi-scale time series. Multi-scale time series have two granularity temporal structures. We established the HM-GPFR model for multi-scale time series forecasting and designed an effective learning algorithm. In addition, we also gave a priori parameters to the model and obtained a more robust BHM-GPFR model. Compared with conventional GPFR-related methods (mix-GPFR, mix-GPNM, DPM-GPFR), the proposed method can effectively use the temporal information of both the fine level and coarse level, alleviates the “cold start” problem, and has good performance in short-term prediction and long-term prediction. HM-GPFR and BHM-GPFR not only achieved high prediction accuracy; they also had good interpretability. Combined with the actual problem background and domain knowledge, we can explain the state transition law learned by the model.
In practice, the number of hidden states
K in HM-GPFR/BHM-GPFR can be set by expert knowledge. However, how to set
K in a data-driven way is an interesting direction, and this is usually referred to as the model selection problem. The model selection problem is both important and challenging. One can run the algorithms with different
K and use certain criteria (such as AIC, BIC) to choose the best one; however, this procedure is time-consuming, and the obtained result is generally unstable. For mix-GPFR, the Dirichlet process is utilized to tackle the model selection problem [
21]. We suggest hierarchical-Dirichlet-process-based hidden Markov models [
22,
23,
24] as a promising method for the model selection of HM-GPFR/BHM-GPFR. It is also promising to reduce the computational cost by introducing inducing points [
25,
26,
27] to our proposed models, but how to balance the trade-off between performance and computational cost generally depends on the particular application scenario.