1. Introduction
Time series are occasionally observed at irregular observation times. Such irregular samples may occur naturally in climate research [
1], in astronomy [
2], in heart rate analysis [
3] and even in financial time series [
4]. The typical method applied to deal with irregular samples is to ignore them and literally close up the gaps. However, a missing values imputation (or gap filling) strategy can be informative and could provide fundamental knowledge for the subsequent stochastic analysis [
5]. This is particularly true for financial time series. In such a case, although price time series are typically non stationary, log-return time series, computed as the difference in log-prices between two subsequent observations, have better behavior [
6]. In the presence of missing data, log-returns computed over different time intervals may have different informative content. In fact, information affecting the dynamics of power prices can be released while the market is closed [
7,
8]. Moreover, we could encounter difficulties if we want to detect seasonality in market prices or compare markets with different closure day patterns.
In this paper, we will focus on the electricity prices of US power markets. US electricity price time series show irregular sampling (lack of daily data points) as a result of weekends, holidays and other missing data due to market specific reasons.
Existing methods for analyzing irregular time series can be categorized into three main directions [
9]: (i) the repair approach in which missing observations are recovered via smoothing or imputation [
10,
11,
12,
13,
14]—also implemented, especially in recent years, by machine learning methods [
15,
16,
17,
18]; (ii) the generalization of spectral analysis tools [
19,
20], such as wavelets [
21,
22,
23,
24]; (iii) kernel methods [
25,
26]. In this paper, we deal with a repair approach which uses an input preparation step based on machine learning. We work out this problem first by using a regular sampling grid layer over the original time series, and then by computing a value for each of new sampled point from the available samples, in order to have an equidistant missing-data problem [
5]. For such an imputation process, we chose the missForest algorithm [
27] that is completely agnostic about the data distribution. We verified that using this machine learning strategy for gap-filling, data quality did improve in a very efficient manner, especially compared to traditional methods. Therefore, being predominantly data-driven by design, we could rely just on training data and using very few parameters to reconstruct complete time series. Once the filling process of the observed power price time series is completed, the anomaly detection problem is addressed.
“An anomaly is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” [
28]. Electricity price time series are prone to have anomalies: they can occur as a consequences of excess demand, power outages, communication failures, activation of circuit breakers at substations, meter malfunctions and other reasons [
29,
30]. The liberalization process of the electricity sector has significantly increased the price volatility [
31]. Looking at the time series of electricity prices, we can see some very erratic behavior. Power prices show, in fact, variable and unpredictable behavior with high and stochastic volatility; jumps and pronounced spikes; and a strong mean-reversion component, responsible for reducing prices after a jump or a spike has occurred [
32]. Specifically, electricity price time series are characterized by normal stable periods in which they fluctuate around a long-run mean and turbulent price movements in which the dynamics are affected by jumps and short-lived spikes of large magnitude. This complex dynamics produces non-normal empirical distributions of log-returns with high volatility values and non-zero skewness and high kurtosis values [
33].
In this paper we provide a general methodology to detect the stable price dynamics and decouple them from the turbulent dynamics in which jumps and spikes, i.e., anomalous price movements, occur. Our starting point is to consider the reconstructed no-gap time series, filled by the missForest algorithm, as affected by anomalies that we are going to identify and remove. The anomaly identification process is carried out on the filled original time series of electricity prices by using the isolation forest (or iForest) algorithm [
34], an anomaly detection method that isolates anomalies instead of profiling normal data points, as in the most common techniques [
35]. Using this unsupervised method, we can detect abrupt changes or novelty in prices time series without using a “universal” definition, considering that we cannot provide a “standard” reference for anomaly in electricity prices time series. As for the lack of “good” (non-anomalous) benchmark time series, we prefer an agnostic approach. Moreover, since we want to reduce to the minimum the impact of parameter setting on the anomaly detection process, iForest is a particularly suitable algorithm for this purpose [
36,
37,
38]. Once identified, anomalies can be removed from the dynamics. At the end of this process, the additional gaps created by removing the anomalous regions of the dynamics will be newly filled by the missForest imputation algorithm. In this way, we obtain: (i) a complete and clean time series describing the stable dynamics of power prices; (ii) a separation between the stable dynamics and the turbulent dynamics to feed the stochastic analysis with. This is the first contribution of the present paper to the literature.
Several models have been proposed in the literature to describe the dynamics of power prices observed in real markets. Since the seminal paper by Lucia and Schwartz [
39], the literature on this topic has grown exponentially. Mean-reverting jump-diffusion processes have been proposed [
40,
41] to account for the jumpy and spiky behavior of power prices. Regime-switching processes [
42] have also been used with the aim of modeling the stable dynamics and the turbulent dynamics of power prices separately [
43,
44,
45]. Compared to more complex regime-switching models, jump-diffusion models offer a good compromise between mathematical tractability and the physical description of the price dynamics. Their use can be considered as the simplest modeling methodology to describe non-Gaussian processes with stochastic volatility. However, the estimation procedure of jump-diffusion models on market data require some care in order to take into account in a proper way the various components of the dynamics [
32]. When estimating a jump-diffusion model, the main difficulty is to determine which price variations are caused by jumps and which ones are caused by the diffusion component of the process. The easiest and most common way to deal with this problem is to fix a threshold according to which price variations are considered to be caused by jumps and spikes [
44]. In this case, the threshold must be set according to some well defined (but arbitrary) criteria [
46]. An alternative approach is to estimate the jump-diffusion model by maximum likelihood without filtering jumps first [
40]. However, this technique allows one to reproduce the standard deviation of log-returns well but underestimates kurtosis [
45]. The use of iForest algorithm is suitable for overcoming these difficulties. The decoupling of the price dynamics between the stable motion and the turbulent motion obtained by the machine learning techniques proposed in this paper, allows us to provide a suitable estimation procedure for both the diffusion component and the jump component of the model that makes use of the full information contained in both the stable and the turbulent dynamics. The estimation results show an interesting agreement with market data. This is the second contribution to the literature.
To our knowledge, this is the first study in which unsupervised machine learning techniques have been employed to detect jumps and spikes in power price time series, thereby allowing the possibility of accurately describing the observed dynamics using the jump-diffusion models. The workflow of the whole methodology is depicted in
Figure 1.
The proposed approach has several advantages and potential applications. Specifically, our methodology offers the possibility to improve the data quality by using a data-driven approach, i.e., an unsupervised technique having as few parameters as possible, for the imputation of missing data and for the detections of anomalies in the dynamics [
47,
48,
49]. Moreover, accurately modeling electricity price dynamics using simple models which can be easily calibrated on high quality data is essential for all the power market players [
31]. The jump-diffusion model proposed in this paper is a short-term model and the short-term modeling of electricity prices is a central topic for both traders and producers in their attempts to hedge financial risk due to the unpredictability of power prices [
50] by using power derivatives as well [
51]. In this regard, having good short-term models of electricity prices capturing the first four central moments of log-return empirical distributions is of crucial importance for pricing power derivatives [
52]. Moreover, modeling power prices over longer time horizons, ranging from a few years to decades, is strategically important for energy companies, in their efforts toward evaluating investments in capacity expansion and generating new technologies, and for policy makers involved in the energy planning decision making processes. In this regard, the proposed methodology can be employed as a long-term forecasting approach that allows us to derive the long-run behavior of power prices from their short-term dynamics. In the presence of a mean-reverting component, in fact, the probability distributions of power prices tend toward stationary long-run probability distributions [
53]. In this way, the proposed approach also develops a robust link between the short-term and the long-term behavior of electricity prices.
The paper is organized as follows.
Section 2 discusses the data processing methodology.
Section 3 illustrates in some detail the mean-reverting jump-diffusion model used to describe the dynamics of power prices and the estimation procedure.
Section 4 concludes. A comparison between the use of the missForest algorithm for gap filling purposes and a more traditional approach based on moving average techniques is provided in
Appendix A.
4. Concluding Remarks
In this paper we provided a general methodology to fill missing data in time series with irregular observation times and to detect anomalies in the dynamics. Our approach is based on machine learning ensemble techniques. In particular, the missForest imputation algorithm was used to fill in the gaps of the time series, and the isolation forest algorithm was used to detect anomalies in the time behavior. Moreover, the missForest algorithm was also used to fill the additional gaps originated by removing anomalies, in order to create a complete and clean time series describing the stable dynamics of power prices. The decoupling of the price dynamics between the stable motion and the turbulent motion allowed us to define a suitable mean-reverting jump-diffusion model of power prices and a two-step estimation procedure of the model parameters that uses the full information contained in both, the stable time series and the anomalous regions of the dynamics. The same two-step procedure was used to estimate both models, the short-term and the long-term.
The filling and decoupling technique proposed in this paper seems to be a powerful tool of analysis for investigating the features of the complex dynamics of power prices observed in real markets. It allows one to distinguish normal periods in which prices fluctuate around the long-run mean from turbulent movements of power prices characterized by jumps and spikes. Within this framework, the decoupling technique is a powerful tool for estimating jump-diffusion stochastic models of power prices in an accurate way. The obtained results show interesting agreement with empirical data.
Moreover, ensemble methods allowed us to put into evidence some similarities of the electricity price dynamics observed in different power markets. From this point of view, unsupervised machine learning techniques can be used to study the dynamics of the power markets prices as a whole, instead of taking them individually, thereby considering factors in common and similarities. We left those topics to future investigations.
Finally, let us remark that, although our analysis focused on power market prices, the proposed methodology is general and can be applied to very different contexts ranging from physical to social sciences.