This section highlights the results of the different experiments investigating different aspects of the hybrid VMD-BiLSTM model in short- and mid-term forecasting horizons.
5.2. Stationarity Analysis
The stationarity of a time series can be assessed using the Augmented Dickey–Fuller (ADF) [
54] and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) [
55] tests. The ADF test assumes a unit root in the series (null hypothesis), rejecting this if the test statistic is below a critical value, thereby indicating stationarity. Conversely, the KPSS test assumes trend stationarity (null hypothesis), rejecting it for a non-stationary unit root process if the test statistic exceeds the critical value.
Table 4 presents the average results of the ADF and KPSS tests run on the pre-processed SIV sequences (i.e., the main input data to the forecasting model) for the fall/winter and spring/summer seasons. The tests were run using sequences without decomposition, as well as with EMD and VMD decomposition. We investigated the stationarity of the decomposed sequences using different multiples of eight VMD decomposition levels (
K) corresponding to 7, 15, 23, 31, 39, 47, 55, 63, 71, and 79 IMFs in addition to the residual (only specific
K values are reported in the table). All reported
p-values and test static values were averaged over all the sequences of the same case (e.g., test static was averaged over 8819 sequences corresponding to the no-decomposition case for fall/winter). The results show that raw volume sequences were usually not stationary (refer to test results on sequences from the fall/winter season using both tests and on spring/summer sequences using the KPSS). Additionally, decomposing using VMD can provide stationary sequences. For instance, the test statistics using ADF were lower than their corresponding critical values for all considered
K values using the fall/winter sequences and for
using the spring/summer sequences. Using KPSS, the test statistics were lower than the critical values for
for both seasons. Further, the higher the
K, the wider the difference between the two values until a certain
K. This demonstrates that greater degrees of stationarity can result from higher decomposition levels.
In the following subsections, the results of the reported forecasting performance from all conducted experiments are discussed.
5.3. Benchmarking the Proposed Hybrid Model
As can be seen in
Table 5, the historical mean model generally achieved superior forecasting performance compared to all considered baseline conventional models. In particular, the GBDT and SVR models yielded the worst performance among all models for both forecasting horizons. In addition, the historical mean outperformed all deep learning forecasting models in the absence of information about the SIV, as evidenced in
Table 6 (i.e., time only). We note that this specific case, where the deep learning models were trained using time-only inputs, can be considered another baseline that more advanced models should always outperform. Among these deep learning models, TST and InceptionTime consistently exhibited the poorest performance for short-term and mid-term forecasting, respectively. The superiority of the historical mean model highlights the usefulness of immediate past values in predicting future ones for both forecasting horizons.
In the absence of a decomposition technique, better performance can be achieved using the SIV and time information. In this case (i.e., volume and time), all deep learning models significantly outperformed all baselines. Regardless of the forecasting horizon and season, these models could learn from the patterns in the historical SIV and time data, thus achieving overall average improvements of , , , , and in terms of the RMSE, MAPE, CV, ACC, and SS compared with the previous time-only case. More specifically, BiLSTM performed the best with these input data during the fall/winter season, with overall average improvements of , , and in terms of the RMSE, MAPE, and CV for short-term forecasting and , , and in terms of the same metrics for mid-term forecasting. During the spring/summer season, TST performed the best, with overall average improvements of , , and in terms of the RMSE, MAPE, and CV for short-term forecasting and , , and in terms of the same metrics for mid-term forecasting.
When processing the SIV with a decomposition technique, even better performance was generally achieved. The overall average improvements achieved using either decomposition technique were , , and in terms of the RMSE, MAPE, and CV over all the forecasting horizons, seasons, and models. Nevertheless, the use of EMD added few to no improvements to the forecasting performance, generally yielding comparable performance with the previous case (i.e., volume and time). This was better seen with short-term forecasting, as the fewest overall average improvements were achieved, with , , and in terms of the RMSE, MAPE, and CV during fall/winter over all three models. During the spring/summer season, deteriorations from the same previous case were observed, with overall averages of , , and in terms of the RMSE, MAPE, and CV over all three models. However, longer forecasting horizons seemed to better exploit the EMD-decomposed input, as generally better improvements were achieved, with , , and during the fall/winter season and , , and during the spring/summer season in terms of the same metrics over the three models. Nevertheless, errors often accumulated with longer forecasting horizons. These underwhelming performance values can be attributed to EMD’s drawbacks, such as noise sensitivity and the inability to extract closely spaced high-frequency components in the input data. Nevertheless, the extracted IMFs still provided valuable information to the forecasting model for further forecasting horizons.
The best performance was achieved using the VMD-processed inputs (i.e., and time). Similar to the second case (i.e., volume and time), all models could better learn the variations in the SIV. Particularly, compared with the same second case, overall average improvements of , , , , and were achieved in terms of the RMSE, MAPE, CV, AC, and SS over all the forecasting horizons, seasons, and models. These results showcase the capabilities of VMD in capturing intrinsic time-frequency patterns, which is essential for better forecasts. Similar to the and time case, better overall performance was observed with longer forecasting horizons. In particular, the overall average improvements for short-term forecasting were , , and in terms of the RMSE, MAPE, and CV during fall/winter and , , and during spring/summer over all three models. Mid-term forecasting saw overall average improvements of , , and during fall/winter and , , and during spring/summer in terms of the RMSE, MAPE, and CV over all three models. Nevertheless, higher errors were associated with longer horizons.
Specifically, the proposed BiLSTM model consistently achieved the best performance in this case ( and time) for both seasons and forecasting horizons. In particular, BiLSTM achieved the highest overall average improvements compared with the first case (i.e., time only), with for short-term horizons and for mid-term horizons in terms of the RMSE, MAPE, and CV metrics. TST and InceptionTime followed BiLSTM with smaller improvements, where TST achieved average improvements of for short-term forecasting and for mid-term forecasting for the same metrics and over all seasons. InceptionTime achieved average improvements of for short-term forecasting and for mid-term forecasting for the same metrics and seasons. In addition, both of these models appeared to struggle to efficiently assimilate patterns from the decomposed sequences and accurately forecast SIV anomalies, yielding ACC averages that were consistently less than for TST and for InceptionTime for all considered input cases. A variety of factors contributed to these lower performance values. For instance, although it incorporates a multihead self-attention mechanism that can process past and future information in the input, TST usually requires a larger amount of training data than that required by LSTM to learn the underlying patterns and make accurate predictions. Nonetheless, these results emphasize the importance of the BiLSTM network’s memory cells in capturing and retaining long-term information from the input.
These observations can made about the data in
Figure 8, where different examples of forecasting the SIV over short- and mid-term horizons using the proposed hybrid VMD-BiLSTM are showcased. Particularly, the forecasts in both seasons seem to closely follow all target variations over multiple horizons. We note that in
Figure 8a, the forecasts using the EMD variation of the proposed technique seem to follow the general trend of the target SIV sequence for September 2016. However, the proposed technique still qualitatively performs better, as it was able to produce forecasts following the change in the slope of the target.
In this study, we only considered the historical data of the pan-Arctic SIV and time information. However, as daily weather features such as air temperature and humidity are readily available in open datasets, such data can be employed as additional inputs to the forecasting model to improve its performance over shorter or longer horizons. Nevertheless, the proposed hybrid model has proven to be more accurate than other forecasting techniques. However, the VMD decomposition level used within it () is yet to be optimized. The next subsection discusses the experiment’s results, identifying the most optimal K value.
5.4. Experiment I: Impact of VMD Decomposition Level on Forecasting Performance
As can be seen in
Figure 9, the forecasting performance of the proposed hybrid model was impacted by the VMD decomposition level (i.e.,
K). The forecasting performance generally improved with higher values of
K up to a certain value, after which it worsened. This pattern can be seen for every forecasting horizon and season. The best forecasting performance during the fall/winter season necessitated higher values than
, which was employed in the corresponding case in Experiment I (i.e.,
and time), with
for short-term forecasting and
for mid-term forecasting. These decomposition levels enabled overall average improvements of
,
, and
for short-term forecasting and
,
, and
for mid-term forecasting in terms of the RMSE, MAPE, and CV compared with the original case of
(i.e.,
and time). The opposite trend can be seen in the other season, where fewer decomposition levels were sufficient to achieve the best performance. In particular, in spring/summer, a decomposition level of
for short-term forecasting achieved overall average improvements of
,
, and
, compared with the same corresponding case (i.e.,
and time) in terms of the RMSE, MAPE, and CV. For mid-term forecasting during the same season, the initial
was sufficient to achieve the best performance.
Forecasting longer horizons generally necessitated higher VMD decomposition levels. The forecasting performance was improved with a smaller value of K for short-term forecasting, with and for the fall/winter and spring/summer seasons, respectively. Almost double those values were needed for mid-term forecasting, with and for the fall/winter and spring/summer seasons, respectively. This observation can be attributed to the increased uncertainty related to increased forecasting horizons. Thus, further horizon forecasting requires more detailed time-frequency patterns that can be extracted using VMD to achieve optimal performance.
As higher decomposition levels were investigated, this experiment required more time and computational resources. In particular, the time and computation costs needed to train the model with no decomposition involved were multiplied by , the number of VMD decompositions (and residual) chosen, to train the same model with VMD-decomposed inputs using the “divide & conquer” strategy (refer to the results of Experiment II for more details). Nevertheless, this experiment must be conducted whenever such a decomposition technique is integrated into a learning process to ensure optimal results. Moreover, the choice of K values is critical. In this experiment, eight multiples were selected based on previous works in the literature and several preliminary trials. Specifically, values were initially investigated but were dropped after observing few to no improvements in forecasting performance. Hence, multiples of eight provided comprehensible results with big enough improvements following a steady increase in the decomposition levels.
The proposed hybrid forecasting model can provide better forecasting performance with an optimized VMD decomposition level. However, it is important to conduct a robustness check to quantify the usefulness of employing the “divide and conquer” strategy.
5.5. Experiment II: Evaluating the Robustness of VMD-Based Forecasting Strategies
From
Table 7 and
Figure 10, it is apparent that using the “divide and conquer” strategy produced better overall forecasting performance compared to the “all-in-one” strategy. Particularly, average degradations of
,
, and
in terms of the RMSE, MAPE, and CV were observed, regardless of the forecasting horizon and season. These results provide empirical proof of the usefulness of dividing the forecasting task into sub-forecasting tasks using separate deep learning models. Moreover, the errors over each horizon showcase how the model using the “all-in-one” strategy struggled to deal with the increased uncertainty in further horizons (e.g., mid-term forecasting during the spring/summer season). Although a similar observation can be made for the other case, it was significantly attenuated, as minor errors accumulated over the horizons. Such outcomes prove that dividing the forecasting task into specific and separate tasks can reduce the related uncertainty and generate more accurate forecasts.
However, the proposed hybrid technique implements the “divide and conquer” strategy, which requires more time for model learning. More specifically, training the hybrid model using this strategy can be seen as equivalent to training the same model multiple times using the “All-in-one” strategy. As seen in
Figure 6, this increase in time and complexity is tied with
, the number of decompositions (and residual), reflecting the training of the
models with partial input data (e.g., an ith IMF with a corresponding time sequence). Nevertheless, the granularity of data (e.g., daily, in the case of this study) and the availability of higher-performance computation devices can enable the successful application of such a forecasting strategy in real-world scenarios.
Further results are reported in
Table 8, which displays the performance of the proposed hybrid model, computed and averaged for each month of the year over the whole testing set. Some notable patterns and intriguing trends can be observed. In the case of short-term forecasts, the model exhibited superior performance for January, as evidenced by the lowest RMSE, MAPE, and CV values. This result indicates high accuracy and consistency in the model’s forecasts for this month. However, it is noteworthy to highlight March and September when the model’s performance dipped. During these transitional months, when sea ice begins to melt in March and starts to form in September, forecasting can be particularly challenging due to the high variability and dynamics of the sea ice processes. The model, while experiencing relatively higher errors during these periods, still managed to provide reasonably accurate forecasts, illustrating its resilience in handling complex Arctic conditions.
For mid-term forecasts, similar trends can be observed. The model continued to demonstrate the strongest performance for January with the lowest RMSE and CV values, whereas for February, it yielded the lowest MAPE, signifying minimal relative errors. Despite the elevated error values for August, September, November, and December, the model still maintained high ACC and SS values, especially for June and July.
The high ACC values for the majority of months indicate that the model correctly captured the directionality of the anomalies. Moreover, the SS values suggest that the proposed technique performed well compared to climatology-based reference forecasts, especially during the summer months when the sea ice melt is in its advanced stage.