Temporal Attention-Enhanced Stacking Networks: Revolutionizing Multi-Step Bitcoin Forecasting
Abstract
:1. Introduction
2. Literature Review
2.1. Cryptocurrency Prediction Models
2.2. Ensemble Methods in Financial Forecasting
2.3. Attention Mechanisms in Time–Series Forecasting
3. Methodology
3.1. Data Collection and Preprocessing
- Data Cleaning: The dataset was verified to contain no missing values, ensuring its completeness. The Date column was converted into a time–series index to preserve the temporal structure, which is critical for sequential modeling.
- Feature Scaling:All features were normalized using min–max scaling:
- Sequence Creation: A sliding window approach was utilized to transform the dataset into input–output sequences. For a given timestamp t, the input sequence was constructed as:
- Data Splitting: The dataset was divided into training (70%), validation (15%), and test (15%) sets, maintaining chronological order to prevent information leakage and ensure robust evaluation.
3.2. Model Architecture
3.2.1. Base Learners
- Long Short-Term Memory (LSTM): LSTM networks capture long-term dependencies by maintaining cell states across time steps. The key equations are:
- Gated Recurrent Unit (GRU): GRU simplifies LSTM by combining the hidden and cell states into a single state:
- One-Dimensional Convolutional Neural Network (CNN): CNNs extract local temporal patterns using convolutional filters:
- Temporal Convolutional Network (TCN): TCNs use dilated convolutions to model both short-term and long-term dependencies:
3.2.2. Temporal Attention Mechanism
3.2.3. Stacking Framework
3.3. Training Procedure
- Base Learner Training: Each base learner was trained independently to minimize the mean squared error (MSE):
- Meta-Learner Training: The meta-learner was trained on the stacked predictions to minimize:
3.4. Evaluation Metrics
- Root Mean Squared Error (RMSE):
- Mean Absolute Error (MAE):
3.5. Hyperparameter Tuning
3.5.1. Grid Search Procedure
- Define the Hyperparameter Search Space (): For each model, a hyperparameter grid was carefully defined based on its architecture and requirements. For the meta-learner (XGBoost), the search space included parameters such as learning rate, maximum depth, number of estimators, subsample, and column sampling rate. Similarly, the base learners (LSTM, GRU, CNN, TCN) had their own tailored search spaces, such as the number of hidden units, dropout rates, and convolutional filter sizes.
- Grid Search Across Combinations: A grid search was performed over all possible combinations of hyperparameters in the defined search space , ensuring an exhaustive evaluation of model configurations.
- Evaluate Using k-Fold Cross-Validation (): For each hyperparameter combination, the dataset was split into three folds for cross-validation. The model was trained on folds and validated on the remaining fold, iteratively. This process was repeated for all folds, and the average validation loss was computed.
- Select Optimal Hyperparameters (): The hyperparameter configuration that minimized the average validation loss was selected as the optimal set of parameters. These optimal hyperparameters were then used to train the final model on the full training dataset before evaluation on the test set.
3.5.2. Implementation Details
from sklearn.model_selection import GridSearchCV from xgboost import XGBRegressor from sklearn.metrics import mean_squared_error import numpy as np
# Define the model model = XGBRegressor(random_state=42, objective=“reg:squarederror”)
# Define the hyperparameter grid param_grid = { “learning_rate”: [0.01, 0.05, 0.1], “max_depth”: [3, 5, 7], “n_estimators”: [50, 100, 200], “subsample”: [0.7, 0.8, 1.0], “colsample_bytree”: [0.7, 0.8, 1.0], }
# Set up GridSearchCV with 3-fold cross-validation grid_search = GridSearchCV( estimator=model, param_grid=param_grid, scoring=“neg_root_mean_squared_error”, # RMSE as the evaluation metric cv=3, # 3-fold cross-validation verbose=1, n_jobs=-1 # Use all available processors )
# Fit the GridSearchCV to the training data grid_search.fit(X_train, Y_train)
# Retrieve the best hyperparameters and corresponding RMSE best_params = grid_search.best_params_ best_rmse = -grid_search.best_score_
print(“Best Hyperparameters:”, best_params) print(“Best RMSE on Validation Set:”, best_rmse)
# Evaluate the best model on the test set best_model = grid_search.best_estimator_ y_test_pred = best_model.predict(X_test) test_rmse = np.sqrt(mean_squared_error(Y_test, y_test_pred))
print(“Test RMSE with Best Hyperparameters:”, test_rmse)
3.5.3. Loss Function
3.6. Experimental Setup
4. Results
4.1. Data Summary
4.2. Model Architectures
4.2.1. LSTM Architecture
- LSTM Layer: The LSTM layer consists of 50 units, enabling it to capture both short-term and long-term dependencies in the sequential data. The total trainable parameters in the LSTM layer are calculated as:
- Dropout Layer: A dropout rate of 20% is applied to prevent overfitting by randomly deactivating neurons during training. This layer does not introduce any additional parameters.
- Dense Layer: The final dense layer outputs a single predicted value, with the parameters calculated as follows:For this model, the dense layer contains 51 parameters.
4.2.2. GRU Architecture
- GRU Layer: The GRU layer has 50 units and uses reset and update gates to regulate the flow of information. The trainable parameters are calculated as:
- Dropout Layer: A dropout rate of 20% is applied for regularization.
- Dense Layer: The final dense layer outputs the prediction, with 51 trainable parameters.
4.2.3. CNN Architecture
- One-Dimensional Convolutional Layers: The first convolutional layer uses 32 filters with a kernel size of 3, and the second uses 64 filters. The parameters are calculated as:For the first layer, the parameter count is 512, and for the second, it is 12,352.
- Max Pooling and Dropout Layers: Max pooling reduces the dimensionality of the feature maps, while dropout prevents overfitting.
- Dense Layers: A hidden dense layer with 64 units has 28,736 parameters, while the final dense layer contains 65 parameters.
4.2.4. TCN Architecture
- Causal Convolutional Layers: The first layer uses 64 filters, with the parameters calculated as:The parameter count is 1024 for the first layer and 12,352 for the second layer with dilation.
- Dropout Layer: Regularization is applied to prevent overfitting.
- Dense Layers: The hidden dense layer contains 122,944 parameters, while the final dense layer has 65 parameters.
4.2.5. Summary of Architectures
4.3. Model Performance Comparison Across Horizons
One-Day Horizon
4.4. Three-Day Horizon
4.5. Seven-Day Horizon
4.6. Temporal Attention-Enhanced Stacking Network (TAESN)
4.6.1. Attention Model Architecture
- Input Layer: The stacked predictions from the base learners serve as the input to the model. The input shape corresponds to the number of base learners.
- Dense Layers: Two fully connected layers with 64 and 32 neurons, respectively, are applied. These layers introduce non-linearity and learn representations that capture the interactions between the stacked predictions.
- Attention Weights Layer: A dense layer with a softmax activation function computes the attention weights, ensuring that they sum to one.
- Attention Multiply Layer: The computed weights are applied to the input predictions to form a weighted combination.
- Output Layer: A single neuron outputs the final aggregated prediction.
4.6.2. Attention Mechanism
4.6.3. TAESN Performance
4.6.4. Attention Weights Analysis
- One-Day Horizon (Figure 10a): Attention scores reveal a strong focus on LSTM and GRU models, which are well-suited to capturing short-term sequential dependencies.Actionable Insight: Short-term forecasts are primarily influenced by recent price movements, making LSTM and GRU effective for capturing these patterns.
- Three-Day Horizon (Figure 10b): Attention weights are more evenly distributed across all base learners (LSTM, GRU, CNN, and TCN), indicating that both short-term and medium-term patterns contribute significantly.Actionable Insight: Balanced contributions suggest the importance of combining local temporal features (CNN) with broader sequential patterns (LSTM/GRU) and long-term trends (TCN) for medium-term forecasts.
- Seven-Day Horizon (Figure 10c): The TCN model dominates the attention weights, highlighting its strength in modeling long-range temporal dependencies.Actionable Insight: Long-term forecasts benefit most from models that smooth out short-term noise and identify overarching price trends.
4.6.5. Actual vs. Predicted Analysis for TAESN
4.7. Meta-Learner
4.7.1. Implementation Details
4.7.2. Performance Evaluation
4.7.3. Comparison with TAESN
4.7.4. Insights from Actual vs. Predicted Analysis
- For the 1-day horizon, the meta-learner closely followed the actual price trends, with only minor deviations.
- For the 3-day horizon, the meta-learner captured medium-term trends but struggled with periods of volatility.
- For the 7-day horizon, the meta-learner exhibited noticeable lags during rapid price changes, indicating its limitations in modeling long-term dependencies.
4.7.5. Comparative Analysis Across All Models
4.7.6. Discussion of Results
- Superior Short-Term Performance by Meta-Learner (XGB) and GRU: For the 1-day horizon, the Meta-Learner (XGB) achieved the lowest RMSE (98.7) and MAE (67.26), highlighting its effectiveness in combining predictions from multiple base learners to optimize short-term forecasting accuracy. The GRU model, with an RMSE of 100.7 and MAE of 69.4, closely followed and demonstrated strong performance as a single base learner, effectively capturing short-term temporal dependencies in cryptocurrency price data.
- TAESN Consistency Across Horizons: TAESN demonstrated consistent performance across all horizons, maintaining competitive RMSE and MAE values. However, it was outperformed by GRU for the 1-day horizon and TCN for the 7-day horizon, indicating room for optimization in leveraging short- and long-term dependencies effectively. For the 1-day horizon, the underperformance can be attributed to the nature of attention mechanisms, which are optimized to capture dependencies over longer temporal sequences. Short-term horizons are dominated by rapid fluctuations and high noise levels, which reduce the relative benefit of dynamic attention weighting. This suggests that attention mechanisms may struggle to adapt effectively to the high-frequency variations characteristic of 1-day predictions.
- CNN Limitations: CNN consistently underperformed across all horizons, showing higher RMSE and MAE values compared to other models. This underscores its limitations in handling sequential dependencies in time–series forecasting.
- TCN Strength in Long-Term Dependencies: The TCN model performed well for the 7-day horizon, leveraging its architectural design to model long-term temporal patterns. However, its performance for shorter horizons was suboptimal, indicating a specialization for extended predictions.
- Meta-Learner Performance Variability: The meta-learner displayed strong performance for the 1-day horizon but struggled significantly for longer horizons, as evidenced by its high RMSE and MAE values for the 3-day and 7-day forecasts. This suggests that its static weighting mechanism limits adaptability to horizon-specific dynamics.
- Overall Robustness of Attention Mechanisms: Models with attention mechanisms, such as TAESN, showcased robust performance, particularly for multi-horizon forecasting. The dynamic weighting capability provided by attention mechanisms allows these models to adapt to varying temporal patterns effectively.
- Random Walk and ARIMA as Traditional time–series Baseline Models:Both Random Walk and ARIMA provided low error metrics but failed to capture the complexity of the price data, while they perform well in terms of RMSE and MAE, they cannot model the volatility and dependencies inherent in the data. This highlights the need for more sophisticated models like TAESN to effectively forecast Bitcoin prices.
4.8. Statistical Significance of TAESN’s Performance
5. Conclusions
6. Limitations and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Xu, Y.; Shang, S. Application of cryptocurrency technology—A survey. Intell. Autom. Soft Comput. 2020, 26, 563–582. [Google Scholar]
- Grover, F.; Johansson, M. High-frequency trading in the cryptocurrency market. J. Financ. Mark. 2019, 45, 1–16. [Google Scholar]
- Baek, C.; Elbeck, M. Bitcoins as an investment or speculative vehicle? A first look. Appl. Econ. Lett. 2015, 22, 30–34. [Google Scholar] [CrossRef]
- Kondor, D.; Pósfai, M.; Csabai, I.; Vattay, G. Do the rich get richer? An empirical analysis of the Bitcoin transaction network. PLoS ONE 2014, 9, e86197. [Google Scholar] [CrossRef]
- Altan, A.; Karasu, S.; Bekiros, S. Digital currency forecasting with chaotic meta-heuristic bio-inspired signal processing techniques. Chaos Solitons Fractals 2019, 126, 325–336. [Google Scholar] [CrossRef]
- McNally, S.; Roche, J.; Caton, S. Predicting the price of bitcoin using machine learning. In Proceedings of the 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Cambridge, UK, 21–23 March 2018; pp. 339–343. [Google Scholar]
- Vo, A.; Yost-Bremm, C. A high-frequency algorithmic trading strategy for cryptocurrency. J. Comput. Inf. Syst. 2020, 60, 555–568. [Google Scholar] [CrossRef]
- Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Chen, W.; Xu, H.; Jia, L.; Gao, Y. Machine learning model for Bitcoin exchange rate prediction using economic and technology determinants. Int. J. Forecast. 2021, 37, 28–43. [Google Scholar] [CrossRef]
- Liu, Y.; Tsyvinski, A. Risks and returns of cryptocurrency. Rev. Financ. Stud. 2021, 34, 2689–2727. [Google Scholar] [CrossRef]
- Dyhrberg, A.H. Bitcoin, gold and the dollar–A GARCH volatility analysis. Financ. Res. Lett. 2016, 16, 85–92. [Google Scholar] [CrossRef]
- Patel, M.M.; Tanwar, S.; Gupta, R.; Kumar, N. A deep learning-based cryptocurrency price prediction scheme for financial institutions. J. Inf. Secur. Appl. 2020, 55, 102583. [Google Scholar] [CrossRef]
- Chennupati, A.; Prahas, B.; Ghali, B.A.; Jasvitha, B.D.; Murali, K. Comparative Analysis of Bitcoin Price Prediction Models: LSTM, BiLSTM, ARIMA and Transformers. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–7. [Google Scholar]
- Belcastro, L.; Carbone, D.; Cosentino, C.; Marozzo, F.; Trunfio, P. Enhancing Cryptocurrency Price Forecasting by Integrating Machine Learning with Social Media and Market Data. Algorithms 2023, 16, 542. [Google Scholar] [CrossRef]
- Zou, Y.; Herremans, D. PreBit—A multimodal model with Twitter FinBERT embeddings for extreme price movement prediction of Bitcoin. Expert Syst. Appl. 2023, 233, 120838. [Google Scholar] [CrossRef]
- Krauss, C.; Do, X.A.; Huck, N. Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500. Eur. J. Oper. Res. 2017, 259, 689–702. [Google Scholar]
- Xie, M.; Li, J.; Cui, H. Improving Twitter Sentiment Classification via Multi-Level Sentiment-Enriched Word Embeddings. arXiv 2019, arXiv:1902.09314. [Google Scholar]
- Jiang, Z.; Liang, J. Cryptocurrency portfolio management with deep reinforcement learning. In Proceedings of the 2017 Intelligent Systems Conference (IntelliSys), London, UK, 7–8 September 2017; pp. 905–913. [Google Scholar]
- Saqware, G.J. Hybrid Deep Learning Model Integrating Attention Mechanism for the Accurate Prediction and Forecasting of the Cryptocurrency Market. In Operations Research Forum; Springer: Berlin/Heidelberg, Germany, 2024; Volume 5, p. 19. [Google Scholar]
- Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 35, pp. 11106–11115. [Google Scholar]
- Singh, S.; Bhat, M. Transformer-based approach for Ethereum Price Prediction Using Crosscurrency correlation and Sentiment Analysis. arXiv 2024, arXiv:2401.08077. [Google Scholar]
- Qin, Y.; Song, D.; Chen, H.; Cheng, W.; Jiang, G.; Cottrell, G. A dual-stage attention-based recurrent neural network for time series prediction. arXiv 2017, arXiv:1704.02971. [Google Scholar]
- Livieris, I.E.; Pintelas, E.; Pintelas, P. A CNN-LSTM model for gold price time-series forecasting. Neural Comput. Appl. 2020, 32, 17351–17360. [Google Scholar] [CrossRef]
- Song, H.; Rajan, D.; Thiagarajan, J.; Spanias, A. Attend and diagnose: Clinical time series analysis using attention models. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Abbasimehr, H.; Paki, R. Improving time series forecasting using LSTM and attention models. J. Ambient. Intell. Humaniz. Comput. 2022, 13, 673–691. [Google Scholar] [CrossRef]
- He, K.; Yang, Q.; Ji, L.; Pan, J.; Zou, Y. Financial time series forecasting with the deep learning ensemble model. Mathematics 2023, 11, 1054. [Google Scholar] [CrossRef]
- Zhao, M. Financial time series forecast of temporal convolutional network based on feature extraction by variational mode decomposition. In International Conference on Artificial Intelligence in China; Springer: Berlin/Heidelberg, Germany, 2022; pp. 365–374. [Google Scholar]
- Niu, P.; Zhou, T.; Wang, X.; Sun, L.; Jin, R. Attention as Robust Representation for Time Series Forecasting. arXiv 2024, arXiv:2402.05370. [Google Scholar]
- Wu, H. Revisiting Attention for Multivariate Time Series Forecasting. arXiv 2024, arXiv:2407.13806. [Google Scholar]
- Olorunnimbe, K.; Viktor, H. Ensemble of temporal Transformers for financial time series. J. Intell. Inf. Syst. 2024, 62, 1087–1111. [Google Scholar] [CrossRef]
- Seabe, P.L.; Moutsinga, C.R.B.; Pindza, E. Forecasting cryptocurrency prices using LSTM, GRU, and bi-directional LSTM: A deep learning approach. Fractal Fract. 2023, 7, 203. [Google Scholar] [CrossRef]
Model | RMSE | MAE |
---|---|---|
LSTM | 109.8 | 71.5 |
GRU | 100.7 | 69.4 |
CNN | 129.9 | 91.0 |
TCN | 290.0 | 266.4 |
Model | RMSE | MAE |
---|---|---|
LSTM | 165.4 | 113.3 |
GRU | 142.6 | 100.9 |
CNN | 191.8 | 130.2 |
TCN | 193.9 | 140.4 |
Model | RMSE | MAE |
---|---|---|
LSTM | 229.0 | 174.2 |
GRU | 220.1 | 155.3 |
CNN | 286.0 | 218.2 |
TCN | 193.9 | 140.4 |
Horizon | RMSE | MAE |
---|---|---|
1-Day | 140.3 | 70.3 |
3 days | 171.0 | 120.1 |
7-Day | 282.3 | 205.0 |
Horizon | RMSE | MAE |
---|---|---|
1-Day | 98.7 | 67.26 |
3-Day | 238.0 | 194.7 |
7-Day | 324.8 | 255.8 |
Model | 1-Day Horizon | 3-Day Horizon | 7-Day Horizon | |||
---|---|---|---|---|---|---|
RMSE | MAE | RMSE | MAE | RMSE | MAE | |
Random Walk | 0.02 | 0.01 | 0.02 | 0.01 | 0.02 | 0.01 |
ARIMA | 0.27 | 0.22 | 0.27 | 0.22 | 0.29 | 0.24 |
LSTM | 109.8 | 71.5 | 165.4 | 113.3 | 229.0 | 174.2 |
GRU | 100.7 | 69.4 | 142.6 | 100.9 | 220.1 | 155.3 |
CNN | 129.9 | 91.0 | 191.8 | 130.2 | 286.0 | 218.2 |
TCN | 290.0 | 266.4 | 191.8 | 130.2 | 193.9 | 140.4 |
Meta-Learner (XGB) | 98.7 | 67.26 | 238.0 | 194.7 | 324.8 | 255.8 |
TAESN | 140.3 | 70.3 | 171.0 | 120.1 | 283.3 | 205.0 |
Horizon | Comparison | t-Statistic | p-Value |
---|---|---|---|
1-day | TAESN vs. Base Models | 2.1490 | 0.0323 |
3-day | TAESN vs. Base Models | 10.2249 | 0.0000 |
7-day | TAESN vs. Base Models | 7.1728 | 0.0000 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Seabe, P.L.; Pindza, E.; Moutsinga, C.R.B.; Aphane, M. Temporal Attention-Enhanced Stacking Networks: Revolutionizing Multi-Step Bitcoin Forecasting. Forecasting 2025, 7, 2. https://doi.org/10.3390/forecast7010002
Seabe PL, Pindza E, Moutsinga CRB, Aphane M. Temporal Attention-Enhanced Stacking Networks: Revolutionizing Multi-Step Bitcoin Forecasting. Forecasting. 2025; 7(1):2. https://doi.org/10.3390/forecast7010002
Chicago/Turabian StyleSeabe, Phumudzo Lloyd, Edson Pindza, Claude Rodrigue Bambe Moutsinga, and Maggie Aphane. 2025. "Temporal Attention-Enhanced Stacking Networks: Revolutionizing Multi-Step Bitcoin Forecasting" Forecasting 7, no. 1: 2. https://doi.org/10.3390/forecast7010002
APA StyleSeabe, P. L., Pindza, E., Moutsinga, C. R. B., & Aphane, M. (2025). Temporal Attention-Enhanced Stacking Networks: Revolutionizing Multi-Step Bitcoin Forecasting. Forecasting, 7(1), 2. https://doi.org/10.3390/forecast7010002