1. Implication and Contributions of the Research
Time-series data prediction has perennially occupied a position of paramount importance, manifesting widespread applications across diverse domains, such as finance, weather forecasting, traffic planning, and sales forecasting. In the financial sector, the prediction of time-series data serves as the bedrock for investment decisions. Investors heavily lean on precise forecasts of time-series data, encompassing variables like stock prices, exchange rates, and commodity prices to formulate astute buying and selling strategies. Concurrently, governmental and regulatory bodies also utilize accurate time-series data to vigilantly monitor the stability of the financial markets. In meteorology, the prediction of time-series data assumes critical significance in the accurate forecasting of weather patterns, climate changes, and the onset of natural disasters. Within the domain of traffic planning, the prediction of traffic flow is instrumental in aiding urban planners to adeptly manage traffic congestion and enhance overall traffic efficiency. Equally vital is sales forecasting, which empowers retailers to ascertain optimal inventory requirements, ensuring timely product supply while minimizing inventory costs.
The challenges associated with feature selection in time-series data primarily stem from the unique characteristics of such data. Time-series data exhibit temporal correlations, sequence patterns, and seasonality, often making traditional feature selection methods less suitable. Dealing with time-series data requires consideration of various complex factors, including the influence of historical information on current observations, the extraction of temporal features, and addressing issues such as missing values and noise. Consequently, developing feature selection methods tailored for time-series data is a crucial research area with the goal of constructing more precise, efficient, and interpretable time-series prediction models.
Against this backdrop, this paper introduces a pioneering innovation designed to improve the efficacy of feature selection in time-series data prediction. We leverage an ensemble learning approach that seamlessly integrates five distinct feature selection methods. This integrative framework aims to redress the limitations inherent in various feature selection methods, consequently yielding more precise and resilient feature selection outcomes. The chosen five feature selection methods span diverse dimensions, each endowed with its distinct strengths. Through the amalgamation of these methods, our objective was to comprehensively capture information and patterns within time-series data, thereby enhancing the performance and resilience of the prediction model.
The specific innovations and contributions of this paper are outlined as follows:
This paper introduces a feature selection method grounded in ensemble learning. It furnishes a formal definition of the application of ensemble learning to feature selection and, adhering to the principle of “good but different”, identifies five feature selection methods for integration. The weights of these methods were determined through K-fold cross-validation when applied to specific datasets. This ensemble approach considers the outcomes of multiple feature selection methods, consolidating them into a numerical outcome. This process aids in mitigating potential biases in traditional methods, thereby enhancing the accuracy and comprehensiveness of feature selection;
This paper deploys an LSTM model, incorporating features selected through ensemble learning and those identified by five different feature selection methods as inputs to the model. A series of experiments was conducted to validate the effectiveness of this approach. Through the practical applications in time-series prediction tasks, the paper presents concrete data and results to demonstrate the performance and efficacy of the proposed feature selection method in real-world scenarios.
By introducing ensemble learning into the field of feature selection, this study broadens the application scope of ensemble learning and empirically supports its effectiveness in feature selection tasks. We conducted robust comparisons with traditional methods, highlighting the innovative aspects of ensemble learning in enhancing both the accuracy and comprehensiveness of feature selection. Adopting an ensemble learning approach, our research offers a more comprehensive consideration of the strengths of various methods compared to traditional single-method approaches. The experimental results on different datasets demonstrate the superior performance of ensemble learning in terms of MAE and MSE metrics, validating its theoretical value in enhancing the robustness and effectiveness of feature selection.
2. Background
Time-series prediction has always been a daunting challenge. Scholars in this domain have diligently delved into the intrinsic laws governing time-series data through extensive exploration and research. Consequently, they have amassed a considerable repertoire of prediction methods based on the evolving patterns of this data, broadly categorized into linear and nonlinear prediction methods.
During the initial stages, prediction methods heavily leaned towards linear approaches, employing classic algorithms like the exponential smoothing method [
1,
2] and the autoregressive integral moving average prediction method [
3,
4,
5]. These methods boasted advantages such as simplicity, reduced computational demands, and superior performance in short-term predictions. However, they fell short in capturing the inherent nonlinear relationships within financial time-series data, particularly when tackling long-term predictions. Consequently, they exhibited certain limitations in such scenarios.
To overcome the limitations of linear methods, subsequent research proposed the integration of nonlinear models to enhance the comprehension of complex data, thereby giving rise to nonlinear prediction methods. Prominent among these methods are BP neural networks [
6,
7,
8], support vector machines, recurrent neural networks [
9,
10,
11], generative adversarial networks [
12,
13], and reinforcement learning [
14,
15,
16]. By employing these methodologies, researchers have achieved a more comprehensive capture of the nonlinear relationships embedded in financial time-series data, leading to relatively accurate prediction outcomes. This direction represents the focal point of future research and the prevailing trend in the field of financial time-series data.
In recent years, deep learning has attracted considerable attention from researchers in various fields. Deep learning methods have demonstrated remarkable performance when compared to traditional algorithms in time-series prediction tasks, which have undergone extensive development and widespread application. Of particular note, deep neural networks possess superior capabilities in extracting both linear and nonlinear features, outperforming shallow neural networks in this regard. This advantage enables them to capture underlying patterns that may be overlooked by their shallower counterparts, making them well suited for high-precision prediction tasks [
17]. In light of these advancements, this section aims to introduce three primary categories of deep learning models that are particularly suitable for addressing challenges in time-series forecasting.
Convolutional Neural Networks (CNNs) represent a class of deep feed-forward neural networks that center around convolution and pooling operations. Originally developed for image recognition in the domain of computer vision [
18,
19], CNNs have since demonstrated their versatility in various fields. In 2018, Shaojie Bai et al. [
20] proposed an innovative architecture called Temporal Convolutional Networks (TCNs), a variant of CNNs designed with reduced memory consumption and increased parallelizability. TCNs introduced causal convolution to ensure that future information is not accessed during training, thus mitigating issues related to gradient vanishing and gradient explosion. Additionally, the backpropagation path in TCNs differs from the temporal direction, providing added stability. To address the problem of information loss caused by an excessive number of layers in CNNs, TCNs incorporate residual connectivity, facilitating seamless information transfer across layers within the network.
Recurrent Neural Networks (RNNs) are a form of deep learning model introduced by M. I. Jordan in 1990 specifically designed to capture time-dimensional features. Later, in 1997, Mike Schuster et al. [
21] extended the RNN architecture, leading to the creation of Bidirectional Recurrent Neural Networks (Bi-RNNs).
To address some limitations of RNN models, Hochreiter proposed Long Short-term Memory (LSTM) in 1997 [
22]. Subsequently, in 2005, A. Graves et al. [
23] further expanded LSTM to create Bidirectional Long Short-term Memory (BiLSTM). The structure of BiLSTM closely resembles that of Bi-RNN, incorporating two independent LSTM units concatenated together. By doing so, the BiLSTM model effectively addresses the limitation of LSTM’s inability to incorporate future information, enabling feature data obtained at time t to encompass both past and future information [
24].
Vaswani et al. [
25] introduced the Transformer in 2017 as an innovative deep learning framework, distinct from the conventional structures of CNNs or RNNs. The Transformer relies entirely on the attentional mechanism to capture global dependencies between model inputs and outputs. This remarkable ability to handle long-term dependencies and interactions renders the Transformer well suited for time-series modeling tasks, leading to high performance in various time series-related endeavors [
26].
To address specific limitations of the Transformer in long time-series prediction, Haoyi Zhou et al. [
27] proposed the Informer model in 2021. Building upon the classical Transformer encoder–decoder structure, the Informer model aims to tackle challenges encountered in long time-series prediction tasks. In the same year, Lim B et al. [
28] presented Temporal Fusion Transformers (TFTs), implementing a multiscale prediction model with a static covariate encoder, a gated feature selection module, and a temporally self-attentive decoder. TFTs not only deliver accurate predictions but also retain interpretability, considering global, temporal dependencies, and events.
3. Feature Selection Based on Ensemble Learning
The incorporation of ensemble learning in feature selection endeavors to improve the robustness, comprehensiveness, and stability of models, thereby mitigating the risk of overfitting and enhancing predictive accuracy. These advantages position ensemble learning as a potent tool in time-series prediction tasks, thereby contributing significantly to the enhancement of model performance and reliability.
Given a dataset D, consisting of m samples and p features, we propose a set of feature selection methods . Each method employs a specific feature selection rule, assigning a score to each feature, where i denotes the index of the method and j represents the index of the feature.
For each method
, a binary function
can be defined, where
X represents the original feature set and
Y represents the target variable. This function returns a subset containing the selected features, denoted as
. The specific definition is as follows:
Each is a subset of the original feature set X, representing the features selected by method . Combining the selection results of all methods forms a feature selection set .
The objective of ensemble learning methods is to combine these individual methods to select the final feature subset
S, maximizing a performance metric
P, typically representing the performance of a predictive model. The formal expression is as follows:
where
represents the final feature subset, i.e., the indices of the selected features.
P is the performance metric function, which could be accuracy, mean squared error, etc., measuring the model’s performance.
is the performance metric function for the
i-th feature selection method, assessing the performance of the feature subset based on the scores
.
is the weight used to balance different methods and can be determined based on the performance of each method.
This formula expresses that the objective of ensemble learning is to optimize the weights of features to select the final feature subset , aiming to achieve the best performance metric P. The optimization of weights can be realized through the combination and adjustment of various methods, tailored to meet the requirements of the problem and the characteristics of the data.
Therefore, to obtain a feature selection subset that maximally enhances the accuracy of a time-series prediction model, i.e., maximizing the performance metric function
p for
, we need to evaluate the weights
for each feature selection method and calculate the importance scores
for each feature. For the former, we must determine an appropriate method for weight calculation that comprehensively considers the varying performances of different feature selection methods when dealing with diverse types of datasets. Regarding the latter, we need to establish a set of alternative feature selection methods
, covering diverse feature selection strategies to meet the requirements of different scenarios. For each method
, performance metrics are employed to assess its effectiveness, resulting in feature importance scores
. Finally, adopting a feature scoring weighting approach using
, we generate the ultimate feature subset
. The entire ensemble learning-based feature selection process is illustrated in
Figure 1.
3.1. Determination of Ensemble Learning Strategies
Given the set of feature selection methods , an appropriate ensemble learning strategy is needed to obtain the final postfiltered feature subset . To integrate information from multiple feature selection methods, mitigate dependence on a single method, and enhance overall robustness, this study adopts the feature scoring weighting approach as its ensemble learning strategy. The feature scoring weighting approach exhibits comprehensive, flexible, and performance-enhancing characteristics across various problems and data contexts. This method facilitates the amalgamation of information from multiple feature selection methods, thereby improving model performance and robustness, diminishing the risk of overfitting, all while upholding a certain level of interpretability. The specific implementation steps are outlined as follows:
Firstly, through each feature selection method , the score for each feature is calculated according to its respective feature scoring strategy. Here, represents the score assigned by method to feature j.
To ensure comparability among importance scores under different feature selection methods, it is imperative to constrain the scores of features within the range of 0 to 1. Therefore, normalization of the feature scores provided by each method is requisite. This paper employs a softmax transformation for the normalization of importance scores.
In this way, the raw feature score is transformed into a probability representing each feature, ensuring that they range between 0 and 1, with a total sum of 1. These normalized scores can be utilized as comparable values across different feature selection methods.
Different feature selection methods may have varying applicability in different domains and for different types of time-series data. In order to comprehensively consider the scores from multiple methods, this paper introduces a weight vector W = , where represents the weight assigned to method . These weights are utilized to adjust the relative contributions of each method.
This paper employs the K-fold cross-validation method to determine the weights for each feature selection method. This method divides the dataset into K folds and performs the following steps for each fold: First, select K − 1 folds as the training set, and reserve one fold as the test set. Then, apply each feature selection method
on the training set, obtaining the feature selection results
for each method. Next, train the regression model using the filtered feature subset and evaluate the model’s performance on the test set. Mean Squared Error (MSE) is used as the performance metric for each method and is recorded. For each feature selection method
, based on the performance metric results from K-fold cross-validation, calculate the average performance metric
. These average performance metrics serve as indicators for the weights
. The weights
are then normalized based on the average performance metrics to ensure that they sum up to 1, i.e.,
Finally, the learned weights
are utilized in the output of the feature scoring weighting method, forming the ultimate feature selection results:
where
represents the ultimate feature subset, which is the weighted sum of scores from each method. The weight vector W governs the relative importance of each method. The final feature subset
incorporates information from multiple method scores, representing the features ultimately selected.
3.2. Determination of Feature Selection Methods
To establish a suitable set of feature selection methods , this paper introduces a “good but different” principle. According to this principle, individual learners should contribute to performance and exhibit differences among themselves. This ensures mutual complementarity and deficiency compensation during the ensemble process, ultimately enhancing the overall performance.
When applying ensemble learning to feature selection, it is equally crucial to ensure that the feature selection methods meet the “good but different” criteria. This implies that they should exhibit diversity, independence, stability, reliability, efficiency, adaptability, and robustness. Adhering to these requirements is essential in guaranteeing that feature selection methods can offer robust support for ensemble learning, thereby improving the performance and reliability of the model.
Grounded in the principles outlined above, this paper selects the Pearson correlation coefficient method, recursive feature elimination method, random forest method, gradient boosting decision tree, and XGBoost algorithm as the foundational feature selection methods. The specific rationale was as follows:
The Pearson correlation coefficient method (Pearson) is a statistical approach employed to measure the linear correlation between two continuous variables. It was deemed suitable as a feature selection method owing to its capability to identify the strength and direction of the linear relationship between features and the target variable. This method proves particularly valuable for features that manifest a linear relationship with the target variable. Pearson correlation coefficient is calculated using the following formula:
where r represents the Pearson correlation coefficient,
and
denote the
i-th observation of the two variables, and
and
represent the means of the two variables, respectively.
Recursive Feature Elimination (RFE) is a stepwise feature selection method that identifies the most informative feature subset by iteratively training models and eliminating the least important features. It is well suited as a feature selection method due to its ability to automatically identify and select important features, mitigate overfitting, enhance model interpretability, and the fact that it requires minimal manual intervention. Let
X represent the feature set, and
M be the metric used to evaluate the performance of the model. The process of Recursive Feature Elimination (RFE) can be expressed as follows:
where remove_least_important_feature(X) denotes the operation of removing the least influential feature in terms of model performance from the feature set
X.
The Random Forest method (RF) is an ensemble learning approach that improves overall performance by aggregating predictions from multiple decision trees. It is well suited as a feature selection method due to its capability to estimate feature importance and utilize a voting method for classification problems. Random forests exhibit a notable level of robustness against outliers and noise, rendering them suitable for addressing complex data scenarios. The importance calculation for feature
is as follows:
where
N is the number of decision trees in the random forest, and
represents the decrease in impurity in the
j-th decision tree due to the introduction of the feature
.
Gradient Boosting Decision Trees (GBDTs) are ensemble learning algorithms that amalgamate the principles of decision trees and gradient boosting, consistently enhancing model performance through iterative training of multiple decision trees. They are well suited as feature selection methods owing to their ability to automatically estimate the importance of features, demonstrate adaptability to high-dimensional and large-scale datasets, and furnish insights and understanding of the data. The update rule for gradient boosting decision trees is as follows:
where
represents the model’s prediction at the
m-th round,
is the learning rate, and
is the decision tree fitted in the
m-th iteration.
XGBoost is an efficient, flexible, and scalable machine learning algorithm based on the gradient boosting framework. It consistently enhances predictive performance through the iterative training of multiple decision tree models. In comparison to traditional gradient boosting methods, XGBoost introduces additional regularization terms and tree depth limitations, thereby improving model stability and generalization. It is well suited as a feature selection method due to its notable advantages in both performance and robustness, while also aiding in mitigating the risk of overfitting. The objective function of XGBoost comprises the loss function, regularization term, and model complexity term. For regression problems, the objective function is as follows:
where
L is the loss function,
is the regularization term, and
represents the
k-th decision tree model.
The five methods mentioned above each exhibit distinct characteristics in feature selection and all adhere to the “good but different” principle. From the linear relationship measurement of the Pearson correlation coefficient method to the iterative feature elimination of the recursive feature elimination method, and from the robust performance of the random forest and gradient boosting decision tree in handling outliers and noise to the efficiency and versatility of XGBoost, each method is guided by the principle of being “good but different”. The goal was to select features that contribute significantly while possessing unique advantages to adapt to the varied requirements of different problems and datasets.
By integrating these five diverse feature selection methods, this paper aims to derive more powerful, robust, and comprehensive feature selection results, thereby contributing to the construction of enhanced time-series prediction models.
4. Experimental Results and Analysis of Feature Selection Based on Ensemble Learning
4.1. Experimental Preparation
4.1.1. Initial Feature Set
To better illustrate the performance enhancement achieved by the ensemble learning-based feature selection method for time-series data prediction models, this paper utilizes financial time-series data to construct the initial feature set for subsequent experiments. Financial time-series data typically encompass multiple features such as stock prices, trading volume, market capitalization, financial indicators, etc. Financial time-series data are characterized by their richness in information, high dimensionality, temporal nature, and real-time aspects. These characteristics render them well suited for applying ensemble learning methods to feature selection, thereby improving model performance, mitigating risks, and fostering a more profound understanding of financial markets.
In theory, a broader range of features in the initial feature set implies a more comprehensive coverage of information, leading to more accurate predictive results for the model. It is imperative to choose features that are highly correlated with stock prices or returns to construct the initial feature set for the prediction model. This paper proposes selecting a total of 20 candidate features from three major categories: market-related, trading-related, and market capitalization-related features, with the objective of comprehensively considering various types of features.
The present study incorporates a set of market-related features, namely, high, open, low, pre_close, pct_change, change, and avg_price, totaling seven features. These features denote the daily stock trading metrics of highest price, opening price, lowest price, previous day’s closing price, percentage change, and stock price change, respectively. Market-related features can reflect short-term fluctuations and trends in the market, aiding in capturing instantaneous changes in stock prices.
For trading-related features, we selected vol_ratio, vol, turn_over, amount, selling, and buying, constituting six features. These features represent volume ratio, trading volume, turnover rate, transaction amount, selling transactions, and buying transactions. Analyzing these trading-related features can provide insights into market activity, trading volume, price fluctuations, and other aspects, contributing to a better understanding of the overall market conditions.
In terms of market value-related features, the study includes pe, float_mv, total_mv, swing, activity, strength, and attack, totaling seven features. These features signify price-to-earnings ratio, float market value, total market value, amplitude, activity level, strength, and market aggressiveness, respectively. Changes in stock market value may be correlated with investor sentiment, market cycles, and other factors, making market value-related features crucial for predicting stock returns.
By comprehensively considering these three major categories of features, the initial feature set can encompass information from different aspects, thereby enhancing the predictive accuracy of the model.
Table 1 displays the initial feature set selected from these three major categories.
4.1.2. Experimental Dataset
In terms of dataset selection, to ensure diversity and representativeness, this paper chose stock time-series data from three distinct industries: finance, power, and technology. Specifically, the stock data from three datasets, namely, China Industrial and Commercial Bank (ICBC), GD Power Development (GD Power), and China Unicom, were selected. This approach enabled us to comprehensively explore and evaluate the applicability of ensemble learning-based feature selection across different industry datasets.
Each stock dataset encompasses a total of 716 trading days, spanning from 1 January 2020 to 1 January 2023. Specifically, each dataset is composed of 20 features from the initial feature set in
Table 1, along with the closing prices used for prediction, forming a 716 × 21 dimensional data matrix. The datasets were sequentially split into training and testing sets in an 8:2 ratio. Descriptive statistics for the closing prices in each dataset, including mean, standard deviation, minimum, quartiles, and maximum values, are presented in
Table 2.
4.1.3. Data Preprocessing
To ensure the quality of the subsequent feature selection and model construction, data preprocessing is an indispensable step. In this paper, three specific data preprocessing methods were employed, namely, missing value handling, outlier treatment, and standardization. This series of preprocessing steps significantly enhanced the data quality, reduced potential sources of errors in the models, facilitated a better understanding and interpretation of the data, and also led to a reduction in the computational burden, thereby improving computational efficiency. The specific procedures are outlined below.
Stock feature data may exhibit issues such as missing values, format errors, or precision discrepancies due to network problems, time periods, or the absence of original data. Typically, it is necessary to address missing values, which are often represented as NaN or other placeholders and can be detected by examining the dataset. In this study, forward filling was employed to address missing values, where previous values are used to fill the gaps, preserving the continuity of the time series.
Individual feature data may contain exceptionally large deviations, which can impact the standard deviation of the data and even lead to the distortion of the overall dataset. To address this issue, this paper employed the Median Absolute Deviation (MAD) method. For each feature, the median and MAD were calculated, where the median was the middle value, and MAD was the median of the absolute differences between each data point and the median. Outliers are typically defined as data points deviating from the median by a certain extent, and in this study, the threshold for identifying outliers was set at three times the MAD. Subsequently, each data point for each feature was examined, comparing the absolute difference between each data point and the median with the threshold for outliers. If the absolute difference was greater than the threshold, the data point was flagged as an outlier. Finally, each identified outlier was set to the median to mitigate its impact on the data. Removing outliers contributes to enhancing the robustness and accuracy of the model.
Stock feature data may exhibit differences in magnitude, necessitating standardization to ensure comparability among different features. This paper employed the z-score normalization method due to its simplicity in calculation and its suitability for data approaching a normal distribution. This standardization method helps ensure that the magnitudes of different features do not adversely affect the interpretability of the model and facilitates the exploration of relationships between features and stock price trends. By transforming the data into a standard normal distribution with a mean of 0 and a standard deviation of 1, the values of different features share the same scale, making them suitable for comparison and modeling.
4.2. Experimental Results and Analysis
Next, we employed the five feature selection methods determined in
Section 3.2 to assess the importance scores of each feature in the initial feature set for different stocks. The results are presented in
Figure 2.
Two main observations are clearly evident from the graph. Firstly, different feature selection methods exhibited significant variations when analyzing the same stock data. Using ICBC as an example, various methods assigned relatively high scores to the “low” feature, but there were substantial score differences for the “pe”, “float_mv”, and “avg_price” features among different methods. This indicates a noteworthy variability in the impact of stock features under different feature selection methods.
Secondly, concerning stock data from different industry sectors, there were notable variations in the importance score distributions for each feature. Using the GD Power dataset as an example, the “low” and “avg_price” features received relatively high scores across various evaluation methods, while other features had comparatively lower scores. In contrast, in the dataset for ICBC, features such as “float_mv” and “pe” obtained higher scores. This emphasizes the distinct importance of various features in different industry sectors.
Therefore, to fully leverage the advantages of various feature selection methods on different data types, it is essential to employ ensemble learning methods to integrate the scores from different methods. Through the combination of scores from different methods, a more comprehensive consideration of the importance of different features under diverse data contexts can be achieved, thereby enhancing the model’s robustness and performance.
After obtaining the importance scores for each feature, weights for each feature selection method were calculated using the K-fold cross-validation method, as outlined in
Section 3.2. In this study, K was set to 5, indicating the use of 5-fold cross-validation. The conclusive results are presented in
Table 3.
According to the data in
Table 2, it is evident that the same feature selection method carries different weights across various types of stock data. Higher weights indicate that the features selected by that method exhibit superior predictive performance in the corresponding stock data. Consequently, the method is more suitable for this type of data.
By assigning distinct weights to these methods, ensemble learning can select the most suitable feature selection method for each time-series data context and conduct comprehensive screening. Ultimately, this approach can achieve superior predictive performance and higher robustness when facing diverse types of time-series data requirements.
After obtaining the importance scores
for each feature and the weights
for each feature selection method, the feature score weighting method was applied to calculate the top five ranked features for each stock dataset. The final feature subset
is presented in
Table 4.
Ultimately, to validate the effectiveness of the ensemble learning-based feature selection method, this study applied the five features selected by each feature selection method and the final five features chosen by the ensemble method to the task of stock price time-series prediction. The dataset, again, included the stock data of three companies: ICBC, GD Power, and China Unicom. The time range remained from 1 January 2020 to 1 January 2023.
Long Short-Term Memory (LSTM) was chosen as the specific prediction model to ensure accuracy in the forecasting task. LSTM networks are a variant of recurrent neural networks specifically designed for processing and learning from time-series data. The core components of an LSTM network include cells and gates, with three main gates: the input gate, forget gate, and output gate. The memory cell is the heart of the LSTM network and is responsible for storing and passing information. The input gate determines which information will be written to the memory cell, the forget gate decides which information will be removed from the memory cell, and the output gate determines which information will be extracted from the memory cell. These gates govern the flow of information in and out, and the updating of the memory state within the cell. The primary computational processes of an LSTM network can be represented by the following equations:
where
represents the output of the forget gate,
is the output of the input gate,
is the new candidate memory,
is the current state of the memory cell,
is the output of the output gate,
is the hidden state of the LSTM, and
are weight matrices, while
are bias vectors. The symbol
denotes the sigmoid activation function. These equations describe the primary computational processes of the LSTM network, allowing the network to more effectively capture long-term dependencies when handling time-series data.
For the evaluation of prediction performance, we utilized widely used metrics, namely, Mean Absolute Error (MAE) and Mean Squared Error (MSE). The model parameters are specified as outlined in
Table 5.
The parameter input_size represents the dimensionality of the input features, indicating the number of features input to the model at each time step. In the context of this predictive task, each time step comprised five features, making this parameter equal to 5. Hidden_size denotes the number of hidden units. In LSTM, these units capture patterns and relationships in time-series data. With 64 hidden units, the model exhibited a more complex learning capacity. Num_layers determines the depth of the network, i.e., the number of stacked LSTM layers. Here, two LSTM layers were stacked together, each with its own hidden state. Learning_rate is the learning rate, controlling the step size of model parameter updates. A smaller learning rate promotes model stability. Batch_size indicates the number of samples input to the model in each update, with larger batch sizes enhancing training efficiency. Num_epochs specifies the number of iterations the model underwent over the entire training dataset. Seq_length is the sequence length, representing the temporal span of historical data considered at each time step. In this case, the model utilized data from the past 5 days to predict the closing price on the 6th day; hence, this parameter was set to 5. The model employed the Adam optimizer and was implemented using the PyTorch framework.
The final prediction results are presented in
Table 6, with the best-performing metrics highlighted in bold. It is evident from the table that the gradient boosting decision tree method exhibited the best performance on one dataset, indicating that this method effectively identified the most influential features in that dataset. On the other hand, the recursive feature elimination and random forest methods achieved the best performance on two different datasets each, emphasizing their effectiveness in specific contexts. Moreover, ensemble learning methods demonstrated optimal performance on all three different datasets, underscoring their comprehensive applicability across various data types.
These findings underscore the crucial role of ensemble learning-based feature selection methods in enhancing the accuracy of feature selection and optimizing the performance of time-series predictions. Employing features selected through ensemble learning for modeling demonstrated improved performance in terms of both the MAE and MSE metrics, effectively enhancing the effectiveness of time-series prediction models. This outcome emphasizes the potential of ensemble learning methods in improving the accuracy of feature selection and predictive outcomes, providing robust support for research and applications in the field of time-series prediction.