1. Introduction
With global climate warming and the intensification of ecological environmental degradation, various natural disasters have become increasingly frequent, especially floods in river regions, which result in severe losses [
1]. Hydrological forecasting explores the patterns of hydrological phenomena and is essential for flood prevention, drought mitigation, and efficient water resource utilization [
2]. Hydrological forecasting typically includes flow forecasting, water level forecasting, sediment concentration forecasting, and water quality forecasting, with water level forecasting being central to flood control, navigation management, and water conservancy safety. However, due to the non-stationarity of hydrological processes, the complexity of characteristic variables, and noisy data, accurate water level forecasting has remained a challenging problem in related research [
3].
To date, water level forecasting models can be broadly divided into two main categories, the first being those based on physical models and mathematical statistical models. Vicente et al. [
4] developed a forecasting model for the Estana Lake basin based on rainfall characteristics, basin location, and spatiotemporal scale changes. This model used eight different runoff accumulation algorithms and fifteen different spatial patterns to predict runoff and analyze the results, also predicting lake water level trends under different threshold runoff conditions. Liu Zhiyu et al. [
5] introduced a distributed hydrological model to address the issues of rapid onset, fast convergence time, and short forecast periods associated with small and medium river floods, and applied it to watersheds with limited data. These models, primarily based on statistical methods, have made varying degrees of progress in water level forecasting but suffer from limitations such as handling non-linear relationships and insufficient generalization ability.
Another approach is based on machine learning and hybrid models for water level forecasting. With the rapid development of big data and artificial intelligence technologies, many hydrologists have utilized machine learning, data mining, and deep learning methods [
6,
7] to improve existing water level forecasting methods and models, proposing a series of feasible forecasting models. Nguyen et al. [
8] used Support Vector Machines (SVM) for river water level forecasting. Although their evaluation metrics yielded favorable outcomes, the factors considered were not exhaustive. Puttinaovarat et al. [
9] introduced an adaptive machine learning method that uses learning strategies to drive data for predicting different flood levels. This method achieved good forecasting results through extensive applications. Khalaf et al. [
10] developed a forecasting model combining the Internet of Things and machine learning to prevent river flooding. Garcia et al. [
11] used historical water level and rainfall data with a Random Forest method to predict water level changes. While traditional machine learning methods are commonly employed, their accuracy requires further enhancement. Seo et al. [
12] conducted simulation experiments using daily water level data from Korean hydrological stations, combining wavelet decomposition with neural networks and fuzzy inference to construct river water level forecasting models. Comparison with original methods showed that the hybrid model improved the accuracy of river water level forecasting. Indrastanti et al. [
13] used a Long Short-Term Memory (LSTM) network model to predict downstream river water levels using time series of precipitation and water levels from upstream and downstream points. Moishin et al. [
14] developed a flood forecasting model using the deep learning method ConvLSTM to assess the likelihood of future flood events based on flood index forecasting. Although this model achieved certain results, it did not effectively handle correlations between data. In 2014, Cho and his team [
15] presented the Gated Recurrent Unit (GRU), a model that streamlines the internal neuron operations and improves training efficiency, while maintaining an accuracy level similar to that of LSTM models. Water level forecasting is influenced by various factors and features nonlinearity, temporality, and complexity. Single neural network models often do not meet the accuracy requirements for forecasting [
16]. Therefore, Liu Weifei et al. [
17] combined the GRU with a Back Propagation (BP) neural network to predict lake water levels and examined how varying training sets affect model accuracy. Results showed that the GRU-BP hybrid model maintained high forecasting accuracy even with limited sample data. Hu Hao et al. [
18] introduced a hybrid model based on a modified weighted DRSN-LSTM for multi-timescale forecasting of downstream water levels at the Xiangjiaba Dam. By using a newly constructed error weight correction function and cross-entropy function to adjust water level response factors, and optimizing LSTM network parameters with the Archimedes optimization algorithm, the model demonstrated good accuracy and efficiency in water level forecasting. Nie Qingqing et al. [
19] proposed a forecasting model combining an improved Grey Wolf Optimization (MGWO) algorithm with a Temporal Convolutional Network (TCN), using the improved Grey Wolf algorithm to enhance TCN’s forecasting performance.
These forecasting models can handle the nonlinearity and temporality of precipitation data sequences, but their performance often declines due to non-stationary elements and substantial noise within the sequences [
20]. Hence, it is essential to preprocess the sequences prior to predictive analysis. Singular Spectrum Analysis (SSA) is a non-parametric statistical method used for time series analysis, signal processing, and image processing [
21]. It extracts trends, periodic components, and noise from the original data to analyze and predict time series [
22], effectively identifying trends, periodic components, and noise in the data. SSA is particularly effective for removing or reducing random noise in complex hydrological sequences. Therefore, preliminary decomposition using SSA can effectively separate signal and noise components from the data, providing a clearer and less noisy data foundation for subsequent analysis. The CEEMDAN method is one of the most commonly used data-denoising techniques currently [
23]. Guo et al. [
24] used CEEMDAN to process Zhengzhou annual precipitation time series data, and the CEEMDAN-LSTM model improved the accuracy of single LSTM models in precipitation time series forecasting. Tao et al. [
25] utilized CEEMDAN for decomposing historical water level time series data and employed the CEEMDAN-GRU model for predicting water levels following IMF reorganization. The findings indicated that the optimized CEEMDAN-GRU model achieved better forecasting accuracy compared to the LSTM and CEEMDAN-LSTM models. Despite this, previous research overlooked the significance of varying frequency components. Different frequency components exert different influences on forecasting results, with long-term forecasting being more affected by low-frequency components. Moreover, the GRU model frequently shows greater deviations during the training and forecasting of low-frequency components, which affects the overall precision of the forecasting. Research on frequency-targeted precipitation forecasting models for long-term forecasting is still lacking. Therefore, addressing the shortcomings of single models in parameter optimization and forecasting accuracy, we propose a “decompose-divide frequency domain-predict” forecasting model. The permutation entropy algorithm classifies the components obtained from CEEMDAN, including both IMF and Res, into high- and low-frequency categories. The low-frequency components are modeled using a TCN–Transformer hybrid model, while the high-frequency components are trained using a GRU model. This approach captures trends from subtle variations in the data, leading to improved prediction accuracy. The Transformer model is widely used in natural language processing (NLP) fields, such as machine translation [
26], time series forecasting [
27], and question-answering systems [
28]. TCN (Temporal Convolutional Network) is a deep learning model specifically designed for processing sequence data and capturing temporal dependencies through convolutional layers [
29]. Given the distinct advantages of both TCN and Transformer models, this paper considers coupling them. This coupled model leverages the complementary strengths of both models in sequence data handling, enhancing the model’s ability to process time series data, particularly in capturing long-term dependencies and understanding complex sequence dynamics, making it particularly suitable for complex time series forecasting tasks, such as forebay water level data in this study.
Based on the above, to improve the accuracy of gate-front water level forecasting, this study, using daily water level monitoring data from the Mengcheng Water Conservancy Hub from 2018 to 2022, proposes a method combining Singular Spectrum Analysis, Complete Ensemble Empirical Mode Decomposition with Adaptive Noise, Gated Recurrent Unit Neural Network, Temporal Convolutional Network, and Transformer models based on self-attention mechanisms. The GRU–TCN–Transformer water level forecasting model is constructed to simulate and predict the gate-front water levels of the Mengcheng Water Conservancy Hub from 2018 to 2022. The forecasting results are compared with those from various models and real data to demonstrate the model’s effectiveness and provide a new reference for improving the accuracy of gate-front water level forecasting.
2. Research Methodology
2.1. GRU
The GRU network improves upon traditional LSTM and RNN models. It effectively captures dependencies in varied temporal sequences while addressing the gradient information vanishing issue inherent in RNNs [
30]. The GRU’s internal design is relatively straightforward. Similar to LSTM, GRU introduces gating mechanisms to mitigate the gradient problem found in RNNs. However, the GRU comprehensively resolves issues seen in both previous algorithms by maintaining a pair of input variables for the recurrent mechanism. Specifically, the GRU features update and reset gates. The update gate determines the extent to which previous information is kept in the current state, while the reset gate manages the combination of information from earlier and current time steps. Consequently, GRU ensures accelerated training durations while maintaining high levels of predictive accuracy. Employing GRU for temporal data analysis enables the identification of patterns across different intervals, making it especially advantageous for tasks like forecasting.
Figure 1 depicts the fundamental architecture of the GRU neural network.
Figure 1 illustrates the flow path of data using arrows. Within the equations, σ denotes the Sigmoid activation function, while tanh represents the hyperbolic tangent activation function. Variables
and
correspond to the update gate and reset gate, respectively. The input is denoted as
, and
represents the output from the preceding GRU unit. The term
synthesizes data from
and
, with
being the final output of an individual GRU unit.
The computation process for the GRU unit is explained below [
15]:
- (1)
The update vector is obtained by calculating the update gate using the following formula.
- (2)
To determine the reset vector , the reset gate is computed using the current time step’s input vector and the output vector from the previous time step. This calculation yields the reset vector for the current time step.
- (3)
Update the vector with current time step data.
- (4)
To compute the output vector for the present time step, the update gate vector is integrated with the retained vector through the output gate.
The GRU architecture provides an effective approach for capturing complex relationships in time series data across multiple temporal scales, showcasing significant adaptability and accuracy in modeling high-frequency precipitation data. Compared to LSTM, the GRU demonstrates superior performance in extracting features from long sequences, demands a reduced number of training parameters, and provides enhanced computational efficiency. To enhance the model’s forecasting accuracy, we utilized the GRU architecture to process and learn from the high-frequency intrinsic mode functions (IMFs) derived from the CEEMDAN decomposition of precipitation data in this study.
2.2. TCN
TCN (Temporal Convolutional Network) is a deep learning model specifically designed for handling sequential data. It captures temporal dependencies in time series data through convolutional layers [
31]. TCN demonstrates excellent performance in various sequence forecasting and classification tasks due to its series of innovative structural designs, particularly in addressing long-term dependency issues more effectively and efficiently than traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. The structure of the TCN network is shown in
Figure 2.
Its development is a natural extension of deep learning applications in time series analysis. It stems from recognizing the advantages of Convolutional Neural Networks (CNNs) in handling sequential data, particularly in capturing long-term dependencies and avoiding the complexities associated with recurrent structures. The conceptualization and implementation of TCN integrate the benefits of CNNs, dilated convolutions, residual connections, and causal convolutions, creating a powerful tool for time series analysis. By leveraging the advantages of convolutional networks and optimizing specifically for time series data, TCN offers an efficient and robust framework for various sequence modeling tasks and is extensively used in areas such as natural language processing, audio analysis, and time series forecasting. Its main features include the following:
Causal Convolution: TCN uses causal convolution to ensure that when predicting the output at the current time step, the model can only utilize information from the current time step and prior steps, thereby preventing the leakage of future information. This is achieved by designing the convolutional kernel to only be connected to past data.
Causal convolution ensures that when predicting the output at time step t, the model relies only on inputs from time step t and before. Mathematically, given an input sequence
, the causal convolution operation can be represented as [
29]:
where
is the output at time step
t,
represents the weights of the convolutional kernel, and k is the size of the convolutional kernel. To ensure the causality of the model, it is required that
t −
i ≥ 0, which means that when calculating
, the convolution operation does not use any future inputs
.
Dilated Convolution: To capture long-term dependencies, TCN employs dilated convolution techniques. By increasing the spacing between elements in the convolutional kernel, dilated convolution expands the receptive field of the convolutional layer, allowing the network to cover longer input sequences without significantly increasing computational complexity.
Dilated convolution introduces a dilation factor d to increase the spacing between the weights of the convolutional kernel, thereby expanding the receptive field of the convolutional layer and capturing long-term dependencies. The operation of dilated convolution can be represented as [
29]:
where
d is the dilation factor used to control the spacing between input elements. By adjusting the value of
d, the model can cover a longer range of input sequences without significantly increasing the number of parameters.
Residual Connections: TCN typically includes residual connections, which help alleviate the problems of vanishing or exploding gradients during the training of deep networks, allowing the network to learn deeper sequence features. In TCN, the output of the residual module includes not only the output of the convolutional layer but also the input itself, as represented by the following formula [
29]:
where
is the output of the convolutional layer (which may include multiple convolution operations),
is the input to the residual module, and Activation() is an activation function, such as ReLU. Residual connections allow gradients to flow directly through the network, thereby improving training stability and efficiency.
Variable Sequence Lengths: Unlike recurrent neural network structures, TCN can handle sequences of varying lengths, making it more flexible when dealing with time series data of different lengths.
By incorporating these concepts, TCN can effectively capture both short-term and long-term dependencies in time series data. Specifically, TCN achieves learning of long-term sequence dependencies by stacking multiple dilated convolutional layers (with dilation factors typically increasing exponentially for each layer), while using residual connections to enhance the training efficiency and stability of the model. These designs enable TCN to perform excellently in various time series-related tasks, such as text generation and time series forecasting.
2.3. Transformer
In 2017, the team at Google introduced the Transformer architecture [
32], which is built entirely on attention mechanisms and includes both encoder and decoder components. This model, possessing a more complex architecture compared to conventional attention methods, demonstrates superior capability in feature extraction. In contrast to the GRU, the Transformer employs self-attention to analyze the full sequence, enabling parallel processing. Additionally, it offers enhanced capabilities for managing global information, making it particularly effective for detecting periodic variations in low-frequency precipitation elements.
The Transformer model is divided into two main components: the encoder on the left and the decoder on the right. The encoder comprises N = 6 identical layers, each containing two sub-layers: a multi-headed self-attention mechanism and a fully connected feed-forward neural network. The decoder replicates this structure with six identical layers; however, each layer in the decoder incorporates three sub-layers. The first two sub-layers are akin to those in the encoder, while the third sub-layer focuses on attention between the encoding and decoding processes. Residual connections and layer normalization are applied to each sub-layer in the Transformer.
Figure 3 illustrates the Transformer model’s architecture.
The Transformer model employs an attention mechanism that consists of three types of vectors: Query, Key, and Value. Here, Query denotes the feature matrix for the query, Key signifies the feature matrix for the key, and Value refers to the feature matrix for the value. Each weight is determined through the application of the softmax function for normalization, which is then utilized to produce the output. The output matrix is derived using the formula presented in Equation (8) [
32].
Here, represents the dimension of the Key vector in each head, while denotes the dimension of the Query vector.
In practical applications, the Transformer network primarily employs the multi-head self-attention framework. This method projects the Query, Key, and Value vectors into different subspaces using linear transformations, followed by the concatenation of results from h attention heads to produce the final output. The entire computation process is outlined in the following Equations (9) and (10) [
30].
Here, denotes the weight matrix, represents the projection matrix, and h indicates the number of heads in the multi-head attention mechanism.
To apply the Transformer model to time series analysis, it is essential to focus on multi-head self-attention mechanism and the positional encoding. These capabilities enable the simultaneous handling of input time series, which speeds up the training processes and enables effective modeling of dependencies over both long and short time frames. Furthermore, by analyzing trends in low-frequency components obtained after decomposition, the Transformer model improves the accuracy of rainfall predictions.
2.4. TCN–Transformer Coupled Model
Due to the distinct advantages of both the TCN and Transformer models, this paper considers coupling the two. This coupled model leverages the complementary strengths of both models in handling sequential data, thereby enhancing the model’s ability to process time series data, particularly in capturing long-term dependencies and understanding complex sequence dynamics. This makes it especially suitable for handling complex time series forecasting tasks, such as the upstream water levels addressed in this study.
From the perspective of long-term dependencies, water level data typically contain long-term seasonal and cyclical patterns. TCN can effectively capture these long-term dependencies through its dilated convolutions, while the Transformer’s self-attention mechanism can learn dependencies between time points on a global scale, enabling the model to understand and utilize these complex temporal dynamics. From the perspective of understanding local patterns, the convolutional structure of TCN is well suited for capturing local time series patterns, such as short-term water level fluctuations. These randomly occurring local patterns are crucial for understanding water level characteristics and making reasonable forecasting, especially during extreme weather events. In terms of improving forecasting accuracy and robustness, coupling TCN and Transformer can enhance the model’s forecasting accuracy and robustness. TCN provides efficient learning of local features in the time series, while Transformer enhances the understanding of global dependencies. This combination makes the model more flexible and robust in facing different forecasting challenges. From the perspective of reducing training difficulty, the inclusion of TCN can alleviate the burden on the Transformer model when processing long sequences, thereby reducing training difficulty and improving training efficiency. TCN reduces the length and complexity of the sequence through local convolution operations, enabling the Transformer to learn key information in the sequence more effectively. In summary, the TCN–Transformer coupled model effectively predicts upstream water levels by leveraging the strengths of both architectures. This coupled model can more comprehensively and accurately capture the complex dynamics of water level changes, thereby providing important support for water resource management, flood warning, and water conservancy planning.
Figure 4 illustrates the TCN–Transformer coupled model.
2.5. GRU–TCN–Transformer Coupled Model
To achieve a deeper decomposition of upstream water levels and enhance the accuracy of rainfall forecasting, this research proposes a forecasting method that integrates GRU neural networks, TCN deep learning models, and Transformer models. This approach utilizes daily historical precipitation data from the Mengcheng Water Conservancy Hub covering the years 2018 to 2022. The GRU–TCN–Transformer coupled forecasting model is constructed by employing an analysis method that integrates SSA and CEEMDAN. This approach transforms complex, non-stationary time series data into multiple relatively stationary time sub-series, thereby fully extracting information from the original water level data. Additionally, the permutation entropy algorithm is employed to partition the stationary sub-series by frequency, separating the decomposed sub-series into components of high and low frequencies. For the high-frequency sub-series components, a GRU neural network, which is particularly effective in learning the features of long sequences with lower parameter overhead and accelerated computational performance, is used for modeling and forecasting. This not only shortens the duration of training but also minimizes the consumption of resources. For the low-frequency sub-series components, the TCN–Transformer coupled model, which combines the advantages of both architectures, can comprehensively and accurately capture the complex dynamics of water level changes, thereby yielding more accurate forecasting results. Therefore, the TCN–Transformer model is used for modeling and forecasting of the low-frequency sub-series components. Finally, the predicted results of each sub-series are combined to produce the upstream water level forecasts.
Figure 5 illustrates the GRU–TCN–Transformer coupled model framework.
The process is outlined below:
- (1)
Preprocessing and Decomposition of Data. First, outliers in the data are replaced and filled in as necessary. SSA is used to decompose the original data sequence. After removing noise components, the data are reconstructed. Subsequently, CEEMDAN performs a secondary decomposition on the reconstructed data. The appropriate number of modes is selected based on the central frequency of different data sequences, resulting in K intrinsic mode function (IMF) components.
- (2)
Frequency Division of Components. The components obtained from CEEMDAN decomposition, including both the IMF and residual components, are partitioned.PE calculates the entropy values for each component, facilitating the separation into high-frequency and low-frequency sub-series.
- (3)
Model Training. Each IMF component is trained and tested independently. The components from the training dataset are used to train the deep learning models. Once training is complete, the test dataset labels are used to forecast each IMF component. The forecasts for all sub-series are then aggregated to reconstruct the final simulated upstream water level.
- (4)
Model Evaluation. The simulated values from the GRU–TCN–Transformer coupled forecasting model are assessed against the actual values from the validation dataset using appropriate evaluation metrics.
Traditionally, data has been split into training and validation subsets based on established guidelines, where the training subset generally represents more than 60% of the entire dataset, while approximately 20% is allocated for validation. Depending on the context, researchers adopt various splitting ratios. For instance, Huang Chao and colleagues [
33] assigned 75% of their dataset to training and cross-validation, reserving 25% for independent testing while developing a model for summer precipitation forecasting in Hunan. Similarly, Ren Yufei and his team [
34] allocated 70% of their dataset for training purposes and 30% for testing when developing a model for short-term wind speed forecasting along high-speed rail corridors. In this research, we noted significant trend changes during the MK mutation test in the later stages of the data. To ensure the overall accuracy of the model, we decided to utilize the initial 91% of the original data sequence for training, while the remaining 9% was set aside for validation during the model construction.
2.6. Model Assessment Metrics
To assess the accuracy of annual precipitation forecasts from each model and to reflect the intuitiveness, reliability, and accuracy of the GRU–TCN–Transformer coupled model in comparison to others, we utilized several evaluation metrics: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (). The following outlines the definitions of these metrics:
Mean Absolute Error (MAE):
Root Mean Square Error (RMSE):
Coefficient of Determination (
):
The following equations delineate the evaluation metrics employed to assess the efficacy of the model. In these metrics, represents the actual values within the dataset, predicted outcomes generated by the model, is the average of the actual values, and N indicates the total number of data points. Mean Absolute Error (MAE) measures the average of the absolute errors between the predicted and true values, while Root Mean Square Error (RMSE) calculates the square root of the average of the squared differences between predicted and true values. Both MAE and RMSE function as measures of model error and forecasting precision, where values nearer to zero indicate greater accuracy and reduced error. The Coefficient of Determination () assesses the goodness of fit for the regression model, where a higher value signifies a better fit. Generally, an value approaching 1 indicates a high degree of model fit.
5. Conclusions
To address the nonlinear and non-stationary characteristics of the upstream water level data, this study proposes an analysis method combining Singular Spectrum Analysis (SSA) and Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN). This approach involves a process of data decomposition, denoising, reconstruction, and further decomposition to obtain high-quality IMF subsequences. The PE algorithm is then used to divide each modal component into high-frequency and low-frequency sequences. The GRU and TCN–Transformer models are employed to predict these two parts separately, which reduces the forecasting error of each component, improves overall forecasting accuracy, and enhances forecasting stability. This method provides a scientific basis for water resource management and forecasting.
The daily upstream water level forecasting results for the Mengcheng Water Conservancy Hub from 2018 to 2022 indicate that using an analysis method combining Singular Spectrum Analysis (SSA) with Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) to decompose the original data can significantly improve forecasting performance. The GRU–TCN–Transformer model shows higher forecasting accuracy in upstream water level forecasting compared to other single architecture models such as TCN–Transformer, GRU, Transformer, and SVM. The forecasting results have achieved a value of 0.8076, demonstrating that the combination of TCN–Transformer and GRU can maximize the advantages of each model, allowing the model to better capture the dynamics of water level changes in the components and resulting in forecasting with high reliability.
The model proposed in this study effectively improves the forecasting accuracy of upstream water levels. However, the research process did not consider incorporating additional variables that could impact upstream water level forecasting. Thus, a key focus for future research should be expanding the data diversity by incorporating additional dimensions, such as meteorological factors, water demand, hydrology, and water quality. Utilizing interdisciplinary approaches to more deeply integrate these variables into model construction and data analysis will enhance the model’s adaptability to complex environmental factors and the comprehensiveness of its forecasting. Additionally, the study did not consider whether to incorporate other variables that may affect the prediction of water levels in front of the sluice gate. It is necessary to increase the diversity of data to enhance the model’s adaptability to complex environmental factors and improve the comprehensiveness of the predictions.