1. Introduction
Flooding occurs when the water level exceeds the stream’s flood stage. This threshold varies for each location and gage, and the severity of the flooding depends on how much water has exceeded this level, making the water level (also known as gage height and stage) the most prominent indicator of imminent flooding. In Missouri, water level oscillations are often caused by extreme and localized rainfall brought on by storms. With climate change leading to more frequent storms, this will result in more abrupt water level fluctuations. Predicting water levels enables early warning, allowing for timely evacuation, rerouting transportation, managing road traffic, and deploying flood protection measures [
1].
Studies in flood forecasting and water level prediction usually make use of statistical, heuristic, and machine learning models to forecast flood and water levels. Traditional statistical models such as autoregressive (AR) [
2,
3], moving average (MA), autoregressive moving average (ARMA), and autoregressive integrated moving average (ARIMA) are among the popular models for flood forecasting and water level prediction [
4,
5,
6]. The autoregressive models use past values to predict future values. Like other time series data, water level changes can be attributable to both short-term water level fluctuations (e.g., hourly) and long-term water level fluctuations (e.g., annual). Thus, water level data have long-term and short-term dependencies. Although autoregressive methods have been popular in research, they fall short in distinguishing the long-term and short-term dependencies and how to leverage them. These traditional models are a subset of linear regression models [
7]. Hence, they are of limited flexibility when forecasting non-linear time series. Additionally, these methods were reported not to lead to satisfactory results for short-term predictions, and at least a decade of historical data is required for them to produce a meaningful forecast [
8]. To overcome these shortcomings when analyzing more complex time series problems, machine learning models have risen in popularity. According to a review study by [
9], artificial neural networks (ANNs) are the most widely used machine learning methods due to their accuracy, high fault tolerance, and parallel processing, hence they are good at dealing with complex dependencies associated with flood prediction. This paper investigates how ANN performance can be further improved when combined with other methods, such as statistical methods and hybrid models where they are used in conjunction with other machine learning models [
9]. Ghorpade et al. [
10] reviewed the use of different machine learning models and concluded that among different models (including linear regression, support vector machines, decision trees, etc.) the long short-term memory networks (LSTMs) have a superior performance for flood prediction. LSTMs are a form of a recurrent neural network that has proven beneficial in overcoming the problem of vanishing gradients [
11]. LSTMs have been used in many studies for water level prediction. In a study by Gude et al. [
12], the performance of a statistical model (autoregressive moving average) is compared with the performance of an LSTM, and it was concluded that LSTMs can provide more accurate water level predictions and uncertainty estimates. Li et al. [
13] examined LSTMs and their variants for water level prediction. This study used an LSTM, bi-LSTM, CNN-LSTM, and CNN-attention-LSTM, with the latter outperforming other models in all metrics. This suggests that an attention layer can improve the performance of LSTM.
Although the use of attention mechanisms in tandem with other base models can be beneficial, recent research suggests that using pure attention can even lead to better results. Transformer architecture uses pure attention while having lower computational intensity. This architecture has achieved state-of-the-art results in sequential modeling for natural language processing [
14]. Transformers have gained popularity since their introduction and have provided superior performance while reducing the computational complexity of natural language processing applications. The architecture has been slightly tailored to suit other applications, including time series prediction [
15,
16], which usually takes place via adding a positional encoding component. Before the existence of deep neural transformers, the attention mechanism was applied on top of another architecture to increase the model’s overall performance capability [
17,
18,
19]. However, the transformer architecture relies solely on the power of self-attention layers for capturing long-term and short-term dependencies [
14]. Although transformers have recently been favored, there is some hesitation when it comes to the use of this architecture in time series forecasting. Zeng et al. [
20] compared a transformer architecture with a one-layer linear model for nine benchmark datasets. Surprisingly, the linear model outperformed the transformer model in all but one case. The authors believe that this stems from the fact that using the self-attention mechanism will cause the loss of ordering information. This is not a major concern for semantic-rich applications such as natural language processing but has a crucial role in modeling temporal changes necessary for time series modeling [
14,
15,
16,
17,
18,
19,
20].
Hybrid architectures for predicting time series using LSTMs in conjunction with convolutional neural networks (CNNs) are promising. These models leverage the capabilities of CNNs for spatial feature extraction and Recurrent Neural Networks (RNNs) for temporal feature extraction. The long short-term time series network (LSTNet) is specially tailored for time series [
21]. This model uses a layer of RNN in conjunction with a skip-RNN layer; the latter can go back through time to extract information from previous time steps. Additionally, the model integrates an autoregressive layer to mitigate scalability challenges commonly encountered by artificial neural networks (ANNs). Some studies derive advantages from employing the LSTNet architecture [
22,
23,
24,
25,
26]. For instance, Lee et al. [
23] used an LSTNet to forecast short-term load in power systems. While LSTNets are designed for time series data, its specific architecture has not been thoroughly explored or validated in the context of water level forecasting, leaving a gap in its applications within this domain. In this study, we have chosen LSTNets, transformers, and LSTMs to compare against the MLP as the baseline model. The performance of LSTNet is compared to these popular deep learning models commonly utilized in water research studies to evaluate which of the models consistently displays better performance that can be extended to all the river gages in the state. In addition, an evaluation is conducted to determine which of these models can perform better under the challenges accompanied by time series forecasting, such as data drift. Data drift is a common challenge that occurs when the distribution of data changes over time, making it more difficult for the model to provide accurate predictions. A model that can overcome data drift can handle an unseen or holdout dataset with different distributions with higher accuracy.
Section 2 describes the dataset and study area, data preprocessing, methodology, deep learning models, hyperparameter tuning, experiments and implementation, and evaluation metrics. Model performance comparison and result interpretation are conducted in
Section 3. Discussion and insights for future work are presented in
Section 4. Finally,
Section 5 reviews and concludes the results of this study.
2. Materials and Methods
2.1. Dataset and Study Area
The gage height dataset used in this research is extracted from the United States Geological Survey (USGS). USGS has an established network of active river gages collecting stage/gage height data throughout the state of Missouri. These gages provide stage/streamflow readings, enabling a comprehensive understanding of river conditions, facilitating the development of predictive models, and offering real-time updates on water conditions. The historical stage/gage height information is extracted from the USGS website in time series format with 30-min intervals for 20 selected gages. For training a deep learning model, the last 10,000 values almost equal to six months, ranging from 1 June 2021 to 30 December 2021, have been chosen. Each catchment in the dataset has been assigned a unique index value ranging from 1 to 20, corresponding to the river gage location in the state of Missouri.
The location of river gages used in this study is shown on the map in
Figure 1. detailed information for each gage, including site number, site name, latitude (decimal degrees), and longitude (decimal degrees), can be found in
Table 1.
2.2. Dataset Preprocessing
Prior to dividing the data into training and validation sets, preprocessing steps such as addressing missing data and normalizing the data are carried out. Given that the water level data exhibits continuous and uniform fluctuations, we used a uniform distribution for data imputation, with the minimum and maximum values of the distribution determined by the preceding and succeeding available data points, respectively. The normalization and scaling are done separately for the training, validation, and test data to prevent data leakage. Additionally, the dataset is transformed into windows of time steps. For each prediction made for n steps into the future, the model utilizes the preceding m sequence of water level values observed, such that if the value at time step t is named , then input is and the predicted output = . The value of n is set to 4-, 6-, 8-, and 10-time steps, which corresponds to a prediction interval from 2 to 5 h. On the other hand, the selection of m is determined through the process of hyperparameter tuning. After completing these preprocessing procedures, the data is divided into training and validation sets using the conventional train-validation-test split method, with 20% of the data reserved as holdout for testing, and the remaining 80% used for training and validation with a 9:1 ratio. In addition, we employed walk-forward cross-validation. This method is particularly suited for time series data, as it preserves the sequential order of information, which is crucial for capturing trends and patterns over time and updating the model weights in case of drift. By using this method, we updated the models’ trainable parameters using each fold of validation, thereby utilizing all portions of the training data without risking data leakage.
2.3. Methodology
Twenty datasets comprising water level values from selected USGS river gages of interest were chosen for analysis with our deep learning models. The model selection process was driven by validation performance, with the top-performing model on the validation dataset chosen as the best model for testing. Notably, the optimal hyperparameter combination and considerations, such as time series window size, may vary among the best models for each gage. In addition, an extensive hyperparameter search employing the Keras random search algorithm was conducted. Parameter analysis was also performed for selecting configurations like time series window size (timesteps) and batch generation before feeding the preprocessed data into the model. Ultimately, the performance results are juxtaposed to determine the preferred model for future water level predictions. The code has been written using the Python programming language Version 3.10 and the TensorFlow library Version 2.11 [
27].
2.4. Deep Learning Models
The four different deep learning models analyzed for predicting water levels in this study were MLPs, LSTM, LSTNet, and transformers. Among these models, LSTNet and LSTM possess recurrent gates that facilitate the retention and omission of information, distinguishing them from MLP and transformer, which lack this capacity. The positional encoding added to the transformer architecture is a compensatory mechanism to mitigate this impact.
Multi-Layer Perceptron (MLP) is a fundamental type of feedforward neural network, consisting of multiple layers of interconnected neurons [
28,
29,
30]. A non-linear activation function is applied to the weighted sum of each neuron’s inputs in the hidden layer. The outputs of the hidden layers are then passed through another activation function to generate the final output. MLPs are commonly used for a wide variety of tasks; however, there is a downside to using MLPs in that they may not be able to extract long-range dependencies due to the lack of a gated unit, and they are prone to vanishing and exploding gradients. The number of hidden layers and neurons for each layer were tuned using hyperparameter tuning [
31]. The model’s architecture can be viewed in
Figure 2a.
Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has gained widespread popularity due to its ability to effectively process and model sequential data. The main advantage of LSTM lies in its capability to overcome the vanishing and exploding gradient problem typically associated with traditional RNNs [
11]. LSTM cells are designed with a sophisticated gating mechanism that selectively retains and forgets information, allowing the network to capture long-range dependencies and effectively handle sequences with varying time lags. The memory cell in LSTM contains three main gates: the input gate, the forget gate, and the output gate. Together with the cell state, these gates enable the LSTM to remember important information over extended intervals, making it suitable for tasks like time series analysis, natural language processing, and speech recognition. However, despite its strengths, LSTM also has certain limitations. First, the architecture is computationally memory-intensive, which can hinder its application in resource-constrained environments. The complexity of LSTM networks also makes them more challenging to train and tune, often requiring a larger amount of data and more extensive hyperparameter optimization. Additionally, while LSTMs can handle long-range dependencies, they might still struggle with capturing extremely long-term dependencies in sequences [
32,
33]. This limitation may result in less effective modeling of extremely complex patterns or dependencies that span across vast time intervals [
34]. The number of stacked LSTM layers, the number of hidden units for each layer, and the dropout rate of this model are adjusted using the Keras random search tuner. The architecture of this model can be observed in
Figure 2b [
35].
LSTNet is a neural network architecture designed for time series forecasting, as proposed in the paper [
21]. LSTNet combines the strengths of both convolutional neural networks (CNNs) and long-short-term memory (LSTM) networks to capture short-term and long-term patterns in time series data. The key idea behind LSTNet is to use CNNs to model short-term dependencies efficiently and LSTMs to capture long-term dependencies in the time series. The model consists of two main components: the CNN-based first layer, which encodes the recent history of the time series, and the LSTM-based layers, which process the encoded features to generate forecasts. LSTNet includes a skip-LSTM layer getting the encoded information from the CNN layer, allowing the model to go back through time and extract important information from previous steps. Finally, LSTNet tries to overcome the output scalability issues by adding the output of an autoregressive model to the output of the concatenated LSTM and skip-LSTM layers. In this model, the number of CNN filters and kernels, number of LSTM hidden units, number of skip-LSTM hidden units, number of layers skipped, and dropout rates are adjusted with hyperparameter tuning. The architectural representation of the LSTNet model is showcased in
Figure 2c. However, the original paper uses GRU as the recurrent neural network; we found that LSTM is better suited to our application, as LSTM has the ability to handle long-term dependencies more effectively than GRU. Additionally, we set a dynamic number of stacked LSTM layers, with the number of layers and LSTM units for each layer determined through hyperparameter tuning. The look-back window size was also determined through hyperparameter tuning.
The transformer architecture is a deep learning model introduced by Vaswani et al. [
14,
34]. It represents a significant departure from traditional sequential models like LSTMs by relying solely on self-attention mechanisms for handling long-range dependencies in sequential data. The core idea of the transformer is the attention mechanism, which allows the model to focus on relevant parts of the input sequence while processing each token. The transformer is parallelizable, leading to lower training times. A transformer encoder tailored for time series prediction is used in this study. The head size, number of multi-head attention layers, number of transformer blocks, number of feedforward layers, number of hidden units for each feedforward layer, and different dropout rates are optimized using hyperparameter tuning. The architecture of this model can be observed in
Figure 2d.
2.5. Hyperparameter Tuning
Given that achieving the best performance of these models depends on suitable hyperparameter tuning and the specific characteristics of the dataset, a hyperparameter tuning process was conducted. This tuning procedure was executed utilizing the Keras random search tuner. Random hyperparameter tuning has been shown to outperform grid search, requiring shorter computation time and reduced processing costs [
35]. Although none of the methods used for hyperparameter tuning guarantee optimal hyperparameters and may lead to suboptimal hyperparameter selection, these methods will help us to avoid poor hyperparameter configurations and can boost the performance of deep learning models by preventing both underfitting and overfitting by a large margin [
36]. To determine the optimal hyperparameter values, 20 distinct trials were executed for each of the deep learning architectures. Hyperparameters unique to a particular model, such as the number of transformer blocks or CNN filters, were fine-tuned to maximize performance. Moreover, certain hyperparameters were common across all models, such as the time series window/step size, learning rate, batch size, maximum training epochs, and early stopping patience, were assigned consistent lower and upper bounds. This approach was adopted to ensure equitable experimental conditions for all models. Moreover, early stopping was implemented to prevent overfitting.
2.6. Experiments and Implemenation Setting
All experiments were conducted using Python programming language, the TensorFlow library [
27], and the same general hyperparameter tunings, with an NVIDIA GeForce RTX 3090 GPU. We looked back at “window size” time steps, where window size has resulted from the general hyperparameter tuning and the water level for 4, 6, 8, and 10 future prediction intervals are predicted. Adam was used as the optimization algorithm. To ensure that performance superiority was not the result of random weight initialization, the model’s random weights were set to a constant value. The hyperparameters common across all models are presented in
Table 2.
2.7. Evaluation Metrics
This study uses Root Mean Squared Error (RMSE) as the loss function. RMSE is the square root of the Mean Squared Error (MSE), and it has the same units as the target variable, making it easier to interpret, compare across different scales, and understand the magnitude of error relative to the flood stage. RMSE emphasizes larger variances in error, which is vital for our use case, as we require a higher penalty for significant errors. Flooding typically results from abrupt changes in water level values, and when such changes occur, models often tend to underpredict them due to the regression toward the mean effect. By using RMSE, which squares the error, we increase the penalty for larger discrepancies between predicted and actual values, thereby focusing on penalizing these significant errors more effectively. We also employ two other performance metrics to assess the prediction model’s effectiveness: Mean Absolute Error (MAE) and Pearson’s Correlation Coefficient. RMSE and MAE measure the disparity between actual values and predictions generated from the samples, with smaller values indicating improved performance. On the other hand, Pearson’s correlation coefficient captures the strength of the linear relationship between actual values and predictions. The closer the correlation coefficient is to 1, the stronger the correlation between the two variables. The correlation coefficient may not accurately reflect the strength of the relationship in case the relationship between variables is not linear; therefore, it is better to use the correlation coefficient in tandem with other metrics.
Assuming that n is the total number of input features for the test dataset,
yi represents the actual values of labels, and
pi represents the predicted value for the test data, The performance metrics can be formally represented as follows:
3. Results
The four models were trained across 20 individual datasets with a diverse array of hyperparameters. These hyperparameters encompass the number of blocks of each architecture, hidden layers, dropout values, batch size, learning rate, training epochs, and early stopping patience. An iterative process, employing the Keras random search tuner, is employed to efficiently optimize these hyperparameters, thereby selecting the model configuration that yields the most favorable evaluation performance. Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Pearson’s correlation coefficient are employed to assess the effect of the loss function on model performance.
In addition, the results are evaluated using different prediction intervals to ensure that the performance results of the models are consistent.
Figure 3 Shows the error frequency histogram for each model averaged over 20 datasets for a prediction interval of four steps ahead. The red plots show the normal distribution curve plotted with the mean and standard deviation of the average errors pertinent to each model. Given the histogram in
Figure 3, it can be observed that the histogram is right-skewed, the average error is close to zero, and the mode is smaller than the average error, indicating that overpredicting is more frequent. This right-skewedness is likely to occur in real-world datasets, as the frequency diagram of water levels for each individual gage is right-skewed. Moreover, it can be observed that the 2σ for the LSTNet’s normal curve is smaller compared to the other models. This deviation from normal distribution can be indicative that the average values for each model may not be ideal for final comparisons; therefore, given the difficulty of comparing the performance results of 20 datasets, we use a boxplot to inspect the median and percentiles of performance metrics for each individual dataset.
Figure 4 shows the performance results of the models across different datasets. This visualization offers a summary of both the central tendencies and the spread of performance metrics across 20 datasets for different prediction intervals. Moreover, the boxplot is a good visualization technique that enables the elimination of outliers, which may be the result of random weight initialization or hyperparameter tuning. Each box contains the minimum, Q1, median, Q3, and maximum for the designated performance metric across 20 datasets. The median value is the middle value for the selected performance metric across all datasets when the values are ordered from the smallest to the largest. It represents the value that separates the higher half of the RMSE from the lower half of the RMSE. The first quartile, denoted as Q1, is the value that divides the lower 25% of the RMSEs when ordered. It is the data point at the 25th percentile. In the boxplot, it marks the bottom of the box. The third quartile, denoted as Q3, is the value that separates the lower 75% from the upper 25% of the selected performance metric’s values when ordered. It represents the data point at the 75th percentile. In a boxplot, it marks the top of the box. The maximum and minimum values are the largest and smallest RMSE values, excluding outliers (the river gages with errors above the maximum or below the minimum defined in Equations (5) and (6)). The interquartile range (IR) is calculated as follows:
In
Figure 4, if we suppose the current time step is t, t + n indicates the prediction interval where n is the number of steps ahead in the future. Each step is equivalent to a 30-min interval.
Figure 4a–c shows the RMSE, MAE, and correlation coefficient of the test dataset for four steps ahead, respectively. The median RMSE for MLP, LSTM, LSTNet, and transformer is 0.00985, 0.01300, 0.00724, and 0.00862, with LSTNet having the lowest median RMSE. LSTNet also has the smallest Q1, Q3, and maximum RMSE. The median MAE for MLP, LSTM, LSTNet, and transformer is 0.00784, 0.01115, 0.00555, and 0.00700, respectively, with LSTNet having the lowest median MAE along with the lowest Q1, Q3, and maximum. Regarding the correlation coefficient (
Figure 4c), where values fall between 0 and 1, with larger values indicating better performance, LSTNet has the highest minimum, Q1, median, Q3, and maximum, consistently showing the best performance among all models in terms of the correlation coefficient.
Figure 4d–l exhibit performance metrics for prediction intervals of 6 (d–f), 8 (g–i), and 10 (j–l). Based on
Figure 4, the following conclusions can be drawn:
With the increment of prediction interval, the performance of all the models degrades, which was expected theoretically.
The LSTNet purportedly has a better performance compared to other models, and this performance advantage remains consistent as the prediction interval increases, which highlights the robustness of LSTNet in predicting multiple datasets and multiple prediction intervals.
Figure 5 illustrates the point-by-point averaged normalized errors across 20 datasets, with the errors scaled from −1 to 1 to facilitate comparison. These errors delineate the performance of LSTNet across distinct datasets. If the error for dataset1 at timestamp t is denoted as e1, and similarly for dataset2 at timestamp t as e2, and so forth, the vector E(t) = {e1, …, e20} is formed. Subsequently, the median, q1, and q3 point-by-point error is computed as the median, q1, and q3 of E(t) = {e1, …, e20}.
In
Figure 6, the plot for water level values from various river gages is presented. The dataset consists of 10,000 timesteps. Utilizing the initial 8000 values for training and evaluation, the remaining 2000 values are reserved for testing, where data drift occurs. The error in
Figure 5 remains negligible for the initial 1000 test timesteps, followed by a discernible spike in error observed from timestep 1000 to 1250. The median, mean, q1, and q3 point-by-point errors exhibit larger magnitudes compared to the preceding and subsequent timesteps. Notably, timestep 1000 aligns with the corresponding point on the dotted red line in
Figure 6. As can be seen, the water level values of several river gages face sudden fluctuations simultaneously (red and orange dashed lines, where the red lines show a water level value that is larger than previously seen values). This indicates a change in the distribution of data due to an external influence beyond the input feature itself, and the intensity of these water level fluctuations is unprecedented for some of the river gages. Since the water level prediction is accomplished using a univariate dataset, it cannot reflect the effects of data drift under external influences.
While
Figure 6 shows that some gages underwent unprecedented fluctuations not observed in the training dataset, a Kolmogorov-Smirnov test was also used to examine this statistically. The Kolmogorov-Smirnov test is a non-parametric statistical test used to compare two distributions. It assesses whether two samples are drawn from the same distribution or if one sample is significantly different from a reference distribution. In the context of drift detection, it can help identify whether the distribution of incoming data has changed compared to a baseline dataset. The null hypothesis (H0) for the Kolmogorov-Smirnov test states that the two samples come from the same distribution. A
p-value of 0.05 or smaller indicates strong evidence against the null hypothesis, leading to its rejection and suggesting that distribution drift is present.
The Kolmogorov-Smirnov test was applied to the training and test data (
Table 3). Based on these results, distribution drift is detected for all gages.
The results indicated that LSTNet demonstrates overall better performance across our test datasets for all prediction intervals. Since all of the datasets have undergone distribution drift, we can conclude that LSTNet performs better under drift conditions.
A sensitivity test was performed to determine whether there is a relationship between specific ranges of hyperparameters and lower RMSE. First, an Ordinary Least Squares (OLS) regression was performed. In this case, the dependent variable was the error, and the independent variables were the hyperparameters. The OLS test found no significant relationship between the hyperparameters and the error, suggesting that any existing relationships are likely non-linear or more intricate (
Table 4).
An OLS regression test was performed on the LSTNet model’s hyperparameters and error to determine whether the error is dependent on the hyperparameters. The results for other models were similar to those of LSTNet.
However, because the OLS test assumes a linear relationship, it cannot detect more complex interactions. Due to the limitations of OLS in identifying only linear dependencies, we also conducted a Variance Inflation Factor (VIF) test to detect multicollinearity. Multicollinearity occurs when independent variables (hyperparameters) are highly correlated, making it difficult to isolate their individual effects.
The results of the VIF test indicate moderate to high multicollinearity between different hyperparameters, which complicates the use of statistical tests that only detect linear dependencies. While some combinations of hyperparameters led to better model performance, the presence of multicollinearity and complex interrelationships made it difficult to identify those patterns using standard statistical methods.
In general, VIF values between 1 and 5 suggest moderate multicollinearity, while values above 5 indicate high multicollinearity. This high multicollinearity makes it difficult to isolate the individual impact of each hyperparameter on the error, as the variables may interact with each other in complex, non-linear ways.
Table 5 shows the results of the VIF test for LSTNet’s hyperparameters. from the results in
Table 5, we can conclude that, with the exception of two hyperparameters that showed moderate multicollinearity, all other hyperparameters exhibited high multicollinearity.
The analysis revealed no specific combination or pattern of hyperparameters that consistently led to lower RMSE values. However, the improvement in RMSE resulting from hyperparameter tuning is evident. It can be reasoned that due to multicollinearity among hyperparameters, they are not independent; rather, they influence each other. This interdependence makes it challenging to identify which ranges of hyperparameters or specific relationships contribute to better accuracy. Nevertheless, it can be concluded that certain combinations of hyperparameters have resulted in improved model performance. The results were the same for all models.
The findings indicate that LSTNet performs more effectively for univariate water level forecasting under drift conditions. the combination of CNN for extracting local dependencies, LSTM for longer-term dependencies, a skip-RNN layer to enable the model to traverse through time, and autoregressive layers for minimizing scaling issues have contributed to LSTNet achieving better performance.
4. Discussion
The findings of this study underscore the potential of the LSTNet Model in improving water level predictions across multiple datasets and prediction intervals, highlighting its robustness and adaptability to various hydrological conditions. The LSTNet model demonstrated lower median RMSE and MAE values, as well as higher correlation coefficients, compared to MLP, LSTM, and transformer models across 20 river gages and all prediction intervals (4, 6, 8, and 10 steps ahead). This performance advantage persisted as the prediction interval increased, suggesting that LSTNet is better equipped to capture both short-term and long-term dependencies in water level data. As expected, the prediction error increases as the prediction interval extends.
In the field of hydrology, overcoming data drift becomes crucial since hydrological phenomena are not solely dependent on a univariate input but are also influenced by various factors (e.g., climate change, frequent precipitation brought about by storms). Although including the other factors as input features might help in increasing the prediction accuracy, there might be some limitations such as the lack of historical datasets for such features or lack of computational resources to train models with more input features, as more features will subsequently result in more trainable weights and added computational complexity.
Data and concept drift are prevalent challenges encountered in real-world datasets. In our case, data drift is characterized by changes in water level distribution, and concept drift is characterized by changes in the dynamics between future water levels (output) and historical values (input). Flooding occurs when water level values exceed a certain threshold known as the flood stage; this threshold varies for each location and gage, and the severity of the flood depends on how much water has surpassed this level. However, larger water level values are considered outliers; they do not occur with the same frequency as other values and deviate from the mean of the distribution, making them harder to predict accurately. This effect is called regression toward the mean. outliers have frequency and variability that are not consistently captured, so when the model predicts based on patterns in the data, it pulls these extreme values toward the average. Moreover, our models are optimized to minimize the overall error; since extreme values contribute disproportionately to the error, the model may downweigh them to fit the majority of the data. all this contributes to the observed underprediction of larger errors.
The problem of underpredicting extreme values is not limited to water level forecasting but is also present in any time-series forecasting problem regardless of the model used. It is challenging to accurately predict outliers. When dealing with this problem in other domains that work with non-sequential data, solutions like balancing an imbalanced dataset can be effective; however, in the time-series forecasting field, under-sampling is not viable, as the order of sequential information is lost. However, there are several remedies to diminish the effect. First, time series cross-validation can be used to incorporate all portions of validation data in optimization to prevent missing outlier data without data leakage. Second, a suitable loss function can be chosen to emphasize larger errors. In addition to these techniques, there are drift detection mechanisms such as ADWIN (Adaptive Windowing) [
37,
38,
39] and DDM (Drift Detection Method) [
40,
41]. The former detects changes in data distribution by calculating statistical properties within a specified window and comparing them to a reference distribution, while the latter focuses on the performance metrics of the predictive model to identify concept drift. These drift detection mechanisms are typically employed in real-world applications in production environments where models are not continuously optimized with real-time data, and predictions rely on pretrained models. when a drift or performance decrement is detected, the model is optimized with the new distribution to accommodate emerging patterns. future studies can use these methods to implement adaptive sampling in the optimization process to ensure extreme values are captured, while avoiding oversampling of frequent non-extreme data points. This approach may further improve prediction accuracy for applications like flood detection, where accurate predictions for extreme values are vital.
The results of this study showed that although LSTNet has not been widely favored by researchers for water level forecasting, it is a capable model when modified and tuned. it is important to note that flooding in the state has usually resulted from high -intensity rainfall brought about by localized storms, and an abrupt rise in water level is generally due to precipitation, whether local or from an upstream area. This highlights the importance of considering hydraulic connectivity and graph dynamics when predicting water levels. One of the limitations on this study was the analysis of gages as independent locations, however using graph neural networks can be beneficial for taking this study further. LSTNet is a capable model for predicting time series data, and the strength of this model can be used within a graph model to enhance the performance of future models.
In conclusion, this study presents LSTNet as a promising tool for water level prediction, offering improved accuracy and robustness compared to other deep learning models examined in this paper. As climate change continues to impact hydrological systems, the development of such adaptive and accurate models becomes increasingly crucial for effective water resource management and flood risk mitigation.