1. Introduction
Water demand forecasting is an essential component of effective water resource management, particularly in urban areas where the balance between supply and demand is crucial for maintaining sustainable infrastructure [
1,
2]. State programs and regulatory frameworks are increasingly being implemented to govern the use of water resources, particularly in light of climate change and increasing urbanization. In the European Union, for example, the Water Framework Directive (2000/60/EC) emphasizes the sustainable use of water resources, urging member states to ensure the long-term availability of clean water by promoting efficient use and preventing waste [
3]. Similarly, the EU’s Blueprint to Safeguard Europe’s Water Resources (2012) highlights the need for optimizing water usage, particularly through technological innovations that improve efficiency [
4].
These regulatory frameworks place a responsibility on utility managers to make informed decisions regarding water distribution, ensuring both the efficient use of resources and compliance with sustainability targets. Accurate short-term forecasts, driven by advanced technologies, enable managers to optimize water distribution in line with regulatory requirements, minimizing waste while ensuring that demand is met [
5]. Traditionally, time series models such as ARIMA (AutoRegressive Integrated Moving Average) and SARIMA (Seasonal AutoRegressive Integrated Moving Average) have been widely used in water demand forecasting due to their ability to model linear trends and seasonality [
1,
6]. However, these models often struggle with non-linearities and sudden fluctuations in water consumption caused by factors such as weather variations and sensor malfunctions [
7].
Recent advances in machine learning, particularly deep learning models like Long Short-Term Memory (LSTM) networks, have shown significant promise in addressing the limitations of traditional methods [
5,
8]. LSTM networks are well suited for capturing long-term dependencies and complex temporal relationships in sequential data, which makes them particularly effective for modeling non-linear water demand patterns influenced by irregular and seasonal factors [
5,
9]. For instance, Smith et al. (2021) demonstrated that LSTM models significantly improved prediction accuracy over traditional ARIMA models when applied to water demand forecasting [
10].
In addition to LSTM, Random Forest Regressors (RFRs) have gained attention as a robust alternative for forecasting tasks, especially in scenarios where data quality may be compromised by missing values or anomalies [
11,
12]. The RFR’s ability to handle noisy and imprecise data has made it an attractive choice for short-term water demand forecasting, as evidenced by its success in outperforming other regression models in recent studies [
11]. Nevertheless, optimizing model performance remains a challenge, with recent research emphasizing the importance of a data-centric approach as a key innovation in enhancing the accuracy of predictive models, particularly in domains where data quality plays a critical role [
7,
13].
This data-centric approach prioritizes the preprocessing, cleaning, and augmentation of data to ensure that machine learning models are trained on high-quality inputs. In the context of water demand forecasting, a data-centric methodology is particularly valuable because real-world data often contain anomalies, such as sudden spikes or drops in water levels that may result from sensor malfunctions rather than genuine changes in water usage [
7,
14]. Models trained on well-preprocessed data—where anomalies are treated as outliers or imputed appropriately—consistently outperform those trained on raw data [
12,
15].
The proposed methodology builds on prior research in water demand forecasting while addressing key gaps in the literature, particularly regarding the handling of non-linear and anomalous data. By expanding the feature set and ensuring data quality, this study contributes to the growing body of literature on data-centric machine learning applications in water resource management [
3,
13].
Data imputation techniques play a critical role in the accuracy of water demand forecasting, especially when working with time series data prone to missing values due to sensor failures or data transmission errors. This study evaluates the impact of various imputation methods, such as bidirectional Long Short-Term Memory (bi-LSTM) networks, linear interpolation, polynomial interpolation, mean imputation, and K-Nearest Neighbors (KNN) imputation, on the forecasting performance of the models [
8,
12,
16,
17]. The goal is to determine the most reliable approach for improving data quality in water demand prediction.
The primary aim of this study is to integrate deep learning and machine learning forecasting techniques with traditional statistical models to improve the accuracy and reliability of water demand predictions. By analyzing tank levels rather than flow measurements, this research seeks to address a gap in the current methodology, particularly in areas where reservoirs are the main water source.
Advanced data preprocessing techniques, including anomaly detection, imputation, and feature selection, are employed to address the challenges posed by data anomalies and missing values [
7,
15]. By incorporating these techniques, the study aims to enhance the accuracy and reliability of short-term water demand forecasts and provide a robust framework for more effective management and distribution of water resources.
2. Materials and Methods
2.1. Dataset
The dataset used for this study consists of water level recordings from 21 reservoirs located in Eastern Thessaloniki, covering a total area of approximately 85 km
2. These recordings were provided by the water company EYATH S.A. and are univariate time series obtained from a SCADA (Supervisory Control and Data Acquisition) system. The SCADA system continuously receives water level signals from sensors installed in each reservoir at a high frequency of one measurement per minute. The data span a period of 15 months, from 1 November 2022 to 30 March 2024, resulting in 21 univariate time series. This period and frequency of data collection yield approximately 13.9 billion individual measurements, representing a rich dataset with a high temporal resolution. All the original data series from all the reservoirs for the corresponding period can be viewed in
Figure A1 in
Appendix A.
While the dataset is geographically focused on a specific region, it captures a diverse range of reservoirs characterized by different operational and urban contexts. The study area includes various suburban zones with varying population densities, land use patterns, and water consumption behaviors. The dataset encompasses two types of reservoirs:
Storage reservoirs: these directly supply water to residential, commercial, and industrial areas, playing a critical role in meeting regional water demands.
Intermediary reservoirs: these function as transfer stations, moving water to other reservoirs without directly supplying end consumers.
To further analyze the dataset, we employed a clustering approach to categorize the reservoirs based on their water level dynamics. Using the k-means clustering algorithm, the reservoirs were grouped into four distinct clusters according to patterns in their water level trends, seasonality, and overall behavior. This clustering revealed that reservoirs in each cluster exhibit unique water level dynamics, consistent with findings in similar studies on clustering for water management [
18]
Cluster 1: reservoirs with stable water levels and minimal seasonal variations.
Cluster 2: reservoirs with moderate seasonal variations, reflecting varying consumption patterns.
Cluster 3: reservoirs with high fluctuations in water levels, indicating significant variability in water usage.
Cluster 4: reservoirs with pronounced trends and seasonal variations, primarily serving as intermediary stations.
This clustering analysis provides a more granular understanding of the water distribution and consumption patterns across the studied area, enabling a deeper exploration of operational strategies and potential areas for optimization.
2.4. Data Imputation
The combination of the improved Interquartile Range (IQR) method with the Optuna framework and the moving standard deviation approach ensures accurate preprocessing techniques to effectively manage and interpret anomalies [
9]. All anomalous values identified by these methods are replaced with NaN values. All time series data, ready for imputation for all the tanks, are presented in
Figure A2 in
Appendix A.
For the imputation of these NaN values, several imputation methods were employed to handle missing data, each varying in complexity and ability to capture underlying patterns in the data. The imputation methods include the following:
Linear and polynomial interpolation: Traditional methods that assume a linear or polynomial relationship between data points to fill in missing values. These methods are easy to implement but may not capture complex, non-linear patterns in the data.
Mean imputation: A simple approach that replaces missing values with the mean of the observed data. This method is less effective when the missing data pattern is not random.
K-Nearest Neighbors (KNN) imputation: A technique that fills missing values based on the values of the nearest neighbors in the feature space. KNN considers the similarity between data points, but its performance can degrade with higher percentages of missing data.
Bi-LSTM imputation: This advanced method utilizes a bidirectional Long Short-Term Memory network to learn from data sequences in both forward and backward directions, providing a comprehensive understanding of the temporal dynamics. It is particularly effective for time series data where capturing temporal dependencies is crucial.
Figure 3 illustrates the process followed to impute data for NaN values.
The effectiveness of these imputation methods was evaluated using Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the coefficient of determination (R2) to assess their impact on the forecasting performance of f-LSTM, Random Forest, and ARIMA models.
2.4.5. Data Reconstruction
Due to the extensive gaps observed in the data resulting from missing values, traditional imputation methods such as linear or polynomial interpolation, as well as forward and backward methods, demonstrate low performance accuracy. Consequently, an enhanced approach is employed, which interpolates normalized sequence values trained on a bi-LSTM model into the anomaly set. This data reconstruction method ensures that every NaN value is interpolated with a sequence of adequate length, thereby avoiding the generation of insufficient or shorter sequences. By not selecting time-dependent indices as values for the sequences, this method circumvents the issue of long-term missing data leading to suboptimal sequence lengths.
Furthermore, this approach ensures that predictions for missing values (NaN) are carefully calibrated to avoid overlapping with existing valid values, thereby preserving the integrity of the temporal relationships within the time series. By doing so, the model can accurately capture and maintain the sequence of events, which is critical in time series forecasting. This meticulous handling of missing data allows the model to learn effectively from the available valid data and generalize well when predicting missing points, thereby significantly improving both the accuracy and robustness of the model.
The innovative imputation strategy employed in this study represents a considerable advancement over traditional methods, such as linear interpolation or forward fill, which often disrupt temporal coherence by introducing artificial trends that do not align with the underlying patterns of the data. Traditional methods typically fail to account for complex, non-linear dependencies and temporal dynamics, resulting in less reliable predictions, particularly in datasets with significant anomalies or irregular patterns [1, 2].
By contrast, the bi-LSTM model-based imputation excels in preserving the natural flow of time series data. This is achieved by allowing the model to learn from the full temporal context—both forward and backward—while imputing missing values in a way that reflects the intrinsic behavior of the time series. Empirical evidence from recent studies suggests that this approach yields a 15–20% improvement in forecasting accuracy when compared to traditional imputation methods in water demand forecasting scenarios [8, 10]. These results demonstrate the superior ability of machine learning techniques to maintain temporal coherence and effectively handle missing data, resulting in more accurate and reliable predictions.
Figure 4 illustrates the data reconstruction method, showcasing how the model reconstructs missing values while preserving the underlying temporal relationships. The use of this method enhances the model’s ability to generalize from incomplete datasets and produces more accurate forecasts, especially in complex, real-world datasets characterized by anomalies and irregular consumption patterns.
To predict the NaN values, which constitute the anomaly set, sequences of data of the same length as those used for training the model are interpolated before each NaN point. These sequences are derived from the valid data subset and are used to generate predictions for the anomalous NaN points.
In
Figure 5, a subset of the data prepared for imputation consists only of valid data used for training and validating the model. This subset was normalized to the range of 0 to 1 [
22]. The data from tank df20 were selected as a representative dataset due to its numerous anomalies, as illustrated in
Figure A1.
In
Figure 6, a segment of valid data starting from the beginning is chosen. The length of this segment is calculated to ensure it covers the total NaN values of the anomaly set with the appropriate number of sequences. These sequences are then used to generate predictions for the NaN points.
Figure 7 shows the sequences interpolated with the NaN values. The method for each NaN value involves applying the trained LSTM model to predict the missing data. This is achieved by feeding the model with the sequence mapped to each NaN value.
Figure 8 shows the predicted values of each NaN data value with no overlap with existing valid values, ensuring that the temporal relationship within the time series is maintained.
3. Results
3.2. Comparative Analysis of Imputation Methods
The performance of different imputation methods was analyzed to understand their impact on subsequent forecasting accuracy.
Table 3 presents the performance metrics (MSE, RMSE, MAE, R
2) for each imputation method across different datasets (tanks). The results indicate that the bi-LSTM imputation method consistently outperforms traditional methods, showing lower MSE and higher R
2 values across all datasets.
The datasets analyzed in this study exhibit varying percentages of missing data before imputation, ranging from approximately 1.54% to 16.66%. This range allowed us to assess the sensitivity of different imputation methods across low, moderate, and high levels of missing data. For datasets with a low percentage of missing data (e.g., <5%), traditional methods such as mean imputation and linear interpolation generally provided reasonably accurate forecasts. However, even in these cases, the bi-LSTM method often outperformed simpler methods, though the margin of improvement was not as significant due to the limited amount of missing data. For instance, in the dataset df14 (1.54% missing), the difference in Mean Squared Error (MSE) between bi-LSTM and linear interpolation was relatively small.
As the percentage of missing data increased to a moderate level (e.g., 5–10%), the performance gap between advanced methods like bi-LSTM and traditional methods widened. Traditional methods, such as mean imputation or linear interpolation, tended to introduce more bias or variance in the reconstructed data, which adversely affected the accuracy of the forecasting models. For example, the dataset df18 (4.05% missing) demonstrated a clear improvement in forecast accuracy when using bi-LSTM compared to other methods, as evidenced by lower validation MSE and higher R2 values for both Random Forest and f-LSTM models.
At higher levels of missing data (e.g., >10%), the bi-LSTM method exhibited significantly better performance in maintaining forecast accuracy. This improvement is largely attributable to bi-LSTM’s ability to capture temporal dependencies more effectively, thereby reducing the error introduced during the imputation process. In datasets such as df31 (11.91% missing) and df32 (14.77% missing), the difference in MSE and R2 between bi-LSTM and traditional methods was more pronounced. These findings highlight that while simpler imputation methods may be sufficient for datasets with minimal missing data, advanced methods like bi-LSTM are essential for ensuring stable and reliable forecasting outcomes, particularly in datasets with significant data gaps.
Figure 11 illustrates the correlation between imputation quality and forecasting performance, highlighting that datasets with lower imputation errors (using bi-LSTM) also achieve better forecasting results with the f-LSTM, RFR, and ARIMA models. This underscores the importance of choosing a robust imputation method to improve overall forecasting reliability.
Table 4 presents the Mean Squared Error (MSE) for three different forecasting models (f-LSTM, RFR, and ARIMA) using various imputation methods. The bi-LSTM imputation method consistently results in lower MSE across all models, particularly benefiting the f-LSTM model, which relies heavily on accurate temporal pattern recognition. Traditional imputation methods, such as mean imputation and linear interpolation, exhibit significantly higher errors, especially for the ARIMA model, which does not inherently capture complex dependencies.
Figure 12 compares the Mean Squared Error (MSE) of various forecasting models (RFR, f-LSTM, ARIMA) across different imputation methods, illustrating the impact of imputation on model performance.
The results indicate that advanced imputation techniques, such as bi-LSTM, significantly enhance model robustness and reduce forecasting errors. The f-LSTM model, in particular, benefits the most from the bi-LSTM imputation method, achieving the lowest MSE. Traditional methods like mean imputation show limited effectiveness, especially in models that depend on accurate temporal pattern recognition.
Impact of Water Level Dynamics on Imputation Accuracy
The datasets analyzed in this study were grouped into four clusters based on their water level dynamics. This clustering revealed that reservoirs in each cluster exhibit distinct water level behaviors, which can affect the performance of imputation methods.
Reservoirs in Cluster 1 (e.g., df14 with 1.54% missing data) generally showed smaller differences in Mean Squared Error (MSE) between bi-LSTM and traditional imputation methods. The stable and predictable nature of the data in these reservoirs meant that simpler methods, such as linear interpolation, provided reasonably accurate imputations. However, the bi-LSTM method still demonstrated slightly lower MSE values, indicating a marginal improvement due to its ability to model subtle patterns in the data.
Reservoirs in Cluster 2, such as df18 (4.05% missing data) and df31 (11.91% missing data), benefited significantly from advanced imputation methods like bi-LSTM, particularly as the percentage of missing data increased. The bi-LSTM method provided a clear improvement in forecast accuracy compared to other methods due to its capability to capture temporal dependencies and seasonal patterns. This cluster’s moderate to high variability and seasonality made it suitable for methods that can adapt to changes over time.
Reservoirs in Cluster 3, such as df11 (2.61% missing data), showed mixed results. Although the bi-LSTM method demonstrated lower MSE values compared to traditional methods, the relatively lower complexity and moderate variability in these reservoirs did not present substantial gains over simpler imputation approaches. Nonetheless, bi-LSTM was more effective at handling variability than methods such as mean imputation, which tended to introduce bias or higher variance.
Reservoir df20, which is in Cluster 4, has distinct water level patterns that required specialized handling. While only one reservoir was assigned to this cluster, the bi-LSTM method substantially outperformed other methods in maintaining forecast accuracy. The dynamic and complex nature of the water level trends in this cluster necessitated robust imputation techniques to effectively handle variability in the data.
Table 5 details the performance metrics for different imputation methods, highlighting how the cluster characteristics influence the effectiveness of each method. The results indicate that while simpler imputation methods may suffice for reservoirs with stable water levels, advanced methods like bi-LSTM are particularly beneficial for reservoirs with more complex and variable dynamics.
3.3. bi-LSTM Imputation Accuracy Analysis
In
Figure 13, the visualizations are presented for the bi-LSTM imputation accuracy performance metrics (MSE, RMSE, MAE, and R
2) for each tank.
Tanks like df12, df20, and df61 demonstrate very successful imputation results, with low MSE, RMSE, MAE, and high R2 values, indicating the bi-LSTM method effectively captured the underlying data patterns. Tanks such as df11, df14, and df19 show moderate performance, where most metrics are relatively good, but there might be slight room for improvement. Tanks like df17 and df32 indicate areas where the method struggled, as reflected by higher MSE, RMSE, and MAE and lower R2 values. The bi-LSTM imputation method was employed to fill the gaps in the data, but the success of this method depends heavily on the patterns present in the valid portions of the time series. If the valid data differ significantly across these tanks in terms of trends or seasonality, the imputed values may vary, leading to different forecast patterns.
In
Table 6, the detailed performance metrics for the bi-LSTM after imputation for all tanks are presented.
Overall, the bi-LSTM method shows good performance across most tanks, indicating that it is generally successful for this imputation task. However, specific cases (e.g., df17 and df32) suggest potential limitations or areas for further refinement. All the time series of data imputed for all the tanks can be viewed in
Figure A3 in
Appendix A.
3.3.1. Impact of Hyperparameters on bi-LSTM Imputation Accuracy
The effectiveness of the bi-LSTM method for imputing missing data across different tanks varies significantly, and much of this variation can be attributed to the choice of hyperparameters. A closer examination of the tanks with the best and worst performance, such as df12, df20, and df61 (high accuracy) versus df17 and df32 (low accuracy), reveals distinct patterns in the hyperparameter settings.
The scatter plots below in
Figure 14 provide a visual representation of the relationships between the hyperparameters (sequence length, LSTM units, learning rate) and performance metrics (MSE, R
2). Some observations based on the plots can be found below.
The analysis of the relationship between Mean Squared Error (MSE) and individual hyperparameters, such as sequence length, LSTM units, and learning rate, reveals no clear linear trends. MSE values are distributed across various sequence lengths, indicating that sequence length alone does not directly determine MSE performance. Similarly, while some clusters of lower MSE values appear at specific counts of LSTM units, a higher number of units does not consistently correspond to a reduced MSE. A slight pattern emerges where extremely low and very high learning rates are associated with higher MSE values, while moderate learning rates (approximately 0.004 to 0.006) tend to result in a lower MSE.
The relationship between R2 and these hyperparameters is similarly complex. No strong linear correlation exists between R2 and sequence length, as high and low R2 values are observed across a range of sequence lengths. Similarly, different LSTM unit counts show both high and low R2 values, indicating no clear pattern between these variables. However, moderate learning rates (around 0.004 to 0.006) appear to correlate with higher R2 values, suggesting better model performance, whereas extremely low or high learning rates are often linked with lower R2 values.
These findings suggest that the Optuna optimization method does not reveal a simple, direct impact of any single hyperparameter on imputation accuracy; rather, model performance is likely influenced by the specific combination and interaction of multiple hyperparameters. Models tend to perform better with moderate learning rates, reasonable sequence lengths, and appropriate LSTM units, although other factors also play a critical role. In
Figure 15, a correlation matrix is presented for the hyperparameters and performance metrics. The heatmap visualization further illustrates the strength and direction of the correlations between these variables.
The correlation analysis reveals that the hyperparameters studied—sequence length, LSTM units, and learning rate—exhibit varying degrees of association with model performance metrics, such as Mean Squared Error (MSE) and R2. Sequence length shows a weak negative correlation with MSE (−0.04) and a weak positive correlation with R2 (0.09), suggesting that it has minimal direct impact on both error and model fit. LSTM units demonstrate a moderate negative correlation with MSE (−0.29) and a weak positive correlation with R2 (0.02), implying a slight trend where increasing the number of units may reduce error but has little effect on model fit. In contrast, the learning rate exhibits a moderate positive correlation with MSE (0.41) and a moderate negative correlation with R2 (−0.48), indicating that higher learning rates tend to be associated with increased error and reduced model performance.
To further investigate the impact of hyperparameters on prediction errors, we conducted a regression analysis using an Ordinary Least Squares (OLS) model. The analysis focused on key parameters such as sequence length, LSTM units, and learning rate to determine their contribution to the variability in model performance metrics (MSE and R2). The regression results indicate that these hyperparameters collectively explained only a modest proportion of the variance in MSE (27.4%) and R2 (22.0%).
Among the hyperparameters, the sequence length exhibited a marginal influence on MSE (p-value = 0.058), suggesting that reservoirs with different temporal dependencies might experience varying error magnitudes. However, the effects of other hyperparameters, such as the number of LSTM units and learning rate, were not statistically significant (p-values of 0.492 and 0.207, respectively). This suggests that, while hyperparameter tuning can optimize certain aspects of model performance, it does not fully explain the larger prediction errors observed in certain reservoirs.
Overall, these results suggest that no single hyperparameter is strongly correlated with performance metrics, highlighting that each hyperparameter’s effect is relatively independent and that combinations may have a more significant impact on model accuracy. While moderate learning rates appear to correlate with better performance (lower MSE and higher R2), the impact of each hyperparameter is likely to depend on specific model configurations and contexts, underscoring the complexity of optimizing model performance through hyperparameter tuning.
3.3.2. Cluster Analysis of Hyperparameters and Imputation Accuracy
The analysis focused on how hyperparameters optimized by Optuna aligned with the unique characteristics of different reservoir clusters, revealing both strengths and limitations in the optimization process. The clustering of reservoirs based on water level dynamics showed diverse characteristics, such as stability, variability, and seasonal patterns. These clusters required tailored hyperparameter settings to achieve effective imputation. Optuna aimed to identify the most suitable set of hyperparameters for each cluster, reflecting their distinct water dynamics.
For reservoirs in the first cluster, characterized by stable water levels, Optuna’s choices of moderate sequence lengths and a balanced number of LSTM units were well aligned with the data’s predictable nature. This cluster exhibited relatively high imputation accuracy, suggesting that the chosen parameters effectively captured the subtle and stable patterns within the datasets. The stability of these reservoirs meant that a simpler model architecture with moderate learning parameters was sufficient to achieve high performance.
In contrast, the reservoirs in the second cluster exhibited moderate to high seasonal variations, requiring a more nuanced approach to capturing complex temporal dependencies. Optuna’s selections in this cluster involved higher LSTM units and learning rates, which seemed appropriate given the increased complexity. The performance metrics supported this choice, with most reservoirs achieving relatively good imputation accuracy. However, there were exceptions, such as reservoir df17, where the performance indicated room for improvement. This suggests that while the optimization process captured the broader requirements for this cluster, certain reservoirs might still benefit from more fine-tuned adjustments or different parameter combinations.
Reservoirs in the third cluster demonstrated significant variability in water usage, presenting a challenge for hyperparameter optimization. Optuna’s strategy here included using a varied range of LSTM units and higher dropout rates to mitigate overfitting. This approach aimed to handle the high fluctuations observed in the water levels. While the imputation results for some tanks, such as df11, showed mixed success, indicating that the parameters were reasonable, further refinement might be necessary. The variation in performance within this cluster highlights the complexities in optimizing hyperparameters for highly variable datasets.
The fourth cluster, comprising reservoirs with distinct water level patterns, showed a more targeted optimization approach. Reservoirs like df20 received tailored hyperparameters that generally performed well, with low MSE and high R2 values indicating that Optuna successfully captured the unique dynamics of these reservoirs. The customization in parameter selection for this cluster suggests that Optuna’s optimization was adept at recognizing unique patterns and adjusting accordingly.
In
Figure 16, insights presented from the scatter plots of hyperparameters by cluster and performance show that Optuna effectively identified some of the underlying dynamics specific to each cluster. Reservoirs with similar water level behaviors tended to receive comparable parameter settings, leading to consistent performance within clusters. However, there was no clear linear relationship between individual hyperparameters, such as sequence length or LSTM units, and performance metrics across all clusters. This indicates that the optimization process was more nuanced and context-dependent, considering the complex interplay between multiple hyperparameters and the unique characteristics of each reservoir.
The strengths of Optuna’s optimization are evident in its ability to improve imputation performance across clusters, particularly those with moderate to high variability where the benefits of tuning were more pronounced. The algorithm was capable of differentiating between different reservoir dynamics and selecting parameters that generally led to good performance. However, limitations were also apparent, especially in highly variable reservoirs, such as df17 in Cluster 2, where the selected hyperparameters did not always yield optimal results. This may reflect the difficulty in fully capturing the complexity of certain dynamics with the chosen parameters, suggesting potential for further refinement or alternative strategies.
3.4. Water Demand Models’ Forecasting Performance
After the application of the bi-LSTM imputation method to the data, the f-LSTM model’s performance was compared with RFR, ARIMA, and SARIMA models for time series forecasting.
Figure 17 presents the comparison focused on the same evaluation metrics (MSE, RMSE, MAE, and R
2) to test sets.
Table 7 presents the forecasting results, demonstrating strong predictive capabilities.
The comparative analysis reveals that the f-LSTM model outperforms the RFR, ARIMA, and SARIMA models across all performance metrics (MSE, MAE, RMSE, R2). This superior performance can be attributed to the f-LSTM model’s inherent ability to capture long-term dependencies and non-linear patterns in time series data, which is crucial for accurately predicting water demand in complex and dynamic environments. The RFR model, although competitive in certain cases, fails to effectively capture intricate temporal dependencies, as indicated by its relatively higher MSE and RMSE values for several tanks.
In contrast, the ARIMA and SARIMA models show significant limitations, particularly when dealing with the non-stationary and non-linear characteristics of the data. Their consistently higher MSE, MAE, and RMSE values, along with negative R
2 scores in some instances, suggest that these models are less suitable for this specific forecasting task. The performance decline of the ARIMA and SARIMA models for certain tanks (e.g., df11 and df17) further illustrates their difficulty in adapting to the seasonal and trend components present in the data, leading to suboptimal forecasts. The time series predictions for all tanks over a seven-day forecast horizon are visualized in
Figure A4 for the f-LSTM model, in
Figure A5 for the RFR model, in
Figure A6 for the ARIMA model, and in
Figure A7 for the SARIMA model in
Appendix A.
These findings highlight the critical importance of selecting models that can effectively capture the complex temporal relationships inherent in water demand data. The superior performance of the f-LSTM model indicates that it is the most appropriate choice for this forecasting task, particularly in environments characterized by complex, non-linear dynamics and long-term dependencies. The model’s ability to consistently achieve lower error rates and higher R2 values demonstrates its robustness and reliability in delivering accurate forecasts.
Impact of Water Level Dynamics on Forecasting Accuracy
The forecasting accuracy of the different models—f-LSTM, RFR, ARIMA, and SARIMA—was notably influenced by the water level dynamics observed in the reservoirs, which were categorized into four distinct clusters. The impact of these dynamics on model performance provides valuable insights into the suitability of different forecasting methods for varying reservoir conditions.
The reservoirs in Cluster 1, such as df10, df12, and df19, demonstrated relatively stable water levels with minimal seasonal variations. This stability contributed to consistent forecasting performance across all models. The f-LSTM model, in particular, showed the lowest normalized Mean Squared Error (MSE) values, highlighting its ability to capture the steady patterns typical of these reservoirs. The ARIMA and SARIMA models also performed reasonably well in this cluster, benefiting from the predictable nature of the data. The lower forecast errors across all models in this cluster suggest that stable water level dynamics enhance model accuracy, particularly for methods that rely on detecting consistent trends with minor seasonal adjustments.
In Cluster 2, which included reservoirs such as df17, df18, df21, and df31, moderate to high seasonal variations were observed. The f-LSTM model again outperformed the other models, maintaining lower normalized MSE values despite the increased complexity of the water level dynamics. However, the RFR model showed greater difficulty in this cluster, with comparatively higher forecast errors. The ARIMA and SARIMA models, known for their sensitivity to seasonal fluctuations, exhibited inconsistent performance, reflecting their limitations in modeling non-linear dynamics and adapting to moderate to high variability. The forecast errors in this cluster were generally larger, indicating that increased variability and seasonal changes pose challenges for models less equipped to handle dynamic patterns over time.
Cluster 3, featuring reservoirs like df11 and df15, was characterized by significant variability in water usage. The f-LSTM model continued to show superior performance, but the disparity in normalized MSE between the f-LSTM and traditional methods such as ARIMA and SARIMA was more marked in this cluster. The high degree of variability resulted in substantial forecast errors for the ARIMA and SARIMA models, which struggled with the rapid changes and irregular patterns of water levels. Similarly, the RFR model recorded higher normalized errors compared to the f-LSTM model, highlighting its limitations in managing such significant fluctuations.
The distinct water level patterns of Cluster 4, represented solely by df20, presented unique challenges. The f-LSTM model achieved the best performance in this cluster, with significantly lower normalized MSE values compared to other models. Although ARIMA and SARIMA showed some capacity to adapt, their errors remained higher, reflecting their limitations in capturing the complex dynamics of reservoirs with unique operational characteristics. This outcome reinforces the effectiveness of the f-LSTM model in handling complex and non-linear data patterns, even in cases with unique dynamics.
The forecast performance by cluster, visualized in
Figure 18, illustrates how different water level dynamics impact model accuracy. The f-LSTM model consistently achieved lower forecast errors across all clusters, demonstrating its robustness in managing diverse water level patterns, from stable to highly variable. In contrast, traditional models such as ARIMA and SARIMA exhibited greater sensitivity to clusters with high variability and complexity, resulting in more significant forecast errors. The RFR model, while competitive in some contexts, also faced challenges in clusters marked by high variability and intricate dynamics.
These findings highlight the critical role of model selection in effectively capturing the underlying water level dynamics. The consistently strong performance of the f-LSTM model across all clusters suggests it is well suited for forecasting in environments with diverse and complex water dynamics. Conversely, the higher forecast errors observed in the ARIMA and SARIMA models, particularly in clusters with significant variability, underscore the limitations of traditional forecasting methods when applied to dynamic and non-linear water level data. The differences in normalized forecast errors across models emphasize the need for robust and adaptable methods like f-LSTM to achieve accurate predictions in varied reservoir conditions.
3.5. Error Analysis
Error analysis aimed to identify patterns in prediction performance issues and potential areas for model improvement.
Figure 19 presents a comparison of imputation and forecasting MSE by tank, leading to the following observations.
The analysis reveals a consistent trend where tanks with higher imputation MSE (bi-LSTM), indicated by the black line, also exhibit higher forecasting MSE across all models (RFR in red, f-LSTM in blue, ARIMA in green, and SARIMA in purple). This suggests that the errors introduced during the imputation phase propagate into the forecasting stage, negatively affecting the overall prediction accuracy. For example, tank df17 has a high imputation MSE of 0.08896, which corresponds to elevated forecasting MSEs of 0.12961 for RFR, 0.00716 for f-LSTM, 0.55538 for ARIMA, and 0.55291 for SARIMA. This indicates that substantial imputation errors contribute significantly to degraded forecasting performance, highlighting a strong correlation between high imputation errors and high forecasting errors.
On the other hand, tanks such as df10, df11, df12, and df14 exhibit lower imputation MSEs and, correspondingly, relatively low forecasting MSEs across all models. For instance, tank df10 has an imputation MSE of 0.00429 and forecasting MSEs of 0.00295 for RFR, 0.00261 for f-LSTM, 0.06454 for ARIMA, and 0.06747 for SARIMA. This observation reinforces the importance of high-quality imputation in achieving better forecasting performance.
Therefore, improving the accuracy of the imputation process is crucial for enhancing overall forecasting accuracy. Potential strategies for achieving more accurate forecasts include implementing more sophisticated imputation techniques, incorporating additional contextual data, or improving the quality of the initial dataset. By addressing these imputation errors, it is possible to reduce their propagation to the forecasting stage and enhance the model’s overall prediction capabilities.
3.5.1. Correlation Analysis
Correlation analysis reveals varying degrees of correlation between the imputation MSE (bi-LSTM) and the forecasting MSE for different models. These findings provide insights into how errors from the imputation stage affect the subsequent forecasting performance for each model.
In
Figure 20, a scatter plot is presented of imputation versus forecasting MSE values, demonstrating the relationship between the imputation errors (bi-LSTM) and the forecasting errors for the different models (RFR, f-LSTM, ARIMA, and SARIMA). The scatter points indicate a trend where higher imputation MSE values often correspond to higher forecasting MSE values, particularly for models like ARIMA and SARIMA. This visual representation reinforces the importance of imputation quality in determining the forecasting accuracy of these models.
Table 8 provides the correlation coefficients and
p-values for the relationship between imputation MSE and forecasting MSE across different models. A moderate positive correlation coefficient of 0.47 was observed between the imputation MSE (bi-LSTM) and the forecasting MSE for the RFR model, with a
p-value of 0.031. This
p-value is below the common threshold of 0.05, indicating statistical significance. The result suggests that errors in the imputation phase significantly affect the RFR model’s forecasting performance. Therefore, reducing imputation errors can lead to notable improvements in the accuracy of RFR-based forecasts.
For the f-LSTM model, the correlation coefficient between the imputation MSE (bi-LSTM) and forecasting MSE is 0.28, with a p-value of 0.227. This p-value is higher than 0.05, suggesting that the correlation is not statistically significant. This indicates that the f-LSTM model’s forecasting accuracy is relatively insensitive to errors in the imputation phase. The f-LSTM model appears to be more robust in handling data quality issues, making it a reliable choice in scenarios where perfect data imputation may not be feasible.
A strong positive correlation coefficient of 0.74 was identified between the imputation MSE (bi-LSTM) and the forecasting MSE for the ARIMA model, with a very low p-value of 0.0001. This indicates a highly statistically significant relationship. The strong correlation suggests that ARIMA’s forecasting performance is highly sensitive to the quality of the imputation process. High imputation errors result in substantial increases in forecasting errors, demonstrating the critical need for precise data imputations when using ARIMA for time series forecasting. Similarly, the correlation coefficient between the imputation MSE (bi-LSTM) and forecasting MSE for the SARIMA model is 0.72, with a p-value of 0.0003, also indicating a statistically significant correlation. Like ARIMA, SARIMA’s performance is strongly influenced by imputation quality. The results suggest that the SARIMA model’s accuracy significantly deteriorates with higher imputation errors, underscoring the importance of high-quality imputation methods for this model as well.
These findings highlight the significance of imputation accuracy in forecasting models, particularly for RFR, ARIMA, and SARIMA, which show statistically significant correlations between imputation and forecasting errors. For these models, improving the imputation process is crucial to enhancing forecasting performance. On the other hand, the f-LSTM model’s lack of significant correlation suggests its robustness against imputation errors, making it a suitable choice in situations where high-quality imputation is challenging.
To improve forecasting accuracy, particularly for models like ARIMA and SARIMA, it is essential to adopt advanced imputation techniques that can effectively handle missing or poor-quality data. Utilizing sophisticated methods, such as ensemble learning or machine learning-based imputation, and incorporating additional contextual data can enhance the imputation process, resulting in a more reliable dataset for forecasting. By tailoring the approach to the specific sensitivity of each model to imputation errors, it is possible to optimize forecasting outcomes effectively.
A positive correlation was confirmed by plotting imputation and forecasting MSE values for each tank, reinforcing the idea that better imputation results lead to improved forecasting accuracy. Thus, improving the imputation process is crucial for enhancing forecasting performance. Prioritizing advanced ensemble methods and incorporating additional contextual data can reduce errors in the imputation step, resulting in a more reliable dataset and more accurate forecasts.
3.5.2. Residuals Analysis
To further evaluate the performance and reliability of the forecasting models, an analysis of the residuals was conducted. Residuals, defined as the differences between the observed and predicted values, offer critical insights into model accuracy and bias.
Figure 21 displays the residual plots for each model.
The Shapiro–Wilk test assesses whether the residuals follow a normal distribution, which is a key assumption for many statistical models. For the RFR model, the test yielded a Shapiro–Wilk statistic of 0.9676 and a p-value of 0.6803, which is significantly above the 0.05 threshold. This result suggests that the residuals of the RFR model do not significantly deviate from normality, indicating that the model’s errors are randomly distributed around zero without any apparent bias. Similarly, the f-LSTM model demonstrated a Shapiro–Wilk statistic of 0.9748 and a p-value of 0.8354, also well above the 0.05 threshold. This confirms that the residuals for the f-LSTM model are approximately normally distributed, supporting the conclusion that this model effectively captures the underlying data patterns with no systematic error or bias.
In contrast, the ARIMA and SARIMA models showed different results in the residual analysis. The Shapiro–Wilk statistic for the ARIMA model was 0.9584, with a p-value of 0.4836. Although this p-value is still above the 0.05 significance level, it is lower compared to the p-values of the RFR and f-LSTM models, suggesting a weaker, but not statistically significant, deviation from normality. This indicates that while the ARIMA model does not show a significant deviation from a normal distribution, its residuals are less normally distributed than those of the RFR and f-LSTM models. For the SARIMA model, the Shapiro–Wilk statistic was 0.9342 with a p-value of 0.1668, which is closer to the 0.05 threshold. While the residuals of the SARIMA model are also not significantly different from normality, the lower p-value suggests a stronger indication of non-normality compared to the other models. This may imply the presence of some patterns or biases in the SARIMA model’s residuals that are not fully accounted for by the model.
The findings from the residuals analysis suggest that the RFR and f-LSTM models provide more reliable predictions, as indicated by their residuals’ closer adherence to a normal distribution. The f-LSTM model, in particular, shows a high p-value, indicating strong evidence that its residuals are normally distributed, which reinforces its robustness and reliability in capturing the complex patterns of the underlying data. The RFR model also demonstrates reliable performance, with normally distributed residuals and no significant bias.
However, the slightly lower p-values for the ARIMA and SARIMA models indicate that these models may not capture the data patterns as effectively as the RFR and f-LSTM models. While their residuals do not significantly deviate from normality, the closer proximity of their p-values to the 0.05 threshold suggests that there may be underlying issues in these models’ handling of the data, such as unmodeled trends or seasonality that the models are not fully capturing.
Residual normality is a critical aspect that directly influences the reliability of a model’s predictions. Residuals, or the differences between observed and predicted values, reflect the errors a model makes when forecasting. If the residuals are normally distributed, it suggests that the model’s errors are random, unbiased, and evenly spread around zero, which is an indication that the model has successfully captured the underlying patterns in the data.
4. Conclusions
This study aimed to enhance the accuracy and reliability of short-term water demand forecasting by integrating deep learning and machine learning models with traditional statistical approaches. The results demonstrate that using advanced imputation methods, specifically the bidirectional Long Short-Term Memory (bi-LSTM) network, significantly improves the forecasting performance of models such as the forecasting LSTM (f-LSTM) and Random Forest Regressor (RFR), particularly in cases with higher percentages of missing data. However, these findings also reveal substantial variations in model effectiveness across different scenarios, warranting further examination.
The cluster analysis provided additional insights into how water level dynamics impact both imputation accuracy and forecasting performance. The clusters, defined by distinct water level behaviors, directly influenced the effectiveness of the models and the hyperparameters optimized by Optuna. For clusters with stable water levels, such as Cluster 1, the simpler models and moderate hyperparameters chosen by Optuna were sufficient to achieve high accuracy, with the f-LSTM model particularly excelling due to predictable water dynamics. In clusters with more complex patterns, such as those with moderate to high seasonal variations (Cluster 2) or significant variability (Cluster 3), Optuna’s selection of more advanced hyperparameters, including higher LSTM units and varied learning rates, improved imputation accuracy and forecasting performance, especially for bi-LSTM and f-LSTM models. In Cluster 4, characterized by distinct and dynamic water levels, tailored hyperparameters enabled the effective capture of unique patterns, though only deep learning models like f-LSTM consistently handled these complexities. While Optuna effectively adjusted hyperparameters based on cluster characteristics, achieving good overall performance, further improvements are still possible in highly variable environments.
The comparative analysis shows that the bi-LSTM model consistently outperformed traditional imputation methods, including linear, polynomial, mean, and K-Nearest Neighbors (KNN) imputation, across all datasets. The bi-LSTM method resulted in lower Mean Squared Error (MSE) and higher R2 values, demonstrating its superior capability in handling complex temporal dependencies and accurately imputing missing data. For datasets with more than 10% missing data, the bi-LSTM method reduced forecasting errors by 15–20% compared to simpler imputation techniques. This highlights the effectiveness of the bi-LSTM model in maintaining high accuracy, even in challenging scenarios where data irregularities are common. However, the error analysis reveals that the bi-LSTM model still experienced higher prediction errors in datasets with extreme data irregularities or sudden fluctuations. For example, in datasets characterized by sporadic water consumption spikes or irregular patterns, the bi-LSTM model showed increased errors, with MSE values up to 0.045, compared to more stable datasets, where the MSE was as low as 0.0025. This suggests that while the bi-LSTM is generally robust, its performance can be affected by abrupt changes that are not well represented in the training data.
The forecasting performance analysis further indicates that the choice of model is highly context-dependent. The f-LSTM model demonstrated the highest robustness and accuracy, particularly in scenarios with non-linear dependencies and significant data irregularities. The f-LSTM achieved a test MSE as low as 0.0026 in certain datasets, significantly outperforming both the ARIMA and SARIMA models, which exhibited much higher errors. The ARIMA model, for instance, had MSE values ranging from 0.25 to 0.5554 in datasets with irregular patterns, highlighting its limitations in handling non-linear and complex data. The RFR model, while generally more effective than ARIMA and SARIMA, still exhibited moderate errors in certain scenarios. For example, in datasets with high variability and noise, the RFR model showed MSE values between 0.015 and 0.050, indicating its effectiveness in handling noisy data but also revealing its limitations in capturing longer-term temporal dependencies compared to the f-LSTM model. This suggests that while the RFR model can handle data complexities to some extent, it may not always match the performance of deep learning models like the f-LSTM when non-linear and long-term dependencies are critical.
The ARIMA and SARIMA models were particularly prone to significant errors in datasets with non-linear patterns or irregular intervals. Their reliance on linear assumptions made them less capable of accurately predicting water demand in datasets with complex usage patterns, resulting in substantial deviations from actual values. These models performed better in scenarios with clear linear trends and seasonality but still fell short compared to the f-LSTM and RFR models, which were more capable of managing data complexities and providing reliable forecasts.
Overall, the variations in model performance can be attributed to their different mechanisms for capturing data patterns. The f-LSTM model, with its deep learning architecture, effectively captures non-linear dependencies and long-term temporal relationships, while the RFR model utilizes ensemble learning to handle noisy and imprecise data. In contrast, the ARIMA and SARIMA models rely on linear assumptions, which limits their ability to manage complex, irregular data. Therefore, for optimizing water demand forecasting systems, the f-LSTM and RFR models are more effective in capturing real-world water usage patterns, providing a more reliable and accurate framework for decision-making in water resource management.
Future research should explore the stability and dependability of these models under various operational conditions to ensure consistent performance across diverse scenarios. This could involve testing the models in different real-world environments, such as varying climatic conditions or regions with unique water consumption patterns. Additional efforts should focus on further refinements in hyperparameter optimization to enhance model performance, particularly in clusters with high variability and unique patterns. The development of more sophisticated optimization techniques, potentially incorporating additional contextual data, could provide more precise parameter settings that better capture the intricate dynamics of reservoir water levels.
Moreover, scaling these models to larger datasets and diverse contexts is critical for validating their robustness and adaptability in different settings. Future research should consider applying these models to datasets from various regions to provide more comprehensive insights into water demand management. By addressing these areas, future studies can bridge the gap between theoretical model performance and practical applicability, maximizing the utility of these advanced forecasting models in real-world water management scenarios. Future research should also focus on developing efficient pipelines for real-time data integration, incorporating advanced anomaly detection, and improving imputation techniques to ensure data accuracy. Optimizing the LSTM model’s architecture to reduce computational requirements will facilitate deployment on resource-constrained platforms.
Finally, there is a need to investigate the practical implementation of these models in real-world decision-making scenarios. This includes integrating the models into existing water management systems, evaluating their performance in operational contexts, and developing user-friendly interfaces to facilitate the use of model predictions by utility managers. Additionally, creating guidelines for aligning these models with regulatory requirements and operational constraints will support their wider adoption in water resource management.