2.1. Study Area
Located in the Brazilian city of Surubim in the state of Pernambuco in the Capibaribe watershed, the Jucazinho Dam bars the river that is also called the Capibaribe, as we can see in the map of the situation in
Figure 1. Its construction began in 1995 and was completed in 1998, and at the time, there was the expropriation of more than two thousand hectares in the riverside areas, involving about 5000 people [
16].
One of the reasons for the construction of the dam was the scenario of scarcity of water supply in the rural region of Pernambuco. With the construction of Jucazinho, which has a maximum storage capacity of 245.26 million cubic meters of water at the maximum maximorum elevation of 295 m, 21 municipalities could be served, which impacted the lives of approximately 800 thousand inhabitants. As stated in
Figure 2, 92% of the dam’s use is for water for human supply. In addition, the dam is used for fish farming, livestock and agricultural.
The other reason for the construction of the dam was for flood control planning in Capibaribe, which involved the construction of several dams in order to protect the metropolitan region of Recife (about 135 km away) from historical floods such as those that occurred between the years 1960 and 1980. The dam has a flood control volume of m3.
Its construction type is gravity with roller-compacted concrete, and it has a central stepped spillway with a ski-jump-type dissipation basin. There are also two side spillways connected to the dam’s abutment. It contains a gallery with access at the two abutments and which extends over the entire embankment of the dam. Above the central spillway is a bridge that connects the abutments. It has a water intake for supply and another for the release of the ecological flow downstream, both with a pipe of 2.0 m in diameter and a reduction to 1.5 m. In
Table 1, we provide data from the dam’s technical file, with information provided by Department of Water Resources of Pernambuco and data measured by Neves et al. [
17].
2.2. Input Data
The “garbage in, garbage out” principle refers to the fundamental idea that the quality of the output of a data processing system is directly influenced by the quality of the input data. In other words, if inaccurate, incomplete or inadequate information is fed into a system, it is inevitable that the resulting output will also be inaccurate or of poor quality. This principle highlights the critical importance of reliable and high-quality data entry to ensure accurate and useful results in any computing or decision-making process.
The data for this study were obtained from the HIDROWEB portal: an online platform by ANA that offers information on Brazil’s water resources. The portal provides real-time data from a vast network of monitoring stations across the country, covering hydrometeorological, hydrographic and water quality aspects. In the Jucazinho Dam’s basin, 32 rainfall monitoring stations and two river flow monitoring stations were identified, as shown in
Figure 3.
To choose stations for the study, those with records within the same time frame were initially selected. Fluviometric stations 39100000 and 39130000 had records matching with pluviometric stations 735159, 736040, 736041, 736042 and 836092 and were the most recently updated and were thus chosen for the study.
A key aspect of hydrological modeling is the correlation between rainfall and flow data. Rainfall drives surface runoff and groundwater recharge, directly affecting river flow levels. Spearman’s correlation (
), a robust non-parametric measure, was used to assess this correlation, as shown in Equation (
1). It ranges from −1 to 1 and indicates negative (
< 0), positive (
> 0) or no correlation (
= 0).
where
and n are, respectively, the difference in ranks between the original series and the series sorted in ascending order for the
i-th observation and the total number of observations.
The highest correlation was observed between pluviometric station 736042 and fluviometric station 39130000, as demonstrated in
Figure 4, with a correlation coefficient of 0.18, leading to their selection for the study.
Despite the low correlation, this research addresses a real situation where the case study (Jucazinho Dam) was selected by the study funder (COMPESA). The choice was due to its importance to the region, its history of “collapse”, and its operation based on the technicians’ empiricism. Precisely due to data limitations, this study can contribute to the literature by evaluating artificial intelligence models in real situations with scarce data.
The general data of both stations are contained in
Table 2. Due to the beginning and end of both series, the records from 1 January 1986 to 1 June 2023 were used to develop the hydrological model.
Regarding the flow data, to fill the faults of station 39130000, data from station 39100000 were used and were multiplied by a factor of 1.57, referring to the ratio between the drainage area of 2450 km2 of station 39130000 and the drainage area of 1560 km2 of station 39100000, which were obtained through data from ANA’s HIDROWEB portal.
It is noteworthy that both stations are located on the Capibaribe River, which is the main river of the Capibaribe Hydrographic Basin, which is barred by the Jucazinho Dam, and that the number of faults is insignificant when compared to the total series, allowing, without major damage to the model, this type of fault filling.
The rest of the missing data, both flow and precipitation, for stations 736042 and 39130000, were interpolated linearly. The series with the gaps filled is presented in
Figure 5 and covers a total of 13,668 days. Initially, stations 736042 and 39130000 presented, respectively, 0.61% and 5.38% of failures in this number of days. After using the data from station 39100000 multiplied by the factor of 1.57, the percentage of failures from station 39130000 dropped to 4.15%. Finally, both series presented 13,668 days of records without failures.
It should be noted that linear interpolation is not the appropriate methodology for filling in daily failures of a precipitation or flow series, but specifically because we were filling for a period with low precipitation and flow, with values equal to zero or close to it, it was acceptable to apply the method without the association of significant errors in the results of the filled series.
It also should be noted that fluviometric station 391300000 is about 35.16 km from the Jucazinho Dam embankment and 14.64 km from the Jucazinho inundation area, and side spillway crest elevations are about 295 m. Therefore, the flow series generated by the artificial intelligence models trained from the data presented in
Figure 5 should be multiplied by a factor of 1.69 for a practical application of reservoir management. This factor refers to the ratio between the drainage area of 4149.90 km
2 of the Jucazinho Dam and the drainage area of 2450.00 km
2 of station 39130000.
Figure 6 presents the average of the monthly accumulations of all the years of the catchment basin. It is verified that there is little rainfall in the area, with approximately four months of rain and eight months of drought, indicating that the Capibaribe watershed has no hydrological memory.
The hydrological memory of a watershed represents its ability to store and release water over time in response to climatic conditions. It is influenced by factors such as geology and land use, and basins with permeable soils tend to have greater hydrological memory.
Hydrological models face challenges in basins without hydrological memory because they may have less predictable responses, impairing the model’s ability to capture anomalous climate events, given that temporal variability in water retention and release directly influences the hydrological response.
2.3. Model Construction
For the development of the model, the input variables are presented in
Table 3; the model predicts a sequence of next steps from a sequence of past observations. As
represents the flow rate for the time prior to
, the delayed flows of 1, 2, 3, …, 32 days with respect to t are called, respectively,
,
, …,
; likewise, the flows with an advance of 1, 2, 3, …, 32 days in relation to t are called, respectively,
,
, …,
. The same nomenclature logic is used for precipitation.
Thus, in the first scenario, the prediction of the flow for the next day was made from the previous data of one day of flow and precipitation (C-1). In the second scenario, previous data from two days of flow and precipitation were considered in order to predict the next two days of flow (C-2). In the other scenarios, the same logic was used but with 4 (C-4), 8 (L-8), 16 (L-16) and 32 days of data (L-32). The models were grouped into:
(Group C): Short-term prediction for 1 (C-1), 2 (C-2) and 4 (C-4) days of prediction;
(Group L): Long-term prediction for 8 (L-8), 16 (L-16) and 32 days (L-32) of prediction.
The models were constructed in an orderly manner without mixing the chronological order of the pairs containing the input and output variables and with recursive, repeating values in these pairs. Using model C-1 as an example, we show the pairs in
Table 4.
Note that , and are repeated in both the input and output sets. In this way, historical values are used both to predict future flow values and to provide information about past patterns that influence these predictions.
In a short-term context, usually covering periods of up to a week, flow prediction is essential for immediate decision-making. This includes real-time control of water flow, flood prevention, and reservoir water level management. The ability to anticipate intense weather events or sudden changes in hydrological conditions allows for rapid responses, such as the controlled release of water to prevent flooding or the immediate adjustment of the stored volume to meet current demand.
On the other hand, from a long-term perspective, usually covering periods longer than a week, flow forecasting is crucial for strategic planning. This allows for gradual adaptation to seasonal hydrological conditions, contributing to the sustainable management of water resources over time.
Regarding their classification, hydrological models based on artificial intelligence can be adapted to generate both probabilistic and deterministic predictions. Because the model generates unique and specific predictions based on initial conditions and defined parameters, it is reasonable to classify it as deterministic.
The input and output data were normalized between 0 and 1 to prevent the predominance of variables with high values, which is common in machine learning models. It is noteworthy that the normalization was done by variable: for example, the normalization of
considers the data series only of
. For this process, the MinMaxScaler function of the Sklearn library was used, which performs the transformation through Equation (
2):
where
is the nth value of the data series,
is the value after normalization, and
and
are, respectively, the minimum and maximum values of the data series.
To apply the machine learning models, the data were divided into training and testing using the traintestsplit function from the Sklearn library, and a sensitivity analysis was performed considering the following proportions:
65% for training and 35% for testing (65–35);
70% for training and 30% for testing (70–30);
75% for training and 25% for testing (75–25);
80% for training and 80% for testing (80–20).
2.3.1. Defining Hyperparameters
Hyperparameters are external parameters that are not learned directly by the model during training but need to be specified before the training process begins. These parameters influence the overall behavior of the model and affect training performance and effectiveness. In contrast, model parameters are the weights that the model adjusts during training to make predictions based on the data.
Because the time series is large-scale, which significantly increases the computational cost, and we aim to obtain an initial reference point to evaluate the adaptability of machine learning models, the present work used the standard values for the hyperparameters. These values, because they are generic and work well in a variety of situations, provide computational efficiency, avoiding the initial need for extensive experimentation.
2.3.2. Support Vector Machine
In the SVM model, the SVR function of the Sklearn library was used, for which the list of hyperparameters with their respective types of variables and default values are provided in
Table 5.
The SVR machine learning model is an extension of the SVM algorithm for regression tasks. The original SVM was initially developed for classification problems, but the SVR variant has been adapted to handle the prediction of numerical values instead of classes.
The core logic behind SVR is the same as for SVM and involves searching for an optimal hyperplane that best fits the training data. However, unlike SVM classification, where the goal is to find a hyperplane that separates classes efficiently, SVR seeks a hyperplane that optimizes the prediction of continuous values.
2.3.3. Random Forest
In the RF model, the RandomForestRegressor function of the Sklearn library was used, for which the list of hyperparameters with their respective types of variables and default values is provided in
Table 6.
The RandomForestRegressor machine learning model resides in the construction of an ensemble of decision trees, which are a fundamental component of RF. Each tree is trained independently using random sampling of both the dataset instances and the characteristics in each node split. This random approach contributes to the diversity among the trees and, consequently, to the robustness of the model.
During the training process, each tree makes individual predictions for the dataset instances according to the decisions made in their structures. The final RandomForestRegressor prediction is obtained by averaging these predictions, resulting in a more stable estimate that is less susceptible to overfitting.
Overfitting is a common phenomenon in machine learning in which a model over-adapts to the specific details of the training data, losing the ability to generalize to new data. This occurs when the model is too complex relative to the inherent complexity of the data, capturing irrelevant patterns, noise, or specific variations of the training set.
As a result, the model’s performance may be excellent on the training data but may fail to tackle new data, hindering its ability to make accurate predictions in real-world situations.
2.3.4. Artificial Neural Network
In the ANN model, the MLPRegressor function from the Sklearn library was used, for which the list of hyperparameters with their respective types of variables and default values is provided in
Table 7.
The MLPRegressor machine learning model belongs to the category of ANNs known as MLPs. This model is specifically designed for regression tasks as it is able to perform predictions of numerical values based on input data.
The fundamental structure of MLPRegressor is composed of multiple layers of neurons, with each layer connected to its adjacent layers. This architecture allows the model to capture complex relationships between input and output variables. Unlike simple linear models, MLPRegressor is able to learn nonlinear patterns in data.
During training, MLPRegressor uses an iterative process known as backpropagation. This process involves forward-passing inputs through the network to generate predictions and then comparing those predictions with the actual values to calculate the error. The error is then backpropagated through the network, and the weights of the connections between neurons are adjusted to minimize the error in the next iteration.
2.4. Model Evaluation
The models were evaluated for their performance by means of a graphical analysis and statistical criteria. The purpose of the evaluation was to verify the quality of the calibration and validation results by comparing the flow data simulated by the models with the actual observed data.
The adopted graphical analysis sought to verify the trend of the correlation of the observed and predicted data along the increase in the flow and also with the increase in days according to the models.
The statistical criteria adopted were the Nash–Sutcliffe model efficiency coefficient [
18] (NSE), the percentage of bias (PBIAS), the coefficient of determination (R
2) and the root mean standard deviation ratio (RSR). The NSE ranges from −
∞ to 1, with 1 being representative of the optimal value. Values between 0 and 1 are seen as acceptable performance levels; however, values ≤ 0 indicate that the observed mean is a better predictor than the simulated value, indicating poor performance of the model.
The optimal value of the PBIAS is 0, and positive or negative values with low magnitudes represent good performance. Positive values indicate that the model underestimated measured values, while negative values indicate that the model overestimated measured values.
The R2 estimates the correlation between the measured and simulated values and ranges from 0 to 1, with a value of 1 representing perfect agreement. The RSR ranges from the optimal value of 0, which indicates that the root mean square error (RMSE) is zero, to a large positive value. Therefore, the lower the RSR, the lower the RMSE and the better the simulation performance.
In the quantitative analysis, the Moriasi et al. [
19] classification presented in
Table 8 was used. The classification used the evaluation of a monthly model, and therefore, for simulations with daily values, it can be considered that NSE values above 0.36 are still satisfactory.
Graphical analyses were performed using scatter plots for all models and graphs of readings over time for models C-1 and L-32. The graph of readings over time for the L-32 model was created considering only the first, fifteenth and last day of the pairs, considering that the values are repeated. Both graphs depict the relationships between two variables visually.
The scatter plot displays observations comparing the independent variable on the x-axis to the dependent variable on the y-axis. The pattern of points on this graph offers insights into the relationship’s nature: indicating whether it follows a linear trend and assessing the model’s fit quality. The second graph shows readings over time, with each point representing observations comparing dates on the x-axis to flows measured at the pluviometric station and predicted by the model on the y-axis. The closer the generated curves, the better the model fit. This graphical representation allows for the identification of difficulties with predicting highs or lows in the series and detecting any delays between the graphs.