1. Introduction
Despite freshwater scarcity being a global problem, the solutions must be locally formulated in order to understand the connections between water supply and demand, and to adequately respond to the local water shortages. Water shortages are being exacerbated by human activities, as manifested by population growth and the impacts of land-use changes driven by urbanization, agricultural activities, industrialization, and economic development. The cumulative effects of the intensification of land-use activities and climate change continue to pose uncertainty on the availability of water resources, with the effect of intensified manipulation of the surface and groundwater hydrological regimes [
1].
The monitoring of dam water levels is important, not only to ensure efficient dam operations, but also for applications related to the integration of reservoir management schemes, identifying the main factors that influence the dam water level variabilities, determining the impacts of global climate changes on catchment hydrological systems, and ensuring sufficient freshwater supply [
2,
3]. In addition, the accurate monitoring and prediction of dam water level is important as it relates to the parameters such as inflows into the dams, dam water storage and water release from the dam reservoirs, evaporation, and infiltration. These parameters constitute the dam reservoir uncertainties and are important in dam operations and modelling.
For the simulation, prediction and forecasting of the dam water levels, reliable models are required [
4]. However, the variability of dam water levels results from complex non-linear processes, which include factors such as precipitation, evaporation, discharge from tributaries, topographic structure, land use, etc. These influences are more complicated when the dam has various water supply sources, e.g., precipitation, rivers, wellfields and supplies from other dams. As such, reliable and accurate prediction of dam water levels is challenging for hydrologists and water resource managers.
To solve the hydrological time series simulation and prediction problems, numerous techniques have been developed. Such models include the hydrodynamic models (e.g., MIKE21, CHAM and EFDC), time-series models using ARMA and ARIMA, and soft computing approaches, e.g., Artificial Neural Networks (ANNs), Support Vector Regression (SVR), and model trees [
5,
6,
7]. While the hydrodynamic models have proven to be superior in simulating water levels, for accurate and reliable predictions, they require detailed and calibrated data, complex boundary conditions and parameters as input data, and are computationally expensive to implement [
8,
9,
10,
11].
To improve the prediction and forecasting of water levels under data scarcity, soft computing techniques have been recommended [
12,
13] because of their ability to capture complex and non-linear input-output relationships with no explicit knowledge of the physical processes [
12,
14]. Further, machine learning (ML) models have been considered as they can efficiently represent the complex non-linear relationships in the temporally dynamic system, which are not normally addressed in traditional mathematical models [
13]. Additionally, machine learning models can deal with large spatial-temporal data in terms of scalability, multi-dimensionality, flexibility, efficiency and accuracy. As such, they can capture not only the primary exogenous parameters that influence the dam water level variabilities, such as the catchment land-use and land-cover, watershed characteristics, hydrological variables and climate factors but also the secondary factors, including the reservoir operational decisions.
In recent decades, numerous machine learning models have been proposed and compared for predicting dam water levels. In [
15], the support vector machine (SVM) and adaptive network-based fuzzy inference system (ANFIS) are compared for the forecasting of daily reservoir water levels in the Klang gate, in Malaysia, concluding that SVM was superior to the ANFIS model. In addition, in the Kenyir Dam in Malaysia, Reference [
16] also compared supervised Boosted Decision Tree Regression (BDTR), Decision Forest Regression (DFR), Bayesian Linear Regression (BLR) and Neural Network Regression (NNR) and showed that BLR and BDTG tree-based ML models were more accurate in predicting the reservoir water levels. Using the wavelet decomposition with ANN and ANFIS, Reference [
17] demonstrated that the hybrid WANN and WANFIS models were more suitable for predicting daily reservoir water levels. In addition, previous research [
18] predicted the water level variabilities in the Chahnimeh reservoirs in Zabol based on evaporation, wind speed and daily average temperature factors using ANN, ANFIS and Cuckoo optimizations algorithms and the results indicated that the ANFIS was the best algorithm. For short-term reservoir water level predictions in Yaojiang, China, Reference [
19] also compared ANN, SVM and ANFIS, with the results showing that all three models had advantages in using all the predictor datasets, avoiding noisy information with lags of inputs, and detecting the peaks under extreme conditions, respectively.
Furthermore, Reference [
20] predicted and estimated the daily reservoir levels for the Millers Ferry dam on the Alabama river using ANFIS, SVM, radial Basis Neural Networks (RBNN), and Generalized Regression Neural Networks (GRNN) methods in comparison with the ARMA and Multilinear Regression (MLR) methods. The study concluded that, for the best-input combinations, ANFIS produced better results. For the prediction of dam inflows into the Soyang River dam in South Korea, Reference [
21] showed that instead of individual models, the combined ensemble forecasts using Random Forest (RF) and Gradient Boost Trees (GT) with Multilayer Perceptron (MLP) could give greater results. In predicting the water levels in the Upo wetland in South Korea, Reference [
22] also concluded that RF regression tree-based ML had the best prediction accuracy against ANN, decision trees (DT), and SVM. In addition, Reference [
23] showed that MLR and M5P not only had higher accuracy than the k-NN and ANN but were also faster to train than the Advanced Hydrologic Prediction System (AHPS).
Despite the accurate prediction results, which also varied according to the case studies and different machine learning models, there are also limitations with some of the machine learning based prediction algorithms. For example, ANN and ANFIS have shown the disadvantage of presenting different results that depend on the system complexity and the available data [
19]. Some algorithms also tend to have low and unstable convergence rates, and some tend to fall into the local optimum trap, and other algorithms require high computational time [
24]. In addition to these drawbacks, most implemented studies did not apply baseline evaluation methods in forecasting competition evaluation. This is particularly important in gauging the relative performance of the ML models to allow for better contextualization of the results in relation to the complexity between the models [
25]. Further, most of the previous investigations tended to input all the exogenous predictor variables in the prediction without significance and impact evaluations on the performance of the models, with the assumption that the inclusion of additional variables improves the model prediction accuracy [
26].
From previous studies, the following is a summary of the drawbacks in the prediction of dam water levels: (1) only a few studies have focused on the optimization of machine learning and stochastic models and their integration for the prediction of dam water levels; (2) most of the related studies focused on dam water level forecasting, as influenced by flood stages and different reservoirs rather than on the dam water capacity predictions, and (3) the studies utilized few variables in dam water level forecasting, with the dependent variable as dam water level, and only rainfall and dam water itself as the independent variables.
To determine a suitable model for predicting the water levels in Botswana’s Limpopo River Basin from 2001 to 2019, this study evaluates the results of the case study of the Gaborone dam and the Bokaa dam. To improve on the drawbacks in the previous studies, the aims of the current study are: (1) to determine the optimal machine learning model for the accurate prediction of monthly dam water levels by comparing the parametric Multivariate Linear Regression (MLR) as the baseline model, stochastic Vector AutoRegressive (VAR), ensemble Random Forest Regression (RFR) and Multilayer Perceptron (MLP) Neural Network (MLP-ANN); (2) to evaluate the effectiveness of the algorithms in learning and predicting the temporal trends in the dam water levels by comparing the performances of the optimized models; (3) to determine the significance of climate factors (rainfall and temperature, climate indices), southern oscillation index (SOI), Niño 3.4, Aridity Index (AI), Darwin Sea level pressure (DSLP), and land-use land-cover comprising of built-up, cropland, water, forest, shrubland, grassland and bare-land, in the prediction of the dam water levels in the two dams, and; (4) to derive the optimal model approach(es) for predicting the variability of dam water levels in the two dams. The main contribution of this work is on the derivation of a hybrid model capable of combining stochastic and machine learning models for the accurate prediction of dam water levels in the two dams by integrating the LULC and the climate conditions within the dam catchments.
5. Conclusions
Under the influence of climate change and the intensification of land-use activities, understanding dam water capacity variations is important for planning dam water supply regimes and management. In the present study, dam water level observations in the Bokaa dam and Gaborone dam, in the semi-arid Botswana, were simulated and predicted using linear multilinear regression (MLR) and stochastic Vector AutoRegression (VAR) models, along with Random Forest Regression (RFR) and Multilayer Perceptron Neural Network (MLP-ANN) techniques. Using LULC, climate factors (rainfall and temperature) and climate indices (DSLP, Aridity Index (AI), SOI and Niño 3.4) as the dam water predictor variables, the results show that the stochastic VAR was able to detect the variation of LULC with dam water levels better than MLR, RFR and MLP-ANN, while RFR and MLP-ANN captured the relationships with the climate conditions with MLP-ANN, performing better than RFR. The stochastic VAR was not able to correlate rainfall and temperature with the dam water levels, except when integrated with the four climate indices. RFR and MLP-ANN gave the highest dam water level prediction results using rainfall, temperature, and the climate indices. MLP-ANN gave the best prediction results for the dam water level fluctuations for both dams, with the Gaborone dam predictions being more accurate than those for the Bokaa dam in terms of R2, but slightly lower when determined using MAPE. The higher MAPE for the Gaborone dam confirmed that the dam does not entirely rely on precipitation, but also on conjunctive water sources, including periodic direct supply from the Bokaa dam and wellfields. The proposed VAR-ANN hybrid model improved the prediction accuracy of the dam water levels for both dams by integrating the linear and non-linear variabilities in the predictor datasets and the dam water levels. To improve on the current study, the temporal intervals for the LULC should be increased to annual in order to accurately capture the seasonal variabilities in the LULC; secondly, the contributions of water sources from wellfields and other dams should be incorporated into the prediction modeling. For the low convergence in the simulation and prediction of the dam water levels, using faster and hybrid tree-based machine learning algorithms is recommended for further investigations.