Dam Water Level Prediction Using Vector AutoRegression, Random Forest Regression and MLP-ANN Models Based on Land-Use and Climate Factors

Ouma, Yashon O.; Moalafhi, Ditiro B.; Anderson, George; Nkwae, Boipuso; Odirile, Phillimon; Parida, Bhagabat P.; Qi, Jiaguo

doi:10.3390/su142214934

Open AccessArticle

Dam Water Level Prediction Using Vector AutoRegression, Random Forest Regression and MLP-ANN Models Based on Land-Use and Climate Factors

by

Yashon O. Ouma

^1,*

,

Ditiro B. Moalafhi

²

,

George Anderson

³,

Boipuso Nkwae

¹,

Phillimon Odirile

¹

,

Bhagabat P. Parida

⁴ and

Jiaguo Qi

⁵

¹

Department of Civil Engineering, University of Botswana, Private Bag UB 0061, Gaborone, Botswana

²

Faculty of Natural Resources, Botswana University of Agriculture and Natural Resources (BUAN), Private Bag 0027, Gaborone, Botswana

³

Department of Computer Science, University of Botswana, Private Bag UB 0061, Gaborone, Botswana

⁴

Department of Civil and Environmental Engineering, Botswana International University of Science and Technology (BIUST), Private Bag 16, Palapye, Botswana

⁵

Center for Global Change and Earth Observations, Michigan State University, East Lansing, MI 48824, USA

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(22), 14934; https://doi.org/10.3390/su142214934

Submission received: 6 October 2022 / Revised: 28 October 2022 / Accepted: 4 November 2022 / Published: 11 November 2022

Download

Browse Figures

Versions Notes

Abstract

:

To predict the variability of dam water levels, parametric Multivariate Linear Regression (MLR), stochastic Vector AutoRegressive (VAR), Random Forest Regression (RFR) and Multilayer Perceptron (MLP) Artificial Neural Network (ANN) models were compared based on the influences of climate factors (rainfall and temperature), climate indices (DSLP, Aridity Index (AI), SOI and Niño 3.4) and land-use land-cover (LULC) as the predictor variables. For the case study of the Gaborone dam and the Bokaa dam in the semi-arid Botswana, from 2001 to 2019, the prediction results showed that the linear MLR is not robust for predicting the complex non-linear variabilities of the dam water levels with the predictor variables. The stochastic VAR detected the relationship between LULC and the dam water levels with R² > 0.95; however, it was unable to sufficiently capture the influence of climate factors on the dam water levels. RFR and MLP-ANN showed significant correlations between the dam water levels and the climate factors and climate indices, with a higher R² value between 0.890 and 0.926, for the Gaborone dam, compared to 0.704–0.865 for the Bokaa dam. Using LULC for dam water predictions, RFR performed better than MLP-ANN, with higher accuracy results for the Bokaa dam. Based on the climate factors and climate indices, MLP-ANN provided the best prediction results for the dam water levels for both dams. To improve the prediction results, a VAR-ANN hybrid model was found to be more suitable for integrating LULC and the climate conditions and in predicting the variability of the linear and non-linear time-series components of the dam water levels for both dams.

Keywords:

Bokaa and Gaborone dams (Botswana); dam water levels; land-use land-cover; climate change; multivariate linear regression; Vector AutoRegressive (VAR); Random Forest Regression; Multilayer Perceptron ANN; VAR-Neural Network hybrid model

1. Introduction

Despite freshwater scarcity being a global problem, the solutions must be locally formulated in order to understand the connections between water supply and demand, and to adequately respond to the local water shortages. Water shortages are being exacerbated by human activities, as manifested by population growth and the impacts of land-use changes driven by urbanization, agricultural activities, industrialization, and economic development. The cumulative effects of the intensification of land-use activities and climate change continue to pose uncertainty on the availability of water resources, with the effect of intensified manipulation of the surface and groundwater hydrological regimes [1].

The monitoring of dam water levels is important, not only to ensure efficient dam operations, but also for applications related to the integration of reservoir management schemes, identifying the main factors that influence the dam water level variabilities, determining the impacts of global climate changes on catchment hydrological systems, and ensuring sufficient freshwater supply [2,3]. In addition, the accurate monitoring and prediction of dam water level is important as it relates to the parameters such as inflows into the dams, dam water storage and water release from the dam reservoirs, evaporation, and infiltration. These parameters constitute the dam reservoir uncertainties and are important in dam operations and modelling.

For the simulation, prediction and forecasting of the dam water levels, reliable models are required [4]. However, the variability of dam water levels results from complex non-linear processes, which include factors such as precipitation, evaporation, discharge from tributaries, topographic structure, land use, etc. These influences are more complicated when the dam has various water supply sources, e.g., precipitation, rivers, wellfields and supplies from other dams. As such, reliable and accurate prediction of dam water levels is challenging for hydrologists and water resource managers.

To solve the hydrological time series simulation and prediction problems, numerous techniques have been developed. Such models include the hydrodynamic models (e.g., MIKE21, CHAM and EFDC), time-series models using ARMA and ARIMA, and soft computing approaches, e.g., Artificial Neural Networks (ANNs), Support Vector Regression (SVR), and model trees [5,6,7]. While the hydrodynamic models have proven to be superior in simulating water levels, for accurate and reliable predictions, they require detailed and calibrated data, complex boundary conditions and parameters as input data, and are computationally expensive to implement [8,9,10,11].

To improve the prediction and forecasting of water levels under data scarcity, soft computing techniques have been recommended [12,13] because of their ability to capture complex and non-linear input-output relationships with no explicit knowledge of the physical processes [12,14]. Further, machine learning (ML) models have been considered as they can efficiently represent the complex non-linear relationships in the temporally dynamic system, which are not normally addressed in traditional mathematical models [13]. Additionally, machine learning models can deal with large spatial-temporal data in terms of scalability, multi-dimensionality, flexibility, efficiency and accuracy. As such, they can capture not only the primary exogenous parameters that influence the dam water level variabilities, such as the catchment land-use and land-cover, watershed characteristics, hydrological variables and climate factors but also the secondary factors, including the reservoir operational decisions.

In recent decades, numerous machine learning models have been proposed and compared for predicting dam water levels. In [15], the support vector machine (SVM) and adaptive network-based fuzzy inference system (ANFIS) are compared for the forecasting of daily reservoir water levels in the Klang gate, in Malaysia, concluding that SVM was superior to the ANFIS model. In addition, in the Kenyir Dam in Malaysia, Reference [16] also compared supervised Boosted Decision Tree Regression (BDTR), Decision Forest Regression (DFR), Bayesian Linear Regression (BLR) and Neural Network Regression (NNR) and showed that BLR and BDTG tree-based ML models were more accurate in predicting the reservoir water levels. Using the wavelet decomposition with ANN and ANFIS, Reference [17] demonstrated that the hybrid WANN and WANFIS models were more suitable for predicting daily reservoir water levels. In addition, previous research [18] predicted the water level variabilities in the Chahnimeh reservoirs in Zabol based on evaporation, wind speed and daily average temperature factors using ANN, ANFIS and Cuckoo optimizations algorithms and the results indicated that the ANFIS was the best algorithm. For short-term reservoir water level predictions in Yaojiang, China, Reference [19] also compared ANN, SVM and ANFIS, with the results showing that all three models had advantages in using all the predictor datasets, avoiding noisy information with lags of inputs, and detecting the peaks under extreme conditions, respectively.

Furthermore, Reference [20] predicted and estimated the daily reservoir levels for the Millers Ferry dam on the Alabama river using ANFIS, SVM, radial Basis Neural Networks (RBNN), and Generalized Regression Neural Networks (GRNN) methods in comparison with the ARMA and Multilinear Regression (MLR) methods. The study concluded that, for the best-input combinations, ANFIS produced better results. For the prediction of dam inflows into the Soyang River dam in South Korea, Reference [21] showed that instead of individual models, the combined ensemble forecasts using Random Forest (RF) and Gradient Boost Trees (GT) with Multilayer Perceptron (MLP) could give greater results. In predicting the water levels in the Upo wetland in South Korea, Reference [22] also concluded that RF regression tree-based ML had the best prediction accuracy against ANN, decision trees (DT), and SVM. In addition, Reference [23] showed that MLR and M5P not only had higher accuracy than the k-NN and ANN but were also faster to train than the Advanced Hydrologic Prediction System (AHPS).

Despite the accurate prediction results, which also varied according to the case studies and different machine learning models, there are also limitations with some of the machine learning based prediction algorithms. For example, ANN and ANFIS have shown the disadvantage of presenting different results that depend on the system complexity and the available data [19]. Some algorithms also tend to have low and unstable convergence rates, and some tend to fall into the local optimum trap, and other algorithms require high computational time [24]. In addition to these drawbacks, most implemented studies did not apply baseline evaluation methods in forecasting competition evaluation. This is particularly important in gauging the relative performance of the ML models to allow for better contextualization of the results in relation to the complexity between the models [25]. Further, most of the previous investigations tended to input all the exogenous predictor variables in the prediction without significance and impact evaluations on the performance of the models, with the assumption that the inclusion of additional variables improves the model prediction accuracy [26].

From previous studies, the following is a summary of the drawbacks in the prediction of dam water levels: (1) only a few studies have focused on the optimization of machine learning and stochastic models and their integration for the prediction of dam water levels; (2) most of the related studies focused on dam water level forecasting, as influenced by flood stages and different reservoirs rather than on the dam water capacity predictions, and (3) the studies utilized few variables in dam water level forecasting, with the dependent variable as dam water level, and only rainfall and dam water itself as the independent variables.

To determine a suitable model for predicting the water levels in Botswana’s Limpopo River Basin from 2001 to 2019, this study evaluates the results of the case study of the Gaborone dam and the Bokaa dam. To improve on the drawbacks in the previous studies, the aims of the current study are: (1) to determine the optimal machine learning model for the accurate prediction of monthly dam water levels by comparing the parametric Multivariate Linear Regression (MLR) as the baseline model, stochastic Vector AutoRegressive (VAR), ensemble Random Forest Regression (RFR) and Multilayer Perceptron (MLP) Neural Network (MLP-ANN); (2) to evaluate the effectiveness of the algorithms in learning and predicting the temporal trends in the dam water levels by comparing the performances of the optimized models; (3) to determine the significance of climate factors (rainfall and temperature, climate indices), southern oscillation index (SOI), Niño 3.4, Aridity Index (AI), Darwin Sea level pressure (DSLP), and land-use land-cover comprising of built-up, cropland, water, forest, shrubland, grassland and bare-land, in the prediction of the dam water levels in the two dams, and; (4) to derive the optimal model approach(es) for predicting the variability of dam water levels in the two dams. The main contribution of this work is on the derivation of a hybrid model capable of combining stochastic and machine learning models for the accurate prediction of dam water levels in the two dams by integrating the LULC and the climate conditions within the dam catchments.

2. Materials and Methods

2.1. Study Area

The study area is located within Botswana’s Limpopo River Basin (BLRB). The larger Limpopo River Basin is a transboundary basin, covering an area of approximately 416,300 km², and straddles four southern African countries: South Africa (45%), Botswana (19%), Mozambique (21%), and Zimbabwe (15%). The basin is home to more than 18 million people and Botswana has the highest percentage (61%) of its population living in the basin. As shown in Figure 1, the semi-arid Botswana relies on the following small-to-medium-sized dams, which are located within the BLRB: Gaborone (141.4 MCM); Letsibogo (100 MCM); Shashe (85 MCM); Dikgatlhong (400 MCM); Bokaa (18.5 MCM); Lotsane (42.35 MCM); Ntimbale (26.5 MCM), and Thune (90 MCM). The case study dams are the Bokaa dam and the Gaborone dam, located in the southern part of the BLRB (Figure 1). The two dams are located at a distance of approximately 40 km apart.

With the general scarcity of freshwater in the arid and semi-arid regions, water management problems tend to worsen, especially during extreme hydrological events, such as drought. For this reason, and to optimally manage the dam operations, continuous and accurate reservoir management schemes—including predictions of the variabilities of the dam water capacities and the determinations of the influences of a natural climatic phenomenon and anthropogenic activities on the water resources—is essential. In most regions, predicting and forecasting dam water capacities is still challenging for water resource operators and managers. This is attributed to the fact that, despite reservoir water levels being directly regulated by the inflows and outflow releases, there are several uncertainties in the dam water level determinant variables, such as the temporal dynamics of climatic factors, e.g., rainfall and temperature, and dam operations and management regimes, which are complex to model.

2.2. Data

2.2.1. Land-Use and Land-Cover (LULC)

For the multitemporal LULC classification, Landsat series data from Landsat 4 (L4-MSS), Landsat 5 (L5-TM), and Landsat 7 (L7-ETM+), acquired from 1986 to 2020, were used. Using the FLAASH (Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes) atmospheric correction algorithm and the Landsat rescaling coefficients, the multitemporal Landsat images were corrected to generate the surface reflectance imagery.

The LULC classification was carried out using Breiman’s random forest algorithm [28] and was implemented within the Google Earth Engine, as detailed in [29]. To improve the classification accuracy, the mean, variance, and contrast gray-level cooccurrence matrix (GLCM) texture features were found to be most significant and were included in the classification scheme. The LULC classification accuracy metrics results are presented in Table 1, and the LULC area coverages are summarized in Table 2 for the Bokaa and Gaborone dam catchments. From the results, the Bokaa dam catchment occupies an area of approximately 3610 km² and the Gaborone catchment is approximately 4344 km².

From the classification error matrix, the overall accuracy (OA) is determined from the ratio of the correctly classified pixels to the total training sample. Further, the respective class User’s Accuracy (UA) is determined by the ratio of the correct positive predictions, while the Producer’s Accuracy (PA) is the ratio of the correctly detected positives. For each year, the average OA, UA and PA are presented in Table 1. The results in Table 1 show that for both dam catchments, the LULC classification accuracies, as measured using the PA, UA and OA metrics, were higher than 80%, and the corresponding Kappa Index ranged between 0.75 and 0.87. The accuracy measures demonstrate that the LULC was derived with a high degree of accuracy for both dam catchments.

In Table 2, it is observed that for both catchments, the built-up areas are increasing exponentially, while the vegetation and bare soil-covered areas increased and decreased, interchangeably, either due to activities in croplands or due to climate influences. Tree cover within the catchments is also observed to be increasing in coverage, while shrubland has decreased in extent over the years.

2.2.2. Climate Data

1.: Rainfall and Temperature

Monthly rainfall data from the Gaborone gauge station was also used for both the Bokaa dam and Gaborone dam catchments due to their geographical proximity, climatic similarities, and given that there is no gauge station within the Bokaa dam catchment. Figure 2 shows the observed rainfall patterns within the Bokaa and Gaborone dam catchments, and Figure 3 shows the minimum, average, and maximum temperature variabilities within the catchments. Over the 19 years of study, it is observed that the mean temperature is increasing while the amount of rainfall received in the two catchments is decreasing.

2.: Climate Indices

The climate indices considered were those that have teleconnections with particular rainfall over southern Africa, that is, DSLP, SOI, and Niño 3.4. The average March-June pressures at Darwin have proven to have high positive sea level pressure (SLP) anomalies and teleconnections to droughts over southern Africa [30]. The SOI standardized sea-level pressure difference between Papeete and Darwin is also related to rainfall over the sub-region. In addition to the three climate indices, the aridity index (AI) was derived using station rainfall and temperature data, as in Equation (1):

I_{i} = \frac{12 P_{i}}{T_{i} + 10}

(1)

where

P_{i}

= the monthly total precipitation (mm) and

T_{i}

= mean near-surface temperature (°C).

2.2.3. Dam Reservoir Water Levels

The mean monthly dam water levels were used as the indicators for water availability in surface water storages, from 2001 to 2019, for the two dams. Figure 4a shows the variability of the dam water levels with rainfall, with the Bokaa dam exhibiting a marginally higher degree of correlation with rainfall than the Gaborone dam. The scatterplot regressions in Figure 4b depicts very low correlations between the measured dam water levels and rainfall in the two catchments.

2.2.4. Data Statistics and Correlational Analysis

The summary of the mean monthly statistical descriptions of the study datasets, from 2001 to 2019, for the two dams is presented in Table 3.

In terms of the correlations presented in Figure 5, the Bokaa dam exhibits the highest water levels, but inverse correlations with tree cover, at −0.349, followed by maximum temperature, bare soil, grassland, average temperature and aridity index, respectively, at −0.243, 0.216, 0.175, 0.161, and 0.161. The Bokaa dam water level correlations were particularly worse with Niño 3.4 and DSLP, at −0.035 and −0.047, respectively. In general, the water levels in the Bokaa dam have positive but low correlations with LULC classes and low negative correlations with the climate factors. Comparatively, the Gaborone dam had higher correlations with the predictor variables (Figure 5). The highest correlations for the Gaborone dam water levels with the predictor variables were for grassland, water bodies, and shrubland, at 0.815, 0.761 and −0.730, respectively. The lowest correlations were with built up, rainfall, and aridity index, at 0.013, 0.029 and 0.034, respectively. The Gaborone dam displays positive and higher correlations with dam surface area and grassland; however, lower and negative correlations with climate factors and indices.

2.3. Methods

2.3.1. Multivariate Linear Regression (MLR)

MLR was utilized as a baseline for competition evaluation [24]. Linear regression models are simple models that have linear and non-linear parameters for predictions. For small sample sizes, the parametric multilinear regression (MLR) models are able to establish the relationships between the predictor variables and the dependent variable using least squares fitting. In this study, the dam water levels depend on climate factors, climate indicators, and LULC. The general MLR model is expressed as in Equation (2).

y_{i} = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + \cdot \cdot \cdot + β_{q} x_{q i} + ε_{i}

(2)

where:

y_{i}

= observed dependent variable; n = sample size with i = 1, …, n;

x_{1}, x_{2}, \dots, x_{q}

= explanatory predictor variables;

x_{1 i}, x_{2 i}, \dots, x_{q i}

= observed value descriptors;

ε_{i}

= residual or error for individual i,

β_{0}

= constant;

β_{1}, β_{2}, \cdot \cdot \cdot β_{q}

= multiple regression coefficients. In Equation (2), dam water level (WL) is the dependent variable, Y, determined by a set of predictor variables, as those in Figure 5 (RN, TMX, TMM, TMA, DSLP, AI, SOI, NINO, BUP, CL, WT, FR, SL, GL, BL).

2.3.2. Vector AutoRegressive Model

VAR is a stochastic linear prediction model that predicts the current time variable value, based on its previous time value, and takes into consideration other predictor variables. Through dynamic analysis, VAR detects the changes to a particular variable, affects changes to other variables, the lags of those variables and the changes in the variables’ lags. VAR thus extends the univariate autoregression to the multiple time-series regression, with the lagged values of all series as regressors. For example, the VAR model of two variables

X_{t}

and

Y_{t} (k = 2)

with the lag order p is defined as in Equations (3) and (4). The

β

and

γ

can be estimated using the ordinary least squares method.

Y_{t} = β_{10} + β_{11} Y_{t - 1} + \cdot \cdot \cdot \cdot + β_{1 p} Y_{t - p} + γ_{11} X_{t - 1} + \cdot \cdot \cdot \cdot + γ_{1 p} X_{t - p} + μ_{1 t}

(3)

X_{t} = β_{20} + β_{11} Y_{t - 1} + \cdot \cdot \cdot \cdot + β_{2 p} Y_{t - p} + γ_{21} X_{t - 1} + \cdot \cdot \cdot \cdot + γ_{2 p} X_{t - p} + μ_{2 t}

(4)

The lag-order for the VAR(p) model is determined using the lag-length selection criteria, and the VAR(p) models are fitted with orders p = 0, 1, … p_max and the p-value, which minimizes some model selection criteria, is chosen. The parameter lag selection criteria in this study are the Akaike’s Information Criterion (AIC_p), Schwarz Bayesian Information Criterion (BIC_p), Hannan-Quinn Criterion (HQC_p), and Final Prediction Error (FPE_p). The traditional unrestricted VAR is unsuitable for non-stationary data with seasonality and, therefore, this study imposed a priori differencing on the input datasets for stationarity.

The implemented VAR model for dam water level time series prediction was developed with the following steps:

Testing for stationarity of the individual predictor variables using the augmented Dickey-Fuller (ADF) test.
Determining the lag for the VAR(p) model using lag-length selection. VAR(p) models are fitted with orders p = 0, 1, … p_max, and the p value resulting in minimal model selection criteria is chosen based on the parameter selection criteria above. In this study, the lag orders are determined for the specific predictor variables.
Establishment of an optimal VAR model with appropriate lags for each parameter. For multivariate time series, the VAR model is constructed such that each variable, at a time point, exhibits as a linear function of the recent lag of itself and other variables. The generalized VAR(p) = VAR(1) form for the n = 15 predictor variables can be expressed as in Equation (5). Equation (5) is solved using ordinary least squares, and c represents the intercepts; A is the regression coefficient matrix, and e is the error in prediction at time t.

[\begin{matrix} Y_{1, t} \\ \cdot \\ \cdot \\ Y_{15, t} \end{matrix}] = [\begin{matrix} c_{1} \\ \cdot \\ \cdot \\ c_{15} \end{matrix}] + [\begin{matrix} A_{1, 1} & \cdot & A_{1, 15} \\ \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \\ A_{15, 1} & \cdot & A_{15, 15} \end{matrix}] [\begin{matrix} Y_{1, t - 1} \\ \cdot \\ \cdot \\ Y_{15, t - 1} \end{matrix}] + [\begin{matrix} e_{1, t} \\ \cdot \\ \cdot \\ e_{15, t} \end{matrix}]

(5)

4.: Residual autocorrelation assessment for goodness-of-fit. For the time series data, the autocorrelation of the residuals between the observed and the model-fitted values is used to determine the goodness-of-fit of the model. Accuracy assessment metrices, including R², RMSE, MAE and MAPE are used.
5.: VAR system stability test assessment with the autoregressive (AR) roots graph. The VAR stability determines how well the model represents the time series over the sampling window. This is evaluated using the roots of the characteristic polynomial of the coefficient matrix A in Equation (5). If the roots are less than 0, the VAR model is considered stable.

2.3.3. Random Forest Regression

RFR is an ensemble learning regression model based on a decision tree algorithm [28]. The RFR principle entails randomly generating different unpruned CART decision trees, in which the decrease in Gini impurity is regarded as the splitting criterion. As a bootstrap resampling and bagging approach, the bootstrap samples from the training set data are fitted with an unpruned decision tree for each bootstrap sample. At the decision tree nodes, variable selection is made on small random subsets of the predictor variables and the best split from the predictors used to split the node. The trees in the forest are averaged or voted to generate output probabilities and a final model that generates a robust model. In this study the construction of the RFR through the following steps:

From the original data, nTree bootstrap samples are drawn.
For each bootstrap dataset, a tree is grown, and for each tree-node mTry variables are randomly selected for splitting.
The aggregated information from the nTree trees is used for new data prediction, in this case voting for regression.
Out-of-bag (OOB) error rate are computed using the test dataset not in the bootstrap sample.

RFR hyperparameters were tuned to determine the optimal lag-order, epochs, number of trees (n_estimators) and max_depth for predicting the dam water levels.

2.3.4. Multilayer Perceptron (MLP) Neural Network

MLP-ANN is one of the most popular Neural Network models with input, hidden, and output layers. The advantage of MLP-ANN is that even with a single hidden layer and arbitrary bounded and smooth activation function, the network can approximate a continuous non-linear system. The adopted network in this study was trained on the Levenberg–Marquardt backpropagation with a gradient scheme for weighting adjustment to minimize the predicted and observed data errors. The MLP-ANN model was implemented following the structure and detailed steps in [31].

2.4. Performance Evaluation Metrics

Four statistical measures were used to evaluate the prediction efficiency of the models, RMSE, R², MAE, and MAPE. The metrices are respectively represented in Equations (6)–(9), where

h_{i}^{o}

is the observed dam water level and

h_{i}^{s}

is the simulated or predicted dam water level. RMSE, MAE, and

h

are measured in % of dam water level.

RMSE = {[\sum_{i = 1}^{n} \frac{{(h_{i}^{s} - h_{i}^{o})}^{2}}{n}]}^{0.5}

(6)

R^{2} = \frac{{(\sum_{i = n}^{n} (h_{i}^{o} - {\bar{h}}^{o}) (h_{i}^{s} - {\bar{h}}^{s}))}^{2}}{\sum_{i = n}^{n} {(h_{i}^{o} - {\bar{h}}^{o})}^{2} \sum_{i = n}^{n} {(h_{i}^{s} - {\bar{h}}^{s})}^{2}}

(7)

MAE = \sum_{i = 1}^{n} \frac{| h_{i}^{s} - h_{i}^{o} |}{n}

(8)

MAPE = \frac{1}{n} \sum_{i = 1}^{n} | \frac{(h_{i}^{o} - h_{i}^{s})}{h_{i}^{o}} | \times 100 %

(9)

2.5. Data Normalization

The input datasets were standardized within the range [0.1–0.9]. The [0.1–0.9] normalization, using the minimum-maximum boundary, was used to standardize the original data, as expressed in Equation (10). The standardization minimizes biases as all the input data receive the same attention.

f : h_{i}^{o} \to 0.1 + 0.8 * (\frac{h_{i}^{o} - h_{i}^{o}_{\min}}{h_{i}^{o}_{\max} - h_{i}^{o}_{\min}})

(10)

where

h_{i}^{o}, y \in R^{n}

,

h_{i}^{o}_{\min} = \min (h_{i}^{o})

,

h_{i}^{o}_{\max} = \max (h_{i}^{o})

and

h_{i}^{o}

= input data. The datasets were divided into 70% for training sets (April 2001–May 2014) and 30% for testing (June 2014–December 2019).

The predictor parameters were organized into predefined significant inputs comprising of: Set-1: Climate Indices, Rainfall and Temperatures; Set-2: Min-Avg-Max Temperatures; Set-3: All Variables; Set-4: Rainfall; Set-5: Land-Use Land-Cover (LULC); Set-6: LULC, Rainfall, Minimum and Maximum Temperatures, Climate Indices; Set-7: Rainfall, Minimum-Maximum Temperatures and Set-8: Climate Indices.

To evaluate the relative importance of the predictor variables, backward sensitivity analysis is adopted, where the significance of each input variable is determined by stepwise variable replacement and the measure of the MAE deviation.

3. Results

3.1. Hyperparameter Tuning for the Models

3.1.1. Parameter Lag Order Determination for VAR Model

The optimal lag orders for the Gaborone and Bokaa dams were determined based on the AIC, BIC, and HQIC measures. From the summary results in Table 4, rainfall (Set-4) had the lowest AIC, BIC, and HQIC information criteria for the Gaborone dam, respectively corresponding to −9.493, −5.990, and −8.070 (Table 4). Set-7, comprising rainfall and temperature, was the second lowest, followed by Set-8, consisting of all the temperatures, and the highest measure was detected from Set-3, comprising all the parameters. For the Gaborone dam, temperature and rainfall had the highest lag orders, at 43 and 40, respectively. From the results in Table 4 for the Bokaa dam, the rainfall factor (Set-4) gave the lowest AIC, BIC, and HQIC, at −9.061, −7.732 and −8.523, respectively, and the highest lag order of 20. This is followed by Set-7, combining rainfall and temperatures, with a lag order of 12. For the Bokaa dam, the respective optimal lag orders varied between 7–20, with temperature having the least lag order compared to the Gaborone dam. In Table 4, the FPE values are not included since their magnitudes were all negligible.

The VAR training results show that the contributions of rainfall and temperatures were insignificant for both dams, with R² of less than 35%. The combination of the two climate factors in Set-7 only improved the training results for the Bokaa dam water levels, but did not influence the water levels in the Gaborone dam. Both dams responded well with LULC and the four regional climate indices, with R² of between 76% and 95% (Table 4).

3.1.2. Training for RF Regression

To determine the optimal RFR tuning hyperparameters, the data sets were trained with 70% of the data. The training results, based on lag order and max_depth, n_estimators, are presented in Figure 6 for the Bokaa dam, and the corresponding results for the four best predictors variables are presented in Table 5. For the Gaborone dam water level simulations and predictions, the results for the RFR model tuning parameters are also presented in Figure 6, with the best predictor variables statistics presented in Table 5.

The RFR hyperparameter tuning results show that the water level prediction in the Bokaa dam required significantly higher lag orders than the Gaborone dam but relatively shallower depth and fewer n_estimators or number of RFR trees (Table 5). The RFR training results for the best datasets depict R² > 0.82, with the exception of the Gaborone dam, where the climate indices yield R² = 0.563.

3.1.3. Training of MLP-ANN Model

The training of the MLP-ANN for predicting the water levels in the two dams was based on the lag order, the network number of hidden layers, epochs, and batch sizes. The tuning results for the dams are illustrated in Figure 7 and the summary statistics for the best four predictor variables are presented in Table 6.

For MLP training, low lag orders, between 1–4, are required to train the ANN, with the hidden layers varying from 2–4 (Table 6). The Bokaa dam required higher epochs, with relatively lower batch sizes, to train the model compared to the Gaborone dam, with the exception of the data set comprising min-avg-max temperatures for the Gaborone dam. The difference between the MLP and RFR hyperparameter tuning is that MLP-ANN detected the direct impact of rainfall (Set-4) on the Bokaa dam water level variability, while RFR only detected it indirectly, in combination with temperature (Set-7). For the Gaborone dam, RFR detected the direct impact of climate indices (Set-8), however, this was only captured indirectly for the Bokaa dam using RFR with Set-1. The RFR and MLP-ANN results indicate that the temporal variability of the dam water levels within the two catchments is influenced by the climate indices and climate factors. The impact of LULC is not directly related to the water levels but may contribute to the determination of demand and dam operation regimes. The best predictor variables in Table 6 show high training output with R² > 0.83.

The observed variable responses in the hyperparametric tuning for water levels in both dams, using RFR and MLP-ANN, respectively, shown in Figure 6 and Figure 7, are attributed to the systematic one-parameter-at-a-time tuning approach. For both models, the input of the combination of the predetermined optimal hyperparameters in the determination of the final hyperparameter automatically minimizes the model errors yielding the best fit results, as observed in the final tuning response curves in Figure 6 and Figure 7.

3.2. Dam Water Level Prediction Results

This section presents the standardized dam water prediction results for comparison between the two dams. The RMSE, MAE, and MAPE are calculated on the inverse of Equation (9) of the standardized datasets.

3.2.1. Prediction of Dam Water Levels Using MLR

The results for the prediction of the dam water levels using MLR are presented in Table 7, and shows that for both dams, Set-3, comprising of all variables, Set-5 (LULC), and Set-6 (LULC, Rainfall, Min and Max Temperatures) were the best predictors. For the Bokaa dam, the highest R² was 0.583, from Set-3 and Set-6, while the same sets yielded R² = 0.841 for the Gaborone dam, and LULC (Set-5) had R² of 0.785 for the Gaborone dam, compared to 0.489 for the Bokaa dam. The rest of the predictor variables predicted the time-series variability of the dam water levels at less than 50% accuracy in terms of R². Since the same regression fitting equation was used for training and testing the time-series dam water levels, the MLR results were found to be similar, with very low prediction accuracy. Using the same fit for the entire 19-year data gave better results, as presented in Figure 8, and demonstrated the fact that more robust model(s), at both training and testing phases, are required in the prediction of dam water levels.

Despite the good predictions using LULC for the Gaborone dam, which impacts on Set-3 and Set-6, the graphical plots in Figure 8 and the large RMSE, MASE, and MAPE show that the linear MLR is not suitable for simulating and predicting the complex, seasonal, and non-linear trends exhibited by the water levels in both dams. As such, the MLR results confirm the hypothesis that more robust regression models are necessary for predicting water levels in the dams.

3.2.2. VAR Prediction of Dam Water Levels

1.: Bokaa Dam Water Level Prediction Using VAR

The dam water level predictions for the Bokaa dam were based on the predetermined optimal training results for each dataset, shown in Table 4. The prediction results show that only Sets-1, -3, -5, -6, and -8 presented the highest convergence for the Bokaa dam (Table 8). Set-5, comprising the LULC classes, gave the highest R², at 0.998. The second highest (R² = 0.975) predictor variable is (Set-6), followed by Set-1 (R² = 0.959), Set-3 (R² = 0.928) and Set-8 (R² = 0.916). In terms of climate indices and climate factors, Set-1 (R² = 0.959; RMSE = 3.3%; MAE = 2.7%; MAPE = 14.3%) and Set-8 (R² = 0.995; RMSE = 2.7%; MAE = 2.2%; MAPE = 36.9%) gave the best results. Without the climate indices, the long-term predictions of dam water levels using temperatures (Set-2), rainfall (Set-4), and their combination shows low prediction results. The rainfall and temperature sets registered the highest MAPE errors, of more than 50%. The good performance of the LULC is attributed to the interpolation within the five years, which results in minimal variability within the input data and, therefore, low data variability and high accuracy.

The best results for the water level predictions in the Bokaa dam are presented in Figure 9. The prediction results and the graphical fits show that, despite having the highest performance accuracy, the predictor factors combined LULC (Set-3 and Set-6) are not the best predictor variables. This is particularly due to the inability of the model to capture the dam water levels at the beginning of the prediction using the LULC as the predictor factor. These differences are captured within the dotted boxes in Figure 9, depicting a lack of expected trends and patterns. From the graphical and statistical analysis, the best predictor variables for the Bokaa dam water levels are Set-1 and Set-8, where Set-1 was influenced by both climate factors and climate indices.

2.: Gaborone Dam Water Level Prediction Using VAR

Using the VAR model, the prediction of the Gaborone dam water levels is detected to be significant using the four climate indices (Set-8), as shown in Table 8 and Figure 10 (R² = 0.929; RMSE = 0.7%; MAE = 0.6%; MAPE = 8%). The rainfall and temperature climate factors performed marginally in predicting the Gaborone dam water levels, with R² of less than 0.3 and MAPE above 20%, while their combination in Set-7 yielded higher accuracy prediction accuracy results. Similarly, high prediction results were obtained using the integration of the climate indices with rainfall and temperature in Set-1. The results for the climate-based predictors are presented in Figure 10 for the Gaborone dam, with Set-3 including all parameters. By visually assessing the trends of the predictions within the dotted boxes in Figure 10, it is empirically observed that climate indices gave the best results. However, the results show that in the absence of climate factors, LULC can be used to predict the water levels in the dams with good accuracy (R² > 0.990; RMSE < 0.7%; MAE < 0.3%; MAPE < 3.5%).

3.2.3. RFR Simulation and Prediction

1.: RFR Prediction of Bokaa Dam Water Levels

The RFR prediction results for the Bokaa dam show that all the datasets are suitable for predicting the water levels, with R² > 0.8. LULC and RFR presented the least prediction accuracy with R² = 0.807 and the best four predictors were Set-2 of all the temperatures, followed by Set-3, Set-7, and Set-1, with R² of 0.836, 0.829, 0.824, and 0.820, respectively. The corresponding RMSE varied between 11.3–12.5%, with an MAE average of approximately 7% and MAPE of approximately 13%. Figure 11 presents the predictions for the four best predictor variable sets. The results from Set-2 and Set-7 comprise temperatures and rainfall and depict that RFR captured the relationship between the dam water levels and the climate factors (rainfall and temperature). The analysis of the prediction trends confirms Set-2 and Set-7 as the most suitable for predicting the dam water levels, as illustrated within the dotted boxes, where the predictor variables are able to capture the temporal trends of the measured dam water levels.

2.: RFR Prediction of Gaborone Dam Water Levels

Using the optimal RFR hyperparameters for predicting the water levels in the Gaborone dam, Table 9 shows Sets-2, -4, -7, and -8 presenting the best results, with R² values of 0.918, 0.819, 0.898, 0.897 and 0.890, respectively. The datasets comprise temperature, rainfall, their combination, and the climate indices, respectively. The RMSE is observed to be lower for the Gaborone dam than the Bokaa dam, ranging between 9.7% and 11.4%, while the MAE averages were at 6.5% of dam water levels and MAPE is higher, at between 23% and 38%. The LULC-based prediction results show that, despite the positive correlation of more than 65%, with the dam water levels, LULC does not capture the temporal seasonality and variability of the dam water levels (Figure 12). The results in Table 9 and Figure 12 depict that RFR is able to predict the water levels in the Gaborone dam using the climate factors, with the temperatures (Set-2) being the best climate factor, followed by rainfall (Set-4). The combination of temperature and rainfall marginally reduces the influence of the predictive ability of temperatures by nearly 10%, to R² of 0.898. The climate indices (Set-8) display a significant impact on the water levels in the Gaborone dam, with R² = 0.890. The dotted box regions in Figure 12 show the inability of RFR to accurately predict the temporal trends in the Gaborone dam water levels.

3.2.4. MLP-ANN Simulation and Prediction

1.: Bokaa Dam Water Level Prediction Using MLP-ANN

With the rectifier linear unit activation function, Adam optimizer, and a learning rate of 0.0003, the results for predicting water levels in the Bokaa dam are presented in Table 10. The local temperature is linked to the dam water levels with the highest R² of 0.865, and the lowest RMSE = 10.9% and MAE = 6.5%. The combination of temperature and rainfall (Set-7) is second, with R² of 0.850, followed by rainfall (R² = 0.829). Climate indices (Set-8) also influenced the dam water levels with R² of 0.805 and the least MAPE = 13.2%. LULC had the least influence on the dam water levels, with MAPE of 27.7%, and its combination with the other parameters in Set-6 further reduced the accuracy, with MAPE = 56.6% and R² = 0.449.

The performances for the best predictor variables within the box window time regions in Figure 13 show that local temperature (Set-2) and rainfall (Set-4) exhibit similar and best prediction trends with MAEs of approximately 6.5% and MAPE of 25%.

2.: Gaborone Dam Water Level Prediction Using MLP-ANN

For the Gaborone dam, all the predictor datasets with LULC (Sets-3-5-6) did not converge to predict the dam water levels (Table 10). This further confirms the observations in MLR and RFR, where LULC recorded low correlations with dam water levels. The best performing sets in predicting dam water levels for the Gaborone dam were Set-4, rainfall (0.926), performing equally with Set-2 (0.925), then Set-7 (0.920), and Set-1 (0.917). The results show a positive response of the dam water levels to rainfall, temperature, and to the climate indices with an average low RMSE of less than 10%, R² > 0.91, and the least MAE, >5% on average. The dotted boxes in Figure 14 show the differences in the dam water predictions for the Gaborone dam. In comparison to Set-7, Sets-1, -2, and -4 present good initial estimations of the dam water level. The MLP-ANN results improved the ability of RFR to detect near-linear trends, with Set-2 (temperature) presenting the best empirical and statistical predictions (Figure 14).

3.3. Relative Importance of the Predictor Variables

For the Bokaa dam, Figure 15 presents the relative importance of each variable in the predictor groups and compares all the factors. Comparing the variables, the tree-cover and shrubland exhibited the highest correlation with dam water levels (slightly more than 50% influence), followed by the max temperature and Niño 3.4. The least contributions are from bare soil, built-up and grassland, with the significance of rainfall and aridity index being negligible. The significance of the predictor variables indicates that within the Bokaa catchment, the degree of vegetation index and the regional temperature have higher correlations with the Bokaa dam water capacity. For the Gaborone dam (Figure 15), grassland and water bodies exhibit the highest significance, followed by cropland and bare soil, with the rest of the parameters contributing less than 2% each. The aridity index and rainfall are observed to have the least contributions toward predicting the Gaborone dam water levels. While grassland has negligible contributions to dam water levels in the Bokaa dam, it has the highest significance for water capacity in the Gaborone dam, accounting for nearly 48% significance. Similar to the Bokaa dam, the significance of vegetation health is observed to have higher correlations with the dam water levels in the Gaborone dam.

Investigating the predictor data groups for the Bokaa dam, in terms of the catchment LULC, tree-cover has the most influence in predicting the dam water levels, accounting for more than 50%; built-up, bare-soil, and grassland have the least contribution, with the significance of each at less than 1%. The climate factors and maximum temperature have the highest contributions, at 34%, and rainfall at 22% for the Bokaa catchment. Among the climate indicators, Niño 3.4 has the highest contribution in predicting the dam water levels in the Bokaa dam, at 28%. For the Gaborone dam, the existence of water bodies and grassland is most important for predicting the dam water levels in the Gaborone dam, with up to 32%. The climate factors exhibit competing significance, ranging between 21–25%, with minimum temperature and rainfall as the most significant climate factors. For the climate indices, Niño 3.4 has the highest significance, at 42%, with AI and DSLP being the least, with a nearly equal relative importance of 17%.

The relative importance measures, shown in Figure 16, depict the sensitives of the predictor variables. The results show that for both dams, LULC forms part of the most significant predictor variables; therefore, the more accurate catchment LULC, in terms of high temporal resolution and actual classification accuracy, is important in predicting dam water levels for both dams. The parametric sensitivities in Figure 15 and Figure 16 also imply that the prediction model should be able to capture the influences of both the high and low significant variables.

4. Discussions

The present study compares the performance of the stochastic VAR and the machine learning RFR and MLP-ANN models. The performances of each prediction horizon are compared using the average MAE, RMSE, and MAPE estimates and the R² statistics as a goodness-of-fit of the models. The metrics are considered to adequately measure the prediction accuracy and depict how well the model generalizes the unseen or test data. To determine the best predictor variables and to gauge the sensitivity of the models to the inputs, different exogenous input combinations were explored, and the results were compared using the above statistical indicators.

4.1. Influence of the Predictor Variables on Dam Water Level Predictions

4.1.1. Impact of LULC on Water Level Predictions

The current study reveals the significance of LULC in predicting dam water levels as detected by the tested models. The assumption in the five-year time epoch used in the LULC temporal resolution is that there are insignificant changes in the natural land-covers such as water bodies, grasslands, shrublands, forests, bare soils, and land-use such as croplands. However, significant changes are expected in urban built-up, although at a slow spatial and temporal rate. Only the stochastic VAR detected the correlation and variability between the dam water levels and LULC, and predicted the dam water levels with LULC as the best predictor variable, with the highest accuracy of greater than 99%. The prediction results using MLR, RFR, and MLP-ANN showed that the LULC pattern, as interpolated over the 20-year period, may not be suitable for predicting the dam water levels for both dams as it exhibited high RMSE, MAE, and MAPE errors. For the Gaborone dam, the use of LULC resulted in a lack of convergence in prediction using the MLP-ANN. To improve the significance of LULC in dam water predictions, it is recommended to increase the temporal resolution of the LULC to annually.

4.1.2. Influence of Climate Factors and Climate Indices

In predicting Bokaa dam water levels using the VAR model, the combination of climate indices, rainfall, and temperature gave the best results (R² = 0.959, MAPE = 14.3%). This is attributed to the high correlation with the climate indices (R² = 0.916, MAPE = 16.2%), which resulted in good performance of all the parameters combined. Rainfall and temperature, however, did not give good results. RFR detected a higher relationship of the dam water levels using temperature (R² = 0.836, MAPE = 14.5%), the combination of temperature and rainfall (R² = 0.824, MAPE = 13.3%), the climate indices (R² = 0.808, MAPE = 18.3%), and the combination of climate indices, rainfall, and temperature (R² = 0.820, MAPE = 18.3%). The MLP-ANN results for the Bokaa dam water levels show temperature (R² = 0.865, MAPE = 24.6%), rainfall (R² = 0.829, MAPE = 25.3%), the combination of rainfall and temperature (R² = 0.850, MAPE = 17.7%), and climate indices (R² = 0.805, MAPE = 13.2%) are directly related to the Bokaa dam water levels.

For the Gaborone dam, VAR predicted the dam water levels using the combined influences from rainfall and temperature combined (R² = 0.858, MAPE = 15.4%), climate indices (R² = 0.929, MAPE = 8.0%), and climate indices, rainfall and temperatures (R² = 0.995, MAPE = 36.9%). Using RFR, the dam water level trends were best predicted using the local temperature observations (R² = 0.918, MAPE = 23.7%), rainfall (R² = 0.898, MAPE = 24.8%), integrated temperatures and rainfall (R² = 0.897, MAPE = 25.2%), and climate indices (R² = 0.890, MAPE = 23.6%). Using MLP-ANN, similar results as RFR were observed, with local temperatures (R² = 0.925, MAPE = 30.8%), rainfall (R² = 0.926, MAPE = 24.1%), integrated temperatures and rainfall (R² = 0.920, MAPE = 24.7%), and climate indices, rainfall and temperatures (R² = 0.917, MAPE = 24.2%).

While the VAR predictor variables are different for the two dams, with the exception of a combination of climate indices, rainfall and temperature, the predictor parameters for the Gaborone dam are observed to be similar to those of the Bokaa dam. It is observed that the predictions using RFR and MLP-ANN detected the variability of both dam water levels to be influenced by the same factors. For both dams using RFR and MLP-ANN, the results show that the climate factors and climate indices are the best predictors for dam water levels and are best modelled using MLP-ANN, which had the highest prediction accuracy, compared to RFR. The results further show that in the absence of reliable rainfall and temperature data, the water levels in both dams can reliably be predicted using the machine learning models based on the regional climate indices (DSLP, AI, SOI and Niño 3.4).

From the analysis of the significance of the predictor variables in Figure 15, the relatively lower contribution of rainfall in the prediction of dam water levels shows that precipitation and resulting runoff within the catchment may not be only the main sources of dam water but also marginal contributions from conjunctive water sources, such as wellfields and from other dams. As such, improvements in the prediction of the dam water levels should include the determination of the influences of the network of inter-reservoir water transfers.

4.1.3. Model Performances

In general, MLR was not able to detect and predict the variability of the dam water levels. On the other hand, the lower performance of VAR in detecting the influence of the seasonal climate factors and climate indices in detecting the variability of the dam water levels is attributed to the low convergence rate, as the convergence tends to be unstable, and the predictions easily fall into the local optimum trap, with an increase in the computational time, especially for the non-stationary variables [32]. On the other hand, the main advantage of the RFR machine learning, resulting in generally good results with all the variables, is in the ability to detect and discard the outlier dam water levels with ease due to the improved grouping of water level data contained in the set of terminal nodes in the decision tree. The results from MLR, VAR and RFR imply that the fluctuations in the water level in the dams are difficult to capture using the stochastic linear models [33]. The advantage of RFR and why it was able to give relatively good results is that it can handle non-linear and non-Gaussian data well and with minimal over-fitting problems as the number of trees increases [34].

MLP-ANN results support the suggestion that data-driven techniques tend to overcome the drawbacks of traditional models in terms of accuracy and the ability to model complex phenomena [35]. MLP-ANN was able to capture the influence of climate factors and climate indices with higher accuracy, though it had non-converging prediction using LULC. For the two dams, it is possible to infer that the MLP-ANN predictions adapted to the changing climate conditions. The advantages of the ANNs over other methods in predicting dam levels can be attributed to the fact that the ANN structure can detect and include the non-linear components of the system in the whole data set. Comparatively, in predicting reservoir water levels for the Angat dam in the Philippines, [25] tested the Naïve-persistence and Seasonal Mean methods as baselines against ARIMA, gradient boosting machines (GBM), and Deep Neural Networks based on LSTM, univariate (DNN-U) and multivariate models (DNN-M). The results showed that the prediction of the dam water levels was better performed using the data driven Deep Neural Network and not the traditional linear models.

4.2. VAR-ANN Hybrid Dam Water Prediction Model

The results show that neither the stochastic VAR, the decision tree based RFR, nor the MLP-ANN can independently detect the compounded impacts of LULC, climate factors, and climate indices in predicting the dam water levels. In particular, the stochastic VAR is observed to be more capable of predicting the dam water levels using LULC, which exhibits a linear trend from the five-year interval interpolations, while MLP-ANN performed better than RFR and VAR in predicting the dam water levels using the seasonal and non-linear climate factors and indices.

Since time-series hydrological data comprises different frequency components characterized by non-linear interactions, hybrid models have been proposed to improve the performance in hydrological prediction [36]. These approaches include Neural Networks based on Set Pair Analysis (SPA) and Principal Component Analysis (PCA) [37,38], Chaotic Neural Networks [39], Cluster Hybrid Neural Networks [40], And Bootstrapped Artificial Neural Networks [41,42].

From the prediction results, a hybrid dam water level prediction model comprising VAR-ANN is proposed as optimal in modeling the linear and non-linear components of the dam water levels. The VAR-ANN time-series representation of the dam water levels

W L_{t}

is proposed to comprise the linear

L_{t}

and non-linear

N_{t}

predictor variables (Equation (11)).

W L_{t} = (L_{t} + N_{t})

(11)

ε_{t} = (W L_{t} - {\overset{⌢}{L}}_{t})

(12)

{\overset{⌢}{N}}_{t} = f (e_{1}, e_{2}, \dots, e_{t - p}, e_{t})

(13)

W {\overset{⌢}{L}}_{t} = ({\overset{⌢}{L}}_{t} + {\overset{⌢}{N}}_{t})

(14)

In the implementation, VAR is fitted to the linear components and the outcome linear-based predictions

{\overset{⌢}{L}}_{t}

at time

t

are derived. The residuals from the VAR, termed as

ε_{t}

at time t are determined as in Equation (12). The

ε_{t}

dataset after VAR fitting is considered to contain the non-linear

N_{t}

time-series components of the dam water

W L_{t}

levels and can be modelled using the ANN. With p input nodes, the ANN for residuals has the form in Equation (13), with f as the non-linear function estimated by the ANN and

ε_{t}

is the white noise. If

{\overset{⌢}{N}}_{t}

is the ANN prediction, then the hybrid prediction of at time

t

is defined according to Equation (14). The hybrid VAR-ANN model is implemented as depicted in Figure 17.

From the best predictor variables for both dams, the average results of the hybrid VAR-ANN model for the two dams, presented in Figure 18, show an overall improvement in the prediction accuracy of the dam water levels. The results show that the hybrid model integrates the linear and non-linear variabilities in the predictor datasets to accurately predict the dam water levels. The VAR-ANN produces positive predictions using rainfall, temperature, climate indices, and LULC, with an average R² > 0.84 and MAPE < 10%. The results show that the average prediction RMSE, MAE, and MAPE error measures for both dams are also significantly reduced. The results imply that the hybrid model is able to capture the parametric sensitivities of both the high and low significant variables that are depicted in Figure 15 and Figure 16.

4.3. Average Model Errors and ROC Area under Curve (AUC)

The average model prediction error E (%) in Equation (15) is determined as the average for both dams using the best predictor parameters with the highest R² and the least RMSE, MAE, and MAPE error measures. In Figure 19, for the average predicted dam water level errors for the four models, the combination of the VAR and ANN diminishes the magnitude of the prediction error between the predicted and observed dam water levels for the two dams, producing the least errors for the predicted time-series dam water levels, and thus improved consistency in predicting the water levels.

E = \frac{(W L_{p r e d i c t e d} - W L_{o b s e r v e d})}{W L_{o b s e r v e d}} \times 100 %

(15)

In the first months, the E (%) for VAR-ANN is observed to be between −5% and +8% of dam water levels and diminishes to nearly 0.01% for more than 70% of the predicted dam water levels. Even though MLP-ANN performs better than RFR and VAR, its prediction errors exhibit low convergence with sinusoidal patterns in time, and this could be attributed to the influence of LULC. RFR and VAR present higher degrees of error at about 5–10%, with VAR exhibiting random spikes in error with time.

To further infer the significance of the models, the area under the receiver operating characteristic curve scores were computed for the two dams, with the results in Figure 19. The AUC scores are also based on the average true positive (sensitivity) and false positive rates (specificity) measures from the average of the best predictor variables for the dams. The results in Figure 20 show that for the Bokaa dam and Gaborone dam, VAR-ANN had the highest AUC scores, 0.89 and 0.93, performing better than MLP-ANN and RFR. The AUC scores for RFR were nearly equal, at 0.77 and 0.78, respectively, for the Bokaa dam and Gaborone dam, while VAR performance was at AUC < 0.7 for both dams. Despite the good performance from VAR-ANN, the MAPE measures for the Gaborone dam were observed to be higher than those of the Bokaa dam. The average AUC shows that the VAR-ANN has a higher ability to predict the dam water levels from all the predictor variables.

5. Conclusions

Under the influence of climate change and the intensification of land-use activities, understanding dam water capacity variations is important for planning dam water supply regimes and management. In the present study, dam water level observations in the Bokaa dam and Gaborone dam, in the semi-arid Botswana, were simulated and predicted using linear multilinear regression (MLR) and stochastic Vector AutoRegression (VAR) models, along with Random Forest Regression (RFR) and Multilayer Perceptron Neural Network (MLP-ANN) techniques. Using LULC, climate factors (rainfall and temperature) and climate indices (DSLP, Aridity Index (AI), SOI and Niño 3.4) as the dam water predictor variables, the results show that the stochastic VAR was able to detect the variation of LULC with dam water levels better than MLR, RFR and MLP-ANN, while RFR and MLP-ANN captured the relationships with the climate conditions with MLP-ANN, performing better than RFR. The stochastic VAR was not able to correlate rainfall and temperature with the dam water levels, except when integrated with the four climate indices. RFR and MLP-ANN gave the highest dam water level prediction results using rainfall, temperature, and the climate indices. MLP-ANN gave the best prediction results for the dam water level fluctuations for both dams, with the Gaborone dam predictions being more accurate than those for the Bokaa dam in terms of R², but slightly lower when determined using MAPE. The higher MAPE for the Gaborone dam confirmed that the dam does not entirely rely on precipitation, but also on conjunctive water sources, including periodic direct supply from the Bokaa dam and wellfields. The proposed VAR-ANN hybrid model improved the prediction accuracy of the dam water levels for both dams by integrating the linear and non-linear variabilities in the predictor datasets and the dam water levels. To improve on the current study, the temporal intervals for the LULC should be increased to annual in order to accurately capture the seasonal variabilities in the LULC; secondly, the contributions of water sources from wellfields and other dams should be incorporated into the prediction modeling. For the low convergence in the simulation and prediction of the dam water levels, using faster and hybrid tree-based machine learning algorithms is recommended for further investigations.

Author Contributions

Conceptualization, Y.O.O., D.B.M. and G.A.; Funding acquisition, Y.O.O. and J.Q.; Investigation, B.N., P.O. and B.P.P.; Methodology, Y.O.O., D.B.M. and G.A.; Project administration, B.N., P.O., B.P.P. and J.Q.; Resources, B.N., B.P.P. and J.Q.; Writing—original draft, Y.O.O. and D.B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research project was funded by the Office of Research and Development (ORD) of the University of Botswana and by USAID Partnerships for Enhanced Engagement in Research (PEER) under the PEER program cooperative agreement number: AID-OAA-A-11-00012.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data sources for this study are summarized as: (1) Landsat-image from USGS Earth Explorer (https://earthexplorer.usgs.gov/, accessed on 17 November 2021); (2) DEM from ALOS PALSAR (https://search.asf.alaska.edu/#/, accessed on 17 November 2021); (3) precipitation and temperature data from the Department of Meteorological Services (Botswana); and (4) dam reservoir water level data from the Department of Water and Sanitation (Botswana) and Water Utilities Corporation (WUC) (Botswana). The rest of the data used in the study are presented in this paper.

Acknowledgments

The authors wish to thank the Department of Water and Sanitation (Botswana) for providing the measured dam water levels.

Conflicts of Interest

The authors declare no conflict of interest.

References

Garza-Díaz, L.E.; DeVincentis, A.J.; Sandoval-Solis, S.; Azizipour, M.; Ortiz-Partida, J.P.; Mahlknecht, J.; Cahn, M.; Medellín-Azuara, J.; Zaccaria, D.; Kisekka, I. Land-use optimization for sustainable agricultural water management in Pajaro Valley, California. J. Water Resour. Plan. Manag. 2019, 145, 05019018. [Google Scholar] [CrossRef] [Green Version]
Wantzen, K.M.; Rothhaupt, K.O.; Mörtl, M.; Cantonati, M.; Tóth, L.G.; Fischer, P. Ecological effects of water-level fluctuations in lakes: An urgent issue. In Ecological Effects of Water-Level Fluctuations in Lakes; Springer: Dordrecht, The Netherlands, 2008; pp. 1–4. [Google Scholar]
Hu, W.; Zhai, S.; Zhu, Z.; Han, H. Impacts of the Yangtze River water transfer on the restoration of Lake Taihu. Ecol. Eng. 2008, 34, 30–49. [Google Scholar] [CrossRef]
Mosavi, A.; Ozturk, P.; Chau, K. Flood prediction using machine learning models: Literature review. Water 2018, 10, 1536. [Google Scholar] [CrossRef] [Green Version]
Khan, M.S.; Coulibaly, P. Application of support vector machine in lake water level prediction. J. Hydrol. Eng. 2006, 11, 199–205. [Google Scholar] [CrossRef]
Altunkaynak, A. Forecasting surface water level fluctuations of Lake Van by artificial neural networks. Water Resour. Manag. 2007, 21, 399–408. [Google Scholar] [CrossRef]
Lai, X.; Jiang, J.; Liang, Q.; Huang, Q. Large-scale hydrodynamic modeling of the middle Yangtze River Basin with complex river–lake interactions. J. Hydrol. 2013, 492, 228–243. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Q.; Werner, A.; Yao, J. Investigating a complex lake–catchment–river system using artificial neural networks: Poyang Lake (China). Hydrol. Res. 2015, 46, 912–928. [Google Scholar] [CrossRef] [Green Version]
Zaji, A.H.; Bonakdari, H.; Gharabaghi, B. Reservoir water level forecasting using group method of data handling. Acta Geophys. 2018, 66, 717–730. [Google Scholar] [CrossRef]
Kumar, R.; Singh, M.P.; Roy, B.; Shahid, A.H. A comparative assessment of metaheuristic optimized extreme learning machine and deep neural network in multi-step-ahead long-term rainfall prediction for all-Indian regions. Water Resour. Manag. 2021, 35, 1927–1960. [Google Scholar] [CrossRef]
Do Carmo, J.S.A. Physical Modelling vs. Numerical Modelling: Complementarity and Learning. 2020. Available online: https://www.preprints.org/manuscript/202007.0753/v2 (accessed on 17 November 2021).
Fotovatikhah, F.; Herrera, M.; Shamshirband, S.; Chau, K.; Faizollahzadeh, A.S.; Piran, M.J. Survey of computational intelligence as basis to big flood management: Challenges, research directions and future work. Eng. Appl. Comput. Fluid Mech. 2018, 12, 411–437. [Google Scholar] [CrossRef]
Li, B.; Yang, G.; Wan, R.; Dai, X.; Zhang, Y. Comparison of random forests and other statistical methods for the prediction of lake water level: A case study of the Poyang Lake in China. Hydrol. Res. 2016, 47, 69–83. [Google Scholar] [CrossRef] [Green Version]
Trichakis, I.C.; Nikolos, I.K.; Karatzas, G.P. Artificial Neural Network (ANN) Based Modeling for Karstic Groundwater Level Simulation. Water Resour. Manag. 2011, 25, 1143–1152. [Google Scholar] [CrossRef]
Hipni, A.; El-shafie, A.; Najah, A.; Karim, O.A.; Hussain, A.; Mukhlisin, M. Daily forecasting of dam water levels: Comparing a Support Vector Machine (SVM) Model with Adaptive Neuro Fuzzy Inference System (ANFIS). Water Resour. Manag. 2013, 27, 3803–3823. [Google Scholar] [CrossRef]
Sapitang, M.; Ridwan, W.M.; Faizal, F.K.; Najah, A.A.; El-Shafie, A. Machine learning Application in reservoir water level forecasting for sustainable hydropower generation strategy. Sustainability 2020, 12, 6121. [Google Scholar] [CrossRef]
Seo, Y.; Kim, S.; Singh, V.P. Multistep-ahead flood forecasting using wavelet and data-driven methods. KSCE J. Civ. Eng. 2015, 19, 401–417. [Google Scholar] [CrossRef]
Piri, J.; Kahkha, M.R.R. Prediction of water level fluctuations of chahnimeh reservoirs in Zabol using ANN, ANFIS and Cuckoo optimization algorithm. Iran. J. Health Saf. Environ. 2016, 4, 706–715. [Google Scholar]
Zhang, S.; Lu, L.; Yu, J.; Zhou, H. Short term water level prediction using different artificial intelligent models. In Proceedings of the 5th International Conference on Agro-geoinformatics (Agro-geoinformatics), Tianjin, China, 18–20 July 2016. [Google Scholar]
Üneş, F.; Demirci, M.; Taşar, B.; Kaya, Y.Z.; Varçin, H. Estimating dam reservoir level fluctuations using data-driven techniques. Pol. J. Environ. Stud. 2018, 28, 3451–3462. [Google Scholar] [CrossRef]
Hong, J.; Lee, S.; Bae, J.H.; Lee, J.; Park, W.J.; Lee, D.; Kim, J.; Lim, K.J. Development and Evaluation of the Combined Machine Learning Models for the Prediction of Dam Inflow. Water 2020, 12, 2927. [Google Scholar] [CrossRef]
Choi, C.; Kim, J.; Han, H.; Han, D.; Kim, H.S. Development of Water Level Prediction Models Using Machine Learning in Wetlands: A Case Study of Upo Wetland in South Korea. Water 2020, 12, 93. [Google Scholar] [CrossRef] [Green Version]
Wang, Q.; Wang, S. Machine Learning-Based Water Level Prediction in Lake Erie. Water 2020, 12, 2654. [Google Scholar] [CrossRef]
Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M5 Accuracy Competition: Results, Findings and Conclusions. Int. J. Forecast. 2022, 38, 1365–1385. [Google Scholar] [CrossRef]
Ibañez, S.C.; Dajac, C.V.G.; Liponhay, M.P.; Legara, E.F.T.; Esteban, J.M.H.; Monterola, C.P. Forecasting reservoir water levels using deep neural networks: A case study of Angat Dam in the Philippines. Water 2021, 14, 34. [Google Scholar] [CrossRef]
Hyndman, R.J. A Brief History of Forecasting Competitions. Int. J. Forecast. 2020, 36, 7–14. [Google Scholar] [CrossRef]
Ouma, Y.O.; Moalahi, D.; Anderson, G.; Nkwae, B.; Odirile, P.; Parida, B.P.; Sebusang, N.; Nkgau, T.; Qi, J. Predicting the variability of dam water levels with land-use and climatic factors using Random Forest and Vector AutoRegression models. Proceedings of SPIE 12262, Remote Sensing for Agriculture, Ecosystems, and Hydrology XXIV, 122620J, Berlin, Germany, 5–7 September 2022. [Google Scholar]
Breiman, L. Random forests. Mach Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Ouma, Y.; Nkwae, B.; Moalafhi, D.; Odirile, P.; Parida, B.; Anderson, G.; Qi, J. Comparison of Machine Learning Classifiers For Multitemporal and Multisensor Mapping of Urban LULC Features. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 43, 681–689. [Google Scholar] [CrossRef]
Manatsa, D.; Chingombe, W.; Matsikwa, H.; Matarira, C.H. The superior influence of Darwin Sea level pressure anomalies over ENSO as a simple drought predictor for Southern Africa. Theor. Appl. Climatol. 2008, 92, 1–14. [Google Scholar] [CrossRef]
Ouma, Y.O.; Okuku, C.O.; Njau, E.N. Use of artificial neural networks and multiple linear regression model for the prediction of dissolved oxygen in rivers: Case study of hydrographic basin of River Nyando, Kenya. Complexity 2020, 2020, 9570789. [Google Scholar] [CrossRef]
Ahmed, A.N.; Yafouz, A.; Birima, A.H.; Kisi, O.; Huang, Y.F.; Sherif, M.; Sefelnasr, A.; El-Shafie, A. Water level prediction using various machine learning algorithms: A case study of Durian Tunggal river, Malaysia. Eng. Appl. Comput. Fluid Mech. 2022, 16, 422–440. [Google Scholar] [CrossRef]
Štefelová, N.; Alfons, A.; Palarea-Albaladejo, J.; Filzmoser, P.; Hron, K. Robust regression with compositional covariates including cellwise outliers. Adv. Data Anal. Classif. 2021, 15, 869–909. [Google Scholar] [CrossRef]
Genuer, R.; Poggi, J.-M.; Tuleau-Malot, C. Variable selection using random forests. Pattern Recogn. Lett. 2010, 31, 2225–2236. [Google Scholar] [CrossRef] [Green Version]
Allawi, M.F.; Binti Othman, F.; Afan, H.A.; Ahmed, A.N.; Hossain, M.S.; Fai, C.M.; El-Shafie, A. Reservoir Evaporation Prediction Modeling Based on Artificial Intelligence Methods. Water 2019, 11, 1226. [Google Scholar] [CrossRef]
Okkan, U.; Serbes, Z.A. The combined use of wavelet transform and black box models in reservoir inflow modeling. J. Hydrol. Hydromech. 2013, 61, 112–119. [Google Scholar] [CrossRef] [Green Version]
Wang, W.; Van Gelder, P.; Vrijling, J.K.; Ma, J. Forecasting daily streamflow using hybrid ANN models. J. Hydrol. 2006, 324, 383–399. [Google Scholar] [CrossRef]
Wu, C.L.; Chau, K.W.; Li, Y.S. Predicting monthly streamflow using data-driven models coupled with data-preprocessing techniques. Water Resour. Res. 2009, 45, W08432. [Google Scholar] [CrossRef] [Green Version]
Karunasinghe, D.S.K.; Liong, S.Y. Chaotic time series prediction with a global model: Artificial neural network. J. Hydrol. 2006, 323, 92–105. [Google Scholar] [CrossRef]
Cigizoglu, H.K.; Kisi, O. Flow prediction by three back propagation techniques using k-fold partitioning of neural network training data. Nord. Hydrol. 2005, 36, 49–64. [Google Scholar]
Seo, Y.; Park, K.B.; Kim, S.; Singh, V.P. Application of bootstrap-based artificial neural networks to flood forecasting and uncertainty assessment. Proceedings of 6th International Perspective on Water Resources and the Environment, Izmir, Turkey, 7–9 January 2013; EWRI-ASCE: Reston, VA, USA, 2013. [Google Scholar]
Tiwari, M.K.; Chatterjee, C. Development of an accurate and reliable hourly flood forecasting model using wavelet-bootstrap-ANN (WBANN) hybrid approach. J. Hydrol. 2010, 394, 458–470. [Google Scholar] [CrossRef]

Figure 1. Location map of the Limpopo River Basin (LRB), Botswana’s LRB, Bokaa and Gaborone dams and the dam catchment areas. Reprinted with permission from ref. [27]. Copyright 2022 Society of Photo-Optical Instrumentation Engineers.

Figure 2. Variability of rainfall and climate indices within Bokaa and Gaborone dam catchments (2001–2019).

Figure 3. Temperature variability within the Bokaa and Gaborone dam catchments (2001–2019).

Figure 4. (a) Variability of dam water levels in Bokaa and Gaborone dams with mean monthly precipitation. (b) Correlation between dam water levels and rainfall.

Figure 5. Correlation matrix heatmap of the predictor variables and dam water levels for (a) Bokaa dam and (b) Gaborone dam. The datasets are abbreviated as dam water level (WL), rainfall (RN), max temperature (TMX), min temperature (TMM), average temperature (TMA), Darwin Sea Level Pressure (DSLP), Aridity Index (AI), Southern Oscillation Index (SOI), Niño 3.4 (NINO), Built-Up (BUP), cropland (CL), water body (WT), tree-cover or forest (FR), shrubland (SL), grassland (GL) and bare-land or soil (BL).

Figure 6. Hyperparameter tuning response for dam water level predictions using RFR model based on lag order, max_depth and n_estimators for (a) Bokaa dam (top row), and (b) Gaborone dam (bottom row). Reprinted with permission from ref. [27]. Copyright 2022 Society of Photo-Optical Instrumentation Engineers.

Figure 7. Hyperparameter tuning for dam water level prediction using MLP-ANN model based on lag order, number of hidden layers, and epochs for (a) Bokaa dam (top row), and (b) Gaborone dam (bottom row).

Figure 8. Dam water level predictions for Bokaa dam (top) and Gaborone dam (bottom) using multivariate linear regression. Reprinted with permission from ref. [27]. Copyright 2022 Society of Photo-Optical Instrumentation Engineers.

Figure 13. MLP-ANN prediction of dam water levels for Bokaa dam.

Figure 14. MLP-ANN prediction of dam water levels for Gaborone dam.

Figure 15. Relative importance of the predictor variables for Bokaa dam and Gaborone water levels. Reprinted with permission from ref. [27]. Copyright 2022 Society of Photo-Optical Instrumentation Engineers.

Figure 16. Relative significance of predictor variables within LULC, climate factors and climate indices for Bokaa dam and Gaborone water levels.

Figure 17. VAR-ANN hybrid model for dam water level prediction.

Figure 18. VAR-ANN model average results for dam water level predictions in Bokaa and Gaborone dams.

Figure 19. Mean dam water level prediction errors E (%) from VAR-ANN, MLP-ANN, RFR and VAR.

Figure 20. ROC curves for VAR, RFR, ANN and VAR-ANN and the respective AUC scores.

Table 1. LULC classification accuracy. OA is average overall classification accuracy; PA is average producer accuracy and UA is the average user accuracy.

Bokaa Dam					Gaborone Dam
Year	PA (%)	UA (%)	OA (%)	Kappa Index	PA (%)	UA (%)	OA (%)	Kappa Index
1986	-	-	-	-	82.4%	80.0%	88.9%	0.82
1989	-	-	-	-	83.1%	88.1%	89.1%	0.84
1994	86.0%	86.2%	90.5%	0.857	75.8%	86.8%	83.4%	0.76
1999	87.5%	86.1%	85.5%	0.790	85.4%	86.6%	85.2%	0.78
2004	89.7%	87.8%	88.3%	0.837	86.7%	87.3%	89.6%	0.84
2009	90.1%	86.7%	89.3%	0.869	80.0%	84.3%	81.3%	0.75
2014	85.3%	88.2%	89.4%	0.858	87.8%	91.0%	85.2%	0.78
2019	91.3%	88.3%	88.0%	0.833	86.0%	83.6%	84.8%	0.80

Table 2. Spatial-temporal LULC in Bokaa and Gaborone catchments.

	Bokaa Catchment (Year/Area (km²))
LULC Class	2001	2004	2009	2014	2019
Tree Cover	206.14	477.58	609.53	776.80	507.08
Shrubland	2237.10	2066.95	2173.66	2175.63	2296.40
Grassland	492.00	487.99	453.44	236.40	248.72
Cropland	297.20	306.55	173.84	196.23	233.57
Water	1.76	7.15	8.93	4.04	5.23
Built-Up	70.30	76.42	96.53	99.59	108.79
Bare-soil	305.40	187.26	93.97	121.21	210.10
	Gaborone Catchment (Year/Area (km²))
LULC Class	2001	2004	2009	2014	2019
Tree Cover	752.46	870.96	1160.17	1202.26	1364.78
Shrubland	2782.85	2478.52	1805.56	2210.82	1778.31
Grassland	167.24	174.03	445.11	62.23	323.70
Cropland	451.62	454.66	588.12	526.62	268.74
Water	15.07	12.65	18.49	7.66	17.88
Built-Up	70.28	98.31	155.77	167.44	172.95
Bare-soil	104.73	255.12	171.03	167.22	417.89

Table 3. Descriptive statistics for the datasets. (BC = Bokaa dam catchment; GC = Gaborone dam catchment; SD = standard deviation; CV = coefficient of variation and SE = standard error).

Parameters			Min	Max	Median	Mean	SD	CV	SE
Dam water levels (WL) (%)	B-dam		2.00	105.00	47.00	46.46	28.50	61.34	2.74
Dam water levels (WL) (%)	G-dam		1.00	100.00	44.50	43.86	31.65	72.17	3.05
Rainfall (RN) (mm)			0.00	174.10	17.70	35.15	43.82	124.68	4.24
Min Temp (TMX) (°C)			1.60	23.30	15.15	13.49	6.03	44.73	0.58
Max Temp (TMM) (°C)			20.90	39.10	29.45	28.88	3.93	13.61	0.38
Avg Temp (TMA) (°C)			11.25	29.05	22.40	21.19	4.82	22.73	0.46
AI			10.00	96.15	19.21	27.36	21.71	79.35	2.09
DSLP			4.80	15.00	10.45	10.20	2.80	27.49	0.27
SOI			−6.55	8.04	0.25	0.40	3.16	791.38	0.30
Niño 3.4 (NINO)			25.00	29.42	27.18	27.17	0.99	3.64	0.10
TreeCover (FR)		BC	14.05%	21.52%	18.80%	18.41%	0.02	13.33	0.01
TreeCover (FR)		GC	27.10%	31.42%	28.40%	28.78%	0.02	5.41	0.01
Shrubland (SR)		BC	60.24%	63.61%	60.95%	61.38%	0.01	2.12	0.00
Shrubland (SR)		GC	40.93%	50.89%	46.81%	46.31%	0.03	6.80	0.01
Grassland (GL)		BC	6.55%	10.13%	6.82%	7.47%	0.01	17.05	0.00
Grassland (GL)		GC	1.43%	7.45%	5.00%	4.56%	0.02	45.01	0.01
Cropland (CL)		BC	5.07%	6.47%	5.65%	5.70%	0.00	8.65	0.00
Cropland (CL)		GC	6.19%	12.90%	11.00%	10.34%	0.02	23.93	0.01
Water body (WT)		BC	0.11%	0.19%	0.14%	0.14%	0.00	18.04	0.00
Water body (WT)		GC	0.18%	0.41%	0.27%	0.28%	0.00	26.34	0.00
Built-Up (BU)		BC	2.71%	3.01%	2.81%	2.83%	0.00	3.88	0.00
Built-Up (BU)		GC	3.68%	3.98%	3.88%	3.86%	0.00	2.60	0.00
Bare-soil (BL)		BC	2.90%	5.82%	3.85%	4.08%	0.01	26.03	0.00
Bare-soil (BL)		GC	3.85%	9.62%	5.00%	5.80%	0.02	38.22	0.01

Table 4. Optimal VAR(p) lag order determinants for Gaborone dam and Bokaa dam.

Gaborone Dam						Bokaa Dam
Dataset	Lag Order	AIC	BIC	HQIC	R²	Lag Order	AIC	BIC	HQIC	R²
Set-1	7	−45.9	−37.3	−42.5	0.761	7	−43.3	−35.1	−40.0	0.785
Set-2	43	−58.6	−43.8	−52.6	0.329	6	−57.6	−55.8	−56.9	0.203
Set-3	2	−142.6	−133.1	−138.7	0.810	8	−157.6	−125.2	−144.5	0.872
Set-4	40	−9.5	−5.9	−8.1	0.224	20	−9.1	−7.7	−8.5	0.256
Set-5	8	−62.8	−53.2	−58.9	0.952	7	−78.2	−69.9	−74.8	0.917
Set-6	2	−100.8	−92.5	−97.5	0.936	6	−116.9	−91.9	−106.8	0.860
Set-7	7	−20.8	−18.7	−19.9	0.121	12	−19.6	−16.5	−18.3	0.798
Set-8	2	−25.9	−24.9	−25.6	0.884	7	−25.4	−22.1	−24.1	0.902

Table 5. Bokaa and Gaborone dam RFR optimal hyperparameters after tuning for best predictor datasets.

Best Descriptor Data Sets		Lag Order	Max_Depth	n_Estimators	R²
Bokaa dam
Set-1	Climate indices, Rainfall, Min-Avg-Max Temperatures	12	11	52	0.840
Set-2	Min-Avg-Max Temperatures	11	11	51	0.831
Set-3	All Variables	2	20	50	0.824
Set-7	Rainfall, Min-Max Temperatures	13	20	50	0.820
Gaborone dam
Set-2	Min-Avg-Max Temperatures	1	28	379	0.914
Set-4	Rainfall	1	28	379	0.817
Set-7	Rainfall, Min-Max Temperatures	1	20	100	0.819
Set 8	Climate Indices	2	20	350	0.563

Table 6. Bokaa and Gaborone dam optimal hyperparameters after tuning for MLP-ANN.

Best Descriptor Data Sets		Lag Order	Hidden_Layers	Epochs	Batch Size	R²
Bokaa dam
Set-2	Min-Avg-Max Temperatures	2	3	700	5	0.865
Set-3	All Variables	1	4	400	13	0.825
Set-4	Rainfall	1	2	400	5	0.829
Set-7	Rainfall, Min-Max Temperatures	1	5	300	5	0.850
Gaborone dam
Set-1	Climate indices, Rainfall, Min-Avg-Max Temperatures	3	2	100	1	0.882
Set-2	Min-Avg-Max Temperatures	1	2	600	13	0.914
Set-4	Rainfall	3	2	100	15	0.920
Set-7	Rainfall, Min-Max Temperatures	4	2	100	13	0.921

Table 7. Performance of the different datasets as water level predictors using MLR.

Predictor Set	RMSE (%)		R²		MAE (%)		MAPE (%)
Predictor Set	B-Dam	G-Dam	B-Dam	G-Dam	B-Dam	G-Dam	B-Dam	G-Dam
Set-1	26.2	62.4	0.151	0.181	22.5	61.9	75.8	73.2
Set-2	26.9	26.6	0.098	0.013	22.9	22.8	58.2	92.0
Set-3	18.3	15.8	0.583	0.841	14.6	12.7	62.3	50.8
Set-4	28.2	26.9	0.015	0.001	24.2	22.8	60.6	93.2
Set-5	20.3	16.6	0.489	0.785	16.5	13.5	66.3	39.1
Set-6	18.3	15.8	0.583	0.841	14.6	12.7	62.3	60.8
Set-7	26.8	26.4	0.110	0.021	22.8	22.6	78.0	91.5
Set-8	27.5	26.3	0.058	0.177	23.4	22.9	88.9	73.7

Table 8. Accuracy statistics for water level predictions using VAR model for dam Bokaa and Gaborone dam.

Predictor Set	RMSE (%)		R²		MAE (%)		MAPE (%)
Predictor Set	B-Dam	G-Dam	B-Dam	G-Dam	B-Dam	G-Dam	B-Dam	G-Dam
Set-1	3.5	2.7	0.959	0.995	2.7	2.2	14.3	36.9
Set-2	27.9	52.9	0.157	0.181	23.1	49.9	87.4	88.3
Set-3	4.9	0.2	0.928	0.998	3.4	0.1	21.9	1.5
Set-4	43.7	86.7	0.167	0.116	34.9	73.7	41.0	45.9
Set-5	0.7	0.6	0.998	0.999	0.2	0.2	1.5	3.3
Set-6	4.4	0.2	0.975	0.876	3.0	0.1	28.9	1.5
Set-7	42.8	1.3	0.291	0.858	32.1	0.9	40.6	15.4
Set-8	3.4	0.7	0.916	0.929	2.8	0.6	16.2	8.0

Table 9. Accuracy statistics for water level predictions using the RFR model for Bokaa and Gaborone dams.

Predictor Set	RMSE (%)		R²		MAE (%)		MAPE (%)
Predictor Set	B-Dam	G-Dam	B-Dam	G-Dam	B-Dam	G-Dam	B-Dam	G-Dam
Set-1	12.2	10.6	0.820	0.884	7.8	65.	18.3	25.0
Set-2	11.3	9.8	0.836	0.918	7.1	5.6	14.5	23.7
Set-3	11.9	10.1	0.829	0.816	7.1	6.8	12.1	30.7
Set-4	12.5	10.9	0.811	0.898	7.9	6.1	12.9	24.8
Set-5	12.5	11.3	0.807	0.653	8.0	7.1	12.8	37.5
Set-6	12.3	10.9	0.815	0.782	7.9	6.9	17.6	30.3
Set-7	12.3	10.9	0.824	0.897	7.2	6.1	13.3	25.2
Set-8	12.6	11.3	0.808	0.890	8.0	6.4	18.3	23.6

Table 10. Performance accuracy for water level predictions using MLP-ANN for Bokaa and Gaborone dams.

Predictor Set	RMSE (%)		R²		MAE (%)		MAPE (%)
Predictor Set	B-Dam	G-Dam	B-Dam	G-Dam	B-Dam	G-Dam	B-Dam	G-Dam
Set-1	14.6	9.8	0.717	0.917	10.7	5.6	17.1	24.2
Set-2	10.9	9.3	0.865	0.925	6.5	4.9	24.6	30.8
Set-3	15.2	45.4	0.704	−7.801	10.7	40.9	23.4	57.7
Set-4	11.6	9.3	0.829	0.926	6.7	4.5	25.3	24.1
Set-5	15.8	42.6	0.627	−5.790	11.7	38.7	27.7	76.9
Set-6	17.3	28.7	0.449	−2.049	13.5	24.9	56.6	38.9
Set-7	11.3	9.8	0.850	0.920	6.8	4.7	17.7	24.7
Set-8	12.1	18.2	0.805	0.407	8.4	13.7	13.2	38.0

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ouma, Y.O.; Moalafhi, D.B.; Anderson, G.; Nkwae, B.; Odirile, P.; Parida, B.P.; Qi, J. Dam Water Level Prediction Using Vector AutoRegression, Random Forest Regression and MLP-ANN Models Based on Land-Use and Climate Factors. Sustainability 2022, 14, 14934. https://doi.org/10.3390/su142214934

AMA Style

Ouma YO, Moalafhi DB, Anderson G, Nkwae B, Odirile P, Parida BP, Qi J. Dam Water Level Prediction Using Vector AutoRegression, Random Forest Regression and MLP-ANN Models Based on Land-Use and Climate Factors. Sustainability. 2022; 14(22):14934. https://doi.org/10.3390/su142214934

Chicago/Turabian Style

Ouma, Yashon O., Ditiro B. Moalafhi, George Anderson, Boipuso Nkwae, Phillimon Odirile, Bhagabat P. Parida, and Jiaguo Qi. 2022. "Dam Water Level Prediction Using Vector AutoRegression, Random Forest Regression and MLP-ANN Models Based on Land-Use and Climate Factors" Sustainability 14, no. 22: 14934. https://doi.org/10.3390/su142214934

APA Style

Ouma, Y. O., Moalafhi, D. B., Anderson, G., Nkwae, B., Odirile, P., Parida, B. P., & Qi, J. (2022). Dam Water Level Prediction Using Vector AutoRegression, Random Forest Regression and MLP-ANN Models Based on Land-Use and Climate Factors. Sustainability, 14(22), 14934. https://doi.org/10.3390/su142214934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dam Water Level Prediction Using Vector AutoRegression, Random Forest Regression and MLP-ANN Models Based on Land-Use and Climate Factors

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data

2.2.1. Land-Use and Land-Cover (LULC)

2.2.2. Climate Data

2.2.3. Dam Reservoir Water Levels

2.2.4. Data Statistics and Correlational Analysis

2.3. Methods

2.3.1. Multivariate Linear Regression (MLR)

2.3.2. Vector AutoRegressive Model

2.3.3. Random Forest Regression

2.3.4. Multilayer Perceptron (MLP) Neural Network

2.4. Performance Evaluation Metrics

2.5. Data Normalization

3. Results

3.1. Hyperparameter Tuning for the Models

3.1.1. Parameter Lag Order Determination for VAR Model

3.1.2. Training for RF Regression

3.1.3. Training of MLP-ANN Model

3.2. Dam Water Level Prediction Results

3.2.1. Prediction of Dam Water Levels Using MLR

3.2.2. VAR Prediction of Dam Water Levels

3.2.3. RFR Simulation and Prediction

3.2.4. MLP-ANN Simulation and Prediction

3.3. Relative Importance of the Predictor Variables

4. Discussions

4.1. Influence of the Predictor Variables on Dam Water Level Predictions

4.1.1. Impact of LULC on Water Level Predictions

4.1.2. Influence of Climate Factors and Climate Indices

4.1.3. Model Performances

4.2. VAR-ANN Hybrid Dam Water Prediction Model

4.3. Average Model Errors and ROC Area under Curve (AUC)

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI