1. Introduction
River water quality plays a crucial role in ensuring the sustainability and health of freshwater ecosystems. Traditional monitoring methods often have spatial and temporal coverage limitations, leading to difficulties in effectively assessing and managing water quality [
1]. However, recent developments in machine learning techniques have indicated the potential to predict water quality accurately based on catchment characteristics [
1,
2]. One of the underlying reasons for the growing interest in applying machine learning techniques to predict river water quality is the ability to simultaneously consider a wide range of catchment characteristics. These characteristics include various features that affect water quality, including land use, soil properties, climate data, topography, and hydrological characteristics [
3]. These variables interact in complex ways, and their associations may not easily be distinguished using prevalent analytical methods. In practice, for the modeling of water quality parameters, different mathematical models can be used that show satisfactory accuracy, as in papers [
4,
5]. A more comprehensive understanding of the factors influencing water quality can be achieved by applying machine learning algorithms, which can disclose hidden patterns and capture nonlinear relationships in large and diverse datasets [
6].
The performance of supervised machine learning algorithms has been proved by recent studies in predicting river water quality [
7]. These algorithms are trained on historical water quality data, in line with associating catchment characteristics, to explore the patterns and relationships between them [
8]. Researchers have been able to develop accurate predictive models by applying algorithms such as decision trees, random forests, support vector machines, and neural networks [
9,
10,
11]. Water quality parameters, such as nutrient concentrations, pollutant levels, and biological indicators, can then be applied by these models to make predictions using catchment characteristics [
12]. Such predictions can help determine potential pollution points, prioritize management actions, and support decision-making processes for water resource management.
In addition to catchment characteristics, integrating remote sensing data has emerged as a valuable tool by which the accuracy of water quality predictions can be significantly enhanced [
13,
14], because the spatially explicit information about land cover/land use, vegetation, and surface characteristics can be provided by remote sensing techniques, which include satellite imagery and aerial photographs. Researchers can improve the predictive performance of machine learning models by combining remote sensing data with catchment characteristics [
15]. For example, satellite data can provide insight into vegetation dynamics, land use/land cover changes, and the extent of impervious surfaces, which encompass urban and semi-urban areas, all of which can influence water quality. Integrating these additional spatial data sources into the machine learning models can result in more accurate and spatially explicit predictions, allowing a more comprehensive assessment of water quality dynamics across large river basins.
Furthermore, applying machine learning techniques to predict river water quality has led to collaborative efforts between researchers and stakeholders [
16,
17], which can be crucial to facilitate achieving the objectives of water resources management. These efforts aim to develop standardized frameworks and models that can be applied across different catchments and regions. Data sharing, methodologies, and the best practices can collectively improve the accuracy and reliability of predictive models that researchers develop. Collaborative initiatives also facilitate the identification of common challenges, such as data availability and quality issues, and foster the development of innovative solutions.
Advances in machine learning applications to predict river water quality underscore the growing importance of this field. Researchers are moving toward a more sustainable and informed water resource management by integrating advanced machine learning algorithms with catchment characteristics and remote sensing data. These predictive models can support policymakers, water resource managers, and environmental authorities in making evidence-based decisions, implementing targeted pollution control measures, and maintaining the ecological integrity of river ecosystems.
Integrating machine learning techniques with catchment characteristics gives researchers, water resource engineers, planners, and managers immense potential to predict river water quality. These models can provide accurate and timely information to support water resource management, pollution mitigation efforts, and the preservation of freshwater ecosystems by leveraging the power of advanced algorithms and incorporating diverse environmental data sources [
18]. The advancements made in this field highlight the growing significance of machine learning in addressing the challenges associated with water quality prediction and paving the way for a more sustainable and informed management of our precious water resources.
Although machine learning applications in predicting river water quality based on catchment characteristics have shown promising findings, several gaps in this research field warrant further investigation. They include but are not limited to (1) data availability and quality, (2) the incorporation of temporal dynamics, (3) uncertainty estimation, (4) across-catchments transferability, (5) the integration of socio-economic factors, and (6) interpretability and transparency, which are briefly addressed as follows:
Data availability and quality, particularly historical water quality data and comprehensive catchment characteristics data remain challenges in many regions. Limited data may lead to biased or incomplete models, limiting the accuracy and the generality of predictions. Efforts should be made to improve data collection methods, establish standardized data protocols, and improve data sharing among researchers and stakeholders.
Current machine learning models often ignore the temporal dimension and assume static relationships between catchment characteristics and water quality parameters [
19]. While integrating temporal dynamics into machine learning models could increase their predictive capacity and allow for more accurate short-term and long-term water quality forecasts, various temporal factors, such as seasonal variations, climate change, and short-term events such as precipitation events or pollution incidents, affect river water quality [
20].
Machine learning models typically provide point predictions, but quantifying and communicating the uncertainties associated with these predictions is crucial for decision-making and risk assessment [
21]. Developing methods to quantify and propagate uncertainties through the modeling process, considering the sources of uncertainty such as data quality, model structure, and parameter estimation, would enhance the reliability and applicability of predictive models.
The use of deep learning models is considered for modeling changes in water reservoirs. The methodology of Long Short-Term Memory (LSTM) networks was applied in the work, and a number of criteria including the Coefficient of Determination (R
2), Root Mean Square Error (RMSE), mean absolute percentage error (MAPE), Mean Absolute Deviation (MAD), and Nash–Sutcliffe Efficiency (NSE) were used to assess the accuracy. Satisfactory accuracy of the model was achieved on a series of samples that covered the period from 2003 to 2025 for five basins in Saudi Arabia [
22].
Research regarding the application of artificial intelligence techniques—artificial neural networks (ANN), a group method of data handling (GMDH), and support vector machines (SVM)—for predicting water quality components in Tireh River, southwest Iran showed that the application of ANN and SVM models, using tansig and RBF functions, respectively, showed satisfactory performance. The database included samples collected over a period of 55 years [
23].
In addition, models developed for a particular catchment cannot be applied directly to another due to variations in catchment characteristics, land use/cover, soil, geology, and climatic conditions. The development of transferable models that can account for specific variations in catchment while taking general patterns would be valuable for the management of water resources on a larger scale [
24]. On the other hand, capturing the characteristics related to human activities, such as agricultural practices, urbanization, and industrial activities, has a significant impact on water quality. Incorporating socio-economic factors into machine learning models can improve their predictive power and enable more comprehensive water quality assessments. However, the integration of socio-economic data and understanding the complex interactions between human activity and water quality present challenges that must be addressed.
It should be noted that the need for greater ambiguity and transparency in machine learning models can limit the adoption and acceptance of these models by policymakers and stakeholders [
25]. The development of logical machine learning techniques that provide insights into the model decision-making process and highlight the most influential catch characteristics would improve the reliability and usability of predictive models.
Addressing these gaps requires interdisciplinary collaborations among hydrologists, ecologists, data scientists, policymakers, and other relevant stakeholders. Furthermore, focusing on data-driven approaches, data-sharing initiatives, and advances in computational methods will be critical to advancing the field and harnessing the full potential of machine learning in predicting river water quality based on catchment characteristics. In the study, we intend to investigate how we can address spatial variations in the characteristics of the catchment to explain river water quality using machine learning techniques.
This paper explores the relationship between catchment attributes and in-stream water quality parameters using machine learning methods. It evaluates model accuracy with RMSE, MAE, R, and MAPE, identifies optimal models for 11 parameters, and determines significant influencing variables.
The predictive models developed for each water parameter demonstrate strong performance in most cases. The significant variables identified provide insights into the key factors influencing water quality in the studied catchment. This research, therefore, serves as a catalyst for fostering a nuanced and effective approach to water resource management, underpinned by the empirical foundation laid by the predictive models and the discerned influential variables. As a result, the integration of these findings into decision-making processes holds the potential to optimize resource allocation, mitigate environmental impacts, and ultimately contribute to the overarching goal of achieving sustainable and resilient water ecosystems. The study highlights the potential of artificial intelligence for quick and accurate water quality assessment, tailored to watershed attributes.
2. Materials and Methods
To establish relationships between the catchment attributes and water quality parameters, machine learning methods were employed. Multiple algorithms, such as regression trees, TreeBagger, Random Forests, and Gaussian process regression (GPR) models, were applied to construct predictive models for each water quality parameter.
2.1. Regression Tree Models
The fundamental concept behind regression trees is to partition the input space into distinct regions and assign predictive values to these regions. This segmentation enables the model to make predictions based on the most relevant conditions and characteristics of the data. A regression tree (RT) is a simple and comprehensible machine learning model applicable to both regression and classification problems. It follows a tree-like structure composed of nodes and branches (
Figure 1).
Each node corresponds to a specific condition related to the input data, and this condition is evaluated at each node as the data progresses through the tree.
To predict an outcome for a given input, the starting point is the root node of the tree (
Figure 1). Here, the initial condition associated with the input feature(s) is considered. Depending on whether this condition is deemed true or false, the branches are followed to reach the next node. This process is repeated recursively until a leaf node is arrived at. At the leaf node, a value is found, which serves as the predicted result for the input instance. For regression tasks, this value is typically a numeric prediction.
As the tree is traversed, the input space undergoes changes. Initially, all instances are part of a single set represented by the root node. However, as the algorithm progresses, the input space is gradually divided into smaller subsets. These divisions are based on conditions that help in tailoring predictions to different regions within the input space.
The process of constructing regression trees involves determining the optimal split variable (“j”) and split point (“s”) to partition the input space effectively. These variables are chosen by minimizing a specific expression (Equation (1)) that considers all input features. The goal is to minimize the sum of squared differences between observed values and predicted values in resulting regions [
27,
28,
29].
Once “j” and “s” are identified, the tree-building process continues by iteratively dividing regions. This process is referred to as a “greedy approach” because it prioritizes local optimality at each step. The binary recursive segmentation approach divides the input space into non-overlapping regions characterized by their mean values.
The depth of a regression tree serves as a critical factor in preventing overfitting (too much detail) or underfitting (too simplistic).
2.2. Ensembles of Regression Trees: Bagging, Random Forest, and Boosted Trees
Bagging is another ensemble method that involves creating multiple subsets of the training dataset through random sampling with replacement (bootstrap samples).
The process begins with the creation of multiple bootstrap samples from the original dataset. Bootstrap sampling involves randomly selecting data points from the dataset with replacement. This means that the same data point can be selected multiple times, while others may not be selected at all. In this way, subsets of the same size as the original data set are formed and are used to train the model.
Each subset is used to train a separate regression tree model, and their predictions are aggregated to make the final prediction (
Figure 2). Bagging helps reduce variance by averaging predictions from multiple models, making it particularly effective when the base models are unstable or prone to overfitting [
27,
28,
29].
In bagging, multiple training sets are generated by repeatedly selecting samples from the original dataset, and this process involves sampling with replacement. This technique is utilized to create diverse subsets of data. The primary goal is to reduce the variance in the model’s predictions by aggregating the results from these different subsets. Consequently, each subset contributes to the final prediction, and the averaging of multiple models enhances the model’s robustness and predictive accuracy.
Random forests, a variant of bagging, stand out by introducing diversity among the constituent models within the ensemble. This diversity is achieved through the creation of multiple regression trees, each trained on a distinct bootstrap sample from the data. Moreover, before making decisions at each split within these trees, only a randomly selected subset of available features is considered. This approach helps in decorrelating the individual trees within the ensemble, thereby further reducing variance. The ensemble’s final prediction is generated by aggregating the predictions from these decorrelated trees, resulting in a robust and high-performing model [
26,
27,
28,
29].
The boosting tree method is a sequential training method, and within this paradigm, gradient boosting stands out as a widely employed technique for enhancing overall model performance (
Figure 3). In gradient boosting, submodels are introduced iteratively, with each new model selected based on its capacity to effectively estimate the residuals or errors of the preceding model in the sequence. The distinctive feature of gradient boosting lies in its commitment to minimizing these residuals during the iterative process.
By focusing on minimizing residuals, gradient boosting ensures that each new submodel added to the ensemble is adept at correcting the errors left by its predecessors. This emphasis on addressing the shortcomings of prior models leads to the creation of a robust and adaptive ensemble model. The iterative nature of gradient boosting allows it to systematically refine its predictions, making the final ensemble proficient in capturing intricate patterns and nuances within the data. The result is a powerful model capable of delivering highly accurate predictions by continuously learning and adapting to the complexities present in the dataset.
The fundamental concept is rooted in gradient-based optimization techniques, which involve refining the current solution to an optimization problem by incorporating a vector that is directly linked to the negative gradient of the function under minimization, as referenced in previous works [
31,
32,
33]. This approach is logical because a negative gradient signifies the direction in which the function decreases. When it is applied a quadratic error function, each subsequent model aims to correct the discrepancies left by its predecessors, essentially reinforcing and improving the model with a focus on the residual errors from earlier stages.
In the context of gradient-boosting trees, the learning rate is a crucial hyperparameter that controls the contribution of each tree in the ensemble to the final prediction. It is often denoted as “lambda” (λ). The learning rate determines how quickly or slowly the model adapts to the errors from the previous trees during the boosting process. A lower learning rate means that the model adjusts more gradually and may require a larger number of trees to achieve the same level of accuracy, while a larger learning rate leads to faster adaptation but may risk overfitting with too few trees.
BT models are significantly more complex regarding computational complexity because they are trained sequentially compared to TR and RF models that can be trained in parallel.
2.3. Gaussian Process Regression (GPR)
Gaussian processes provide a probabilistic framework for modeling functions, capturing uncertainties, and making predictions in regression tasks. The choice of covariance functions and hyperparameters allows for flexibility in modeling relationships among variables [
34].
Gaussian process modeling involves estimating an unknown function f(∙) in nonlinear regression problems. It assumes that this function follows a Gaussian distribution characterized by a mean function μ(∙) and a covariance function k(∙,∙). The covariance matrix K is a fundamental component of GPR and is determined by the kernel function (k).
The kernel function (k) plays a pivotal role in capturing the relationships between input data points (x and x′). This function is essential for quantifying the covariance or similarity between random values f(x) and f(x′). One of the most widely used kernels is defined by the following expression:
In this expression, several elements are critical:
represents the signal variance, a model parameter that quantifies the overall variability or magnitude of the function values.
The exponential function “exp” is used to model the similarity between x and x′. It decreases as the difference between x and x′ increases, capturing the idea that values close to each other are more strongly correlated.
The parameter l, known as the length scale, is another model parameter. It controls the smoothness and spatial extent of the correlation. A smaller l results in more rapid changes in the function, while a larger l leads to smoother variations.
The observations in a dataset
can be viewed as a sample from a multivariate Gaussian distribution.
Gaussian processes are employed to model the relationship between input variables x and target variables y, considering the presence of additive noise
. The goal is to estimate the unknown function f(∙). The observations y are treated as a sample from a multivariate Gaussian distribution with mean vector μ and covariance matrix Κ. This distribution captures the relationships among the data points. The conditional distribution of a test point’s response value
, given the observed data
, is represented as
with the following:
In traditional GPR, a single length-scale parameter (
l) and signal variance (
) are used for all input dimensions. In contrast, the Automatic Relevance Determination (ARD) approach employs a separate length-scale parameter (
) for each input dimension, where ‘i’ represents a specific dimension. This means that for a dataset with ‘m’ input dimensions, you have ‘m’ individual length-scale parameters [
34].
The key advantage of ARD is that it automatically determines the relevance of each input dimension in the modeling process. By allowing each dimension to have its own length scale parameter, the model can assign different degrees of importance to each dimension. This means that the model can adapt and focus more on dimensions that are more relevant to the target variable and be less influenced by less relevant dimensions.
GPR involves matrix operations, and the computational complexity can become an issue for large datasets. Techniques such as sparse approximations or using specialized kernels can be employed to address these computational challenges. GPR is frequently used in Bayesian optimization problems where the goal is to optimize an unknown objective function that is expensive to evaluate.
3. Case Study of the Caspian Sea Basin
This study took place in the Caspian Sea catchment area (
Figure 4) in Northern Iran, covering approximately 618 m
2 with coordinates ranging from 49°48′ to 54°41′ longitude and from 35°36′ to 37°19′ latitude (
Figure 4).
The majority of this area, approximately 65.10%, is forested, while the rest consists of rangelands (24.41%), agricultural land (9.41%), urban areas (0.88%), water bodies (0.0126%), and bare land (0.186%) [
35].
Initially, 108 water quality monitoring stations scattered across the southern basin of the Caspian Sea were selected for analysis (
Figure 4). To define the upstream catchment boundaries, digital elevation models (DEMs) with a resolution of 30 m by 30 m from the USGS database were used, with boundary refinement achieved through a user digitizing technique. Macro-sized catchments, those exceeding 1000 square kilometers, totaling 18 catchments, were excluded from the modeling process due to their significant impact on hydrological dynamics.
Water quality data, including parameters like SAR, Na
+, Mg
2+, Ca
2+, SO
42−, Cl
−, HCO
3−, K
+, pH, EC, and TDS, were obtained from the Iran Water Resource Management Company (WRMC) through monthly sampling. Collection adhered to the WRMC Guidelines for Surface Water Quality Monitoring (2009) and EPA-841-B-97-003 standards [
36]. For statistical analysis, the 5-year means (1998–2002) of water quality data were calculated. After scrutinizing for normality and identifying outliers, 88 final stations were used in the study. The geographic scope of the study area is illustrated in
Figure 4.
A land cover dataset was created using a 2002 digital land cover map (Scale 1:250,000) from the Forest, Ranges, and Watershed Management Organization of Iran. The original land cover categories were consolidated into six classes: bare land, water bodies, urban areas, agriculture, rangeland, and forests, following [
37] land use and land cover classification systems. Furthermore, digital geological and soil feature maps (1:250,000 scale) were obtained from the Geological Survey of Iran (
www.gsi.ir, accessed on 24 April 2021). Detailed information about the characteristics of the catchments and their statistical attributes can be found in
Table 1 and
Table 2.
In this study, hydrologic soil groups and geological permeability classes were developed and applied in conjunction with land use/land cover types within the modeling process. Hydrologic soil groups are influenced by runoff potential and can be used to determine runoff curve numbers. They consist of four classes (A, B, C, and D), with A having the highest runoff potential and D the lowest. Notably, soil profiles can undergo significant alterations due to changes in land use/land cover. In such cases, the soil textures of the new surface soil can be employed to determine the hydrologic soil groups as described in
Table 1 [
38]. Furthermore, the study incorporates the application of geological permeability attributes related to catchments, with the development of three geological permeability classes: Low, Medium, and High. These classes are associated with various characteristics of geological formations, such as effective porosity, cavity type and size, their connectivity, rock density, pressure gradient, and fluid properties like viscosity.
The range and statistical properties of training and test data play a fundamental role in the development and evaluation of machine learning models. They impact the model’s generalization, robustness, fairness, and ability to perform effectively in diverse real-world scenarios. Statistical properties of input and output data are given in
Table 1 and
Table 2.
The machine learning methods used in this paper were assessed using five-fold cross-validation. In this approach, the dataset was randomly divided into five subsets, with four of them dedicated to training the model and the remaining subset utilized for model validation (testing). This five-fold cross-validation process was repeated five times, ensuring that each subset was used exactly once for validation. Subsequently, the results from these five repetitions were averaged to produce a single estimation.
All models were trained and tested under identical conditions, ensuring a fair and consistent evaluation of their performance. This practice is essential in machine learning to provide a level playing field for comparing different algorithms and models.
When machine learning models are trained and tested under equal conditions, it means that they are exposed to the same datasets, preprocessing steps, and evaluation metrics.
The quality of the model was assessed using several evaluation and performance measures, which include RMSE, MAE, Pearson’s Linear Correlation Coefficient (R), and MAPE.
The RMSE criterion, expressed in the same units as the target values, serves as a measure of the model’s general accuracy. It is calculated as the square root of the average squared differences between the actual values (
) and the model’s predictions (
) across the training samples (N).
The MAE criterion represents the mean absolute error of the model, emphasizing the absolute accuracy. It calculates the average absolute differences between the actual values and the model’s predictions.
Pearson’s Linear Correlation Coefficient (R) provides a relative measure of accuracy assessment. It considers the correlation between the actual values (
) and the model’s predictions (
) relative to their respective means (
and
). Values of R greater than 0.75 indicate a strong correlation between the variables.
The MAPE is a relative criterion that evaluates accuracy by calculating the average percentage differences between the actual values and the model’s predictions.
This research deals with a limited dataset, and in this case, there is a higher risk of overfitting, where a model performs well on the training data but needs to generalize to new, unseen data. Five-fold cross-validation helps mitigate overfitting by partitioning the dataset into five subsets, using four for training and one for testing in each iteration. This process allows for a more robust evaluation of the model’s performance.
Five-fold cross-validation efficiently utilizes the available data by rotating through different subsets for training and testing, ensuring that each data point contributes to training and evaluation.
Moreover, cross-validation provides a more robust estimate of the model’s performance by averaging the evaluation metrics across multiple folds. This helps ensure that our results are not overly dependent on the particular random split of the data. Additionally, cross-validation allows us to iteratively train and evaluate the model on different subsets, aiding in the fine-tuning of hyperparameters and ensuring the model’s performance is consistently reliable (
Figure 5).
4. Results
The paper analyzes the application of regression trees, bagging, RF, gradient boosting, and Gaussian process regression models using a systemic approach (
Figure 5). For each of the models, the hyperparameters of the model were varied in the appropriate range, and optimal values were determined using a grid-search method. The following values were analyzed:
The depth of a regression tree is a crucial factor in preventing overfitting, which occurs when the tree becomes too detailed and fits the training data too closely, as well as underfitting, which happens when the tree is too simplistic and fails to capture the underlying patterns.
Table 3 illustrates the impact of the minimum leaf size on the accuracy of the regression tree (RT) model. It shows how changing the minimum leaf size affects key performance metrics such as RMSE, MAE, MAPE, and R. Accordingly, the leaf size is almost positively associated with the RMSE but inversely correlated with the R values.
TreeBagger (TB) model
Number of generated trees (B): Investigated values up to a maximum of 500 trees, with 100 as the standard setting in Matlab. The bootstrap aggregation method was employed, generating a specific number of samples in each iteration. The study considered an upper limit of 500 trees.
Minimum amount of data/samples per tree leaf: Analyzed values ranging from 2 to 15 samples per leaf, with a step size of 1 sample. The standard setting in Matlab is 5 samples per leaf for regression, but here, a broader range was examined to assess its impact on model generalization.
Boosted tree model:
Number of generated trees (B): analyzed within a range of 1–100 trees.
Learning rate (λ): explored a range, including 0.001, 0.01, 0.1, 0.5, 0.75, and 1.0.
Tree splitting levels (d): analyzed from 1 (a decision stump) to 27 = 128 in an exponential manner.
Gaussian process regression (GPR) model
With the GPR method, the application of different kernel functions were explored:
Exponential, quadratic exponential, Mattern 3/2, Mattern 5/2, rational quadratic.
ARD Exponential, ARD quadratic exponential, ARD Mattern 3/2, ARD Mattern 5/2, ARD rational quadratic.
All models were evaluated in terms of optimality in terms of the mean square error, and then the optimal model obtained from all the analyzed ones was evaluated on the test data using the RMSE, MAE, MAPE, and R criteria.
In the paper, a detailed procedure is illustrated for determining the optimal model for the prediction of the SAR parameter. In contrast, for all other models for the prediction, it is given in a more concise form. Accompanying results for other models, except the SAR parameter, can be found in the
Appendix A (
Table A1,
Table A2,
Table A3,
Table A4,
Table A5,
Table A6,
Table A7,
Table A8,
Table A9 and
Table A10) of the paper.
4.1. Prediction of SAR Parameter Values
Table 3 illustrates the impact of the minimum leaf size on the accuracy of the regression tree (RT) model. It shows how changing the minimum leaf size affects key performance metrics such as RMSE, MAE, MAPE, and R.
In this particular case, it was found that models with less complexity, i.e., the amount of data per terminal sheet is 10, have higher accuracy (
Figure 6).
The application of TB and RF models was analyzed simultaneously (
Figure 7). The figure shows the dependence of the achieved accuracy of the model on the hyperparameter value. The TB model represents the borderline case of the RF model when all variables are taken into account for potential calculations.
Among the optimal models in this group, the RF model with 500 generated trees proved to be the best. In contrast, the model that uses a subset of eight variables and has a minimum amount of data per terminal leaf equal to one has a higher accuracy according to the RMSE and R criteria, while the model that uses a subset with six variables and has a minimum amount of data per terminal sheet equal to one and has a higher accuracy according to MAE and MAPE criteria (
Table 4). Optimal values according to different accuracy criteria are marked with bold numbers in
Table 4.
With the BT model (
Figure 8), it was shown that the highest accuracy is obtained by applying complex models with a large number of branches.
The optimal obtained model had a structure of 32 branches and a Learning Rate value equal to 0.01.
The optimal values for the parameters of the applied models with different kernel functions were obtained using marine probability (
Table 5 and
Table 6).
The marginal likelihood is a function influenced by the observed data (y(X)) and model parameters {l, , }. The determination of the model parameters is achieved through the maximization of this function.
Importantly, when the marginal likelihood is transformed by taking the logarithm, identical results are achieved as when optimizing the original likelihood. Therefore, model parameter optimization is typically carried out by employing gradient-based procedures on the log marginal probability expression, simplifying the optimization process without altering the final outcomes. The comparative results of the implemented ML models are presented in
Table 7. Optimal values according to different accuracy criteria are marked with bold numbers in
Table 7.
The values of all accuracy criteria according to the adopted accuracy criteria on the test data set are shown in
Table 7. According to the RMSE and R criteria, the RF model had the highest accuracy (it uses a subset of eight variables for calculation, and the amount of data per terminal sheet is equal to one), while according to the MAE and MAPE criteria, the GP model with an exponential kernel function stood out as the most accurate. On the optimal RF model, the significance of each of the input variables was determined such that the values of the considered variable are permuted within the training data, and the out-of-bag error for such permuted data is recalculated. The significance of the variable (
Figure 9) is then determined by calculating the mean value of the difference before and after a permutation. This value is then divided by the standard deviation of these differences. The variable for which a higher value was obtained in relation to the others is ranked as more significant in relation to the variables for which smaller values were obtained.
4.2. Prediction of Na+ Parameter Values
RF models proved to be the optimal models for predicting sodium ion (Na
+) concentrations, while the analysis of all models in terms of accuracy is given in
Appendix A (
Table A1). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 10. Based on the defined accuracy criteria, four models with the following criteria values were selected (
Table 8).
The “Weighted Sum Model” or “Simple Multi-Criteria Ranking” method was used to select the optimal model. For the minimization objectives (RMSE, MAE, MAPE), Min-Max normalization is applied, and for the maximization objective (R), Max-Min normalization is applied to ensure that all metrics are on the same scale. Equal weights are assigned to the normalized evaluation metrics to indicate their relative importance in the decision-making process. The weighted sum method calculated an aggregated value for each model, which considers all four normalized metrics. All models are ranked based on their aggregated values, with the lower aggregated value indicating better overall performance (
Table 9).
As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 11 variables, where the minimum amount of data per sheet is six. The assessment of the significance of individual input variables for the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (
Figure 11).
4.3. Prediction of Magnesium (Mg2+) Parameter Values
RF models proved to be the optimal models for predicting sodium ion (Mg
2+) concentrations. An analysis of all models in terms of accuracy is given in
Appendix A (
Table A2). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 12. Based on the defined accuracy criteria, three models with the following values were selected (
Table 10). Optimal values according to different accuracy criteria are marked with bold numbers in
Table 10.
“Simple Multi-Criteria Ranking” was applied again when extracting the optimal model (
Table 11).
As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 12 variables, where the minimum amount of data per sheet is one. The assessment of the importance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (
Figure 13).
4.4. Prediction of Ca2+ Parameter Values
RF models proved to be the optimal models for Ca
2+ (calcium ion concentration). An analysis of all models in terms of accuracy is given in
Appendix A (
Table A3). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 14. According to all the defined accuracy criteria, only one model stood out with values for RMSE, MAE, MAPE, and R of 0.5847, 0.4500, 0.2007, and 0.7496, respectively.
The assessment of the significance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (
Figure 15).
4.5. Prediction of SO42− Parameter Values
The RF models proved to be the optimal models for predicting SO
42− levels. An analysis of all models in terms of accuracy is given in
Appendix A (
Table A4). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 16.
According to all defined accuracy criteria, only one model was singled out with values for RMSE, MAE, MAPE, and R of 0.5526, 0.3122, 0.5050, and 0.7381, respectively.
The assessment of the significance of the individual input variables for the accuracy of the prediction was performed directly on the obtained model with the highest accuracy (
Figure 17).
4.6. Prediction of Cl− Parameter Values
RF models proved to be the optimal models for predicting Cl
− concentrations. An analysis of all models in terms of accuracy is given in
Appendix A (
Table A5). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 18. Based on the defined accuracy criteria, three models were selected (
Table 12). Optimal values according to different accuracy criteria are marked with bold numbers in
Table 12.
As the optimal model, the RF model with 500 trees was obtained, which uses a subset of 11 variables, where the minimum amount of data per leaf is five (
Table 13).
The assessment of the importance of individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (
Figure 19).
4.7. Prediction of HCO3− Parameter Values
GPR models proved to be the optimal models for predicting HCO3− concentrations. Very similar values in terms of accuracy were also given by the RF models. However, since the difference between the GPR model and the RF model is practically negligible, and since it is not possible to obtain the significance value for individual input variables on the obtained GPR model because it has the same length scale parameter for all variables, RF models were used for the analysis. An analysis of all models in terms of accuracy is given in
Appendix A (
Table A6).
The dependence of the adopted accuracy criteria on the parameters of the RF model is shown in
Figure 20.
In the specific case of applying the RF model, two models were distinguished, namely the RF model that uses ten variables as a subset for analysis and where the amount of data per terminal sheet is equal to one, which is optimal according to the RMSE, MAE, and MAPE criteria and the model that uses 13 variables as a subset for analysis and where the amount of data per terminal sheet is equal to two, which is optimal according to the R criterion. Since the first-mentioned model is optimal according to the three adopted accuracy criteria, RMSE, MAE, and MAPE, and the difference compared to the R criterion is practically negligible, the first model can be considered optimal.
The optimal model has the following criterion values for RMSE, MAE, MAPE, and R of 0.5174, 0.4252, 0.1822, and 0.7721, respectively.
The assessment of the importance of the individual input variables on the accuracy of the prediction was performed precisely on the obtained model with the highest accuracy (
Figure 21).
4.8. Prediction of K+ Parameter Values
The RF models proved to be the optimal models for predicting K
+ levels. An analysis of all models in terms of accuracy is given in
Appendix A (
Table A7). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 22.
In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (
Table 14 and
Table 15). Optimal values according to different accuracy criteria are marked with bold numbers in
Table 14.
The analysis of the significance of the individual input variables of the model was performed on the optimal RF model with hyperparameter values for the value of the number of trees, a subset of variables for splitting, and the amount of data per terminal leaf, which are 500, 12, and 4, respectively (
Table 15). The assessment of the importance of the individual input variables was performed precisely on the obtained model with the highest accuracy (
Figure 23).
4.9. Prediction of pH Parameter Values
The RF models proved to be the optimal models for predicting SO
4 levels. An analysis of all models in terms of accuracy is given in
Appendix A (
Table A8). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 24.
In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (
Table 16 and
Table 17). Optimal values according to different accuracy criteria are marked with bold numbers in
Table 16.
Using the weighted sum method, an aggregated value for each model is calculated, which takes into account all four normalized metrics.
The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in
Figure 25.
4.10. Prediction of EC Parameter Values
RF models proved to be optimal models for EC parameter prediction. An analysis of all models in terms of accuracy is given in
Appendix A (
Table A9). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 26.
According to all defined accuracy criteria, only one model was singled out with values for RMSE, MAE, MAPE, and R of 271.5346, 149.3192, 0.2779, and 0.7665, respectively.
The obtained hyperparameter values for the number of trees, the subset of splitting variables, and the minimum amount of data per leaf are 500, 6, and 1, respectively. The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in
Figure 27.
4.11. Prediction of TDS Parameter Values
The RF models proved to be the optimal models for predicting SO
42− levels. An analysis of all models in terms of accuracy is given in
Appendix A (
Table A10). The dependence of the adopted accuracy criteria on the model parameters is shown in
Figure 28.
In terms of accuracy, three models were singled out, and the optimal model was obtained by applying the Simple Multi-Criteria Ranking method (
Table 18 and
Table 19). Optimal values according to different accuracy criteria are marked with bold numbers in
Table 18.
The analysis of the significance of the individual input variables of the model was performed on the optimal RF model and shown in
Figure 29.
5. Discussion
In our research, most models demonstrated satisfactory accuracy, meeting the predefined criteria. However, a subset of models exhibited shortcomings in specific criteria. To gauge accuracy effectively, we leaned on relative metrics, notably accuracy (R) and mean absolute percentage error (MAPE), as they offer more insightful perspectives compared to absolute criteria such as RMSE and MAE (
Table 20).
Table 20 highlights the accuracy of the machine learning models in predicting individual water parameters. Notably, the RF model emerged as the best performer across various parameters, underscoring its efficacy.
Analyzing the R values reveals the overall satisfactory performance of most models, except for the pH prediction model. Examining MAPE values identified five models—SAR, Na+, SO4, Cl, and TDS—where this metric is relatively higher than other ones. Despite these nuances, our primary research focus was unraveling the significance of individual input variables within the constraints of limited data.
When we delve into the significance of the individual input variables, our conclusions (
Table 21) unveil the following crucial insights:
Forest Cover (‘F’): Forest areas significantly influence diverse water quality parameters. Trees and vegetation in forests contribute organic matter to water bodies, influencing ion concentrations. The root systems of trees can affect the uptake of certain ions. Forests strongly impact the concentrations of sodium, magnesium, calcium, chloride, sulfate, bicarbonate, and potassium ions. Also, forests act as natural filters, reducing the transport of sediments and pollutants into water bodies. Cleaner water, with fewer suspended solids, tends to have lower TDS and EC. Additionally, forest areas often have minimal human activities compared to urban or agricultural areas.
Rangeland (RL) is essential for predicting water sulfate ion concentrations. This suggests that the characteristics associated with the rangeland, such as land cover and land use patterns, significantly influence sulfate levels. Additionally, rangeland strongly affects SAR by influencing sodium concentrations, vital for evaluating water’s suitability for irrigation and soil health. Also, the notable impact on magnesium levels showcases rangeland’s role in shaping water quality. Rangeland’s influence on pH highlights its role in determining water acidity or alkalinity, which is crucial for aquatic ecosystems and nutrient availability. Additionally, rangeland significantly influences electrical conductivity, providing insights into water quality and dissolved ion content, essential for understanding overall water composition. While having a somewhat lesser impact, rangeland still plays a discernible role in shaping sodium concentrations, contributing to insights into water salinity and its ecological implications.
Urban Area (‘UA’): Urban areas have a moderate impact on ion levels, magnesium, chloride, bicarbonate, and SAR parameters, owing to urbanization and land use changes, introducing contaminants and altering water chemistry. Calcium, sulfate, and EC parameters have less impact.
The Agricultural Area (AA) substantially impacts potassium, SAR, and sodium, with a moderate impact on calcium, TDS, and magnesium. The influence of AA on these parameters can be explained by the agricultural areas’ use of potassium-containing fertilizers, leading to elevated potassium concentrations in water. Cultivation practices and nutrient management contribute to increased potassium levels. Additionally, agricultural activities often involve irrigation, and water with high sodium content can increase SAR. Sodium in the soil can be introduced through irrigation water, affecting sodium levels in the water. Moreover, agricultural runoff can introduce calcium, magnesium, and other dissolved solids into water sources.
Catchment Area (‘CA’): The size of catchment areas plays a moderate role in ion transport, particularly affecting SAR, sodium, bicarbonate, calcium, and sulfate levels. The size of the catchment area could moderately impact SAR, as larger areas may interact with more diverse geological and soil features, affecting sodium adsorption ratios.
Considering different soil types (HSGA, HSGB, HSGC, HSGD) and geological permeability (GHGM, GHGN, GHGT) underscores their impact on ion retention and release. Sandy soils facilitate easier ion movement, while clayey soils retain ions. Geological permeability influences potassium, magnesium, calcium, and bicarbonate levels, showcasing the interconnectedness of soil and geological characteristics with water parameters.
6. Conclusions
Our study demonstrates the effectiveness of machine learning methods in predicting and assessing water quality parameters within a catchment area. With the Random Forest (RF) model as the standout performer, the model provides a robust tool for efficient and accurate water quality evaluation.
While certain models may fall short on specific criteria, a nuanced evaluation leveraging relative criteria like accuracy (R) and mean absolute percentage error (MAPE) underscores the overall robustness of the predictive models.
Table 20 encapsulates the detailed results, highlighting the efficacy of the RF model across various water parameters.
Evaluation of R values showcases all models’ satisfactory performance except for pH prediction. Despite marginally elevated MAPE values in five models (SAR, Na+, SO4, Cl, TDS), the core research objective—unraveling the importance of individual input variables within data constraints—was largely achieved.
This accomplishment paves the way for selecting and implementing optimal models from a broader ML spectrum. To further elevate model accuracy, future research will focus on dataset expansion, a strategic initiative to address current limitations and achieve heightened accuracy, particularly in parameters exhibiting slight deviations.
The significance of individual input variables, as outlined in
Table 21, provides crucial insights for understanding their roles in influencing water parameters. Forest cover, catchment area characteristics, stream order, barren land, and urban areas are pivotal factors shaping water quality.
Incorporating these research insights into decision-making processes presents transformative opportunities for strategic resource allocation and environmental impact mitigation. Furthermore, integrating these outcomes empowers decision-makers to adopt targeted strategies for fostering environmental sustainability, contributing to the broader goal of cultivating resilient water ecosystems. This integration signifies a practical pathway toward achieving a delicate balance between human activities and environmental preservation, actively contributing to sustainable water ecosystems.