1. Introduction
Salt-affected soils are mainly related to arid and semiarid regions and basically comprise saline and/or sodic soils. Saline soils have a significant amount of soluble salts which consist of major ions like sodium (Na
+), potassium (K
+), calcium (Ca
2+), magnesium (Mg
2+), bicarbonate (HCO
3−), chloride (Cl
−), carbonate (CO
32−), and sulfate (SO
42−). Sodic soils have an excess of exchangeable Na
+ in the cation exchange complex, as well as in the soil solution. Soluble salts and Na
+ normally originate either from natural processes such as weathering (primary salinity/sodicity) or are induced by human activities such as the inappropriate management of land and water resources (secondary salinity/sodicity). Soil salinity negatively affects root growth and crop yield through the osmotic effect caused by the high concentration of soluble salts, and soil sodicity causes adverse effects, such as an increase in soil pH, loss of soil physical structure (clay dispersion, swelling, and plugging of soil pores), and the deterioration of soil–water relations (decrease in infiltration, hydraulic conductivity, retention and drainage), leading to soil erosion, crusting, compaction, runoff, waterlogging, nutrient imbalances, and specific ion effects on plants [
1,
2,
3,
4,
5,
6,
7].
Salinity levels can be expressed as total soluble salts (TSS) or as soil electrical conductivity (EC) of saturated extract or soil–water suspensions. Sodicity levels are usually determined as the exchangeable sodium percentage (ESP) through the amount of exchangeable Na
+ as a proportion of either the cation exchange capacity (CEC) or the sum of exchangeable cations [
4,
8], as well as by the sodium adsorption ratio (SAR) calculated from the soluble Na
+ relative to the soluble Ca
2+ + Mg
2+ concentrations in a soil solution using the formula proposed by Richards et al. [
9]. The widely used salt-affected soil classification from the US Salinity Lab (USSL)—based on the threshold values of a soil EC
e of 4 dS m
−1, ESP of 15%, and pH of 8.5—generates four classes, namely, normal, saline, saline–sodic, and sodic soil. The Australian classification is analogous to the USSL criteria with the exception that it considers a soil ESP threshold value of 6% and takes into account the pH levels [
10]. Furthermore, neutral and alkali salts determine the distinction between sodicity and alkalinity, so alkali soils normally have an excess of exchangeable Na
+ and carbonates besides a pH above 8 [
11]. Concerning that fact, Chhabra et al. [
12] proposed an alternative classification including the ion ratios of (2CO
32− + HCO
3−)/(Cl
− + 2SO
42−) and Na
+/(Cl
− + 2SO
42−) expressed in mol m
−3, besides soil EC
e and ESP, for facilitating the specific management and reclamation of salt-affected soils.
Data mining can be described as the capacity of identifying patterns from data to establish relationships and models through data analysis, and machine learning (ML) is a process of learning from a system’s experience for self-improving based on resultant information. Moreover, supervised learning models the relationships and dependencies between the target prediction output and the input data/features to predict the output values for new data. Partial Least-Squares (PLS)—Discriminant Analysis (DA) is a ‘supervised’ version of principal component analysis (PCA) which achieves dimensionality reduction with complete cognizance of the classes, arriving at a linear transformation that converts the data to a lower dimensional space with as small an error as possible [
13]. In addition, PLS regression combines features from PCA and multiple regression, allowing the reduction of the dimensionality while focusing on covariance. Support Vector Machines (SVM) seek to design a decision surface and separate the margin between the different levels, finding this hyperplane using support vectors and margins. Then, the SVM with linear kernel function fits an optimal hyperplane between the classes, making linear and separable small samples [
14], while support vector regression fits a line as the hyperplane with the maximum number of points. Breiman and Cutler’s Random Forests (RF) algorithm is a tree-based ensemble which generates trees built on resampled subsets of data, with each tree depending on an ensemble of random variables. RF classification combines the trees by unweighted voting and chooses the most voted class over all the tree ensembles at training time if the response is categorical, or combines the resulting trees by unweighted averaging if the response is continuous [
15,
16].
ML methods have been used to classify soils based on various features such as chemical, physical, and biological variables, as well as on specific criteria. Within the framework of ML algorithms, many methods have been progressively developed to automate the soil classification process, such as Decision Trees, k-Nearest Networks, Artificial Neural Networks, and SVM [
17]; in that context, some investigations on various soil type classifications using ML methods were carried out [
18,
19,
20,
21]. The review on ML and soil sciences by Padarian et al. [
22] shows that the modelling of continuous and categorical soil properties is based on their relationships with environmental covariates and is mainly focused on mapping. Some key findings in the compilation by Motia and Reddy [
23] were that: the implementation of soil classification uses more ML methods than soil regression; the assessment of soil salinity still shows a low contribution from ML; SVM and RF techniques are widely used in ML predictions of soil parameters and classifications; and the
RMSE and
R2 are the top metrics used for the performance evaluation of ML prediction models in soil analysis.
Apart from simple/multivariate regression-based models, most of the studies based on ML methods in predicting and mapping salinity use variables from remote sensing (spectral bands and derived indices) [
24,
25,
26,
27,
28,
29], and combined with other environmental covariates (elevation, geology, hydrology, morphometry, and climate) [
30,
31,
32,
33,
34]. Field-measured data (physical and chemical soil–water properties), which are used to a lesser extent, may improve the prediction performances for soil salinity, even more if alternative salt-term parameters are considered. Moreover, the determination of the content of exchangeable cations—and thus the soil ESP—is usually less cost-effective and more time-consuming than that of soluble ion concentrations, which are often used for estimating salinity/sodicity indirectly. Therefore, this study aimed to evaluate and compare the prediction performances of three ML regression and classification algorithms (PLS, SVM, and RF) for estimating the soil EC
e and ESP, and classifying salt-affected soils from soluble salt ions. Then, the results may contribute alternative covariates for modelling as well as to the characterization and management of salt-affected soils in the study area.
4. Conclusions
The performances of ML classification and regression algorithms (PLS, SVM, and RF) in predicting soil ECe, ESP, and salt-affected soil classes were evaluated and compared. Among the assessed ML regressions, SVM and RF obtained the best performances for predicting the soil ECe, whereas the RF model was superior for estimating the soil ESP. The RF classification algorithm showed the best prediction accuracy (87%) with a kappa value of 82%, followed by SVM and PLS-DA. Soluble Na+ was the most important explanatory variable for all the prediction models, followed by Ca2+, Mg2+, Cl−, and HCO3− which were important for classification, as well as for regression. The sodic class was poorly predicted, and the applied resampling for overcoming its imbalance did not significantly improve the classification performances. The stability analysis showed that the amount of training data generated less impact on the RF regression models, whereas the SVM and PLS-DA were more stable than RF for classification. Additional explanatory variables somewhat improved the PLS and SVM regressions to predict ESP and the RF classification effectiveness. It can be concluded that the RF or SVM and the RF regression can be suitable to estimate the soil ECe and ESP, respectively. In addition, the RF and SVM classification models can be appropriate in predicting salt-affected soil classes from soluble salt ions. Additional samples and explanatory features can be included in the dataset for improving the prediction performances. The assessed models might contribute significantly to the monitoring, mapping, and management of salt-affected soils in the study area.