1. Introduction
Ferralsols constitute 32.9% of Brazil’s population [
1]. Ferralsols are characterized by low levels of plant nutrients such as Ca, Mg, and P and high levels of Al [
2]. Soybean is most commonly grown in the Ferralsols of Brazil [
3].
Modern agriculture requires automation for analysis and management processes. Agricultural automation is well known for the ability to diagnose and observe pests and diseases [
4,
5]. Automation in soil analysis processes is complex and requires further study.
The assessment of soil properties, including a series of chemical processes, aims to determine the ability of the soil to provide specific nutrients necessary for the plant pathway cycle. The most common approach adopted for this purpose consists of extracting a chemical solution using extractants that simulate the absorption of nutrients by plants. However, the use of this traditional method of soil analysis raises environmental challenges, notably owing to the inadequate disposal of waste generated during the assessment of soil properties, including a range of chemical processes, which aims to determine the ability of the soil to provide specific nutrients essential for the plant growth cycle [
6].
The current method not only presents risks for the professionals involved, but also has adverse environmental impacts due to the presence of chemical residues, in addition to incurring high costs related to the acquisition of chemical reagents and requiring a considerable time lag. In this context, sustainable and efficient analytical approaches that minimize environmental impacts and optimize the effectiveness of the soil assessment process are needed to guarantee more efficient agricultural practices [
7].
Multispectral soil analysis involves the use of various spectroscopic techniques to evaluate soil properties [
8]. Different methods have been used for soil analysis, such as microspectrophotometry, X-ray fluorescence spectroscopy, and laser-induced breakdown spectroscopy [
9]. These techniques provide information on the composition, purity, and elemental content of soils, helping to discriminate between different types and sources of soils [
10]. Data fusion methodologies have been applied to improve classification accuracy, especially when combining information from several spectroscopic analyzers. Multispectral analysis can capture the spectral dimensionality of soils, providing valuable information on the variability of soil elements, despite limitations in the resolution of narrowband absorption.
Soil spectral analysis calibration is crucial for accurately predicting soil properties. Soil parameters such as potassium, phosphorus, and organic matter are already being evaluated by visible-near-infrared (Vis-NIR) spectroscopy, but it requires calibration [
11]. Calibration methods involve preprocessing transformations, variable selection techniques, and regression algorithms to increase prediction accuracy [
10]. Methods of calibration are employed to calibrate soil spectral data, including preprocessing transformations, variable selection techniques, and regression algorithms [
12]. Using spectral libraries and reducing sample processing levels have shown potential for lowering costs and time implications for predicting soil properties such as organic carbon, clay, and pH [
13]. Overall, proper calibration methods are essential for leveraging soil spectral analysis to monitor soil properties effectively and contribute to precision agriculture.
The use of indirect analysis through multispectral sensors has enabled expeditious, economically viable, and ecologically sustainable monitoring of elementary soil levels. Moreover, the integration of machine learning (ML) algorithms has proven crucial for obtaining reliable estimates in this context [
14]. The synergistic combination of these approaches provides effective and agile monitoring of the soil nutrient content, which is highly relevant for agricultural soil productivity, food security, and the promotion of sustainable agricultural development [
15]. The convergence of these techniques not only optimizes the speed and efficiency of monitoring, but also contributes to mitigating the environmental challenges inherent to traditional soil analyses, such as the production of chemical residues.
The central problem of this study is the difficulty in quickly and accurately assessing and monitoring soil physicochemical attributes, which are crucial for the proper and sustainable management of agricultural resources. Traditional soil analysis methods, such as laboratory collection and analysis, are generally time-consuming, expensive, and limited in terms of spatial coverage, which makes large-scale and real-time monitoring difficult. In this context, the use of spectral variables as indirect indicators of soil properties emerges as a promising alternative. However, an important issue is the complexity of the relationship between spectral variables and soil physicochemical attributes, which can vary depending on the type of soil, moisture, and presence of organic matter, among other factors. Therefore, there is a technical challenge in building machine learning models capable of capturing these relationships in a robust and generalizable way so that they can be applied in different scenarios. The objective of the current investigation was to analyze the associations between the spectral and physicochemical variables of soil in addition to predicting the physicochemical attribute levels of soil via the use of spectral variables as inputs into machine learning models.
2. Materials and Methods
2.1. Sample Collection and Determination of Physicochemical Properties
Soil samples were collected at 0 to 20 cm depth from the municipalities of Cassilândia, Chapadão do Sul, Costa Rica, and Paraíso das Águas (18°46′26″ S 52°37′28″ W, average altitude of 810 m of sea level), with a coverage area of 16,130.84 km2, located in the State of Mato Grosso do Sul (MS), Brazil. The regional climate is classified as humid tropical, with a rainy season in summer and a dry season in winter, with an average annual rainfall of 1.850 mm, an average annual temperature of 20.5 °C, and a variation of 7.5 °C.
The soil in the region is mostly classified as Rhodic Ferralsol [
16]. A total of 33% of the 1000 samples analyzed were characterized as sandy, 25% as sandy loam, 24% as clay loam, and 20% as clay. The soil samples were collected with different augers, i.e., probe-type augers (20 mm diameter) and screw-type augers, at depths of 0–0.20 m. The soil samples were sieved through a 2 mm mesh and air-dried. The elements Ca, Mg, and K were analyzed in the Exata Brasil Laboratory located in Chapadão do Sul-MS.
KCl solution (1 mol L−1) at a ratio of 1/10 (soil:solution) was used to extract Ca and Mg from the soil. The element potassium (K) was extracted from the Mehlich1 solution (0.05 mol L−1 HCl + 0.0125 mol L−1 H2SO4) at a ratio of 1/10 (soil:solution). The ammonium acetate solution in a proportion of 10 g of soil to 25 mL of the solution was used to extract S from the soil. Ca, Mg, K, and S contents in the soil extracts were measured via Argon Plasma Optical Atomic Emission Spectrometry (ICP-OES) (Perkin Elmer, Waltham, MA, USA).
Multispectral evaluations were carried out in a 20 g aliquot of each sieved, dried, and homogenized soil sample, which was subsequently added to a Petri dish for spectral measurements (
Figure 1). The Petri dish was placed on a flat bench, and the sensor was installed 8 cm from the soil surface. The area of incidence of the spectral beam was 3 cm
2. Two external 50 W halogen lamps were positioned 35 cm from the Petri dish at a zenith angle of 30°, forming a 90° angle to each other following the method described by Franceschini et al. [
14].
The reflectance spectra were obtained with a CROP CIRCLE ACS-470 instrument (Holland Scientific, Inc., Lincoln, NE, USA). The six spectral bands used were green (532–550 nm), red (670–700 nm), and red edge (730–760 nm). The sensor was calibrated via FieldCal SC-1. The spectral bands were applied to the surface of the soil samples in 100 replicates for each band. Reflectances were recorded in spreadsheets, and reflectance averages were calculated for each spectral band.
2.2. Data Analysis via Computational Intelligence
The data were subjected to observation and compared via the WEKA (Waikato Environment for Knowledge Analysis) software version 3.9.3(c) 1999–2018, which was accessed by a computer with an AMD Phenom™ IIx4 B97 processor 3.20 GHz, installed memory RAM 4 GB, 32-bit operating system, Windows 7, using cross-validation with 10 folds (K = 10) and 10 repetitions (100 runs) in a spectral analysis of 690 samples with wavelength data obtained as input values and macronutrients as output values to be predicted for Ca, Mg, and K. The data prediction analysis used 370 samples for S.
The models tested were random forest (RF), multilayer perceptron (MLP), decision trees (M5P), REPTree (REPT), and random trees (RTs). All the parameters adopted were set to the default software configuration. The tested models were selected with applicability in other agronomic works according to Refs. [
13,
14]. MLP is a type of neural network that excels at solving supervised learning problems with multiple inputs. It consists of layers of neurons (or perceptrons) organized into an input layer, one or more hidden layers, and an output layer. Each neuron in a layer is connected to the neurons in the next layer by adjustable weights. In Weka software, the default MLP configuration includes a single hidden layer with a number of neurons defined by the average between the number of attributes and the number of classes, which generally provides a good balance between learning capacity and computational efficiency.
M5P provides more information on mathematical equations and addresses categorical and continuous variables and missing values. This model combines features of decision trees with linear regression, making it particularly useful for continuous and mixed-category data. It builds decision trees where the leaves contain linear equations that facilitate prediction. This allows M5P to handle both categorical and continuous variables and manage missing values, providing a more interpretable model by providing insight into the underlying mathematical relationships between variables.
The random forest (RF) algorithm uses multiple independent decision trees and combines their predictions. It is particularly effective for large-scale problems because of its robustness against overfitting, and facilitates data interpretation by allowing the assessment of variable importance. A random tree (RT) is used to build a decision tree with a random dataset through the division of nodes.
The REPTree algorithm builds decision trees via regression logic in multiple iterations. In each iteration, it evaluates several trees, selecting the best one on the basis of regression error criteria. This model allows a robust pruning approach, where the final tree is simplified to improve the generalizability of the model. The averages of the S, Mg, Ca, and K contents of the actual data and the predicted data of the samples randomly selected via machine learning were contrasted in scatter and line graphs via SigmaPlot 11.0 software.
The accuracy of the prediction models was evaluated by the correlation between the predicted and observed values (r) and the mean absolute error (MAE). The accuracy values of each of the tested models were subjected to analysis of variance to verify the existence of significant differences between the machine learning models. Subsequently, boxplots were generated for r and MAE for each model in the case of macronutrient prediction.
The means of the performance parameters were grouped via the Scott–Knott test at 5% probability. The boxplots and groupings of means were generated via the ggplot2 and ExpDes.pt packages of R software.
For the Pearson correlation coefficient (r), we applied the criterion adapted by Figueiredo Filho and Silva Júnior [
15], which is classified into three categories—low, medium, and high—and is considered low when r is approximately 0.10–0.30, moderate when r is between 0.40 and 0.60, and high when r is 0.70–1.
3. Results
The averages of the S, Mg, K, and Ca contents in the analyzed soil show a disparity between the predicted data due to accumulated error (
Figure 2). The S content in the soil predicted by machine learning analyses via MLP, M5P, RF, RT, and REPT exhibited significant dispersion compared with the chemically analyzed mean content (
Figure 2A). Conversely, the Mg and K contents predicted by MLP, M5P, RF, and REPT showed low dispersion relative to the mean content, not aligning with the values obtained through chemical analysis (
Figure 2B,C). In this context, the use of RT resulted in greater dispersion in the prediction of Mg, K, and Ca contents. However, the mean Ca content predicted by RT was the closest to the real value found in the chemical analyses (
Figure 2D).
With respect to the prediction of the sulfur content, the M5P and RF algorithms outperformed the other algorithms (
Figure 3), presenting high r values (higher than 0.6). This value guarantees the high accuracy of these algorithms in estimating the sulfur content on the basis of spectral reflectance. Another factor that contributes to the accuracy of both algorithms is the low MAE, indicating smaller errors in the prediction, ensuring greater precision of these algorithms when predicting S contents.
With respect to the performance of the algorithms in predicting the magnesium (Mg) content, the results revealed that the random forest (RF) algorithm was superior in terms of accuracy (
Figure 4). This could be translated into greater consistency between the predictions and observed values, as revealed by the correlation coefficient (r), which surpassed those of the other algorithms. Additionally, the RF demonstrated a lower mean absolute error (MAE) value, denoting a significantly high precision in its estimates.
The RT algorithm exhibited similar behavior to that of the RF when evaluated in relation to the accuracy indicator (r). However, it is important to highlight that RT revealed a significantly high MAE value, which indicates that although it presents relative consistency in predictions, it is not an accurate algorithm for predicting magnesium content.
Statistically, the algorithms had the same behavior for the correlation coefficient (r) in potassium prediction. There was also no significant difference in terms of error. Therefore, using algorithms that maintain better performance for the other elements facilitates processing because, in the case of potassium, all the models have the same performance (
Figure 5). The prediction value was approximately 0.3, indicating a moderate value that may be considerably adequate in terms of the variability and dynamicity of P in the soil.
On the other hand, in the analysis of the prediction of the calcium (Ca) content, the M5P algorithm demonstrated superior performance in relation to the other algorithms, as evidenced by correlation coefficient values (r) that approached 0.3 (
Figure 6). Furthermore, notably, the M5P algorithm achieved notably lower mean absolute error (MAE) values, approximately 1.50, with the maximum MAE value for predicting this nutrient. These results emphasize the ability of the M5P algorithm to generate accurate estimates, with a relatively low level of error, which is extremely relevant for estimating the calcium content in soil samples via multispectral data.
In short, the M5P and RF algorithms performed satisfactorily in terms of the correlation coefficient (r). Specifically, these algorithms achieve values remarkably close to the actual values, which is especially evident when the sulfur content is predicted, where the r value is close to 0.8. For the other elements analyzed, the accuracy remained at approximately 0.3, which is moderate and consistently shows that, owing to the dynamism and complexity of these soil elements, the use of multispectral reflectance to determine them is promising. Furthermore, the M5P and RF algorithms yielded lower MAE values, further reinforcing the reliability of the predictions generated by these algorithms. Reducing the error contributes to greater precision in the estimates, improving the accuracy of the predictions obtained via multispectral reflectance.
4. Discussion
The results indicate a significant disparity between the expected concentrations of sulfur (S), magnesium (Mg), potassium (K), and calcium (Ca) and the measurements derived from chemical analysis. The inconsistency of the predicted data is influenced by the quality of the data and noise in the accuracy of the predictive models [
17]. Data variations amplify imprecision, particularly in more complex predictive models [
18]. However, it is essential to validate data analyses with accurate predictive models. In this context, the RT predictive model provided better accuracy for S and Ca than did the actual analyzed data.
Traditional methods for estimating soil nutrients, such as laboratory analysis, are recognized for their accuracy, but they have significant limitations in terms of time and cost [
10]. These methods require specialized labor, and the use of chemical reagents, in addition to being expensive, can pose environmental risks due to the generation of potentially contaminating waste [
10]. This process becomes unfeasible for large-scale and timely application, which contrasts with the growing demand for fast, economical, and sustainable methodologies in the agricultural sector. In this context, the use of multispectral sensors combined with machine learning techniques has emerged as a promising alternative for estimating soil nutrients in a more efficient and environmentally friendly way. The combination of these tools allows data to be obtained noninvasively and in real time, facilitating continuous and large-scale monitoring of soil attributes. In this study, three specific multispectral bands were used to predict soil nutrient levels, employing different machine learning algorithms [
6]. This approach offers a potentially faster and more accessible methodology that can contribute to more sustainable and precise agricultural practices, enabling more effective soil management in response to the needs of agricultural production [
7].
Among the investigated nutrients, S had the highest predictive value, close to 0.80 for the correlation coefficient I, and notably low values of the mean absolute error (MAE), confirming its ability to offer highly accurate predictions (
Figure 2). High prediction values were achieved by the M5P and RF algorithms, highlighting the robustness and reliability of the algorithms in the task of predicting S content. Both algorithms performed well because of their high r and low MAE values, and their use in other agricultural tasks, such as predicting soil organic carbon, stands out [
19]. With satisfactory results in predicting soil nutrients, RF can be used to infer soil fertility [
20]. The soil nutrients significantly influence the distribution of soil organic carbon [
21].
The other predictions yielded median values, with a certain significance, highlighting the complexity of predicting potassium content through the evaluated algorithms, which is particularly relevant in agronomic and soil fertility contexts, where the precise estimation of Mg, K, and Ca contents is crucial for effective and sustainable agricultural practices. In magnesium prediction, the RF algorithm stands out as a superior choice, guaranteeing greater accuracy. These results highlight the relevance of careful selection of the appropriate algorithm for a given task. Dharumarajan et al. [
22] reported that RF was the best model for most soil properties, from macro- to micronutrients, indicating that the RF model is better for solving multivariate adjustment problems since RF combines many trees to form an accurate prediction mechanism. In addition, the RF algorithm, when necessary, has fewer parameters to adjust.
In the potassium prediction results, there was homogeneity in the results regarding r, suggesting that the analyzed algorithms exhibited a moderate correlation with the real observations of potassium content, although they did not reach higher levels of correlation, indicating that the prediction of this element can be challenging. In a similar study, Forkuor et al. [
23] reported that no machine learning algorithm works best for all global situations, and models must be tested to calibrate them to identify an accurate model for predicting soil properties, optimizing data processing.
The analysis of the calcium prediction (Ca) content revealed that the M5P algorithm performed better than the other algorithms did, as evidenced by correlation coefficient values that approached 0.3. This result suggests that M5P established a moderate and consistent r with actual observations of Ca content, indicating its ability to provide meaningful estimates.
The superior performance of M5P in predicting S and Ca, for which the algorithm stands out, highlights its applicability and usefulness in agronomic contexts, where accurate estimation of this nutrient is crucial for adequate soil management and the development of effective agricultural practices. This contributes to increased productivity and sustainability of agricultural activities. According to previous studies, this model performs well in several tasks in different areas, such as predicting cadmium in agricultural soils [
24]. M5P also presents good results in the physical and chemical prediction of soil and water due to its greater accuracy and speed than the regression model [
25]. This diversity of accuracies in different situations demonstrates the generalizability and robustness of the algorithm.
The complexity of S forms in soils contributes to variability in spectral measurements. Sulfur occurs in organic and inorganic forms in soil and is transformed by microorganisms in the soil [
26]. In this sense, organosulfur compounds are stable over time, whereas others may decompose or convert to other forms, leading to variability in spectral measurements [
27].
Ca, Mg, and K interact with each other and with the soil matrix, which changes their chemical bonding structures with other elements present and their spectral expressions [
28]. Calcium and magnesium have the capacity to generate carbonates and additional compounds that affect the reflectance of soil samples [
29]. Potassium influences the spectral properties of soil through interactions with clay [
28].
Our findings demonstrated the effectiveness of the M5P and RF algorithms in predicting soil nutrients, particularly the S content, where the accuracy reached notable levels. The ability of these models to provide estimates close to real values has significant implications for agriculture and soil management, promoting decision-making on the basis of reliable data and contributing to more efficient and sustainable agricultural practices. The greatest contribution of these technologies is to reduce the work involved in analyzing samples and reducing the use of reagents in laboratory analyses, making this part of the procedure faster, requiring and dispensing with the use of expensive reagents from laboratories that report adequate disposal, which are not always served. With future research applying the algorithms found here, it will be possible to adapt and use such technologies with remote sensors or prototypes to be used in situ. Soil samples from other locations and a larger number of samples can be used to increase the accuracy of the algorithms. The use of hyperspectral sensors can also improve predictive value in addition to their application in agriculture, which enables real-time monitoring in agricultural scenarios.
5. Conclusions
In the present study, the use of the CROP CIRCLE ACS-470 multispectral sensor associated with machine learning was demonstrated to be a promising approach for predicting soil macronutrients, especially sulfur, with correlations between actual and estimated values above 0.6. However, regarding the macronutrients P, K, Ca, and Mg, the prediction accuracy reached values of approximately 0.3, indicating a moderate and coherent correlation with actual observations; however, the development of more refined models is needed to improve the results. These findings highlight the reliability and accuracy of predictions, thus strengthening the usefulness and effectiveness of the sensor in the context of soil analysis and agricultural decision-making.
The use of multispectral sensors and data prediction analysis via the M5P and RF algorithms derived from our results are directly applicable to areas with characteristics similar to those of Rhodic Ferralsol and Arenosol soils. For regions with soils of different compositions, we suggest conducting complementary studies to adapt the proposed practices to local conditions.