1. Introduction
The agricultural sector operates at the nexus of soil health dynamics and property availability, where achieving an optimal equilibrium of key soil properties—such as organic matter, N, P, and K—profoundly influences crop productivity and yield outcomes [
1,
2,
3]. A precise evaluation of these essential properties holds paramount importance for informed fertilizer application strategies and the implementation of effective soil management practices [
4,
5]. Traditional chemical methodologies have long served as the bedrock of soil science and agronomy, offering meticulous insights into soil property estimates through a series of well-established laboratory procedures. These conventional techniques encompass soil sampling, chemical extraction processes, and quantitative analyses utilizing methods like colorimetric assays, spectrophotometry, and titration [
6]. Renowned for their high precision and reliability, these methods have undergone extensive validation across diverse soil types and property compositions [
7,
8,
9]. Despite the accuracy of traditional soil testing approaches, their reliance on labor-intensive and resource-intensive chemical analyses presents formidable challenges in terms of operational efficiency and scalability [
10]. Given these challenges, there is an urgent need to develop innovative technologies that can enhance the efficiency and effectiveness of soil property assessments.
In recent years, advancements in soil property detection technologies have increasingly highlighted the need for rapid, portable, and non-destructive methods. For instance, spectroscopic techniques, particularly visible–near-infrared (Vis-NIR) and MIR spectroscopy, leverage the unique spectral signatures of soil constituents to non-destructively evaluate the concentrations of soil properties, such as carbon, N, P, and K [
11,
12,
13,
14]. Among these, MIR spectroscopy, especially when coupled with attenuated total reflectance (ATR) probes, has shown promise in effectively analyzing soil components and contaminants due to its sensitivity to organic and inorganic compounds in soils. Studies have demonstrated the utility of MIR-ATR in capturing correlations between soil properties and the elemental composition, as seen in partial least-squares models for major soil elements, thus enhancing quantitative soil analyses [
15]. Additionally, MIR-ATR techniques have been successfully applied in diagnostic screening for urban soil contaminants and in rapidly assessing petroleum-contaminated soils, underscoring their versatility in environmental monitoring [
16,
17]. These methods have gained popularity due to their ability to provide real-time data with minimal soil disturbance. Nevertheless, significant challenges related to soil heterogeneity and variability in moisture content continue to affect the accuracy and reliability of these assessments [
18]. Remote sensing technologies, which utilize satellite and aerial imagery, offer the capability to monitor soil characteristics over extensive geographical areas, thus providing valuable spatial data for large-scale agricultural management [
19,
20,
21]. Despite their advantages, the accuracy of these remote sensing methods often depends on careful calibration with ground truth data, which may be limited in availability and may not adequately represent local soil conditions [
22,
23]. Moreover, emerging technologies such as electrochemical sensors and laser-induced breakdown spectroscopy (LIBS) facilitate rapid and direct analysis of soil elements, enabling timely decision making in agricultural practices [
24,
25,
26,
27,
28]. While these methods present clear advantages, they are not without challenges, issues related to sensor stability, environmental interference, and the need for frequent recalibration pose significant obstacles to their practical implementation [
29,
30]. Overall, while innovative technologies have made substantial strides in soil properties detection, the reliance on data from individual sensors still imposes limitations on the overall accuracy of soil assessments.
To address these limitations, there is a growing interest in developing innovative methods for assessing soil properties through the fusion of multi-sensor data combined with machine learning algorithms. Numerous studies have demonstrated the effectiveness of using multi-source data within fusion networks to enhance the accuracy of soil property predictions. For instance, combinations of remote sensing data with spectral data, spectral data with image data, and LIBS with MIR spectral data have all shown promise in improving predictive outcomes [
31,
32,
33]. These advancements provide a new direction for the assessment of soil properties. Consequently, it is essential to investigate the fusion methods of multi-source spectral data to maximize the potential of near-field soil sensing technologies for accurate soil property estimation. On the one hand, many of these approaches primarily focus on point predictions for individual properties, which limits their ability to comprehensively evaluate multiple key properties simultaneously [
34,
35,
36,
37]. Furthermore, there is a notable lack of research on interval prediction methods in this context. Unlike traditional point estimation, interval prediction generates a range of upper and lower limits for property values, which better reflects the inherent uncertainty and variability of soil properties [
38]. This approach leverages established chemometric methods to enhance prediction reliability within a comprehensive interval modeling framework, as demonstrated in prior studies by Rodionova and Pomerantsev [
39,
40]. By broadening the traditional model application, this method accounts for greater variability, which is particularly beneficial in agricultural and environmental research [
39]. Currently, most applications of interval prediction are concentrated in fields such as photovoltaic energy generation and water demand forecasting, with relatively few studies addressing the interval prediction and evaluation of soil property levels [
41,
42]. This gap presents an opportunity for further research to explore how interval prediction techniques can be effectively adapted and applied to soil property assessments. On the other hand, machine learning represents a pivotal component within the realm of multi-source data fusion technology. Models like partial least square regression (PLSR), random forest (RF), support vector machine (SVM), and deep learning (DL) exemplify the potential of amalgamating diverse data streams to enhance the precision and efficacy of property predictions [
31,
34,
43,
44,
45]. These methodologies offer a sophisticated framework for synthesizing heterogeneous data inputs, thereby elevating the granularity and reliability of soil property estimations. Despite the substantial progress made in leveraging machine learning for multi-source data fusion, considerable opportunities for advancement remain, particularly in the area of point-to-interval prediction of soil property [
46]. The transition from singular data points to comprehensive prediction intervals remains a frontier where further refinement and innovation are warranted.
In addition, the preprocessing and fusion methods applied to raw spectral data are critical factors influencing estimation accuracy [
47]. Various studies have employed techniques such as baseline correction, smoothing, and denoising for spectral preprocessing, with fractional order derivatives (FODs) also emerging as a particularly effective approach for enhancing the accuracy of Vis-NIR estimations [
48,
49,
50]. Despite these advancements, a standardized method for processing spectral data has yet to be established, and comparative studies assessing the performance of different preprocessing techniques remain scarce. The fusion of spectral data focuses on integrating information during the post-processing phase, and numerous studies have demonstrated its efficacy in enhancing both the accuracy and stability of soil property estimations [
51,
52,
53]. However, when performing multi-task predictions for soil properties, inconsistencies among the information provided by the different spectral bands can lead to distortions in the fusion results, particularly in scenarios characterized by strong spectral aliasing or noise [
54]. Consequently, effectively managing noise and preventing overfitting presents significant challenges in current research efforts. Addressing these issues is essential for optimizing the performance of predictive models in soil science.
In this study, we present a novel approach that integrates diverse spectral data to facilitate precise point-to-interval forecasts of critical soil property contents. Following the collection, preprocessing, and feature extraction of soil spectral data, we systematically designed and compared various fusion strategies to evaluate their impact on prediction accuracy and associated errors. Leveraging a PLSR model, we successfully achieved point predictions for five essential soil properties. Building on this work, we established prediction intervals at varying confidence levels by applying statistical insights derived from the errors associated with the point predictions. This comprehensive analysis not only allowed for the identification of uncertainties in predicting soil properties but also illuminated the underlying causes of these uncertainties.
2. Materials and Methods
2.1. Sample Preparation
The research site is located within a controlled experimental field at the Jilin Academy of Agricultural Sciences (JAAS) in Gongzhuling City, Jilin Province. This geographical zone is defined by a temperate continental climate, marked by an average annual precipitation of approximately 594.8mm and an average annual temperature of 5.6 °C. The pronounced seasonal variations observed in this region significantly contribute to the overall fertility of the soil, thereby fostering a conducive environment for a diverse range of agricultural practices. The nutrient profile within this locale is subject to the interplay of natural processes and agricultural interventions. Hence, point and interval monitoring of soil property contents is imperative for augmenting agricultural yields and sustainability within this region.
Field surveys and soil sampling were carried out in the study area during the fall of 2021. Considering the soil type, topography, and land-use characteristics, a plum sampling method was employed to collect samples from the 5–20 cm soil layer. In total, 109 soil samples were collected. Following the preparatory steps, each soil sample was partitioned into three distinct sections. The first section was dedicated to the collection of UV-Vis-NIR spectral data, and the second to the acquisition of MIR spectral data. The third component involves the chemical analysis of key soil properties, which is essential for accurately determining the actual concentrations of each property.
The chemical measurement of soil properties is conducted using internationally recognized methods. Specifically, soil organic matter was measured by the Walkley and Black dichromate oxidation method [
7]. In this process, the soil sample is oxidized with potassium dichromate and sulfuric acid. The organic matter content is then determined based on the consumption of dichromate, which correlates with the organic carbon content in the sample. For TN, we employed the Kjeldahl method, which involves digesting soil samples with sulfuric acid and a catalyst to convert nitrogen into ammonium [
55]. Following digestion, the ammonium was distilled and titrated to quantify the total nitrogen content in the sample. HN was assessed using a hydrochloric acid hydrolysis procedure [
56]. In this method, soil samples were treated with acidic solutions to release nitrogen from organic compounds. The resulting ammonium can then be quantified using colorimetric methods or ion-selective electrodes, providing a reliable measure of HN. The determination of AK was achieved through the ammonium acetate extraction-flame photometry method [
57]. During this procedure, the soil sample is shaken with an ammonium acetate solution, which displaces potassium ions from the soil matrix. The concentration of potassium in the extraction solution is subsequently measured using atomic absorption spectrophotometry or flame photometry. For AP, extraction was performed using the Olsen method, which employed a sodium bicarbonate solution [
58]. The concentration of phosphorus in the extract was determined through colorimetric methods, allowing for an accurate assessment of phosphorus availability in the soil.
2.2. UV-Vis-NIR and MIR Spectral Data Measurement
The UV-Vis-NIR spectral data of the 109 soil samples were measured in the laboratory using a UV-3600IPLUS, 220C (A12615900129) spectrometer (Shimadzu, Kyoto, Japan). The instrument is equipped with a deuterium lamp (200–330 nm) and a tungsten–halogen lamp (330–2500 nm), enabling it to cover the entire spectral range from UV to NIR. A spectral resolution of 2 nm was employed, generating a total of 1151 spectral data points. The type of photometric value was set to reflectance, which can indicate the presence of specific soil constituents, such as organic matter or minerals. Prior to the measurements, the instrument was calibrated using a barium sulfate (BaSO4) white reference plate with near 100% reflectance. The reflectance values of the soil samples in each spectral band were recorded by calculating the ratio of the sample’s reflectance intensity to the standard white reference, and the scanning was repeated three times for each sample, with the average taken as the final reflectance spectral data.
The MIR spectral data for the soil samples were acquired using an IRAffinity-1S Fourier-transform infrared (FTIR) spectrometer (Shimadzu, Kyoto, Japan). Spectral measurements were conducted across a wavenumber range of 4000–400 cm−1, with the spectral resolution set to 2 cm−1. In this study, transmission MIR spectroscopy was employed to obtain transmittance spectral data for soil samples, which is particularly suitable for capturing the detailed transmittance characteristics of soil components, providing distinct peaks that correlate to molecular vibrations. Similar to the UV-Vis-NIR measurements, a reference sample of potassium bromide (KBr) powder, exhibiting negligible absorption in the mid-infrared region, was utilized for background scanning. The intensity signals from both the measurement sample and the reference sample were converted to a digital format and subsequently subtracted, yielding the final MIR spectra. We opted to calculate the average of the three scans to minimize the variability caused by environmental factors during measurements and to provide a more stable representation of each sample’s spectral signature.
2.3. Spectral Data Preprocessing
The UV-Vis-NIR and MIR spectral profiles underwent smoothing via the Savitzky–Golay (SG) method employing a window size of 15 and a polynomial order of 2 [
59]. Subsequently, the raw spectra were subjected to 0-1 order derivative transformations at 0.2 intervals using the Grünwald–Letnikov fractional order derivative (FOD) approach [
60]. This fractional order derivative technique, unlike conventional integer order derivatives, such as first or second order, enhances the detection of subtle spectral absorption features, thereby facilitating the extraction of more informative data [
61]. To optimize the preprocessing of the spectra, a PLSR model was constructed using spectra data transformed by different FODs. A 10-fold cross-validation approach was employed, where the dataset was divided into ten subsets, with each subset iteratively serving as a testing set while the remaining subsets were used for training [
62]. This process involved averaging the results of each iteration and minimizing the prediction error, assessed through the root mean square error (RMSE) metric while maximizing the coefficient of determination (R
2) as much as possible. By systematically evaluating the performance of different FOD transformations, we were able to identify the most effective FOD for spectral enhancement and subsequent model development.
2.4. Feature Selection
Principal component analysis (PCA) is a powerful statistical technique used for dimensionality reduction and feature selection, which is particularly suited for high-dimensional spectral data, such as UV-Vis-NIR and MIR spectra. By transforming the original features into orthogonal principal components (PCs) that capture the maximum variance in the data, which are ordered by the amount of variance they explain in the data. The key steps include data standardization, covariance matrix computation, eigenvalue decomposition, selection of PCs based on variance explained, data transformation, and interpretation of the most informative features identified through the PCs [
63,
64]. For UV-Vis-NIR and MIR data, PCA can help identify the most informative wavelength regions that contribute the most to the overall variance in the data, which can be useful for tasks like quantitative analysis from points to intervals.
2.5. Data Fusion Strategy
2.5.1. Series Splicing (SS)
To construct the new feature matrix, we connected features selected from the UV-Vis-NIR and MIR spectral data that explain the maximization of the variance of the data. This approach involves initially identifying the most informative features from each spectral range using optimal feature selection methods. Once the features are determined, they are systematically combined in a sequential manner, as illustrated in
Figure 1, resulting in a comprehensive feature matrix that integrates the significant spectral characteristics from both UV-Vis-NIR and MIR spectra. This concatenated feature matrix is expected to enhance the overall predictive performance by leveraging the complementary information provided by the distinct spectral regions.
2.5.2. Outer-Product Analysis (OPA)
The principle of OPA involves creating a new feature set by calculating the outer product of the feature vectors from two sets of spectral data [
65]. The outer product is a matrix that captures the interactions between every pair of features from the two datasets, potentially enhancing the discriminative power of the combined data [
66]. The specific implementation methodology is detailed in
Figure 2. The band numbers of UV-Vis-NIR and MIR were set as n1 and n2, respectively, with the total number of samples being m. To generate a high-dimensional feature matrix, each sample undergoes OPA, which involves the outer product of the UV-Vis-NIR and MIR spectral matrices. Where U
std[i] and M
std[i] represent the UV-Vis-NIR and MIR spectral matrices of the ith sample, respectively, and * denotes the outer product operation. The outer product vectors of all samples are then stacked into a matrix to obtain the fused high-dimensional feature matrix Z
i.
2.5.3. Granger–Ramanathan Averaging (GRA)
The GRA algorithm for data fusion relies on predictions derived from single-sensor UV-Vis-NIR and MIR spectral data. This approach involves assigning weights to each spectral band based on the outcomes of the Granger causality test, which is typically executed through regression analysis [
67]. The final output is generated by constructing a linear model that correlates the fused prediction results with the individual predictions obtained from the different sensors. The structure of this is shown in
Figure 3.
Here, the fused prediction result is denoted as Y. The weights, represented by W1 and W2, are ascertained through the Granger causality test, encapsulating the significance of each band in the predictive process. Concurrently, X1 and X2 symbolize the projections of soil properties predicated on UV-Vis-NIR and MIR data correspondingly, W0 is the intercept, embodying the distinct informational contributions of these spectral ranges.
2.6. Point Prediction Model
The initial phase of modeling involves point prediction utilizing machine learning techniques. Prior to constructing the model, the Kennard–Stone (K-S) algorithm was employed to divide the dataset of all soil properties into a training set comprising 65 samples (60%), a testing set comprising 22 samples (20%), and a validation set comprising 22 samples (20%) [
68]. Specifically, the training set was employed to fit the model, allowing it to learn patterns and relationships within the data. Following this, the validation set was used to fine-tune the model’s hyperparameters and provide an initial assessment of its performance. The testing set was reserved for the ultimate stage of assessment, where it serves as the ultimate benchmark for validating the model’s robustness and ensuring that it can perform accurately in real-world scenarios. PLSR was then applied to develop a predictive model for estimating soil property contents at the point level. As one of the most prevalent modeling methods in spectral analysis, PLSR effectively addresses the high dimensionality and multicollinearity inherent in spectral data. It emphasizes the relationships among variables, performs well with small sample sizes, and possesses certain nonlinear modeling capabilities. A critical parameter in constructing a PLSR model is determining the optimal number of latent variables (LVs), which significantly influences the model’s performance. In this study, we focused on identifying the optimal number of principal component factors (PCFs). To achieve this, we employed a 10-fold cross-validation approach. For each model configuration, we incrementally increased the number of PCFs and evaluated the predictive performance using the test set. The primary goal was to minimize the RMSE, a key indicator of the model’s prediction accuracy. Through this iterative process, we were able to identify the optimal number of PCFs that resulted in the lowest prediction error.
2.7. Interval Prediction Model
In the next phase of modeling, probability interval prediction is achieved through the analysis of the absolute errors associated with point predictions. In this study, we aimed to validate the model’s effectiveness and the non-parametric estimation approach by employing both Gaussian probability interval prediction and Kernel density estimation (KDE) interval prediction for comparing interval prediction results at various confidence levels. Specifically, the Gaussian interval prediction utilizes statistical analysis of the standard deviation of the absolute errors from the validation set to derive the upper and lower bounds of the confidence intervals [
69].
KDE is fundamentally based on inferring the distribution of an entire dataset from a finite number of samples. It is a non-parametric method for estimating the probability density function of a random variable, particularly useful when the underlying distribution is unknown or does not conform to standard parametric distributions [
70,
71]. KDE is employed to predict probability intervals, which serve to estimate the upper and lower bounds of soil property contents at a specified confidence level. This approach offers valuable insights into the distribution and variability of property content across different soil samples.
In this context, f(x) represents the density function estimated at a specific point. The variable n denotes the total number of observations in the dataset, while Xi refers to the individual observed values. The parameter h indicates the bandwidth.
2.8. Model Construction and Evaluation
2.8.1. Evaluation Indicators of Point Prediction
The evaluation criteria for assessing the point prediction model include the coefficient of determination (R²), root mean squared error (RMSE), and mean absolute error (MAE) [
72]. During the modeling process, each dataset—training, validation, and testing—was evaluated using performance metrics. These stages form a coherent sequence. However, the final evaluation of the model is primarily based on its performance on the testing set, as this provides the most reliable measure of its predictive power when applied to unknown samples. It is worth noting that R
C2, RMSE
C, and MAE
C refer to the R
2, RMSE, and MAE for the training set. Similarly, R
V2, RMSE
V, and MAE
V represent the same metrics for the validation set, while R
P2, RMSE
P, and MAE
P indicate the corresponding values for the testing set. In this context, a higher R² value, approaching 1, alongside minimized RMSE and MAE values, signifies that the model is more accurate and performs better. Their mathematical expressions are as follows:
where y
i represents the true value, f
i denotes the predicted value,
denotes the measurement on average, and n indicates the number of data points.
2.8.2. Evaluation Indicators of Interval Prediction
The prediction interval coverage probability (PICP) and prediction interval normalized average width (PINAW) are widely employed metrics to evaluate the accuracy and quality of probabilistic interval forecasting results [
73]. The PICP represents the likelihood that an actual observation falls within the predicted interval, whereas PINAW serves as a measure to assess the normalized average width of the probabilistic interval prediction model. However, as these two metrics tend to exhibit a somewhat contradictory relationship, the coverage width criterion (CWC) is necessary to concurrently consider the degree of coverage and narrowness of the prediction intervals. Ideally, a high-quality prediction interval should achieve a PICP as close to 1 as possible while maintaining minimal values for both PINAW and CWC. The calculation formulae for CWC are as follows:
where η is a parameter greater than 0, with this paper using a value of 1, and μ is the given confidence level.
4. Conclusions
This study aimed to advance the fusion of multi-source spectral data with machine learning algorithms to facilitate precise point-to-interval prediction of unknown soil property concentrations. Initially, the UV-Vis-NIR and MIR datasets were established to organically integrate the relevant feature bands extracted through PCA. This integration provided comprehensive inputs for both point predictions and interval predictions of key soil properties. The prediction models were then validated using the integrated datasets to ensure their robustness and reliability. The results underscore the efficacy of the OPA fusion strategy in significantly enhancing the accuracy of point predictions across the three fusion methods, highlighting the critical role of the integration of diverse spectral data sources in improving prediction performance effectively. Within the OPA-PLSR framework, successful predictions were achieved for SOM, TN, and HN, with correlation coefficients of 0.91, 0.90, and 0.89, respectively. AK also demonstrated approximate quantitative detectability with an R2 value of 0.73, while AP exhibited a lower R2 of 0.53, indicating a requirement for further refinement in prediction accuracy. In terms of the interval prediction accuracy, the KDE interval prediction model outperformed the Gaussian probability interval prediction model, demonstrating better CWC at equivalent confidence levels. Reliable interval predictions were attained for SOM, TN, HN, and AK, characterized by narrow average bandwidths and high interval coverage rates. These outcomes effectively quantified the uncertainty surrounding key soil properties, offering a reasonable range of fluctuation in property concentrations. Another thing that should be emphasized is AP, characterized by higher prediction uncertainty. This research not only highlights the potential of advanced spectral fusion techniques in soil property analysis but also advances our understanding of soil property dynamics.
However, a significant limitation of this study is its reliance on a limited number of samples collected in a single year, which may restrict the generalizability of the results. Future research will prioritize expanding soil sample datasets to encompass multiple years, allowing for a more comprehensive understanding of property dynamics over time. For future data processing, priority will be placed on harmonizing the units of spectral data across different spectral bands to ensure consistency and comparability, and effort will be directed toward incorporating comparative analyses of averaged and replicated spectra. Additionally, there will be a focused effort on enhancing the prediction of available nutrients, particularly AK and AP. This will involve refining the related techniques and models to improve the accuracy and reliability of these predictions.