1. Introduction
Soil temperature (
TS) is one of the essential environmental factors used in various disciplines such as engineering and hydraulic structures design, geotechnics, hydrological and meteorological studies, agriculture, forestry, and many other fields. The majority of the chemical and physical characteristics of soil are temperature-dependent. For instance, fluctuations in
TS directly affect water infiltration; thus, variations in
TS and infiltration rates follow similar patterns [
1,
2,
3]. Soil temperature may affect other processes, such as groundwater discharge, land–atmosphere fluxes such as evapotranspiration, gas exchange, and surface runoff patterns [
4,
5,
6]. Fluxes of greenhouse gases such as carbon dioxide (CO
2) and nitrous oxide (N
2O) directly depend on
TS [
7,
8], which relates to global warming on a large scale. Any alterations to atmospheric constituents mutually affect the soil temperature. Because the thermal energy balance between the soil and atmosphere is preserved, the soil can either act as a heat source or sink [
9,
10,
11,
12]. Similarly, the impact of
TS on the agriculture section cannot be ignored, e.g., the dependency of crops’ growth on
TS, irrigation planning, and hydrological drought monitoring [
13,
14].
Estimating the
TS time series by using artificial intelligence (AI) models and exogenous variables is common. For instance, multilayer perceptron (MLP), an artificial neural network (ANN) model and a multivariate linear regression with air temperature (Ta), solar radiation (RS), relative humidity (RH), and precipitation inputs as inputs [
15], MLP and an adaptive neuro–fuzzy inference system, genetic programming and an ANN with Ta, RS, RH, wind speed, and dew points inputs [
16,
17,
18], a self-adaptive evolutionary extreme learning machine with Rs, Ta, and pressure [
19], an emotional neural network, and a least square support vector machine [
20] are some of the AI methods used in this field. Though these modeling methods are powerful in creating estimations, they are prone to model parameter tuning and input selection uncertainties [
21,
22]. To solve this problem, researchers developed remotely sensed data.
Further developments of remote sensing tools and land surface methods with extended historical temporal coverage have improved the quality and quantity of environmental data, particularly in regions with limited available data. However, land coverage, upwelling atmospheric emissions [
23], cloud coverage [
24], satellite machine failure, and consequent errors in the inputs of land surface models are most often responsible for missing values in remotely sensed data. Therefore, many methods have been proposed to fill in the dataset’s gaps [
25].
Various methods have been proposed to fill in gaps in remotely sensed data, including combining temporal steps to create a composite set. Weiss et al. [
26] categorized data recovery methods into three classes: (a) spatial, (b) temporal sequences of images, and (c) spatio–temporal filling methods. Inverse distance weighting (IDW), kriging, cokriging, and their variants are widely recognized methods in spatial class [
27,
28]. IDW and its variants, such as harmony search, genetic algorithm (ga), and particle swarm–IDW, are commonly used and reliable approaches. A comparative study by Bărbulescu et al. [
29] showed that the ga–IDW method outperformed other methods in 70% of the cases studied. The principal component analysis, stochastic models, and k nearest neighbor (kNN) are other popular statistical approaches for estimating missing values [
30,
31]. Weiss et al. used the spatio–temporal method to fill gaps in the land surface temperature and vegetation index, achieving an accuracy of R
2 = 0.87, however could not evaluate embedded uncertainties or account for temperature values impacted by cloudy days. This conclusion adds to the inherent uncertainty of their model.
Integrating remotely-sensed data with modeling backed by historical data decreases measurement limitations and variable-based method deficiencies, namely, input variable uncertainties and error accumulation [
18,
22], where both non-linear and linear models are applicable. However, using a method with a derivable explicit expression fast enough to build real-time estimations and reasonably accurate information is prioritized over other methods. The advanced state-space (SS) method in hydrology is a relatively less utilized linear method for modeling time series with intense periodic fluctuations. This model uses multiple in-line pre-processing methods to produce white noise data. It additionally can be used to remove time series anomalies [
32] and forecast future steps using the out-of-sample forecasting ability of the model. Studies on the SS model as a forecasting method are limited. Reservoir inflow modeling, climatic data, catchment rainfalls, and land hotspots are the parameters modeled and forecasted by the SS method [
33,
34,
35,
36]. For example, Vijayakumar and Vennila [
33] compared the autoregressive integrated moving average (ARIMA) model with the SS method. The model input included reservoir inflows, and the results indicated that the SS method could overcome the deficiencies of the ARIMA model. They stated that the ARIMA model did not successfully deal with the time series components observed in their data. For the forecast procedure, the trend component was removed before autocorrelation modeling. Additionally, they mentioned that the model could not forecast the “turning points”. These deficiencies, alongside the linear nature of stochastic models and requisite preprocessing steps before modeling, represent other limitations.
In a recent study by Zeynoddin et al. [
37], a comprehensive exploration of techniques for predicting time series data was undertaken, encompassing variable-based models, AI modeling approaches, and linear methods. They proposed a linear methodology that pre-processes
TS data using several methods and models the data dynamically via stochastic models. Comparative analysis with ML techniques [
19] revealed that combining the SS method with the stochastic model yielded significantly more precise results. Building upon these findings, Zeynoddin and Bonakdari [
38] introduced a structurally optimized deep learning method that integrated the SS method for long-term soil moisture forecasting. Remarkably, their study demonstrated that the SS method excelled at regenerating long-term time series features and substantially enhanced forecast accuracy compared to other methods. Moreover, their solution effectively mitigated the data size dependency problem often encountered in data-driven modeling approaches. While this proposed structure has limitations, such as its finite domain of model variables, exclusivity to seasonally structured data, and the complexity of the deep learning method, the authors hypothesized that the comprehensive and generalizable structure of the SS method for seasonal data modeling could be leveraged to fill the temporal gap in FLDAS datasets. These datasets, generated at a monthly resolution [
39] with a two-month availability lag, as indicated on the GEE catalog website (Earth Engine Data Catalog: NASA_FLDAS_NOAH01_C_GL_M_V001#description), are influenced by the annual or monthly trends exhibited by the
TS time series [
10], which temporally and spatially vary due to atmospheric and subsoil processes affecting thermal equilibrium.
The objectives of this paper are (a) to introduce an extraction method for the collection of monthly mean TS data for two agricultural areas in Quebec, Canada and (b) to apply the powerful SS method for modeling. Following these are (c) a spatial validation analysis will be carried out to analyze the model’s accuracy alternation and pattern evaluation in case a time series of different locations are used, and (d) a trend and long-term alternations analysis of the TS variable will be carried out based on the model results and forecasts. Finally, (e) the results of the proposed FLDAS–state-space method will be comprehensively discussed in the following sections. Furthermore, this paper will outline the practical application of these results, including their integration into future TS forecasts, thereby making a significant contribution to the field.
2. Time Series and Pre-Processing Review
Since this study uses a time series-based approach, a review of necessary methods for time series preparation is presented. Time series can have jumps, periods, and seasonal and non-seasonal trends deemed deterministic. They can additionally have a stochastic component, constituted of residuals of the deterministic characteristics subtracted from the time series’ values. The deterministic component is the predictable part of the time series’ pattern and can be detected and extracted. Detecting the mentioned patterns and analyzing
TS data entails employing several tests. These tests not only help to detect meaningful patterns, however they can additionally help to extract statistics and interpret the time series. Following pre-processing methods, models are applied to prepare data for forecasting [
40].
The first attribute of a time series to be checked is stationarity. It means the absence of meaningful changes in mean, variance, and other statistics over time. Therefore, the Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test is applied to
TS for assessing stationarity [
41]. In the KPSS test, a regression equation is fitted to the data. The
TS series is stationary if the variance of the independent variables of that equation is null. Alternatively, it is stationary in trend if the deterministic term of the trend is not null and stationary in level if the deterministic term of the trend is equal to zero. Seasonal and non-seasonal trends in the
TS time series as one of the non-stationarity factors is checked by the Mann–Kendal and seasonal Mann–Kendal trend (MK and SMK) tests [
42,
43].
Jumps appear in time series as sudden increases or decreases and can be both induced by human and natural sources. To evaluate this factor, the non-parametric Mann–Whitney (MW) test can be performed [
42]. More details about pre-processing steps can be found in Zeynoddin et al. [
22]. The autocorrelation function (ACF) is a tool to evaluate a dataset’s randomness or assess whether a non-linear or linear model would be appropriate. This way, it can be found whether an observation is related to adjacent observations or not. Additionally, specific patterns in time series, such as sinusoidal fluctuations interpreted as periodicity, can be detected. Therefore, we can judge the time series’ condition and interpret it. An ACF plot is an intuitive method to assess the data more conveniently. The ACF demonstrates the autocorrelation of data separated by a lag distance. Both positive and negative ACF values are considered as the correlation between lags. Still, for several lags, if the positive values of an ACF consecutively exceed specific values, in this case, the 95% significance levels, they are interpreted as significantly correlated lags. These significant correlations should be reduced or removed from the time series to obtain independent variables and reduce the impacts of the deterministic components in the modeling procedure. As mentioned earlier, the deterministic components are the predictable part of the series and are responsible for dependency in the time series. This part of the time series’ structure can be separated before modeling and restored to modeled data. The stochastic component is the mere important part for modeling purposes. The significance intervals are defined based on the cumulative distribution function of the standard normal distribution and the 95% significance level.
5. Discussion
It was mentioned that a significant seasonal trend exists in these time series. The monthly values of different years were separated from the series to exhibit this seasonal trend and assess its significance. The MK test was then applied again. The test results indicated a trend in each month of the
TS time series, some of which are significant. In Quebec City, the months of August, September, October, and November and in Montreal, September, October, and November had a significant positive trend for different soil layer depths. Each location had a positive trend for seven months of the year. This means that the soil is generally getting warmer.
Figure 11 presents the time series of the first soil layer (0–10 cm) for August, September, October, and November in Quebec City, along their linear trend lines, as examples of the progressive warming in the
TS time series.
Qian et al. [
56] performed extensive research on the
TS trend and its association with the trend in climatical variable changes in Canada. They investigated the
TS trend in 30 stations across Canada, from 1958 to 2008. Their research showed a positive trend in
TS in 60% of stations, significantly in spring and summer means. They additionally reported a positive correlation between
TS trend and changes trend in snow cover thickness, air temperature, rain and snowfall, and total precipitation. The correlation between these parameters was additionally surveyed in other studies [
10,
57]. Later, Assani et al. [
58] performed a similar investigation on the maximum and minimum air temperature and rainfall, exclusively to Quebec province. They reported significant trends in these variables during summer, indicating that summer warming in southern Quebec is more pervasive than winter warming. Considering these findings, the progressive warming in soil temperature can be justified by reduced rainfall during the warm season and decreased soil moisture, especially in the mentioned months. Consequently, water exploitation typically increases during this season. The increase in air temperature and, at a large scale, the impact of global warming have caused changes in the overall trend of soil temperature. Some researchers even reported a stark trend in soil temperature compared to air temperature [
59,
60], which means the trend in the former is more noticeable than that in the latter. Monitoring and controlling soil moisture can be a short-term solution to controlling the
TS upward trend. However, unmanaged water exploitation can significantly reduce groundwater and decrease soil moisture. This factor is essential as it directly impacts the soil’s thermal conductivity and can lower it to one-eighth [
61]. Moreover, the heat-retaining feature of soil is negatively affected since the specific heat of pure water (4.2 J/g°C) is required compared to that of clay minerals (0.76 J/g°C), a dominant component of the soil [
10]. To understand the magnitude of the impact, it should be mentioned that an increase of 1 °C of 1 gram of dry clay soil needs 1.1 J whereas 1 gram of wet clay soil needs 2.21 J (average) of energy under laboratory conditions with 25% of water content and 1300 kg/m
3 density. However, this value varies for different conditions and depends on soil texture, water content, and soil density [
62].
Soil moisture and temperature are the two essential parameters that affect soil respiration. The relationship between these parameters and respiration can be estimated using several models, e.g., quadratic and exponential models. As mentioned earlier, these parameters have a strong correlation. Several studies have been carried out to distinguish them from one another. In this case, the estimation of soil moisture given the available soil temperature at multiple or limited depths has been debated by many scholars, and several methods have been proposed in this regard [
63,
64,
65,
66]. These parameters are essential in the long-term prediction of drought and food monitoring [
39,
67,
68]. The other applications include irrigation planning, water resources management, flow forecasting, and hydrological land–atmospheric studies.
Higher temperatures are linked to climate change [
69]. Crop water demand and evapotranspiration increase as the temperature rises. Consequently, soil moisture decreases, and soil droughts result from this phenomenon [
70]. As the
TS level increases over time at both Quebec City and Montreal, soil moisture reduction disrupts irrigation scheduling in terms of amount and time intervals. More water pumping is needed to compensate for the reduced soil moisture, and the additional supply of water leads to more energy consumption [
71]. However, the interactions between crop water requirement, soil temperature and moisture, evapotranspiration, and other hydro–climatic factors are complex. Access to the parameter records can additionally be limited. Thus, additional studies about these factors, their interactions, and the use of new data collection methods are required. Methods such as the FLDAS, development of land surface models, integration of deep learning methods in the processing and simulating of different hydro–climatic factors, as well as access to satellite data (i.e., data-driven Earth system science) are needed [
72,
73,
74].
6. Validation of the SS Method
Access to high-quality field data has always been problematic, due to financial and operational resource constraints. Therefore, remotely sensed data and model-based simulations have been used for the past couple of decades. On the other hand, evaluating and validating these methods have been equally crucial for scholars and users. Hence, different validation studies are always carried out. In our case, the Noah LSM, the central part of the FLDAS, has been the center of attention for scholarly work with field-data alternatives. For instance, Xia et al. [
75] validated Noah soil temperature products by comparing them with field datasets from 137 stations in the United States over different time scales (e.g., daily, monthly, and annually) and different soil layer depths, from 0 to 200 cm. These authors declared that model products generally emulated observational field data. Although they found some bias for greater soil layer depths and a decreased accuracy of model results, they stated that Noah modeling “Kills” is acceptable. Other articles additionally validated the Noah products for different soil conditions and locations [
76,
77]. The FLDAS LD
TK inputs are additionally monitored and evaluated by The National Aeronautics and Space Administration (NASA) and many other scholars. Therefore, in this paper, we concentrated on validating the results of our model at different locations. Following the assessments of the obtained time series characteristics for these two locations, the proposed methodology modeled the datasets. The results are provided in
Table 5 and
Table 6 for the dynamic and 12-step-ahead forecasts.
The same prediction pattern as QC and MC is seen in these tables. The reduction in accuracy with an increase in soil layer depth is obvious. An intercomparison of the models reveals that the smoothing parameters related to trend (β) and seasonal factor (γ) are decreased with an increase in the soil layer, as they are almost equal to 0 or very small for the 10–40 cm and 40–100 cm layers. For instance, in
Table 3 and
Table 4, the seasonal parameter for soil layer 10–40 cm is equal to 0.008, and the trend factor is almost equal to 0.014 for the MC datasets. Nevertheless, after raising the seasonal parameter to 0.907 and increasing its involvement in the modeling equation, the correlation of the models was enhanced. In the QC results, this pattern can be seen for both β and γ parameters; however, it is not as obvious. After modeling the datasets of the validation points, the same problem is seen in the models. This means that these important factors are not adequately involved in the modeling procedure, which can be due to the nature of the data or the disability of the optimization method in finding the appropriate parameters. The reason for this problem can be investigated by using other datasets of a different nature (e.g., soil moisture) and/or changing the optimization method and using metaheuristic optimization methods to find the best-suited parameters (e.g., utilizing a genetic algorithm, particle swarm, or teacher–learner optimization methods) [
38]. In the end, as with previous datasets, the future forecasts based on the trained model parameters were conducted, and the results are displayed in
Figure 12. Shaded areas in this figure are the out-of-sample forecasts (12-step-ahead (
fs12)) from 1 November 2020 to 1 October 2021.
7. Conclusions
Soil temperature (TS) is crucial in environmental processes. However, obtaining high-quality field data is challenging due to resource constraints. Remotely sensed data and model-based simulations have been used as alternatives; however, they have limitations such as missing data and temporal lags. The FLDAS project addresses this issue by providing reliable data collection for regions without soil temperature measurements, although it still has a temporal lag. This study focused on TS at two locations (QC and MC) across different soil layers. An SS method was developed to model the time series, and the proposed techniques were validated through 1-, 6-, and 12-step-ahead forecasts of TS data. The results showed that the proposed SS model could fill in the gap with an average accuracy of R2 of 0.88 and RMSE of 2.073 °C for the dynamic model, R2 of 0.834 and RMSE of 2.979 °C for 6-steps-ahead, and R2 of 0.921 and RMSE of 1.865 °C for different depths at the QC and MC locations, respectively. The SS method can adjust its parameters as the model progresses, resulting in increased accuracy with more forecasting steps. However, the model’s level component is sensitive to extremes, and the SS seasonal parameter is not adequately incorporated, leading to over- and underestimations. Future studies should explore alternative optimization methods and different types of time series to address this issue. Additionally, a trend assessment revealed a significant warming trend in the datasets of both locations. It emphasizes the need for a close monitoring of TS in these areas and consideration of influencing factors such as groundwater level, surface soil moisture, air temperature, and evapotranspiration. The proposed methodology does not require data preprocessing, as the SS model performs multiple preprocessing steps simultaneously, and it can be applied to any study location using the FLDAS. However, a major drawback is its reliance on historical data, and the FLDAS only provides monthly datasets starting from 1982. It is recommended to additionally model and compare the FLDAS data with other linear and machine learning models.