In this paper, a novel integrated BiLSTM model is proposed that combines the daily average ST prediction and the ST amplitude prediction to obtain the hourly ST. Compared with the traditional machine learning model, the integrated BiLSTM exhibits obvious advantages in predicting the hourly ST for various climates.
Our results are presented in four aspects. First, the proposed algorithm was used to predict the ST under different observations, and the results were compared with those of other traditional algorithms. Second, the ST prediction performances of each algorithm under different soil depths were compared. Third, the ST prediction performances of each algorithm under different climate types were compared. Finally, the algorithm proposed in this paper was compared with the algorithms in other literatures.
4.1. Model Comparisons
The prediction performance of different models at each observation station are shown in
Figure 6,
Figure 7 and
Figure 8. As presented in
Figure 6, the best performance is obtained by the integrated BiLSTM model, the root mean squared error (RMSE) of which at all 30 sites is lower than that of the benchmark algorithm, whereas the worst performance is obtained by the LR algorithm, the RMSE of which is the highest at all sites. According to the statistics, based on the RMSE index, the integrated BiLSTM is 4.8–18.7%, 8.8–23.6%, 20.9–25.7%, 12.8%–25.1%, 17.0–45.0%, and 42.0–65.1% more accurate than LSTM, BiLSTM, DNN, RF, SVR, and LR, respectively.
The mean absolute error (MAE) of every model for each observation station is given in
Figure 7. As can be seen, integrated BiLSTM performs better than the benchmark models at 27 observation sites (90% of the total number of stations), and based on the MAE index is 2.6–18.7%, 5.7–21.5%, 20.3–25.7%, 9.8–22.1%, 17.5–44.1%, and 42.2–65.1% more accurate than LSTM, BiLSTM, DNN, RF, SVR, and LR, respectively. At station 2215, the BiLSTM algorithm obtains the best MAE value of 1.06 °C, whereas the MAE value of the integrated BiLSTM algorithm is 1.07 °C, which is only 0.01 °C higher than that of the BiLSTM. By contrast, LSTM obtained the lowest MAE of 1.14 °C and 1.12 °C at the two stations 2031 and 2147, which are 0.06 °C and 0.02 °C lower than that of the integrated BiLSTM, respectively. In other words, the difference between the integrated BiLSTM and the best algorithm is not obvious at these three stations.
Figure 8 shows the R
2 generated by every model at each observation, and as the figure indicates, the integrated BiLSTM was the best model with the highest R
2 value at each observation station. According to statistics, based on the R
2 index, the integrated BiLSTM is 1.3–4.8%, 2.4–8.0%, 5.1–8.3%, 3.2–9.3%, 4.4–45.0%, and 23–57.4% more accurate than LSTM, BiLSTM, DNN, RF, SVR, and LR, respectively. The performance of the LSTM model is slightly better than that of the BiLSTM, and the performance of the LR model is still not ideal. We also noted that the R
2 value of the LR is less than zero at the two sites labeled 2218 and 2147, which indicates that LR model is unsuitable for processing data with a nonlinear correlation.
Although the RF model is not as good as the integrated BiLSTM, LSTM, and BiLSTM with respect to the three indicators, RMSE, MAE, and R
2, it can be clearly seen from
Figure 6,
Figure 7 and
Figure 8 that favorable agreements exist between the results of the RF model and these three deep learning models, and the prediction results at multiple sites are better than DNN, which confirms the potential of using the RF model for an estimation of the hourly ST.
The best and worst statistical results for each model, and the identification number of the observation sites that produced these results, are given in
Table 3 and
Table 4, respectively. The RMSE, MAE, and determination coefficient (R
2) are used as the evaluation criteria. From
Table 3 and
Table 4, we can clearly see that the performance of the integrated BiLSTM model developed in this study showed the best results. For the 30 observation stations involved in the experiment, the RMSE, MAE, and R
2 values obtained for the integrated BiLSTM are within the range of 0.95–2.53 °C, 0.76–1.99 °C, and 0.823–0.976, respectively.
Among the deep learning technology, DNN algorithm is not as good as integrated BiLSTM, LSTM and BiLSTM for predicting hourly ST. As mentioned above, the results of DNN are sometimes even worse than the RF, as can be seen from
Table 4, the worst R
2 obtained by DNN is 0.613, while the worst R
2 of RF is 0.656.
The statistic performance of each model for an estimation of the hourly ST is shown in
Table 5, which are the average performance of each model after 10 runs. According to
Table 5, the maximum R
2, and the minimum RMSE and MAE (the best results), with values of 0.923, 1.53 °C, and 1.22 °C, respectively, were obtained using the integrated BiLSTM. By contrast, the minimum R
2, and the maximum RMSE and MAE values, were found to be 0.518, 3.43 °C, and 2.76 °C when using the LR. The integrated BiLSTM, BiLSTM, and LSTM perform better than the DNN, RF, SVR, and LR. DNN is inferior to LSTM-based method in processing time series data. Among the deep learning methods, the integrated BiLSTM method achieves the best prediction results for the hourly ST, whereas within the range of traditional machine learning, the random forest model achieves the best performance.
4.2. Model Performance at Different Depths
Figure 9 shows the performance of each model in estimating the hourly ST at different soil depths. With respect to RMSE, MAE, and R
2, the integrated BiLSTM models generally show the best performance at all soil depths, and the LSTM model is the second-best prediction algorithm.
Except for RF, the accuracy (based on the RMSE and MAE) of the other models increases with an increase in depth. Taking the integrated BiLSTM model as an example, the highest RMSE and MAE values (the worst results) are obtained at 5 cm, whereas the lowest values (the best results) are obtained at 100 cm. With respect to R
2, the accuracy of the integrated BiLSTM model increases from 5 to 20 cm, and decreases from 50 to 100 cm, whereas the R
2 values of the other models decrease with an increase in depth. This result differs from the conclusions in [
21] and [
53], the experimental results of which show that the accuracy of the machine learning algorithm will gradually decrease at a soil depth of 10 to 100 cm. This may be related to the different models used and the different climatic conditions.
We noticed that the lowest RMSE value obtained by the integrated BiLSTM is 2.04 °C at 5 cm, which is 0.1 °C lower than that obtained by the second-best model. The difference gradually increases to 0.31 °C at 50 cm, and then decreases to 0.23 °C at 100 cm. In other words, from 5 to 50 cm, the difference between the integrated BiLSTM model and the second-best model widens, indicating that the integrated BiLSTM has potential for deeper ST prediction tasks.
4.3. Model Performance at Different Climates
The statistic performance of each algorithm for predicting the hourly ST under different climates is shown in
Table 6,
Table 7 and
Table 8. As can be seen from these tables, the best performances were generally obtained using the integrated BiLSTM. This produced the best results for four climates (80% of the total number) for RMSE, five climates (100% of the total number) for MAE, and five climates (100% of the total number) for R
2, respectively. The LR was the worst model, producing the minimum R
2, and the maximum RMSE and MAE, under all climate types involved in the experiment. With respect to the RMSE, the BiLSTM performed slightly better than the integrated BiLSTM under the BWh climate.
For the five models, namely, integrated BiLSTM, LSTM, BiLSTM, DNN, and RF, the lowest RMSE and lowest MAE were obtained under the Dfa climate type, whereas the highest RMSE and MAE were generated under the BWh or Csa climate, which does not mean that these models perform better under the Dfa climate. Based on the statistics, the average soil temperature measured in the Dfa climate is 12.21 °C, whereas the average soil temperatures of Bwh and Csa are 21.68 °C and 21.72 °C, respectively. If the relative error (the ratio of the error to the true value) is calculated, we can see that the relative error of each algorithm is the highest in the Dfa climate. In other words, the performance of each model under a Dfa climate (snow areas) is not as good as that in warm or dry areas (BSk, BWh, Cfa, and Csa). It is also clear from
Table 8 that the R
2 of these five models is the lowest under the Dfa climate type.