1. Introduction
The global navigation satellite system (GNSS), including the US GPS, Europe Union GALILEO, Russia GLONASS, and China BeiDou system has achieved great success with an unprecedented impact on all positioning-related areas. It can not only provide spatial information for global users with navigation, positioning information, speed measurement, timing, but also have the opportunity of L-band microwave signals with high time-resolution. As further development of GNSS, the target’s reflected signal can be received and utilized [
1,
2,
3]. Then the way of utilizing the GNSS reflected signals were employed to detect the targets. This is a new concept of remote sensing called GNSS-reflectometry (GNSS-R), featured with no special radar transmitter. Besides, it is a low-cost option with wide global coverage, a large amount of data acquisition, and can also be a powerful complement to other traditional remote sensing methods.
GNSS-R can be regarded as a bi-static radar concept system. In the past 20 years, theoretical [
4] and experimental [
5] studies using GNSS-R have demonstrated the potential of GNSS-R in remote sensing measurements. There are mainly two types of GNSS-R applications: Altimetry and scatterometry. This GNSS-R technique was firstly proposed for ocean altimetry [
5], which is one of the main applications. The altimetry makes use of propagation delay of the reflected signals (from waveform or carrier phase) to measure the surface elevation [
6,
7]. Another main GNSS-R application is scatterometry that was proposed by Hall and Cordey [
4], which used the power/shape information of the waveform (or DDM) to characterize the surface roughness or reflectivity for wind speed retrieval [
8,
9,
10,
11], soil moisture measurement [
12,
13], or sea ice detection [
10,
14]. In addition, with the continuous development of GNSS-R remote sensing technology, it has been widely used in many fields such as measuring the snow depth [
15], tsunami [
16], vegetation biomass [
17], flooding inundation [
18], and inland water [
19,
20]. The experimental platform has also evolved from ground-based experiments [
21] to aircraft [
12], balloons [
22], and the latest low-orbit satellite [
23] platform for measuring hurricanes.
In 2002, NASA took the lead in launching a series of soil moisture remote sensing flight experiments (SMEX02-03) using GPS reflection signals. The entire system effectively measured the signal power that varies with the soil moisture content [
12]. Based on the bistatic radar configuration, two antennas were used respectively to receive the direct signal from the satellite and signals reflected from the ground. A right-hand circularly polarized antenna (RHCP) was oriented toward the sky, and a left-hand circularly polarized (LHCP) antenna (single-polarized) or added a right-hand circularly polarized antenna (constituting dual-polarization) perpendicular to the ground [
21]. The dielectric constant was solved by using the soil reflectivity and the bistatic radar equation. Then, the soil water content can be obtained by various permittivity inversion models (permittivity–soil moisture). As an extension of the earlier work, a calibration process was added to the subsequent soil moisture remote sensing experiment, and a new reflectometer was used to record the data from the satellite with a high elevation angle (greater than 65°) in the visible range. The results showed that the received calibrated soil reflectivity could be detected and used to estimate the expected relationship between the dielectric constant and soil moisture [
24].
After that, researchers proposed another interference pattern technique (IPT) to retrieve the soil moisture content [
25]. A left-hand circularly polarized antenna or a vertically polarized antenna, which oriented towards the horizontal, was used to receive the interference signals from dual paths of direct and reflected. The ground receiver SMIGOL reflectometer was used to measure the instantaneous power that is from the interference of the direct and the reflected signal from the ground. Then, the soil moisture was determined by the position of the point (the notch point) where the amplitude fluctuation of the instantaneous power is the smallest.
Another similar approach used GPS multipath reflection signals to perform soil content retrieval and is presented with only one antenna and a classical GNSS receiver [
26,
27,
28]. A representative result [
29] is from the University of Colorado, USA. The experiment used a right-hand circularly polarized antenna pointing to the sky and a GPS receiver featured with a geodetic characteristic to receive the direct signals and land-surface reflected signals that caused multipath effects. By measuring the signal-to-noise ratio of the received signal, soil moisture content can be obtained, and the method can be applied to sensing other different objects, such as inverted barometer and storm [
30].
At present, various types of space-based, on-board observation experiments are vigorously carried out, and many countries are vigorously promoting related applications [
31]. Following the launch of the UK-DMC satellite carrying GPS reflected signal receiving equipment in the UK in 2003 [
32], the international exploration of GNSS-R spaceborne observations has developed rapidly. For example, the UK TDS-1 satellite launched in Kazakhstan in 2014 is equipped with SGR-ReSI (Space GNSS Receiver–Remote Sensing Instrument) sensors for GNSS-R measurements [
33] are currently used for soil moisture inversion studies [
34]. NASA has launched the CYGNSS observation constellation in December 2016.
Especially, some significant results have been found utilizing space-borne data for the soil moisture content (SMC) application. For instance, the sensitivity of GNSS-R observables and SM was studied well in detail using TDS-1 data [
35]. The sensitivity of the calibrated GNSS-R reflectivity to surface soil moisture was found to be ~0.09 dB/% at an incident angle of ~30° and decrease as the angle of incidence increased. In another study concerning the first global-scale assessment of GNSS-R, soil moisture active passive (SMAP) mission for soil moisture and biomass determination and scattering properties over land were evaluated and the results showed that the sensitivity to the effects of the Earth’s topography and above ground biomass (ABG) was even over that of Amazonian and Boreal forests [
36]. For the CYGNSS mission, the influence of the GNSS satellites’ elevation angle on the reflectivity of LHCP, as a function of soil moisture content (SMC) and effective surface roughness parameter was revealed [
37]. Also, the relationship between forward scattered L-band global navigation satellite system (GNSS) signals, recorded by the CYGNSS constellation and SMAP soil moisture (SM) was studied [
38]. It showed the sensitivity of CYGNSS to SM that varies spatially and can be used to convert reflectivity to the estimates of SM. The unbiased root-mean-square difference between daily average CYGNSS-derived SM and SMAP SM is 0.045 cm
3/cm
3 and is similarly low between CYGNSS and in situ SM. The development of space-borne sensors was greatly promoting the related study on a global scale.
In the meanwhile, many empirical and electromagnetic bistatic models were evolved [
39,
40,
41], enriching the knowledge of the scattering effects taking place in GNSS-R soil moisture retrieval. It is crucial to choose features that have the greatest impact on the results so as to reduce the number of variables when building a model, which is occasionally overlooked. Apart from that, most of the researches only focus on the studies of the soil moisture retrieval algorithm. Besides, the existing methods of soil moisture retrieval using GNSS-R technology are mostly based on analytical and semi-empirical models, which often need plenty of experimental data and are deficient in generalization ability. Moreover, the complex modeling process and uncertainty of the experimental environment (such as the inconsistency of the direct and the reflected receiving channel, the noise of the signal receiver, and so on) have a direct influence on the accuracy of the soil moisture estimation. Therefore, there is an urgent need to evaluate the contribution and sensitivity of the input variables, which could be quite significant in doing experiments and interpreting behavior.
The soil moisture retrieval using GNSS-R can be regarded as a nonlinear regression problem and received data can be taken as many input features (variables). Besides the traditional methods, the latest XGBoost based on the Boosting algorithm [
42], which is good at variable importance estimation was introduced here to evaluate the variable contribution in GNSS-R.
The Boosting algorithm is a popular and effective integrated learning algorithm in the field of data mining. By weighting and superimposing each weak classifier to form a strong classifier, the prediction error is effectively reduced and the classification results with higher accuracy are obtained. Based on the boosting algorithm, an algorithm called Gradient Boosting was proposed to continuously reduce the residuals and further reduce the residuals of the previous model in the gradient direction to obtain a new model. After that, an improved Gradient Boosting algorithm, Extreme Gradient Boosting (XGBoost) was proposed in 2015 [
42].
In recent years, XGBoost has been widely used in-store sales forecasting, hazard risk prediction, power load forecasting, and other fields [
43,
44,
45]. The most important reason for its success is that it is scalable in all scenarios. The scalability of XGBoost is determined by the optimization of several important models and algorithms, including a new tree learning algorithm for processing sparse data and a reasonable weighted quantile sketch process. The weight of the instance is allowed to be processed in the learning of the approximate tree. At the same time, parallel and distributed computing can continuously improve the learning rate of the tree, thus exploring a faster model. More importantly, XGBoost utilizes non-core computing, enabling the user to process hundreds of millions of samples.
Different from the traditional decision tree algorithm, XGBoost adds regular terms such as leaf node weight and tree depth to the cost function. On one hand, it can control the complexity of the model; on the other hand, it can prevent over-fitting phenomenon [
46]. At the same time, it uses a second-order Taylor expansion approximation to the cost function, which makes the approximation of the objective function closer to the actual value, thus improving the prediction accuracy. In recent years, the XGBoost algorithm has achieved excellent results due to its high operational efficiency and prediction accuracy in the field of machine learning and data mining [
47].
In this paper, XGBoost learning method is aided to understand the behavior and the contribution of the input variables of GNSS-R. By utilizing the XGBoost algorithm to evaluate the contribution of the input variables (such as SNR, receiver noise…), the sensitivity of the input variables to the retrieval results is shown. In addition, the results of ground-truth measurements (corresponding to two typical soil types and different soil conditions) are used to confirm the analysis performed with XGBoost learning method and investigate the performance of GNSS-R retrieval. The variation rate of the retrieved results with respect to input variables is analyzed. This knowledge can help the soil moisture retrieval and modeling process. The paper is organized as follows: In
Section 2, the GNSS-R soil moisture retrieval and XGBoost algorithm are presented.
Section 3 is focused on the results performed by XGBoost and shows the statistical data analysis obtained from ground-truth experiments. Finally, discussions of the results and conclusions are drawn in
Section 4.
4. Discussion
A major focus on GNSS-R soil moisture currently is to evaluate the sensitivity of different observables to SM. Previous work was mainly focused on satellite remote sensing of soil moisture, concerning the dataset from the newly launched satellites UK TechDemoSat-1 (in short TDS-1) and NASA CYGNSS (Cyclone Global Navigation Satellite System). The reflection power obtained from the spaceborne sensors was compared to SMAP/SMOS products. A strong, positive linear relationship was found existing between the reflective power/reflectivity and the SM [
35,
38], also reported in this paper. The correlation of different GNSS-R observables to SM was found conclusive on a global scale. Apparently, experiments of in-situ sensors with smaller spatial scale require more studies. Besides the spaceborne mission, the ground-truth experiment is also a commonly used and favorable tool to implement the GNSS-R application.
We focus on the evaluation of the region of interest for different types of terrains using the ground-truth measurement, to evaluate the effect of the influence of uncertainty of received SNR and the elevation angle to SM. From the GNSS-R bistatic retrieval perspective, the GNSS-R parameters were regarded as input data, and the TDR data were taken as the output for the linear fit process. We have to note that, especially, in the case of ground-truth measurement, there are some factors (e.g., the interference of the equipment, the behavior of the radiation patterns, and the complex environmental conditions) that will affect the received signal. The uncertainty of the input parameters (bias of the elevation angle and SNR) may lead to some bad retrieval results that sometimes are hard to interpret.
The in-situ GNSS-R measurement was done and the data were post-processed to obtain the permittivity and soil moisture content. We showed the correlation between the GNSS-R and TDR results. It was found that the good correlation between the TDR and SMC retrieval results concerned the satellite that the bar was directly pointing to. As we have mentioned before, the TDR measurements were done in the footprint of the antenna, which just corresponded to the Fresnel zone of the satellite that obtained the expected results. Future studies could be the evaluation of permittivity by TDR equipment implemented, which is implemented nearby or outside the footprint, to investigate the influence of the antenna pattern on SMC retrieval. Besides that, the differences of wave propagation, penetration depth and attenuation factor in the two sites need to be carefully considered before planning the ground-based measurement. The clay mineral can include “water” in its mineralogical network. This might be the reason why you could not retrieve the expected SMC, although many of them cause small bias. In particular, when the soil was saturated, the GPS can only sense one or two centimeters of the soil [
29].
The most commonly used method of analyzing the quality of the soil moisture is a linear robust fit [
35]. In order to reveal the potential relationship between incident angle, SNR and the SM, all the incident angles of satellites with corresponding SNR were collected to do the linear robust fit to show the dependence of the variables to SM. The input of the linear fit is the GNSS-R input data, and the output is the TDR results. The highest sensitivity of SNR to SMC (TDR) can be observed, being 3.84 dB/%, which was higher than reported [
35] but it could be reasonable since all the satellites were taken into account in this paper and the output TDR values were very critical for only two sites. Unlike the previous research, the aim of this paper was to investigate the degree to which the retrieval performance can be influenced by the uncertainty of the input data. Different from the traditional approach, another purpose of this paper was to utilize the XGBoost algorithm for the GNSS-R data by adopting the data mining concept. Since machine learning algorithms attempt to dig out the implicit rules from a large amount of data, they can function as a tool to uncover a function, especially when this function is too complicated to be formally expressed. In this case, the input is sample data, and the output will be the expected result.
Some existing and proven machine learning and neural networks methods have emerged to establish the estimation model based on the correlation selected features and retrieve soil moisture from SMOS data [
66,
67]. Both machine learning and neural networks are of artificial intelligence. Machine learning and neural networks (aiming at more complex problems and big data) are methods of implementing artificial intelligence. Machine learning is a technique for data modeling. What is more profound is that it extracts the appropriate model from given data to explain and predict. Like some common statistical methods, machine learning is also a form of statistical learning method. A computer uses existing data to derive a model, and then uses the model to predict the result. We also used the latest published Random Forest method for accessing the variable importance as in the following figures.
The results (
Figure 16,
Figure 17 and
Figure 18) obtained from the latest published Random Forest method [
66] were similar to the case involving XGBoost. They also showed that SNR was the most sensitive variable among the input variables. The difference was that the values of importance differences between each variable from XGBoost were larger than the values from the Random Forest method. From the algorithm mechanism point of view, one reason could be that the Random Forest uses majority voting in the final output, while XGBoost accumulates all results from each step. Another reason may be that the Random Forest method is not sensitive to the optimize parameter, which is good for a beginner, and the XGBoost needs to spend time on the optimization work.
Compared to the traditional statistics method, the machine learning algorithm is simpler and more flexible, and it is a good tool to find the underlying rules and value of data even from vast amounts of data. The pros of this study were to use the features of XGBoost method, which is a recently developed ensemble machine learning method good at the variable selection in data mining to examine the characterization of the input variables in the GNSS-R soil moisture retrieval. It showed a good correlation with the statistical analysis of ground-truth measurements. It is worthwhile before establishing models and can also help with understanding the underlying GNSS-R phenomena and interpreting the data.
5. Conclusions
In this paper, the performance of the bistatic GNSS-R soil moisture retrieval was examined and analyzed on the basis of a machine learning aided method. We took the first step to utilize the feature of the XGBoost to analyze the input variable importance in GNSS-R, which has quite high operating efficiency and prediction accuracy. A simulation data set was built and used for testing and training. The range of the parameters was set as close as staying to the experimental situation. In the meaning time, several optimization parameters (estimators, samples, and col-sample-tree), also for different typical types of soil compositions were changed to verify the stability of the results. It was reported that the variable showed the highest contribution than the other variables (, , and ) in the GNSS-R input vectors, either when we retrieved the permittivity or obtained soil moisture content for different soil types. It means that the received SNR is a predominant variable and much more sensitive to the obtained permittivity and soil moisture content with the importance of minimum 40%, and a maximum of 70%. Moreover, the variable showed the least importance (below 10%) in the GNSS-R soil moisture retrieval. In some extremely case (changing the parameter of the algorithm), the importance of variable is nearly zero means that it is almost not sensitive to the obtained permittivity and soil moisture content. Whatever we adjusted the parameter of the algorithm, the order of the variable importance is quite stable.
Here we must note one point that the variable with low importance does not mean that it is not necessary for the retrieval procedure. For example, a variable with higher contribution and importance means that the accuracy of the value is quite crucial for retrieving in GNSS-R and this variable is quite sensitive and important for obtaining satisfying results. The uncertainty of a variable with high importance causes a higher bias than the variable with low importance. From a practical perspective, this is quite significant for interpreting data and solving the problem, particularly when doing the GNSS-R experiment and the retrieval results are unsatisfying.
In order to further validate and discriminate the characteristics of the different input variables. Two GNSS-R ground-based campaigns with different soil conditions and compositions were carried out to do the performance analysis, which corresponds to the soil composition of the simulated data set. The permittivity of the ground-truth measurement was given by TDR measurement. The figure of skyplot provides information about the elevation angles. Combined the information of the GNSS-R and TDR measurement, we used a polynomial regression method to fit the input variables (, ) with the permittivity and soil moisture results respectively, for evaluating the variation rate of retrieved results with respect to each input variable. It also showed that the input variable was a quite sensitivity parameter, which mostly impacts the soil moisture results than the variable . For the two typical soil types, another conclusion was that the increasing rate of SMC (or permittivity) with respect to in silty clay loam soil (higher permittivity condition) was higher than in loamy sand soil (lower permittivity condition).
This paper focused on the understanding of the input variables importance through the XGBoost algorithm and the ground-truth measurement, to investigate the performance of the bistatic GNSS-R soil moisture retrieval method. The quantification of variables importance is not only an important issue for constructing a soil moisture retrieval model but also a critical issue in GNSS-R experiments to interpret data and understand the potential phenomena. Particularly, since the elevation angle determines the signal receiving for antennas, the finding of the paper is also helpful for the GNSS-R receiver construction and impact analysis. This finding also increases the understanding of our knowledge to the input variables and exploring the scope of the machine learning applied in GNSS-R.
Further studies will be conducted to monitor a region for a long period of time, to take seasonal effect into account, and to evaluate the sensitivity of the different observables to SM on a regional scale. Besides, more types of terrains could be added, and the areas of the experiments should be expanded. The findings of the paper show the importance of the SNR, and further analysis of in-situ data may provide more complete insight into how the received SNR can be used to retrieve SM [
38].