Comparison of Machine Learning Algorithms for Merging Gridded Satellite and Earth-Observed Precipitation Data

Papacharalampous, Georgia; Tyralis, Hristos; Doulamis, Anastasios; Doulamis, Nikolaos

doi:10.3390/w15040634

Open AccessArticle

Comparison of Machine Learning Algorithms for Merging Gridded Satellite and Earth-Observed Precipitation Data

by

Georgia Papacharalampous

^*

,

Hristos Tyralis

,

Anastasios Doulamis

and

Nikolaos Doulamis

Department of Topography, School of Rural, Surveying and Geoinformatics Engineering, National Technical University of Athens, Iroon Polytechniou 5, 157 80 Zografou, Greece

^*

Author to whom correspondence should be addressed.

Water 2023, 15(4), 634; https://doi.org/10.3390/w15040634

Submission received: 17 December 2022 / Revised: 16 January 2023 / Accepted: 25 January 2023 / Published: 6 February 2023

(This article belongs to the Section New Sensors, New Technologies and Machine Learning in Water Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

Gridded satellite precipitation datasets are useful in hydrological applications as they cover large regions with high density. However, they are not accurate in the sense that they do not agree with ground-based measurements. An established means for improving their accuracy is to correct them by adopting machine learning algorithms. This correction takes the form of a regression problem, in which the ground-based measurements have the role of the dependent variable and the satellite data are the predictor variables, together with topography factors (e.g., elevation). Most studies of this kind involve a limited number of machine learning algorithms and are conducted for a small region and for a limited time period. Thus, the results obtained through them are of local importance and do not provide more general guidance and best practices. To provide results that are generalizable and to contribute to the delivery of best practices, we here compare eight state-of-the-art machine learning algorithms in correcting satellite precipitation data for the entire contiguous United States and for a 15-year period. We use monthly data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) gridded dataset, together with monthly earth-observed precipitation data from the Global Historical Climatology Network monthly database, version 2 (GHCNm). The results suggest that extreme gradient boosting (XGBoost) and random forests are the most accurate in terms of the squared error scoring function. The remaining algorithms can be ordered as follows, from the best to the worst: Bayesian regularized feed-forward neural networks, multivariate adaptive polynomial splines (poly-MARS), gradient boosting machines (gbm), multivariate adaptive regression splines (MARS), feed-forward neural networks and linear regression.

Keywords:

benchmarking; big data; gradient boosting machines; PERSIANN; poly-MARS; random forests; remote sensing; satellite data correction; spatial interpolation; XGBoost

1. Introduction

Knowing the quantity of precipitation at a dense spatial grid and for an extensive time period is important in solving a variety of hydrological engineering and science problems, including many of the major unsolved problems listed in Blöschl et al. [1]. The main sources of precipitation data are ground-based gauge networks and satellites [2]. Data from ground-based gauge networks are precise; however, maintaining such a network with a high spatial density and for a long time period is costly. On the other hand, satellite precipitation data are cheap to obtain but not accurate [3,4,5,6].

By merging gridded satellite precipitation products and ground-based measurements, we can obtain data that are more accurate than the raw satellite data and, simultaneously, cover space with a much higher density compared to the ground-based measurements. This merging is practically a regression problem in a spatial setting, with the satellite data being the predictor variables and the ground-based data being the dependent variables. Such kinds of problems are also commonly referred to under the term “downscaling” and are special types of spatial interpolation. The latter problem is met in a variety of fields (see, e.g., the reviews by Bivand et al. [7], Li and Heap [8], Heuvelink and Webster [9], and Kopczewska [10]). Reviews of the relevant methods for the case of precipitation can be found in Hu et al. [11] and Abdollahipour et al. [12].

Spatial interpolation of precipitation by merging satellite precipitation products and ground-based measurements has been conducted at multiple temporal and spatial time scales by using a variety of regression algorithms, including several machine learning ones. A non-exhaustive list of previous works on the topic and a summary of their methodological information can be found in Table 1. Notably, this table is indicative of the large diversity in the temporal and spatial scales examined and in the algorithms utilized.

Machine learning for spatial interpolation has gained prominence in various fields of environmental science [31]. These fields include, but are not limited to, the agricultural sciences [32], climate science [33,34], hydrology [35,36] and soil science [37,38]. Among the various machine learning algorithms, random forests seem to be the most frequently used ones (see the examples in Hengl et al. [39]). Notably, as machine learning algorithms do not model spatial dependence explicitly in their original form, efforts have been made to remedy this shortcoming, either directly [40] or indirectly [41,42,43,44]. By exploiting spatial dependence information, the algorithms become more accurate.

As it has been noted earlier, machine learning algorithms constitute a major means for merging satellite products and ground-based measurements for obtaining precipitation data. However, their empirical properties are still not well known. This holds because most of the existing studies investigate a few algorithms and because their investigations may be limited in terms of the length of the time periods examined and the size of the geographical areas examined. Large-scale benchmark tests and comparisons that involve, by definition, many algorithms and, at the same time, are conducted over long time periods and large geographical regions could be useful in providing directions on which algorithm to implement in specific settings of practical interest; thus, they have started to appear in other hydrological sub-disciplines. Relevant examples are available in Papacharalampous et al. [45] and Tyralis et al. [46].

In this study, we work towards filling the above-identified gap. More precisely, we compare a larger number of machine learning algorithms than usual (see Table 1) with respect to how accurate they are in providing estimates of total monthly precipitation in spatial interpolation settings by merging gridded satellite products and ground-based measurements. In addition, the comparison is made for a long time period and for a large geographical area (again contrary to the most common strategy that appears currently in the literature), with this area also having a dense ground-based gauge network, thereby leading to trustable results for the monthly time scale. Moreover, proper evaluations are made according to theory and best practices from the field of statistics, with the methodological aspects developed in this endeavor contributing to the transfer of knowledge in the overall topic of spatial interpolation using machine and statistical learning algorithms.

The remainder of the paper is structured as follows: Section 2 describes the algorithms selected and the methodology followed for exploring the relevant regression setting. Section 3 presents the data and the validation procedure. Section 4 presents the results. Section 5 discusses the most important findings and provides recommendations for future research. Section 6 concludes the work.

2. Methods

2.1. Machine Learning Algorithms for Spatial Interpolation

Eight machine learning algorithms were implemented in this work for conducting spatial interpolation and were extensively compared with each other in the context of merging gridded satellite products and gauge-based measurements. In this section, we briefly describe these algorithms, while their detailed description can be found in Hastie et al. [47], James et al. [48] and Efron and Hastie [49]. Such a description is outside the scope of this work, as the implementations and documentation of the algorithms are already available in the R programming language. The R packages utilized are listed in Appendix A.

2.1.1. Linear Regression

A linear regression algorithm models the dependent variable as a linear weighted sum of the predictor variables [47] (pp. 43–55). The algorithm is optimized with a squared error scoring function.

2.1.2. Multivariate Adaptive Regression Splines

The multivariate adaptive regression splines (MARS) [50,51] model the dependent variable with a weighted sum of basis functions. The total number of basis functions (product degree) and associated parameters (knot locations) are automatically determined from the data. Herein, we implemented an additive model with hinge basis functions. The implementation was made with the default parameters.

2.1.3. Multivariate Adaptive Polynomial Splines

Multivariate adaptive polynomial splines (poly-MARS) [52,53] use piecewise linear splines to model the dependent variable in an adaptive regression procedure. Their main differences compared to MARS are that they require “linear terms of a predictor to be in the model before nonlinear terms using the same predictor can be added”, along with ”a univariate basis function to be in the model before a tensor product basis function involving the univariate basis function can be in the model’’ [54]. In the present work, the poly-MARS model was implemented with the default parameters.

2.1.4. Random Forests

Random forests [55] are an ensemble of regression trees based on bagging (acronym for “bootstrap aggregation”). The benefits accompanying the application of this algorithm were summarized by Tyralis et al. [56], who also documented its recent popularity in hydrology with a systematic literature review. In random forests, a fixed number of predictor variables are randomly selected as candidates when determining the nodes of the regression tree. Herein, random forests were implemented with the default parameters. The number of trees was equal to 500.

2.1.5. Gradient Boosting Machines

Gradient boosting machines (gbm) are an ensemble learning algorithm. In brief, they iteratively train new base learners using the errors of previously trained base learners [57,58,59,60]. The final algorithm is essentially a sum of the trained base learners. Optimizations are performed using a gradient descent algorithm and by adapting the loss function. The latter is the squared error scoring function in the implementation of this work. In the same implementation, the optimization’s scoring function was the squared error, and the base learners were regression trees. In addition, the number of trees was set as equal to 500 to maintain consistency with the implementation of the random forest algorithm. The defaults were used for the remaining parameters.

2.1.6. Extreme Gradient Boosting

Extreme gradient boosting (XGBoost) [61] is another boosting algorithm. It is considerably faster and better in performance in comparison to traditional implementations of boosting algorithms. It is also further regularized compared to such implementations for controlling overfitting. In the implementation of this work, the maximum number of boosting iterations was set as equal to 500. The remaining parameters were kept as default. For instance, the maximum depth of each tree was kept as equal to 6.

2.1.7. Feed-Forward Neural Networks

Artificial neural networks (or simply, “neural networks”) extract linear combinations of the predictor variables as derived features and, subsequently, model the dependent variable as a nonlinear function of these features [47] (p. 389). Herein, we used feed-forward neural networks [62] (pp. 143–180). The number of units in the hidden layer and the maximum number of iterations were set as equal to 20 and 1000, respectively, while the remaining parameters were kept as default.

2.1.8. Feed-Forward Neural Networks with Bayesian Regularization

Feed-forward neural networks with Bayesian regularization [63] for avoiding overfitting were also employed in this work. In the respective implementation, the number of neurons was set as equal to 20 and the remaining parameters were kept as default. For instance, the maximum number of iterations was kept as equal to 1000.

2.2. Variable Importance Metric

We computed the random forests’ permutation importance of the predictor variables, a metric measuring the mean increase of the prediction mean squared error on the out-of-bag portion of the data after permuting each predictor variable in the regression trees of the trained model and providing relative rankings of the importance of the predictor variables [55]. More generally, variable importance metrics can support explanations of the performance of machine learning algorithms [64,65], thereby expanding the overall scope of machine learning. This scope is often perceived as limited to the provision of accurate predictions. Random forests were fitted with 5000 trees for computing variable importance.

3. Data and Application

3.1. Data

Our experiments relied totally on open databases that offer earth-observed precipitation data at the monthly temporal resolution, gridded satellite precipitation data and elevation data for all the gauged locations and grid points shown in Figure 1.

3.1.1. Earth-Observed Precipitation Data

Total monthly precipitation data from the Global Historical Climatology Network monthly database, version 2 (GHCNm) [66] were used for the verification of the algorithms implemented for spatial interpolation. From the entire database, 1421 stations that are located in the contiguous United States were extracted, and data that span the time period 2001–2015 were selected. These data were sourced from the website of the National Oceanic and Atmospheric Administration (NOAA) (https://www.ncei.noaa.gov/pub/data/ghcn/v2; accessed on 24 September 2022).

3.1.2. Satellite Precipitation Data

For the application, we additionally used precipitation data from the current operational PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) system. The latter was developed by the Centre for Hydrometeorology and Remote Sensing (CHRS) at the University of California, Irvine (UCI). The PERSIANN satellite data are created using artificial neural networks to establish a relationship between remotely sensed cloud-top temperature, measured via long-wave infrared (IR) sensors on geostationary orbiting satellites, and rainfall rates. Bias correction from passive microwave (PMW) records measured through low earth-orbiting (LEO) satellites [67,68,69] is also applied. These data were sourced in their daily format from the website of the Center for Hydrometeorology and Remote Sensing (CHRS) (https://chrsdata.eng.uci.edu; accessed on 7 March 2022).

The final product spans a grid with a spatial resolution of 0.25° × 0.25°. We extracted a grid that spans the contiguous United States for the time period 2001–2015. We also transformed daily precipitation into total monthly precipitation to support the investigations in this work.

3.1.3. Elevation Data

For all the gauged geographical locations and the grid points shown in Figure 1, elevation data were computed by using the get_elev_point function of the elevatr R package [70]. This function extracts point elevation data from the Amazon Web Services (AWS) Terrain Tiles (https://registry.opendata.aws/terrain-tiles; accessed on 25 September 2022). Elevation is a key variable in predicting hydrological processes [71].

3.2. Validation Setting and Predictor Variables

We define the earth-observed total monthly precipitation at the point of interest as the dependent variable. Notably, the ground-based stations are located irregularly in the region (see Figure 1); thus, the problem of defining predictor variables is not the usual one that is met in problems with tabular data. To form the regression settings, we found, separately for each station, the closest four grid points and we computed the distances d_i, i = 1, 2, 3, 4 (in meters) from those points. We also indexed the points S_i, i = 1, 2, 3, 4 according to their distance from the stations, where d₁ < d₂ < d₃ < d₄ (see Figure 2).

Possible predictor variables for the technical problem of the present work are the total monthly precipitation values at the four closest grid points (which are referred to as “PERSIANN values 1−4”), the respective distances from the station (which are referred to as “distances 1−4”), the station’s elevation, and the station’s longitude and latitude. We defined and examined three different regression settings. Each of these corresponds to a different set of predictor variables (see Table 2).

The predictor sets 1 and 2 do not account directly for possible spatial dependences, as the station’s longitude and latitude are not part of them. Still, by using these predictor sets, spatial dependence is modeled indirectly, through covariance information (satellite precipitation at close points and station elevation). The predictor set 2 includes more information with respect to the predictor set 1 and, more precisely, the distances between the station location and the closest grid points. The predictor set 3 allows spatial dependence modeling, as it comprises the station’s longitude and latitude.

The dataset is composed of 91,623 samples. Each sample includes the total monthly precipitation observation at a specific earth-located station for a specified month and a specified year, as well as the respective values of the predictor variables, with the latter being dependent on the regression setting (see Table 2). The results of the performance comparison were obtained within a five-fold cross-validation setting.

Overall, the validation setting proposed in this work benefits from the following:

Stations with missing monthly precipitation values do not need to be excluded from the dataset, and missing values do not need to be filled. Instead, a varying number of stations are included in the procedure for each time point in the period investigated. In brief, we kept a dataset with the maximum possible size, and we did not add uncertainties to the procedure by filling in the missing values.
The cross-validation is totally random with respect to both space and time. This is a standard procedure in the validation of precipitation products that combine satellite and earth-observed data.
In the setting proposed, it is possible to create a corrected precipitation gridded dataset because, after fitting the regression algorithm, it is possible to directly interpolate in the space conditional upon the predictor variables that are known.
There is no need to first interpolate the station data to grid points and then verify the algorithms based on the earth-observed data previously interpolated. This procedure is common in the field, but it creates additional uncertainties.

A few limitations of the validation setting proposed in this work also exist. Indeed, there might be some degree of bias due to the fact that this setting does not incorporate, in a direct way, information on spatial dependencies. Such incorporations would require a different partitioning of the dataset [72,73], as machine learning models that may explicitly model spatial dependencies (see, e.g., [74,75]) may not be applicable in settings with a varying number of spatial observations at different times.

To deliver exploratory insight into the technical problem investigated in this work, we additionally estimated Spearman’s correlation [76] for all the possible pairs of variables appearing in the regression settings. We also ranked the total of the predictor variables with respect to their importance in predicting the dependent variable. The latter was performed after estimating the importance according to Section 2.2.

3.3. Performance Metrics and Assessment

To compare the algorithms outlined in Section 2.1 in performing the spatial interpolation, we used the squared error scoring function. This function is defined by

S(x, y) ≔ (x − y)²

(1)

In the above equation, y is the realization (observation) of the spatial process, and x is the prediction. The squared error scoring function is consistent for the mean functional of the predictive distributions [77]. Predictions of models in hydrology should be provided in probabilistic terms (see, e.g., the relevant review by Papacharalampous and Tyralis [78]); still, a specific functional of the predictive distribution may be of interest. A model trained with the squared error scoring function predicts the mean functional of the predictive distribution [77].

The performance criterion for the machine learning algorithms takes the form of the median squared error (MedSE) by computing the median of the squared error function, separately for each set {machine learning algorithm, predictor set, test fold}, according to Equation (2). In this equation, the subscript to x and y, i.e., i ∈ {1, …, n}, indicates the sample.

MedSE ≔ mediann{S(x_i, y_i)}

(2)

The five MedSE values computed for each set {machine learning algorithm, predictor set} (with each corresponding to a different test fold) were then used to compute five relative scores (which are otherwise referred to as “relative improvements” herein), separately for each predictor set, by using the set {linear regression, predictor set} as the reference case. These relative scores were then averaged separately for each set {machine learning algorithm, predictor set}, to provide mean relative scores (which are otherwise referred to as “mean relative improvements” herein). A skill score with linear regression as the reference technique for an arbitrary algorithm of interest k is defined by

S_skill ≔ MedSE_{{k, predictor set}}/MedSE_{{linear regression, predictor set}}

(3)

The relative scores computed for the assessment are defined by

RS_{{linear regression, predictor set}} ≔ 100 (1 − S_skill)

(4)

To extend the comparison by also including the assessment between differences in performance across predictor sets, the procedures for computing the relative and mean relative scores were repeated by considering the set {linear regression, predictor set 1} as the reference case for all the sets {machine learning algorithm, predictor set}. In addition to the two types of relative improvements, we present information on the rankings of the machine learning algorithms. For obtaining the respective results, we first ranked the eight machine learning algorithms separately for each set {case, predictor set} (with each case belonging to one test fold only). Then, we grouped these rankings per set {predictor set, test fold} and computed their mean. Lastly, we averaged the five mean ranking values corresponding to each predictor set and provided the results of this procedure, which are referred to in the following sections as “mean rankings”. Moreover, we repeated the mean ranking computation after computing the rankings collectively for all the predictor sets.

Notably, we did not compare the algorithms using alternative scoring functions (e.g., the absolute error scoring function) because such functions may not be consistent for the mean functional (excluding functions of the Bregman family) [77]. It is also possible to use other skill scores (e.g., the Nash–Sutcliffe efficiency, which is widely used in hydrology). Here, we preferred to use the simple linear regression algorithm as a reference technique. We believe that this choice is credible because of the simplicity and ease in the use of the algorithm.

4. Results

4.1. Regression Setting Exploration

Figure 3 presents the Spearman correlation estimates for all the possible pairs of variables appearing in the regression settings examined in this work. The relationships between the predictand (i.e., the precipitation quantity observed at the earth-located stations) and the 11 predictor variables (see Section 3.2) can be assessed through the estimates displayed on the first column on the left side of the heatmap. Based on the Spearman correlation estimates, the strongest and, at the same time, equally strong among these relationships are those between the predictand and the four predictors, referring to the precipitation quantities drawn from the PERSIANN grid. A possible explanation of this equality could be found in the Spearman correlation estimates made for the six pairs of PERSIANN values, which are equal to either 0.98 or 0.99, indicating extremely strong relationships. This strength can, in turn, be attributed to strong spatial relationships on the PERSIANN grid (i.e., to the fact that the neighboring grid points have similar values) and, perhaps, also to the repetitions of values in the regression settings.

Other relationships that are notably strong and, thus, expected, at least at an initial stage, to be particularly beneficial for estimating precipitation in the spatial setting adopted herein are those indicated by the values 0.45 and −0.40 (which again appear in the same column of the same heatmap; see Figure 3). The former of these two values refers to the relationship between the precipitation quantity observed at an earth-located station and the longitude at the location of this station, while the latter of the same values refers to the relationship between the precipitation quantity observed at an earth-located station and the elevation at the location of this station. The remaining relationships between the predictand and predictor variables are found to be less strong; nonetheless, they could also be worth exploiting in the regression setting. Examples of the above-discussed relationships can be further examined in Figure 4.

Figure 5 presents the estimates of the importance of the 11 predictor variables; these estimates were provided by the random forest algorithm when considering all of these predictor variables in the regression setting. Figure 5 additionally provides the ordering of the same estimates, which is also the ordering of the 11 predictor variables according to their importance. The longitude at the location of the earth-located station is the most important predictor variable (probably because it is a spatial characteristic and the regression is made in a spatial setting), followed by the precipitation quantities drawn from the first, second and fourth closest points to the earth-located station at the PERSIANN grid. These latter three predictors are followed by the elevation at the location of the earth-located station. The next predictor in terms of importance is the precipitation quantity drawn from the third closest point to the earth-located station at the PERSIANN grid. The latitude at the location of the earth-located station follows, and the four variables referring to distances are the least important ones.

4.2. Comparison of the Algorithms

Figure 6 presents information that directly allows us to understand how the algorithms outlined in Section 2.1 performed with respect to each other in the various experiments, separately for each predictor set. Both the mean relative improvements (Figure 6a) and the mean rankings (Figure 6b) indicate that, overall, extreme gradient boosting (XGBoost) and random forests are the two best-performing algorithms. In terms of mean relative improvements, the former of these algorithms showed a much better performance than the latter when they were both run with the predictor sets 1 and 2, and a slightly better performance than the latter when they were both run with the predictor set 3. Feed-forward neural networks with Bayesian regularization follow, and, in terms of mean rankings, they were empirically proven to have an almost equally good performance with random forests. Multivariate adaptive polynomial splines (poly-MARS) and gradient boosting machines (gbm) are the fourth- and fifth-best-performing algorithms, respectively. While the mean rankings corresponding to the latter two algorithms do not suggest large differences in their performance, the mean relative improvements favor poly-MARS to a notable extent. In terms of both mean relative improvements and mean rankings, feed-forward neural networks performed better than gbm and multivariate adaptive regression splines (MARS) when these three algorithms were run with the predictor set 1. The linear regression algorithm was the worst for all the predictor sets investigated in this work. For the predictor sets 2 and 3, feed-forward neural networks were the second-worst algorithm with a very close performance to that of linear regression, probably due to overfitting.

Figure 7 facilitates comparisons, both across algorithms and across predictor sets, of the frequency with which each algorithm appeared in the various positions from the first to the eighth (i.e., the last) in the experiments. For the predictor set 1 (see Figure 7a), the linear regression algorithm was most commonly found in the last position, while its second-most common position was the first, and the six remaining positions appeared in much smaller and largely comparable frequencies. For the same predictor set, the XGBoost algorithm followed a notably similar pattern, although for it the first position was found to be the most common and the last position was found to be the second most common. The remaining positions appeared with smaller frequencies. In addition, the remaining algorithms were found less frequently in the first and last positions than the linear regression and XGBoost algorithms, with random forests appearing more often in these same positions than the other five algorithms. The frequency with which random forests appeared in the first, second, seventh and eighth positions is almost the same and greater than the frequency with which they appeared in the middle four positions. On the other hand, poly-MARS, feed-forward neural networks and feed-forward neural networks with Bayesian optimization appeared more often in the four middle positions than they appeared in the first two and last two positions, and MARS appeared more often in the six middle positions than it appeared in the first and last positions.

For the predictor set 2 (see Figure 7b), there is differentiation in most of the above-discussed patterns. Notably, for this predictor set, the patterns observed for feed-forward neural networks and linear regression are quite similar. These algorithms appeared in one of the last two positions more often than any other algorithm. Moreover, the seventh position was more frequent for them, and their frequency of appearance in the first, third, fourth, fifth and sixth positions was almost the same and a bit smaller than their frequency of appearance in the second position. The same algorithms appeared in the last position equally often with the XGBoost algorithm. The latter is the algorithm that appeared most often in the first position by far. Similarly to what was previously noted for the predictor set 1, this algorithm appeared more frequently in the first and last positions than in any other position for the predictor set 2, with the first position also being much more frequent than the last one. Random forests appeared more often in the first two positions than in any other position, and the remaining algorithms appeared more often in the third, fourth, fifth and sixth positions than in the remaining four positions.

For the predictor set 3 (see Figure 7c), the frequency with which each algorithm appeared in the various positions from the first to the last exhibits more similarities with what was found for the predictor set 2 than with what was found for the predictor set 1. Yet, there are a few notable differences with respect to this good reference case. In fact, although the XGBoost algorithm appeared more often, here as well, in the first and last positions, the frequency of its appearance in the remaining positions was notably larger than the respective frequency for the case of the predictor set 2. In addition, the random forest algorithm appeared more often in the third, fourth, fifth and sixth positions than it did for the same reference case.

Moreover, Figure 8 and Figure 9 allow us to understand how much the additional predictors in the predictor sets 2 and 3 improved or deteriorated the performance of the eight algorithms with respect to using the predictor set 1. The computed improvements were found to be all positive and particularly large for the random forest and the two boosting algorithms, especially when moving to the predictor set 3. Also notably large and positive are the performance improvements offered by the additional predictors in the predictor set 3 with respect to the predictor set 1 for linear regression, MARS, poly-MARS and feed-forward neural networks with Bayesian regularization, while the same does not apply for the case of using the predictor set 2 instead of the predictor set 1 for the same algorithms. Figure 8 further reveals the best-performing combinations of algorithms and predictors. These are the {extreme gradient boosting, predictor set 3} and {random forests, predictor set 3}, with the former offering slightly better performance in terms of mean relative improvement (but not in terms of mean ranking).

Perhaps it is also relevant to highlight, at this point, that the combination {feed-forward neural networks with Bayesian regularization, predictor set 3} was in the fourth position in terms of mean relative improvement (surpassing all the remaining combinations aside from the two best-performing ones and the {extreme gradient boosting, predictor set 2}; see Figure 8a) and in the second position in terms of mean ranking (surpassing all the remaining combinations aside from the {random forests, predictor set 3}; see Figure 8b). At the same time, according to Figure 8a, the feed-forward neural networks without Bayesian regularization performed so poorly when applied with the predictor sets 2 and 3 (in which the number of the predictor variables increases by four and six, respectively, with respect to the predictor set 1) that they were only slightly better than the linear regression model when applied with the predictor sets 2 and 3, respectively. Lastly, according to the same figure, the combination {linear model, predictor set 3} outperformed the combination {feed-forward neural networks, predictor set 2}.

5. Discussion

In summary, the large-scale comparison showed that the machine learning algorithms of this work can be ordered from the best to the worst in regard to their accuracy in correcting satellite precipitation products at the monthly temporal scale as follows: extreme gradient boosting (XGBoost), random forests, Bayesian regularized feed-forward neural networks, multivariate adaptive polynomial splines (poly-MARS), gradient boosting machines (gbm), multivariate adaptive regression splines (MARS), feed-forward neural networks and linear regression. The differences in performance were found to be smaller between some pairs of algorithms when the application is made with specific predictors (e.g., random forests and XGBoost when run with the predictor set 3) and larger (or medium) in other cases. Especially the magnitude of the differences computed between each of the two best-performing and the remaining algorithms, for the case in which the most information-rich predictor set is exploited, suggests that the consideration of the findings of this work can have a large positive impact on future applications. Notably, the fact that the random forest, XGBoost and gbm algorithms perform better or, in the worst case, similarly when predictors are added, could be attributed to their known theoretical properties. Summaries of these properties are provided in the reviews by Tyralis et al. [56] and Tyralis and Papacharalampous [60], where extensive lists of references to the related machine learning literature are also provided.

Aside from the selection of a machine learning algorithm and the selection of a set of predictor variables, which are well covered by this work for the monthly temporal scale, there are also other important themes, whose investigation could substantially improve performance in the problem of correcting satellite precipitation products at various temporal scales. Perhaps the most worthy of discussion here is the use of ensembles of machine learning algorithms in the context of ensemble learning. A few works are devoted to ensemble learning algorithms for spatial interpolation (e.g., [79,80]) and could provide a starting point, together with the present work, for building detailed big data comparisons of ensemble learning algorithms. Note here that the ensemble learning algorithms include simple combinations (see, e.g., those in Petropoulos and Svetunkov [81]; Papacharalampous and Tyralis [82]) and more advanced stacking and meta-learning approaches (see, e.g., those in Wolpert [83]; Tyralis et al. [84]; Montero-Manso et al. [85]; Talagala et al. [86]), and are increasingly adopted in many fields, including hydrology.

Other possible themes for future research, in the important direction of improving both our understanding of the practical problem of correcting satellite precipitation products and the various algorithmic solutions to this problem, include the investigation of spatial and temporal patterns (as the precipitation product correction errors might follow such patterns) and the explanation of the predictive performance of the various algorithms by combining time series feature estimation (see multiple examples of time series features in Fulcher et al. [87]; Kang et al. [88]) and explainable machine learning (see, e.g., the relevant reviews in Linardatos et al. [89]; Belle and Papantonis [90]). Examples of such investigations are available for a different modeling context in Papacharalampous et al. [91]. Lastly, the comparisons could be extended to include algorithms for predictive uncertainty quantification. A few works are devoted to such machine learning algorithms for spatial interpolation (e.g., [92]). Still, comparison frameworks and large-scale results for multiple algorithms are currently missing from the literature of satellite precipitation data correction.

6. Conclusions

Hydrological applications often rely on gridded precipitation datasets from satellites, as these datasets cover large regions with higher spatial density compared to the ones that comprise ground-based measurements. Still, the former datasets are less accurate than the latter, with the various machine learning algorithms consisting an established means for improving their accuracy in regression settings. In these settings, the ground-based measurements play the role of the dependent variable, and the satellite data play the role of the predictor variables, together with data for topography factors (e.g., elevation). The studies devoted to this important endeavor are numerous; still, most of them involve a limited number of machine learning algorithms and are also conducted for a small region and a limited time period. Thus, their results are mostly of local importance, and cannot support the derivation of more general guidance and best practices.

In this work, we moved beyond the above-outlined standard approach by comparing eight machine learning algorithms in correcting precipitation satellite data for the entire contiguous United States and over a 15-year period. More precisely, we exploited monthly precipitation satellite data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) gridded dataset and monthly earth-observed precipitation data from the Global Historical Climatology Network monthly database, version 2 (GHCNm), and based the comparison on the squared error scoring function. Overall, extreme gradient boosting (XGBoost) and random forests were found to be the most accurate algorithms, with the former being more accurate than the latter to a small extent for the majority of the scores computed. The remaining algorithms can be ordered from the best- to the worst-performing as follows: feed-forward neural networks with Bayesian regularization, multivariate adaptive polynomial splines (poly-MARS), gradient boosting machines (gbm), multivariate adaptive regression splines (MARS), feed-forward neural networks and linear regression.

Aside from the above ordering which constitutes, in our opinion, the most important finding of the present work, important findings on the selection of predictor variables in the field of satellite precipitation data correction were also obtained for the monthly time scale. Indeed, we found that the distances of the four closest grid points from a ground-based station, as well as this station’s longitude and latitude, can offer improvements in predictive performance when utilized as predictor variables for most of the machine learning algorithms assessed (including the best-performing ones), together with the monthly precipitation values at the four closest grid points and the station’s elevation. Also importantly, we proposed a new validation setting that could bring considerable benefits to future comparisons of machine and statistical learning algorithms in the field. These benefits are enumerated in Section 3.2. Even more generally, we proposed an authentic methodological framework and contributed to the transfer of theory and best practices from the field of statistics to the field of satellite precipitation data correction.

Author Contributions

G.P. and H.T. conceptualized and designed the work with input from A.D. and N.D.; G.P. and H.T. performed the analyses and visualizations and wrote the first draft, which was commented on and enriched with new text, interpretations and discussions by A.D. and N.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was conducted in the context of the research project BETTER RAIN (BEnefiTTing from machine lEarning algoRithms and concepts for correcting satellite RAINfall products). This research project was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “3rd Call for H.F.R.I. Research Projects to support Post-Doctoral Researchers” (Project Number: 7368).

Data Availability Statement

The data used in this paper are open (see Section 3.1).

Acknowledgments

The authors are sincerely grateful to the Journal for inviting the submission of this paper, and to the Editor and Reviewers for their constructive remarks. They would also like to acknowledge the contribution of the late Yorgos Photis in the proposal of the research project BETTER RAIN.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We used the R programming language [93] to implement the algorithms and to report and visualize the results.

For data processing and visualizations, we used the contributed R packages caret [94], data.table [95], elevatr [70], ggforce [96], ncdf4 [97], rgdal [98], sf [99,100], spdep [7,101,102] and tidyverse [103,104].

The algorithms were implemented using the contributed R packages brnn [105], earth [106], gbm [107], nnet [108,109], polspline [54], ranger [110,111] and xgboost [112].

The performance metrics were computed by implementing the contributed R package scoringfunctions [113,114].

Reports were produced by using the contributed R packages devtools [115], knitr [116,117,118] and rmarkdown [119,120,121].

References

Blöschl, G.; Bierkens, M.F.P.; Chambel, A.; Cudennec, C.; Destouni, G.; Fiori, A.; Kirchner, J.W.; McDonnell, J.J.; Savenije, H.H.G.; Sivapalan, M.; et al. Twenty-three unsolved problems in hydrology (UPH)–A community perspective. Hydrol. Sci. J. 2019, 64, 1141–1158. [Google Scholar] [CrossRef]
Sun, Q.; Miao, C.; Duan, Q.; Ashouri, H.; Sorooshian, S.; Hsu, K.-L. A review of global precipitation data sets: Data sources, estimation, and intercomparisons. Rev. Geophys. 2018, 56, 79–107. [Google Scholar] [CrossRef]
Mega, T.; Ushio, T.; Matsuda, T.; Kubota, T.; Kachi, M.; Oki, R. Gauge-adjusted global satellite mapping of precipitation. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1928–1935. [Google Scholar] [CrossRef]
Salmani-Dehaghi, N.; Samani, N. Development of bias-correction PERSIANN-CDR models for the simulation and completion of precipitation time series. Atmos. Environ. 2021, 246, 117981. [Google Scholar] [CrossRef]
Li, W.; Jiang, Q.; He, X.; Sun, H.; Sun, W.; Scaioni, M.; Chen, S.; Li, X.; Gao, J.; Hong, Y.; et al. Effective multi-satellite precipitation fusion procedure conditioned by gauge background fields over the Chinese mainland. J. Hydrol. 2022, 610, 127783. [Google Scholar] [CrossRef]
Tang, T.; Chen, T.; Gui, G. A comparative evaluation of gauge-satellite-based merging products over multiregional complex terrain basin. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5275–5287. [Google Scholar] [CrossRef]
Bivand, R.S.; Pebesma, E.; Gómez-Rubio, V. Applied Spatial Data Analysis with R, 2nd ed.; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Li, J.; Heap, A.D. Spatial interpolation methods applied in the environmental sciences: A review. Environ. Model. Softw. 2014, 53, 173–189. [Google Scholar] [CrossRef]
Heuvelink, G.B.M.; Webster, R. Spatial statistics and soil mapping: A blossoming partnership under pressure. Spat. Stat. 2022, 50, 100639. [Google Scholar] [CrossRef]
Kopczewska, K. Spatial machine learning: New opportunities for regional science. Ann. Reg. Sci. 2022, 68, 713–755. [Google Scholar] [CrossRef]
Hu, Q.; Li, Z.; Wang, L.; Huang, Y.; Wang, Y.; Li, L. Rainfall spatial estimations: A review from spatial interpolation to multi-source data merging. Water 2019, 11, 579. [Google Scholar] [CrossRef]
Abdollahipour, A.; Ahmadi, H.; Aminnejad, B. A review of downscaling methods of satellite-based precipitation estimates. Earth Sci. Inform. 2022, 15, 1–20. [Google Scholar] [CrossRef]
He, X.; Chaney, N.W.; Schleiss, M.; Sheffield, J. Spatial downscaling of precipitation using adaptable random forests. Water Resour. Res. 2016, 52, 8217–8237. [Google Scholar] [CrossRef]
Meyer, H.; Kühnlein, M.; Appelhans, T.; Nauss, T. Comparison of four machine learning algorithms for their applicability in satellite-based optical rainfall retrievals. Atmos. Res. 2016, 169, 424–433. [Google Scholar] [CrossRef]
Tao, Y.; Gao, X.; Hsu, K.; Sorooshian, S.; Ihler, A. A deep neural network modeling framework to reduce bias in satellite precipitation products. J. Hydrometeorol. 2016, 17, 931–945. [Google Scholar] [CrossRef]
Yang, Z.; Hsu, K.; Sorooshian, S.; Xu, X.; Braithwaite, D.; Verbist, K.M.J. Bias adjustment of satellite-based precipitation estimation using gauge observations: A case study in Chile. J. Geophys. Res. Atmos. 2016, 121, 3790–3806. [Google Scholar] [CrossRef]
Baez-Villanueva, O.M.; Zambrano-Bigiarini, M.; Beck, H.E.; McNamara, I.; Ribbe, L.; Nauditt, A.; Birkel, C.; Verbist, K.; Giraldo-Osorio, J.D.; Thinh, N.X.; et al. RF-MEP: A novel random forest method for merging gridded precipitation products and ground-based measurements. Remote Sens. Environ. 2020, 239, 111606. [Google Scholar] [CrossRef]
Chen, H.; Chandrasekar, V.; Cifelli, R.; Xie, P. A machine learning system for precipitation estimation using satellite and ground radar network observations. IEEE Trans. Geosci. Remote Sens. 2020, 58, 982–994. [Google Scholar] [CrossRef]
Chen, S.; Xiong, L.; Ma, Q.; Kim, J.-S.; Chen, J.; Xu, C.-Y. Improving daily spatial precipitation estimates by merging gauge observation with multiple satellite-based precipitation products based on the geographically weighted ridge regression method. J. Hydrol. 2020, 589, 125156. [Google Scholar] [CrossRef]
Rata, M.; Douaoui, A.; Larid, M.; Douaik, A. Comparison of geostatistical interpolation methods to map annual rainfall in the Chéliff watershed, Algeria. Theor. Appl. Climatol. 2020, 141, 1009–1024. [Google Scholar] [CrossRef]
Chen, C.; Hu, B.; Li, Y. Easy-to-use spatial random-forest-based downscaling-calibration method for producing precipitation data with high resolution and high accuracy. Hydrol. Earth Syst. Sci. 2021, 25, 5667–5682. [Google Scholar] [CrossRef]
Nguyen, G.V.; Le, X.-H.; Van, L.N.; Jung, S.; Yeon, M.; Lee, G. Application of random forest algorithm for merging multiple satellite precipitation products across South Korea. Remote Sens. 2021, 13, 4033. [Google Scholar] [CrossRef]
Shen, Z.; Yong, B. Downscaling the GPM-based satellite precipitation retrievals using gradient boosting decision tree approach over Mainland China. J. Hydrol. 2021, 602, 126803. [Google Scholar] [CrossRef]
Zhang, L.; Li, X.; Zheng, D.; Zhang, K.; Ma, Q.; Zhao, Y.; Ge, Y. Merging multiple satellite-based precipitation products and gauge observations using a novel double machine learning approach. J. Hydrol. 2021, 594, 125969. [Google Scholar] [CrossRef]
Chen, H.; Sun, L.; Cifelli, R.; Xie, P. Deep learning for bias correction of satellite retrievals of orographic precipitation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4104611. [Google Scholar] [CrossRef]
Fernandez-Palomino, C.A.; Hattermann, F.F.; Krysanova, V.; Lobanova, A.; Vega-Jácome, F.; Lavado, W.; Santini, W.; Aybar, C.; Bronstert, A. A novel high-resolution gridded precipitation dataset for Peruvian and Ecuadorian watersheds: Development and hydrological evaluation. J. Hydrometeorol. 2022, 23, 309–336. [Google Scholar] [CrossRef]
Lin, Q.; Peng, T.; Wu, Z.; Guo, J.; Chang, W.; Xu, Z. Performance evaluation, error decomposition and tree-based machine learning error correction of GPM IMERG and TRMM 3B42 products in the Three Gorges reservoir area. Atmos. Res. 2022, 268, 105988. [Google Scholar] [CrossRef]
Yang, X.; Yang, S.; Tan, M.L.; Pan, H.; Zhang, H.; Wang, G.; He, R.; Wang, Z. Correcting the bias of daily satellite precipitation estimates in tropical regions using deep neural network. J. Hydrol. 2022, 608, 127656. [Google Scholar] [CrossRef]
Zandi, O.; Zahraie, B.; Nasseri, M.; Behrangi, A. Stacking machine learning models versus a locally weighted linear model to generate high-resolution monthly precipitation over a topographically complex area. Atmos. Res. 2022, 272, 106159. [Google Scholar] [CrossRef]
Militino, A.F.; Ugarte, M.D.; Pérez-Goya, U. Machine learning procedures for daily interpolation of rainfall in Navarre (Spain). In Trends in Mathematical, Information and Data Sciences; Springer: New York, NY, USA, 2023; Volume 445, pp. 399–413. [Google Scholar] [CrossRef]
Li, J.; Heap, A.D.; Potter, A.; Daniell, J.J. Application of machine learning methods to spatial interpolation of environmental variables. Environ. Model. Softw. 2011, 26, 1647–1659. [Google Scholar] [CrossRef]
Baratto, P.F.B.; Cecílio, R.A.; de Sousa Teixeira, D.B.; Zanetti, S.S.; Xavier, A.C. Random forest for spatialization of daily evapotranspiration (ET₀) in watersheds in the Atlantic Forest. Environ. Monit. Assess. 2022, 194, 449. [Google Scholar] [CrossRef] [PubMed]
Sekulić, A.; Kilibarda, M.; Protić, D.; Tadić, M.P.; Bajat, B. Spatio-temporal regression kriging model of mean daily temperature for Croatia. Theor. Appl. Climatol. 2020, 140, 101–114. [Google Scholar] [CrossRef]
Sekulić, A.; Kilibarda, M.; Protić, D.; Bajat, B. A high-resolution daily gridded meteorological dataset for Serbia made by random forest spatial interpolation. Sci. Data 2021, 8, 123. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G.; Tantanee, S. How to explain and predict the shape parameter of the generalized extreme value distribution of streamflow extremes using a big dataset. J. Hydrol. 2019, 574, 628–645. [Google Scholar] [CrossRef]
Papacharalampous, G.; Tyralis, H. Time series features for supporting hydrometeorological explorations and predictions in ungauged locations using large datasets. Water 2022, 14, 1657. [Google Scholar] [CrossRef]
Wadoux, A.M.J.-C.; Minasny, B.; McBratney, A.B. Machine learning for digital soil mapping: Applications, challenges and suggested solutions. Earth-Sci. Rev. 2020, 210, 103359. [Google Scholar] [CrossRef]
Chen, S.; Arrouays, D.; Leatitia Mulder, V.; Poggio, L.; Minasny, B.; Roudier, P.; Libohova, Z.; Lagacherie, P.; Shi, Z.; Hannam, J.; et al. Digital mapping of GlobalSoilMap soil properties at a broad scale: A review. Geoderma 2022, 409, 115567. [Google Scholar] [CrossRef]
Hengl, T.; Nussbaum, M.; Wright, M.N.; Heuvelink, G.B.M.; Gräler, B. Random Forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ 2018, 6, e5518. [Google Scholar] [CrossRef]
Saha, A.; Basu, S.; Datta, A. Random forests for spatially dependent data. J. Am. Stat. Assoc. 2021. [Google Scholar] [CrossRef]
Behrens, T.; Schmidt, K.; Viscarra Rossel, R.A.; Gries, P.; Scholten, T.; MacMillan, R.A. Spatial modelling with Euclidean distance fields and machine learning. Eur. J. Soil Sci. 2018, 69, 757–770. [Google Scholar] [CrossRef]
Sekulić, A.; Kilibarda, M.; Heuvelink, G.B.M.; Nikolić, M.; Bajat, B. Random forest spatial interpolation. Remote Sens. 2020, 12, 1687. [Google Scholar] [CrossRef]
Georganos, S.; Grippa, T.; Niang Gadiaga, A.; Linard, C.; Lennert, M.; Vanhuysse, S.; Mboga, N.; Wolff, E.; Kalogirou, S. Geographical random forests: A spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling. Geocarto Int. 2021, 36, 121–136. [Google Scholar] [CrossRef]
Georganos, S.; Kalogirou, S. A forest of forests: A spatially weighted and computationally efficient formulation of geographical random forests. ISPRS Int. J. Geo-Inf. 2022, 11, 471. [Google Scholar] [CrossRef]
Papacharalampous, G.; Tyralis, H.; Langousis, A.; Jayawardena, A.W.; Sivakumar, B.; Mamassis, N.; Montanari, A.; Koutsoyiannis, D. Probabilistic hydrological post-processing at scale: Why and how to apply machine-learning quantile regression algorithms. Water 2019, 11, 2126. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G.; Langousis, A. Super ensemble learning for daily streamflow forecasting: Large-scale demonstration and comparison with multiple machine learning algorithms. Neural Comput. Appl. 2021, 33, 3053–3068. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T. Computer Age Statistical Inference; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar] [CrossRef]
Friedman, J.H. Multivariate adaptive regression splines. Ann. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef]
Friedman, J.H. Fast MARS. Technical Report 110. Available online: https://statistics.stanford.edu/sites/g/files/sbiybj6031/f/LCS%20110.pdf (accessed on 17 December 2022).
Kooperberg, C.; Bose, S.; Stone, C.J. Polychotomous regression. J. Am. Stat. Assoc. 1997, 92, 117–127. [Google Scholar] [CrossRef]
Stone, C.J.; Hansen, M.H.; Kooperberg, C.; Truong, Y.K. Polynomial splines and their tensor products in extended linear modeling. Ann. Stat. 1997, 25, 1371–1470. [Google Scholar] [CrossRef]
Kooperberg, C. polspline: Polynomial Spline Routines. R Package Version 1.1.20. 2022. Available online: https://CRAN.R-project.org/package=polspline (accessed on 17 December 2022).
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G.; Langousis, A. A brief review of random forests for water scientists and practitioners and their recent history in water resources. Water 2019, 11, 910. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Mayr, A.; Binder, H.; Gefeller, O.; Schmid, M. The evolution of boosting algorithms: From machine learning to statistical modelling. Methods Inf. Med. 2014, 53, 419–427. [Google Scholar] [CrossRef] [PubMed]
Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G. Boosting algorithms in energy research: A systematic review. Neural Comput. Appl. 2021, 33, 14101–14117. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ripley, B.D. Pattern Recognition and Neural Networks; Cambridge University Press: Cambridge, UK, 1996. [Google Scholar] [CrossRef]
MacKay, D.J.C. Bayesian interpolation. Neural Comput. 1992, 4, 415–447. [Google Scholar] [CrossRef]
Breiman, L. Statistical modeling: The two cultures. Stat. Sci. 2001, 16, 199–215. [Google Scholar] [CrossRef]
Shmueli, G. To explain or to predict? Stat. Sci. 2010, 25, 289–310. [Google Scholar] [CrossRef]
Peterson, T.C.; Vose, R.S. An overview of the Global Historical Climatology Network temperature database. Bull. Am. Meteorol. Soc. 1997, 78, 2837–2849. [Google Scholar] [CrossRef]
Hsu, K.-L.; Gao, X.; Sorooshian, S.; Gupta, H.V. Precipitation estimation from remotely sensed information using artificial neural networks. J. Appl. Meteorol. Climatol. 1997, 36, 1176–1190. [Google Scholar] [CrossRef]
Nguyen, P.; Ombadi, M.; Sorooshian, S.; Hsu, K.; AghaKouchak, A.; Braithwaite, D.; Ashouri, H.; Rose Thorstensen, A. The PERSIANN family of global satellite precipitation data: A review and evaluation of products. Hydrol. Earth Syst. Sci. 2018, 22, 5801–5816. [Google Scholar] [CrossRef]
Nguyen, P.; Shearer, E.J.; Tran, H.; Ombadi, M.; Hayatbini, N.; Palacios, T.; Huynh, P.; Braithwaite, D.; Updegraff, G.; Hsu, K.; et al. The CHRS data portal, an easily accessible public repository for PERSIANN global satellite precipitation data. Sci. Data 2019, 6, 180296. [Google Scholar] [CrossRef] [PubMed]
Hollister, J.W. elevatr: Access Elevation Data from Various APIs. R package version 0.4.2. 2022. Available online: https://CRAN.R-project.org/package=elevatr (accessed on 17 December 2022).
Xiong, L.; Li, S.; Tang, G.; Strobl, J. Geomorphometry and terrain analysis: Data, methods, platforms and applications. Earth-Sci. Rev. 2022, 233, 104191. [Google Scholar] [CrossRef]
Meyer, H.; Pebesma, E. Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Methods Ecol. Evol. 2021, 12, 1620–1633. [Google Scholar] [CrossRef]
Meyer, H.; Pebesma, E. Machine learning-based global maps of ecological variables and the challenge of assessing them. Nat. Commun. 2022, 13, 2208. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Kounadi, O.; Zurita-Milla, R. Incorporating spatial autocorrelation in machine learning models using spatial lag and eigenvector spatial filtering features. ISPRS Int. J. Geo-Inf. 2022, 11, 242. [Google Scholar] [CrossRef]
Talebi, H.; Peeters, L.J.M.; Otto, A.; Tolosana-Delgado, R. A truly spatial random forests algorithm for geoscience data analysis and modelling. Math. Geosci. 2022, 54, 1–22. [Google Scholar] [CrossRef]
Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 1904, 15, 72–101. [Google Scholar] [CrossRef]
Gneiting, T. Making and evaluating point forecasts. J. Am. Stat. Assoc. 2011, 106, 746–762. [Google Scholar] [CrossRef]
Papacharalampous, G.; Tyralis, H. A review of machine learning concepts and methods for addressing challenges in probabilistic hydrological post-processing and forecasting. Front. Water 2022, 4, 961954. [Google Scholar] [CrossRef]
Davies, M.M.; van der Laan, M.J. Optimal spatial prediction using ensemble machine learning. Int. J. Biostat. 2016, 12, 179–201. [Google Scholar] [CrossRef]
Egaña, A.; Navarro, F.; Maleki, M.; Grandón, F.; Carter, F.; Soto, F. Ensemble spatial interpolation: A new approach to natural or anthropogenic variable assessment. Nat. Resour. Res. 2021, 30, 3777–3793. [Google Scholar] [CrossRef]
Petropoulos, F.; Svetunkov, I. A simple combination of univariate models. Int. J. Forecast. 2020, 36, 110–115. [Google Scholar] [CrossRef]
Papacharalampous, G.; Tyralis, H. Hydrological time series forecasting using simple combinations: Big data testing and investigations on one-year ahead river flow predictability. J. Hydrol. 2020, 590, 125205. [Google Scholar] [CrossRef]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G.; Burnetas, A.; Langousis, A. Hydrological post-processing using stacked generalization of quantile regression algorithms: Large-scale application over CONUS. J. Hydrol. 2019, 577, 123957. [Google Scholar] [CrossRef]
Montero-Manso, P.; Athanasopoulos, G.; Hyndman, R.J.; Talagala, T.S. FFORMA: Feature-based forecast model averaging. Int. J. Forecast. 2020, 36, 86–92. [Google Scholar] [CrossRef]
Talagala, T.S.; Li, F.; Kang, Y. FFORMPP: Feature-based forecast model performance prediction. Int. J. Forecast. 2021, 38, 920–943. [Google Scholar] [CrossRef]
Fulcher, B.D.; Little, M.A.; Jones, N.S. Highly comparative time-series analysis: The empirical structure of time series and their methods. J. R. Soc. Interface 2013, 10, 20130048. [Google Scholar] [CrossRef]
Kang, Y.; Hyndman, R.J.; Smith-Miles, K. Visualising forecasting algorithm performance using time series instance spaces. Int. J. Forecast. 2017, 33, 345–358. [Google Scholar] [CrossRef]
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A review of machine learning interpretability methods. Entropy 2021, 23, 18. [Google Scholar] [CrossRef]
Belle, V.; Papantonis, I. Principles and practice of explainable machine learning. Front. Big Data 2021, 4, 688969. [Google Scholar] [CrossRef]
Papacharalampous, G.; Tyralis, H.; Pechlivanidis, I.G.; Grimaldi, S.; Volpi, E. Massive feature extraction for explaining and foretelling hydroclimatic time series forecastability at the global scale. Geosci. Front. 2022, 13, 101349. [Google Scholar] [CrossRef]
Fouedjio, F.; Klump, J. Exploring prediction uncertainty of spatial data in geostatistical and machine learning approaches. Environ. Earth Sci. 2019, 78, 38. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://www.R-project.org (accessed on 17 December 2022).
Kuhn, M. caret: Classification and Regression Training; R Package Version 6.0-93. 2022. Available online: https://CRAN.R-project.org/package=caret (accessed on 17 December 2022).
Dowle, M.; Srinivasan, A. data.table: Extension of ‘data.frame’. R Package Version 1.14.4. 2022. Available online: https://CRAN.R-project.org/package=data.table (accessed on 17 December 2022).
Pedersen, T.L. ggforce: Accelerating ‘ggplot2’. R Package Version 0.4.1. 2022. Available online: https://cran.r-project.org/package=ggforce (accessed on 17 December 2022).
Pierce, D. ncdf4: Interface to Unidata netCDF (Version 4 or Earlier) Format Data Files. R Package Version 1.19. 2021. Available online: https://CRAN.R-project.org/package=ncdf4 (accessed on 17 December 2022).
Bivand, R.S.; Keitt, T.; Rowlingson, B. rgdal: Bindings for the ‘Geospatial’ Data Abstraction Library. R Package Version 1.5-32. 2022. Available online: https://CRAN.R-project.org/package=rgdal (accessed on 17 December 2022).
Pebesma, E. Simple features for R: Standardized support for spatial vector data. R J. 2018, 10, 439–446. [Google Scholar] [CrossRef] [Green Version]
Pebesma, E. sf: Simple Features for R. R Package Version 1.0-8. 2022. Available online: https://CRAN.R-project.org/package=sf (accessed on 17 December 2022).
Bivand, R.S. spdep: Spatial Dependence: Weighting Schemes, Statistics. R Package Version 1.2-7. 2022. Available online: https://CRAN.R-project.org/package=spdep (accessed on 17 December 2022).
Bivand, R.S.; Wong, D.W.S. Comparing implementations of global and local indicators of spatial association. TEST 2018, 27, 716–748. [Google Scholar] [CrossRef]
Wickham, H.; Averick, M.; Bryan, J.; Chang, W.; McGowan, L.D.; François, R.; Grolemund, G.; Hayes, A.; Henry, L.; Hester, J.; et al. Welcome to the tidyverse. J. Open Source Softw. 2019, 4, 1686. [Google Scholar] [CrossRef]
Wickham, H. tidyverse: Easily Install and Load the ‘Tidyverse’. R Package Version 1.3.2. 2022. Available online: https://CRAN.R-project.org/package=tidyverse (accessed on 17 December 2022).
Rodriguez, P.P.; Gianola, D. brnn: Bayesian Regularization for Feed-Forward Neural Networks. R Package Version 0.9.2. 2022. Available online: https://CRAN.R-project.org/package=brnn (accessed on 17 December 2022).
Milborrow, S. earth: Multivariate Adaptive Regression Splines. R Package Version 5.3.1. 2021. Available online: https://CRAN.R-project.org/package=earth (accessed on 17 December 2022).
Greenwell, B.; Boehmke, B.; Cunningham, J. gbm: Generalized Boosted Regression Models. R Package Version 2.1.8.1. 2022. Available online: https://CRAN.R-project.org/package=gbm (accessed on 17 December 2022).
Ripley, B.D. nnet: Feed-Forward Neural Networks and Multinomial Log-Linear Models. R Package Version 7.3-18. 2022. Available online: https://CRAN.R-project.org/package=nnet (accessed on 17 December 2022).
Venables, W.N.; Ripley, B.D. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2002; ISBN 0-387-95457-0. [Google Scholar]
Wright, M.N. ranger: A Fast Implementation of Random Forests. R Package Version 0.14.1. 2022. Available online: https://CRAN.R-project.org/package=ranger (accessed on 17 December 2022).
Wright, M.N.; Ziegler, A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 2017, 77, 1–17. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. xgboost: Extreme Gradient Boosting. R Package Version 1.6.0.1. 2022. Available online: https://CRAN.R-project.org/package=xgboost (accessed on 17 December 2022).
Tyralis, H.; Papacharalampous, G. A review of probabilistic forecasting and prediction with machine learning. arXiv 2022, arXiv:2209.08307. Available online: https://arxiv.org/abs/2209.08307 (accessed on 17 December 2022).
Tyralis, H.; Papacharalampous, G. scoringfunctions: A Collection of Scoring Functions for Assessing Point Forecasts. R Package Version 0.0.5. 2022. Available online: https://CRAN.R-project.org/package=scoringfunctions (accessed on 17 December 2022).
Wickham, H.; Hester, J.; Chang, W.; Bryan, J. devtools: Tools to Make developing R Packages Easier. R Package Version 2.4.5. 2022. Available online: https://CRAN.R-project.org/package=devtools (accessed on 17 December 2022).
Xie, Y. knitr: A Comprehensive Tool for Reproducible Research in R. In Implementing Reproducible Computational Research; Stodden, V., Leisch, F., Peng, R.D., Eds.; Chapman and Hall/CRC: London, UK, 2014. [Google Scholar]
Xie, Y. Dynamic Documents with R and Knitr, 2nd ed.; Chapman and Hall/CRC: London, UK, 2015. [Google Scholar]
Xie, Y. knitr: A General-Purpose Package for Dynamic Report Generation in R. R Package Version 1.40. 2022. Available online: https://CRAN.R-project.org/package=knitr (accessed on 17 December 2022).
Allaire, J.J.; Xie, Y.; McPherson, J.; Luraschi, J.; Ushey, K.; Atkins, A.; Wickham, H.; Cheng, J.; Chang, W.; Iannone, R.; et al. rmarkdown: Dynamic Documents for R. R Package Version 2.17. 2022. Available online: https://CRAN.R-project.org/package=rmarkdown (accessed on 17 December 2022).
Xie, Y.; Allaire, J.J.; Grolemund, G. R Markdown: The Definitive Guide; Chapman and Hall/CRC: London, UK, 2018; ISBN 9781138359338. Available online: https://bookdown.org/yihui/rmarkdown (accessed on 17 December 2022).
Xie, Y.; Dervieux, C.; Riederer, E. R Markdown Cookbook; Chapman and Hall/CRC: London, UK, 2020; ISBN 9780367563837. Available online: https://bookdown.org/yihui/rmarkdown-cookbook (accessed on 17 December 2022).

Figure 1. Maps of the geographical locations of: (a) the earth-located stations offering data for the present work; and (b) the points composing the PERSIANN grid defined herein.

Figure 2. Setting of the regression problem. Note that the term “grid point” is used to describe the geographical locations with satellite data, while the term “station” is used to describe the geographical locations with ground-based measurements. Note also that, throughout the present work, the distances d_i, i = 1, 2, 3, 4 are also referred to as “distances 1−4”, respectively, and the total monthly precipitation values at the grid points 1−4 are referred to as “PERSIANN values 1−4”, respectively.

Figure 3. Heatmap of the Spearman correlation estimates for all the possible pairs of the variables appearing in the three regression settings.

Figure 4. Scatterplots between the predictand (i.e., the precipitation value observed at an earth-located station) and the following predictor variables: (a) elevation at the location of this station; (b) precipitation value at the closest point on the PERSIANN grid for this station; (c) distance of the fourth closest point on the PERSIANN grid for this station; and (d) longitude at the location of this station. The Spearman correlation estimates are repeated here from Figure 3 for convenience. The redder the color on the graphs, the denser the points.

Figure 5. Barplot of the permutation importance scores of the predictor variables. The latter were ordered from the most to the least important ones (from top to bottom) based on the same scores.

Figure 6. Heatmaps of: (a) the relative improvement (%) in terms of the median square error metric, averaged across the five folds, as this improvement was provided by each machine and statistical learning algorithm with respect to the linear regression algorithm; and (b) the mean ranking of each machine and statistical learning algorithm, averaged across the five folds. The computations were made separately for each predictor set. The darker the color, the better the predictions on average.

Figure 7. Sinaplots of the rankings from 1 to 8 of the machine and statistical learning algorithms for the predictor sets (a–c) 1–3. These rankings were computed separately for each pair {case, predictor set}.

Figure 8. Heatmaps of: (a) the relative improvement (%) in terms of the median square error metric, averaged across the five folds, as this improvement was provided by each machine and statistical learning algorithm with respect to the linear regression algorithm, with this latter algorithm being run with the predictor set 1; and (b) the mean ranking of each machine and statistical learning algorithm, averaged across the five folds. The computations were made collectively for all the predictor sets. The darker the color, the better the predictions on average.

Figure 9. Sinaplots of the rankings from 1 to 24 of the machine and statistical learning algorithms for the predictor sets (a–c) 1–3. These rankings were computed separately for each case and collectively for all the predictor sets.

Table 1. Summary of previous studies and the present study on merging gridded satellite precipitation products and ground-based measurements.

Study	Time Scale	Spatial Scale	Algorithms
He et al. [13]	Hourly	South-western, central, north-eastern and south-eastern United States	Random forests
Meyer et al. [14]	Daily	Germany	Random forests, artificial neural networks, support vector regression
Tao et al. [15]	Daily	Central United States	Deep learning
Yang et al. [16]	Daily	Chile	Quantile mapping
Baez-Villanueva et al. [17]	Daily	Chile	Random forests
Chen et al. [18]	Daily	Dallas–Fort Worth in the United States	Deep learning
Chen et al. [19]	Daily	Xijiang basin in China	Geographically weighted ridge regression
Rata et al. [20]	Annual	Chéliff watershed in Algeria	Kriging
Chen et al. [21]	Monthly	Sichuan Province in China	Artificial neural networks, geographically weighted regression, kriging, random forests
Nguyen et al. [22]	Daily	South Korea	Random forests
Shen and Yong [23]	Annual	China	Gradient boosting decision trees, random forests, support vector regression
Zhang et al. [24]	Daily	China	Artificial neural networks, extreme learning machines, random forests, support vector regression
Chen et al. [25]	Daily	Coastal mountain region in the western United States	Deep learning
Fernandez-Palomino et al. [26]	Daily	Ecuador and Peru	Random forests
Lin et al. [27]	Daily	Three Gorges Reservoir area in China	Adaptive boosting decision trees, decision trees, random forests
Yang et al. [28]	Daily	Kelantan river basin in Malaysia	Deep learning
Zandi et al. [29]	Monthly	Alborz and Zagros mountain ranges in Iran	Artificial neural networks, locally weighted linear regression, random forests, stacked generalization, support vector regression
Militino et al. [30]	Daily	Navarre in Spain	K-nearest neighbors, random forests, artificial neural networks
Present study	Monthly	Contiguous United States	Linear regression, multivariate adaptive regression splines, multivariate adaptive polynomial splines, random forests, gradient boosting machines, extreme gradient boosting, feed-forward neural networks, feed-forward neural networks with Bayesian regularization

Table 2. Inclusion of predictor variables in the predictor sets examined in this work.

Predictor Variable	Predictor Set 1	Predictor Set 2	Predictor Set 3
PERSIANN value 1	✔	✔	✔
PERSIANN value 2	✔	✔	✔
PERSIANN value 3	✔	✔	✔
PERSIANN value 4	✔	✔	✔
Distance 1	×	✔	✔
Distance 2	×	✔	✔
Distance 3	×	✔	✔
Distance 4	×	✔	✔
Station elevation	✔	✔	✔
Station longitude	×	×	✔
Station latitude	×	×	✔

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Papacharalampous, G.; Tyralis, H.; Doulamis, A.; Doulamis, N. Comparison of Machine Learning Algorithms for Merging Gridded Satellite and Earth-Observed Precipitation Data. Water 2023, 15, 634. https://doi.org/10.3390/w15040634

AMA Style

Papacharalampous G, Tyralis H, Doulamis A, Doulamis N. Comparison of Machine Learning Algorithms for Merging Gridded Satellite and Earth-Observed Precipitation Data. Water. 2023; 15(4):634. https://doi.org/10.3390/w15040634

Chicago/Turabian Style

Papacharalampous, Georgia, Hristos Tyralis, Anastasios Doulamis, and Nikolaos Doulamis. 2023. "Comparison of Machine Learning Algorithms for Merging Gridded Satellite and Earth-Observed Precipitation Data" Water 15, no. 4: 634. https://doi.org/10.3390/w15040634

APA Style

Papacharalampous, G., Tyralis, H., Doulamis, A., & Doulamis, N. (2023). Comparison of Machine Learning Algorithms for Merging Gridded Satellite and Earth-Observed Precipitation Data. Water, 15(4), 634. https://doi.org/10.3390/w15040634

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Machine Learning Algorithms for Merging Gridded Satellite and Earth-Observed Precipitation Data

Abstract

1. Introduction

2. Methods

2.1. Machine Learning Algorithms for Spatial Interpolation

2.1.1. Linear Regression

2.1.2. Multivariate Adaptive Regression Splines

2.1.3. Multivariate Adaptive Polynomial Splines

2.1.4. Random Forests

2.1.5. Gradient Boosting Machines

2.1.6. Extreme Gradient Boosting

2.1.7. Feed-Forward Neural Networks

2.1.8. Feed-Forward Neural Networks with Bayesian Regularization

2.2. Variable Importance Metric

3. Data and Application

3.1. Data

3.1.1. Earth-Observed Precipitation Data

3.1.2. Satellite Precipitation Data

3.1.3. Elevation Data

3.2. Validation Setting and Predictor Variables

3.3. Performance Metrics and Assessment

4. Results

4.1. Regression Setting Exploration

4.2. Comparison of the Algorithms

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI