1. Introduction
Canopy height is a very important input to modeling global, national, and local forest productivity dynamics including aboveground biomass (AGB), carbon, and volume as they are allometrically related [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14]. Multi-temporal canopy height measurements can be compared to identify areas where the forest canopy has been significantly altered or removed. The collection of sufficient canopy height information is, therefore, a need in combating deforestation and in strategizing forest conversion and reforestation initiatives. However, large-scale and wall-to-wall collection of canopy height information cannot be attained by traditional forest inventory techniques that often rely on field measurements and manual data collection because of several challenges and limitations [
15,
16,
17].
A good alternative to traditional inventory is remote sensing (RS) technology. However, as generally practiced, the traditional method is still integrated with RS technology. The RS technology provides broader spatial coverage, rapid data acquisition, and detailed information about forest vertical structure [
18,
19]. It offers multispectral and hyperspectral data, enabling detailed analysis of forest structure and precise measurements of canopy height that are crucial for understanding forest dynamics, carbon accounting, and planning sustainable forest management practices and monitoring forest changes [
20]. RS enables efficient multi-temporal monitoring that enables the detection and tracking of deforestation and forest degradation. Furthermore, multi-temporal monitoring also allows the assessment of the impact of deforestation and degradation on carbon emissions, which is vital for designing and evaluating climate change mitigation strategies [
21]. This technology provides a wealth of information about forests, including their spatial distribution, vegetation structure, and health status.
There are two types of RS technology depending on the source of illumination. The sun-dependent type is referred to as passive RS, and those operating using their source of illumination are active RS. The latest development in RS technology is that Sentinel satellites have been launched, providing passive data (Sentinel-2) and active data (Sentinel-1). In 2018, two spaceborne LiDAR sensors (Advanced Topographic Laser Altimeter System (ATLAS) mounted in ICESat-2 and GEDI attached to the International Space Station (ISS)) were launched. The ATLAS is a photon-counting LiDAR, while the GLAS and GEDI are full-waveform types.
Both passive and active RS have good qualities useful in forest dynamics monitoring; however, they have several limitations. Passive RS sensors such as Moderate Resolution Imaging Spectroradiometer (MODIS), Landsat, Sentinel-2, Planetscope, and Worldview provide images of the surface reflectance of Earth’s surface. Passive RS data are wall-to-wall, have global-wide coverage, and are less expensive compared to active RS sensors. However, they have limited sensitivity to the vertical forest canopy structure because the signal from the sensor may not be able to penetrate through the canopy to detect the vertical structure of the vegetation. Nevertheless, such an issue can be solved with the inclusion of vegetation indices in the canopy height modeling. The vegetation indices are based on the reflectance properties of vegetation and can be used to indirectly estimate vegetation density, biomass, and canopy height.
Saturation problems are also being dealt with when using optical data. Saturation can occur when the reflectance values of vegetation pixels reach the maximum value that can be recorded by the sensor [
22,
23] This can result in a loss of information and accuracy in the estimation of canopy height. Several approaches can be used to mitigate saturation problems in canopy height estimation, including the use of multiple spectral bands [
24] and the inclusion of ancillary data like topographic features. Optical imagery provides measurements in multiple spectral bands, which can be used to estimate vegetation properties such as leaf area index and biomass. By combining information from multiple bands, it may be possible to estimate canopy height more accurately even in areas with high reflectance values.
Active RS can penetrate cloud cover and vegetation [
25] and provide its illumination during data collection, thus continuously collecting data in good and bad weather conditions and day and night. The penetration ability of active sensors enables measurement of the vegetation’s vertical structure [
26], which is one of the disadvantages of passive sensors. It is the main reason why passive imageries are integrated with active data, especially when the objective is canopy height estimation. However, active RS also faces limitations. Spaceborne LiDARs such as ICESat-2 and GEDI have almost total Earth coverage but are discrete data. TLS, UAV-LiDAR, and ALS can provide continuous data but have limited coverage due to various factors such as cost, employed methodology of data collection, and the operational capabilities of the equipment.
One active RS method, particularly Sentinel-1, provides wall-to-wall and almost around-the-globe coverage data. The Sentinel-1 mission comprises a constellation of two polar-orbiting satellites, operating day and night, enabling them to acquire imagery regardless of the weather. Sentinel-1 carries a single C-band SAR instrument operating at a center frequency of 5.405 gigahertz (GHz). It provides dual-polarization capability, very short revisit times, and rapid product delivery and offers reliable repeated wide-area monitoring. The C-band Sentinel-1 signal is unlikely to penetrate dense canopies [
27,
28]. Therefore, forest backscatter can be assumed to be unaffected by variations in soil moisture and surface roughness, which can cause large and abrupt changes in the backscatter over other land covers, and forests should behave as fairly time-invariant targets [
29,
30,
31].
Aside from limited canopy penetrating potential, especially in dense vegetation, Sentinel-1 imageries also face challenges such as saturation and failure to detect vegetation understory. Data integration can reduce the impacts of such challenges. For instance, the effectiveness of synergizing Sentinel-1 and Sentinel-2 data for canopy height mapping has been explored in a few studies [
32,
33,
34]. However, research in tropical or subtropical ecosystems is still lacking [
35]. In addition, [
36] discovered that combining multispectral (Landsat) and SAR data yielded greater accuracy in predicting forest structure measurements compared to using a single sensor alone.
The use of ancillary data like topographic data, such as elevation, slope, and aspect, could also affect canopy height estimation. A series of studies have shown that the accuracy of DEM and tree parameters generally decreases as the slope gradient increases. For instance, Ref. [
37] reported that, when the slope increased from 15.6° to 37.6°, the vertical RMSE of tree height extraction increased from 0.576 m to 0.901 m. The studies of [
38] and [
39] showed that increasing slope gradient resulted in reduced tree height because LiDAR inversion and DEM errors led to forest volume errors. Ref. [
40] suggested that, as the slope gradient increased, models neglecting slopes would overestimate tree height and could be corrected by a slope coefficient that would allow a more accurate estimation of tree height.
Table 1 summarizes the advantages and disadvantages as well as the sensitivity of RS data in canopy height modeling.
One of the developing tools in forest resources inventory includes machine learning (ML) technologies. ML is a form of artificial intelligence in which a computer is algorithmically trained to perform a task such as regression or classification [
41,
42]. ML regression techniques play a significant role in canopy height modeling. They can enhance the accuracy and efficiency of the models by learning complex relationships between input data and canopy height. In addition, they can integrate various data sources, such as LiDAR point clouds, RS imagery, topographic data, and ground truth measurements. These models can handle multi-modal data and extract meaningful features from them. Once trained, ML models can be applied to new areas or time periods, provided that the input data are available. This transferability can save time and resources when modeling canopy height in different locations or monitoring changes over time. ML offers a wide range of regression algorithms like gradient boosting (GB), k-nearest neighbor (k-NN), and random forest (RF).
GB combines multiple weak predictive models to create a strong predictive model. It is a type of ensemble learning method that belongs to the boosting family of algorithms. GB iteratively builds a sequence of weak prediction models, typically decision trees, and combines their predictions to form a more accurate and robust final prediction. Each weak model is trained on the residuals or errors of the previous models, with the goal of improving the overall prediction in subsequent iterations. This process continues iteratively until a predefined stopping criterion is met, such as reaching a specified number of weak models or achieving a desired level of accuracy. To combine the predictions of the weak models, GB typically uses a weighted average approach, where each model’s prediction is weighted according to its performance or contribution to the overall error reduction. This weighting ensures that more emphasis is placed on the models that provide better predictions.
On the other hand, k-NN regression models represent a simple yet powerful ML algorithm. It is a non-parametric algorithm that makes predictions based on the similarity of data points in the feature space. The underlying idea behind k-NN is that similar data points tend to have similar labels or target values. k-NN regression exploits this assumption by averaging the target values of the nearest neighbors. k-NN does not have an explicit training phase. Instead, it stores the training data points and their corresponding labels or target values. During prediction, the algorithm searches for the k-nearest neighbors of a given test data point from the training data. To determine the similarity between data points, k-NN typically uses a distance metric. The k in k-NN represents the number of nearest neighbors considered for making predictions. It is a hyperparameter that needs to be predefined before applying the algorithm. The choice of k depends on the dataset and the problem at hand. Smaller values of k may lead to more flexible decision boundaries but increase the risk of overfitting, while larger values of k may provide smoother decision boundaries but risk oversimplifying the model. During prediction, k-NN finds the k-nearest neighbors of a test data point based on the chosen distance metric. It calculates the distance between the test point and all the training points and selects the k points with the smallest distances. For regression, it computes the average (or weighted average) of the target values of the k-nearest neighbors.
The RF developed by [
43] is a popular ML algorithm that belongs to the ensemble learning family. It combines multiple decision trees to create a powerful predictive model. Each decision tree is trained on a random subset of the training data, and the final prediction is determined by aggregating the predictions of individual trees. The RF algorithm is widely used in various domains due to its robustness, accuracy, and interpretability. RF builds a collection of decision trees and aggregates their predictions to provide an accurate and robust regression model.
Integrating RS and ML techniques for canopy height modeling is efficient and cost-effective. However, despite the advancement of the technology for the collection of canopy height information and the existence of global-wide canopy height data coverage, wall-to-wall canopy height information is still insufficient. Therefore, taking advantage of available data is necessary in order to come up with efficient, effective, accurate, and flexible canopy height modeling. Thus, this study was conducted to generate wall-to-wall canopy height information by integrating medium-resolution passive satellite images such as Sentinel-2, active RS data such as SAR and LiDAR, satellite-derived indices such as vegetation, water, soil, biophysical, and backscatter indices, and ancillary data such as elevation, aspect, and slope. Specifically, it aimed to assess the performance of the ML regression models and spaceborne LiDAR discrete data and assess the effect of integrating Sentinel-1 and -2 features in canopy height modeling.
3. Results
3.1. Data Distribution of the Retrieved Spaceborne-LiDAR-Based Canopy Height
Table 6 shows the descriptive statistics of the GEDI- and ICESat-2-based canopy height that were retrieved specifically to serve as training data for the different ML-based regression models. Within the study site, a total of 14,910 GEDI footprints were retrieved. The GEDI-based canopy height has maximum and minimum values of 89.77 m and 1.90 m, respectively, and a mean of 26.94 m.
Based on the frequency distribution (%), the frequency of the GEDI data ranges from 0.57% to 18.52%; 18.52% of the height ranges from 25 to 30 m; 17.92% ranges from 20 to 25 m; 13.78% ranges from 30 to 35 m; and 12.86% ranges from 15 to 20 m. The other height values have a frequency of <10%. Overall, 44.51% have a height higher than the average height, which is 26.94 m, and 55.49% have a height lower than the average height.
On the other hand, the ICESat-2-based canopy height has a maximum and minimum height of 149.34 m and 2.27 m, respectively, and an average height of 33.53 m. Based on the frequency distribution (%), the frequency of the ICESat-2-based canopy height ranges from 0.10% to 14.98%, and 14.98% of the footprints have a maximum height of 30–35 m; 14.38% range from 35 to 40 m; 13.62% range from 25 to 30; and 11.76% range from 40 to 45 m. All the remaining height interval values are <10%. Overall, 48.95% of the footprints have a height higher than the average height, which is 33.53 m, and 51.05% are below the average height.
As observed, the maximum canopy heights of GEDI and ICESat-2 are extremely high. However, the high canopy height values were still included after visual inspection using the generated high-resolution orthophotos because they were located in forest areas. The generated extremely high canopy height values of GEDI and ICESat-2 were probably because of the site location, canopy density, and undulating terrain. As recalled, Lishan is located in a high-elevation mountainous region of Taiwan composed of varied species indicating a heterogenous canopy structure and with a complex topography. According to [
66,
70], complex terrain conditions with dense forest canopy cover influence the quality of the digital terrain and CHM data. In addition, Refs. [
71,
72] mentioned in their work that spaceborne-LiDAR-based products have high uncertainty in mountainous regions.
3.2. Agreement of ALS-, GEDI-, and ICESat-2-Based Canopy Height
Following the locations of the collected GEDI footprints, the ALS-based canopy height was extracted for comparison. A total of 4400 GEDI footprints are located within the boundary of the ALS-based canopy height. When the GEDI- and ALS-based canopy heights were compared as shown in
Figure 5a, it is observed that the frequency of ALS data is distributed mainly from heights of 0–30 m (95.84%). The remaining frequency totaling 4.16% is observed at heights of >30 m. GEDI, on the other hand, is concentrated mainly at heights 15–40 m with a frequency of 75.29%. The remaining 24.71% of the data have heights of <15 m and >40 m. The average height of the ALS-based canopy height is 16.94 m, with an SD of 8.39 m and ranging from 0 to 40.69 m. The corresponding GEDI footprints have a mean height of 27.21 ± 11.17 m and range from 0 to 89.77 m.
The same process was performed for the ICESat-2 data, as shown in
Figure 5b. A total of 2410 ICESat-2 footprints overlapped with the ALS-based canopy height data. The result shows the same pattern as that of the comparison of GEDI- and ALS-based canopy height, of which a high frequency of 92.09% of ALS-based canopy height is observed at 30 m and below while the remaining 7.91% have heights ranging from 30 to 47.53 m. The ALS-based canopy height ranges from 0.02 to 47.53 m and has a mean of 19.10 ± 7.46 m. The ICESat-2 -based canopy height ranges from 0 to 149.83 m with an average of 34.14 ± 15.21 m.
Figure 6 shows the 1:1 relationship or overall agreement between the ALS-based canopy height and the spaceborne-LiDAR-based canopy height and the descriptive statistics of the spaceborne-LiDAR-based canopy height bias. A low level of overall agreement was observed for both the spaceborne LiDARs regarding ALS-based canopy height. In
Figure 6a, an R
2 of 0.28 was observed for the linear regression of ALS-based canopy height and GEDI-based canopy height (n = 4400), and 44.92% of the GEDI-based canopy height underestimated the ALS-based canopy height, while the remaining 55.08% overestimated the ALS-based canopy height. In
Figure 6b, an R
2 value of 0.03 was observed for the linear regression of ALS-based canopy height and ICESat-2-based canopy height (n = 2410), indicating a poor correlation. Moreover, 37.36% of the ICESat-2-based canopy height underestimated the ALS-based canopy height, while the remaining 62.64% overestimated the ALS-based canopy height.
The descriptive statistics of the canopy height bias of GEDI-based canopy height and ICESat-2-based canopy height that were integrated in
Figure 6 show that the minimum and maximum bias of the GEDI-based canopy height are 63.01 m and −52.16 m, respectively, while the average + SD canopy height bias is −1.17 + 10.92 m. On the other hand, the minimum and maximum canopy height bias of the ICESat-2-based canopy height are −42.10 m and 103.32 m, respectively, while the mean + SD canopy height bias is 5.37 + 15.08 m. The GEDI-based canopy height has a lower maximum and average canopy height bias than ICESat-2 but has a higher minimum canopy height bias than ICESat-2.
3.3. ML-Based Canopy Height Modeling
The three ML regression models, GB, k-NN, and RF, with GEDI and ICESat-2-based canopy height as training data were built for each of the three layerstacked datasets (Sentinel-1&2, Sentinel-2, and Sentinel-1).
Figure 7 shows the spatial distribution of the ML-based predicted canopy height, while
Table 7 shows the corresponding descriptive statistics. Overall, the spatial distribution shows that low canopy height values are observed in the middle part of the study site, which is known to be agriculture and water areas, and at some of the edges of the study site, which, if viewed on Google Earth, are grassland areas. Visually comparing the different maps in
Figure 7, the different wall-to-wall canopy height maps are similar. However, the descriptive statistics that are shown in
Table 7 tell otherwise. It could be observed that regression model bias occurred for GB and RF, as indicated by the negative values in Sentinel-1 data for both ICESat-2 and GEDI. This indicates that RF and GB were not able to capture the true relationship between the features of the Sentinel-1 data and canopy height. Negative values were merely observed in areas with low canopy height. As recalled, the minimum values of the GEDI- and ICESat-2-based canopy height are 1.90 m and 2.27 m, respectively. One of the possible factors of non-establishment of the true relationship of the features of Sentinel-1 and canopy height is the non-representation of canopy height values that are less than 1.90 m and 2.27 m. One possible reason is the data quality of the Sentinel-1 data. As can be observed, model bias was not observed with Sentinel-2 and Sentinel-1&2. This indicates that the features of Sentinel-1 are not sufficient to establish a relationship with canopy height, thus requiring other variables for canopy height modeling. The negative values were no longer observed when Sentinel-1 was integrated with Sentinel-2. In comparison with the k-NN model, it did not produce negative canopy height values. As recalled, the k-NN regression model considers its nearest neighbors; therefore, if there are no negative values provided in the training data, no negative values will be generated.
Neglecting the negative values, the GB-based canopy height has a minimum canopy height value ranging from 0.06 m to 0.66 m for GEDI and 2.86 m to 5.89 m for ICESat-2. The maximum canopy height ranges from 66.71 m to 70.15 m for GEDI and 86.25 m to 107.16 m for ICESat-2. On the other hand, the mean canopy height values range from 26.44 m to 26.53 m for GEDI and 37.54 m to 37.75 m for ICESat-2. The difference in the mean canopy height values of GEDI and ICESat-2 ranged from 11.10 m to 11.22 m.
For the k-NN regression model, all the predicted canopy heights of the different datasets have a minimum value of 0. This also shows that the predicted canopy height values of ICESat-2 are higher than GEDI, as shown by the maximum and mean canopy height values. The maximum canopy height values for all the satellite data are 99.64 m for GEDI and 146.84 m for ICESat-2. For mean canopy height, different values were observed in all the datasets. The mean canopy height of the GEDI-based model ranges from 26.50 m to 26.64 m, and ICESat-2 ranges from 37.51 m to 37.96 m. The difference in the mean canopy height values of GEDI and ICESat-2 ranges from 11.01 m to 11.32 m.
The precision of the accuracy of the RF-based canopy height modeling was analyzed through testing several configurations. The results revealed that the best configuration for canopy height modeling is the one that uses all the metrics: the Sentinel-2 dataset includes surface reflectance, spectral indices, biophysical features, and topographical features; the Sentinel-1 dataset includes backscatter coefficient, SAR indices, and topographical features; the Sentinel-1&2 dataset includes surface reflectance, spectral indices, biophysical features, backscatter coefficient, SAR indices, and topographical features. Therefore, all the metrics of each of the three layerstacked datasets were used for the wall-to-wall canopy height modeling of the study site.
Figure 8 shows the importance of each metric of each dataset based on the percentage increase in mean square error (%IncMSE). Variable importance is one of the unique features of RF models. Variable importance refers to the measure of how much each input variable contributes to the model’s predictive performance. It could be observed that the top three features with the most importance are LAI, MNDWI, and elevation for Sentinel-1&2 × ICESat-2; LAI, elevation, and coherence for Sentinel-1&2 × GEDI; LAI, MNDWI, and elevation for Sentinel-2 × ICESat-2; elevation, slope, and B4 for Sentinel-2 × GEDI; slope, elevation, and sigmaVV for Sentinel-1 × ICESat-2; and slope, elevation, and coherence for Sentinel-1 × GEDI. In general, the topographic features are all included in the top three features with the highest importance, indicating that topographic features are needed for canopy height modeling. It could be observed that the same variables have different importance depending on the dataset. This is because the variable importance of RF is affected by sample size, data distribution, and range. It could be recalled that the sample sizes of the collected ICESat-2- and GEDI-based canopy heights are not the same. Also, the data distributions and ranges of the two datasets are not the same.
From the spatial distribution (
Figure 7), it could be observed that the areas with low canopy height values are very distinct and are observed in the middle and at the lower and upper left edges of the study site, just like in GB and k-NN. Through visual comparison of the predicted canopy height, the canopy heights of Sentinel-2 and Sentinel-1&2 using ICESat-2 are different from the other predicted canopy heights. Water areas are indistinguishable from agriculture areas and grassland areas.
Based on the descriptive statistics (
Table 7), it could be observed that the minimum values of Sentinel-1 are negative values, indicating overfitting of the model for both GEDI and ICESat-2. Neglecting the negative values, the minimum canopy height of the RF-based predicted canopy height ranges from 2.22 m to 2.24 m for GEDI and 5.59 m to 6.09 m for ICESat-2. For the maximum canopy height, the GEDI-based canopy height ranges from 71.77 m to 73.28 m and the ICESat-2-based ranges from 103.69 m to 105.52 m. The mean canopy height of GEDI ranges from 26.71 to 26.86 m and the ICESat-2-based canopy height ranges from 38.40 m to 38.90 m. Obtaining the range of the difference in the means of GEDI- and ICESat-2-based data resulted in a difference ranging from 11.69 m to 12.04 m.
In general, it could be observed that the predicted canopy height values produced by the three models are extremely high. This could probably be an after-effect of using extremely high values during the training of the models. As recalled, the canopy height data that were used for training the models also have extremely high maximum values. This indicates that the training data should be accurate in order for the models to produce accurate results.
3.4. Correlation of the ML-Based Predicted Canopy Height with the ALS-Based Canopy Height
Finally, after generating the wall-to-wall ML-based canopy height values based on different datasets, they were compared with the ALS-based canopy height values. The result of bivariate Pearson correlation analysis showed a positive significant correlation that ranges from 0.38 to 0.64 (
Table 8). The highest correlation (0.64) is observed for both the Sentinel-1&2 and Sentinel-2 datasets, with GEDI as the training data and with GB as the regression model. It could be observed that the training data of the top five datasets with the highest correlation are GEDI, satellite images are Sentinel-1&2 and Sentinel-2, and regression models are GB and k-NN. On the other hand, the two datasets with the least correlation with ALS-based canopy height have a satellite image of Sentinel-1, training data ICESat-2, and the regression models are k-NN and RF, indicating that the training data have the most important role as compared with satellite image and regression models in canopy height modeling.
3.5. Performance of the ML-Based Predicted Canopy Height Based on RMSE (m) and PRMSE (%)
With accuracy measures such as RMSE(m) and PRMSE (%), the performance of the wall-to-wall ML-based canopy height was validated against the ALS-based canopy height. The ALS-based canopy height only covers a portion of the whole study site. Therefore, for the validation, only the area with ALS-based canopy height was compared with the wall-to-wall ML-based canopy height.
The factors affecting RMSE (m) and PRMSE (%), such as the characteristics of the satellite image, the composition of training data, and the type of regression model, were considered in the analysis. The main and interaction effects of the different factors were analyzed using the three-way full-factorial ANOVA model. The F-value,
p-value, and the partial eta squared (ηp2) were used to present the interpretation of the results of the ANOVA model. Although there is concern regarding prevalent misuse and misconceptions regarding using
p-values [
73], the values were consistent with the F-values and ηp2. As shown in
Table 9 and
Table 10, the ANOVA revealed a significant main effect for the regression model (RMSE: F-value = 2782.22,
p-value =< 0.001; PRMSE: F-value = 1701,
p-value =< 0.001), satellite data (RMSE: F-value = 193.09,
p-value =< 0.001; PRMSE: F-value = 118.07,
p-value < 0.001), and training data (RMSE: F-value = 1,011,356.76,
p-value < 0.001; PRMSE: F-value = 618,324.81,
p-value < 0.001). The result indicates that the RMSE and PRMSE of ML-based canopy height modeling are statistically different and therefore significantly depend on the regression model, satellite data, and training data. The ηp2 value, which provides an overview of the magnitude of the effect of the independent variables, indicates that the training data have the most substantial influence on the RMSE of canopy height modeling, with a value of 1, followed by the regression model and satellite data, with values of 0.97 and 0.70, respectively. The result of ηp2 for PRMSE is the same as for RMSE.
The ANOVA model also revealed a statistically significant difference between the RMSE and PRMSE of the interaction between the two factors at a 0.05 probability level. Based on the F-value of the interactions with significant value, the two-level interaction between the regression model × training data has the highest value (RMSE: F-value = 1173.73; PRMSE: F-value = 717.61), followed by satellite data × training data (RMSE: F-value = 983.93; PRMSE: F-value = 601.54) and regression model × satellite data (RMSE: F-value = 89.72; PRMSE: F-value = 54.86).
The interaction of the full factors (regression model × satellite data × training data) also showed highly significant influence on RMSE and PRMSE, with an F-value and p-value of 155.06 and <0.001, respectively, for RMSE and 94.8 and <0.001, respectively, for PRMSE.
Duncan’s post hoc test was used for multiple comparisons of the factor levels, and the result is shown in
Figure 9. The factor levels were grouped into several subsets based on the significant differences of the means.
Figure 9a is the result of the comparison of the RMSE and PRMSE of the regression models. It shows that the RMSE and PRMSE of the GB, k-NN, and RF are statistically significant different from each other at a 0.05 level of significance with a
p-value of <0.001.
The average RMSE/PRMSE of the levels under the regression models, GB, k-NN, and RF, are significantly different, with values of 11.08 m/43.75%, 11.36 m/44.86%, and 11.64 m/45.79%, respectively. The result indicates that the difference in the RMSE of the GB and k-NN is 0.28 m, while the k-NN and RF have an RMSE difference of 0.28 m. On the other hand, the GB and RF have a difference of 0.56 m. A slight difference in PRMSE was also observed between the three regression models. The difference in the PRMSE of GB and k-NN is 1.11%, while k-NN and RF and GB and RF have differences of 0.93% and 2.04%, respectively. The amount of improvement in the accuracy of the canopy height modeling using the three regression models indicates that the use of different regression models did not affect the accuracy of the canopy height modeling much. Nevertheless, among the regression models, GB provided the highest accuracy, followed by k-NN. The RF model has the least accuracy among the three.
Figure 9b is the result of the comparison of the RMSE and PRMSE of the satellite data. Accordingly, the RMSE and PRMSE of Sentinel-1&2 and Sentinel-2 are not statistically significantly different from each other but are significantly different from Sentinel-1 at a 0.05 level of significance with a
p-value of <0.001. The RMSE/PRMSE values of Sentinel-1&2, Sentinel-2, and Sentinel-1 are 11.31 m/44.69, 11.32 m/44.70, and 11.44 m/45.21, respectively.
Figure 9c is the result of the pairwise comparisons of the RMSE and PRMSE of the training data. According to the pairwise comparison, the RMSE and PRMSE of ICESat-2 and GEDI are statistically significantly different, with a
p-value of <0.001. The RMSE/PRMSE of GEDI are 8.24 m/34.54%, while ICESat-2 is 14.18 m/57.19%. The RMSE is decreased by 5.94 m when GEDI is used as training data instead of ICESat-2. On the other hand, the PRMSE is decreased by as much as 22.65%.
Finally, the RMSE and PRMSE of the predicted canopy height were determined based on the interactions of the satellite data, training data, and regression model (
Table 11). The levels of the three factors resulted in a total of 18 combinations. The 18 combinations were divided by Duncan’s test post hoc test into 16 homogeneous subsets for RMSE and 14 for PRMSE. The RMSE of the full interactions ranged from 8.00 m to 15.09 m, with an average of 11.36 m, and 50.0% of the full interactions have values that are below the average RMSE and 50.0% have values above the average RMSE value. PRMSE ranged from 31.59% to 59.62%, with an average of 44.86%, and 50% of the datasets have PRMSE values below the average and 50% have values that are above the average PRMSE. The dataset with the highest and lowest accuracy has a difference of 7.10 m based on RMSE and 28.03% based on PRMSE.
To summarize, the characteristics of satellite data, composition of training data, and type of regression model play an important role in the accuracy of canopy height modeling in a critical way based on the three-way full-factorial ANOVA design. On average, when considering a particular factor with the most appropriate method or parameters, the recommended datasets are Sentinel-1&2 for satellite data, GEDI for training data, and GB for the regression model because they have the highest accuracy, outperforming the other data combinations. In contrast, the most unfavorable would be using Sentinel-2 for satellite data, ICESat-2 for training data, and RF for the regression model.
3.6. Agreement of GEDI and ICESat-2 in Canopy Height Modeling
The agreement of the predicted canopy height products using ICESat-2 and GEDI as training data was determined by obtaining the difference between the two products.
Figure 10 shows the spatial and frequency distribution of the agreement. Negative values indicate that the GEDI-based canopy height product has a lower prediction value than the ICESat-2-based canopy height product, and 0 values indicate high agreement between the GEDI and ICESat-2 products. On the other hand, values farthest from 0 indicate the lowest agreement. The agreement of the two products using the GB regression model ranges from −70 to 26 m. The majority of the agreement, equivalent to 91.57%, is within −20 to 0 m, indicating high agreement. The agreement when using the k-NN regression models ranges from −118 to 55 m, of which the major frequency is observed from −20 to 0m. The total frequency is 95.44%, indicating that the agreement is high. On the other hand, when using the RF regression model, the agreement between the two products ranges from −79 to 33 m, a lower range compared to k-NN. The majority of the agreement is observed from −20 to 0 m, totaling 89.48%. Generally, low agreement is observed in lower-right and upper-left corners of study sites. Compared with the satellite image, areas with low agreement are observed in boundaries of low and high vegetation, areas with low vegetation, and also in shaded areas of the satellite image.
3.7. Comparison of the Spatial Distribution of the Canopy Height Prediction Bias of GEDI and ICESat-2
Using the Sentinel-1&2 data, the spatial distribution of the canopy height prediction bias produced by GEDI and ICESat-2 was visualized (
Figure 11). Through visual inspection, it can be observed that high bias values are located toward the center of the study site. Upon inspection with the satellite image, it is observed that the center of the site is a non-forested area and includes water, indicating low canopy height values.
Figure 12 shows the behavior of the canopy height prediction and prediction bias throughout the observed canopy height. In general, observed canopy heights with low values were over-predicted, as indicated by positive percentage prediction bias, while those observed canopy heights with higher values were under-predicted, as indicated by negative percentage prediction bias. The observed canopy heights with low values have high prediction bias. When GEDI-based canopy height is used as training data, the GB regression model has a prediction bias ranging from −45.45 m to 51.52 m with a mean + SD of −3.30 + 8.83 m; the k-NN regression model has a bias ranging from −59.04 m to 52.3 m with a mean + SD of −3.45 + 8.96 m; while the RF has a bias ranging from −48.03 m to 50.43 m and a mean + SD of −3.69 + 8.99. When using ICESat-2, the canopy height prediction bias of the GB regression model ranges from −65.36 m to 42.49 m with a mean + SD of −13.11 + 9.99; the k-NN has a bias ranging from −91.08 m to 48.03 m with a mean + SD of −14.07 + 9.88 m; while the RF has a bias ranging from −77.30 m to 40.47 m with a mean + SD of −14.64 + 10.33 m. Based on frequency, when using GEDI-based canopy height as training data, the majority of the prediction bias of the three regression models is within −20 m–10 m, around 90%. When using the ICESat-2-based canopy height as training data, the majority of the prediction bias of the regression models increased and is between −10 and −65.36, −91.08, and −77.30 m for GB, k-NN, and RF, respectively. The prediction bias of the three regression models is higher when ICESat-2 data are used as training data. Further,
Figure 12 shows the correlation of the predicted canopy height and observed canopy height and the correlation of the prediction bias with the observed canopy height. In general, the GEDI-based canopy height has higher correlation with the ALS data than the ICESat-based canopy height. In addition, it is observed that the observed canopy height with low values has the highest positive bias. This indicates that the regression models tend to over-predict low canopy height values and under-predict high canopy height values.
4. Discussion
4.1. Effects of Integrating SAR Data with Optical data in Canopy Height Modeling
Figure 13 depicts the improvement in the accuracy of Sentinel-1 when it was integrated into Sentinel-2 data during canopy height modeling using the GB regression model. Based on RMSE, the accuracy of the predicted canopy height was improved from 8.37 m to 8.00 m, and, based on PRMSE, the accuracy was improved to 31.59% from 33.06%. Accuracy percentage increases of 4.62% (RMSE) and 4.65% (PRMSE) were observed.
Figure 14 shows the spatial distribution and the descriptive statistics of the canopy height prediction bias. Based on the descriptive statistics, the mean canopy height bias values, which are 4.03 m and 3.92 m for Sentinel-1 and Sentinel-2, respectively, were decreased to 3.45 m when the features of the two datasets were integrated in canopy height modeling. This indicates that integrating Sentinel-1 data with Sentinel-2 data helps to improve canopy height accuracy.
Combining Sentinel-1 and Sentinel-2 data for canopy height modeling can offer several advantages due to the complementary nature of these sensors. Both Sentinel-1 and Sentinel-2 are part of the ESA’s Copernicus program, but they operate in different parts of the electromagnetic spectrum and provide distinct types of information.
Sentinel-1 is a SAR sensor that operates in the microwave range. It can penetrate through clouds and provide information about the structure of vegetation. Combining SAR data with optical data from Sentinel-2, which operates in the visible and NIR range, allows for more comprehensive characterization of vegetation structure. This is particularly useful in areas with frequent cloud cover as SAR can provide data even in cloudy conditions, thus improving sensitivity to vegetation structure. Sentinel-1 has a shorter revisit time compared to Sentinel-2, which means it can capture changes in vegetation structure more frequently. Combining the two datasets enables the creation of more temporally dynamic canopy height models, providing insights into seasonal variations and growth patterns, therefore enhancing temporal resolution.
SAR data from Sentinel-1 can be sensitive to the vertical structure of vegetation, allowing for more accurate estimation of canopy height. Integrating this information with the spectral data from Sentinel-2 can lead to a more robust and accurate model for canopy height. With this, it is understood that there will be improved accuracy in canopy height estimation.
In summary, combining Sentinel-1 and Sentinel-2 features in canopy height modeling can provide a more comprehensive and accurate representation of vegetation structure, especially in challenging environmental conditions. However, careful data integration and processing are crucial for realizing the full benefits of this multi-sensor approach.
4.2. Effects of Topography on Canopy Height Modeling
Using the canopy height data generated from the interaction of Sentinel-1&2, GB, and GEDI, the correlations of canopy height with topographic data were emphasized.
Figure 15 shows the scatterplot of the predicted canopy height and topographic data such as slope, elevation, and aspect. The R
2 of the of the canopy height and slope is 0.48. On the other hand, the relationship of elevation and aspect to canopy height is observed to have a very low R
2, indicating that no particular relationship is observed.
To summarize, the effect of slope on predicted canopy height shows that Sentinel-1, GB, and GEDI have the highest R2 compared to the other levels of the same factor.
Topographic features such as aspect, elevation, and slope can significantly influence canopy height prediction models. Integrating these features into the modeling process can enhance the accuracy and interpretability of the predictions. Aspect refers to the direction a slope faces (e.g., north, south, east, or west). It influences the amount of sunlight a location receives and, consequently, can affect vegetation growth. In canopy height prediction, an aspect can impact the microclimate and water availability, affecting vegetation growth patterns. In areas with different aspects, the amount and intensity of sunlight may vary, leading to variations in canopy structure. Including aspects in the model can capture these spatial variations and improve height predictions.
Elevation represents the height of a location above a reference point (usually sea level), and it influences temperature, precipitation, and atmospheric pressure. During canopy height modeling, vegetation tends to exhibit altitude-dependent variations in growth. Different elevations may have different ecological zones and vegetation types. Elevation can indirectly influence canopy height by affecting climate and soil conditions. Models that incorporate elevation information can better capture these variations in vegetation structure.
Slope represents the steepness or incline of the terrain. It affects water drainage, soil erosion, and sunlight exposure. As to the influence on canopy height prediction, slope can influence the distribution of water, nutrients, and sunlight, which in turn affects vegetation growth. Steeper slopes may have different vegetation types and heights compared to flatter areas. Including slope as a feature in canopy height models helps to account for these variations and improves the model’s predictive accuracy.
Combining these three topographic features provides a more comprehensive characterization of the terrain. The combined effects of aspect, elevation, and slope allow for a more nuanced understanding of the relationships between terrain characteristics and vegetation structure. For example, certain aspects combined with specific elevations might create microclimates that are favorable or unfavorable for certain types of vegetation. Models that consider these combined effects can provide more accurate predictions of canopy height. When incorporating topographic features into canopy height prediction models, it is essential to consider the scale of analysis, potential interactions between features, and the availability of accurate topographic data. Additionally, careful validation and testing of model performance are crucial to ensuring the reliability of predictions across different landscapes and conditions.
4.3. Limitations of Generating Wall-to-Wall Canopy Height Information Using Discrete RS Data, Medium-Resolution Passive and Active Satellite Images, and ML Regression Techniques
The integration of RS with ML techniques for canopy height modeling presents various challenges. While this approach offers valuable insights, it also involves complexities that need to be addressed. Some of the challenges associated with integrated RS- and ML-based canopy height modeling are data integration and heterogeneity, lack of ground truth data, complexity of vegetation structure, model overfitting, computational intensity, interpretability, transferability across regions, temporal dynamics, scale dependency, availability and accessibility of data, and handling noisy and incomplete data.
Data integration and heterogeneity: Integrating data from different remote sensing sources (e.g., LiDAR, SAR, or optical imagery) requires careful data fusion. These sources may have different resolutions, scales, and data formats, making it challenging to create a homogeneous dataset for modeling. For this study, the data that were used, like Sentinel-1, Sentinel-2, and topographic data, have a different spatial resolution, like 40 m, 10 m, and 20 m, respectively. During the analysis, the 40 m and 20 m data were rescaled to 10 m, which may have had a negative impact on the original values. For the LiDAR data, the collected ALS data had 1 m resolution but needed to be rescaled to 10 m, 40 m, 17 m, and 25 m to be consistent with the Sentinel-2, Sentinel-1, ICESat-2, and GEDI data, respectively. Again, this could have impaired the data quality.
Lack of ground truth data: Obtaining accurate and sufficient ground truth data for training ML models is often a challenge, especially at a large scale. Ground-based measurements for canopy height across diverse landscapes may be limited or expensive to collect. In this study site, wider coverage was used regarding canopy height prediction. However, in the validation, only the coverage of the ground truth regarding the ALS data was validated due to the availability of the ground truth. It is noted that, when the ALS data were collected, a significant budget was allotted. In modeling, it is better that wider coverage is validated for reliability purposes.
Model overfitting: ML models may overfit the training data, capturing noise and specific patterns that do not generalize well to new data. Balancing model complexity and generalization is crucial to avoid overfitting. Model overfitting in this study was observed in the RF and GB models.
Addressing these challenges involves a combination of advances in algorithm development, careful preprocessing of remote sensing data, collaboration across interdisciplinary teams, and efforts to improve data availability and quality. Additionally, ongoing research in the field is essential to develop innovative solutions and improve the overall robustness of integrated RS- and ML-based canopy height modeling, especially when discrete data are regressed to generate wall-to-wall canopy height.
This study has successfully generated wall-to-wall canopy height information using the discrete data from spaceborne LiDARs, medium-resolution passive and active satellite images, satellite-derived indices, and ancillary data. However, it is limited to general canopy height modeling and was not able to explore the challenges involving the diverse characteristics of vegetation, such as variations in tree species, forest types, and growth stages. Moreover, the generated wall-to-wall canopy height captured the physical pattern of the canopy structure of Lishan Forest when studied visually with real-time high-resolution photos but was not able to significantly address the high canopy height estimation bias. These limitations of the study can serve as baseline information for future research on generating high-accuracy canopy height modeling.
5. Conclusions
This study used discrete data that are provided by spaceborne LiDARs such as GEDI and ICESat-2 to generate wall-to-wall canopy height information using ML-based regression models and satellite images and ancillary data. Evaluation of the results demonstrated that the interaction of the GB regression model, GEDI, and Sentinel-1&2 in canopy height modeling had the best accuracy. Meanwhile, the interaction of RF, ICESat-2, and Sentinel-2 had the lowest accuracy. Moreover, the comparison of the performance of the GEDI and ICESat-2 data in canopy height modeling showed that GEDI data have better agreement with the ALS-based canopy height than ICESat-2. The predicted canopy height has more or less similar appearance to the satellite image when visualized but has low quality when being measured based on RMSE and PRMSE. It is noted that the best accuracy that was attained was 8.00 m and 31.59% based on RMSE and PRMSE, respectively.
This study also concludes that the main and interaction effects of the factors such as characteristics of satellite data, type of regression model, and composition of training data have a significant effect on the accuracy of the predicted canopy height. Furthermore, from the result of Duncan’s post hoc test for the levels of the factors, it can be concluded that regression models such as GB, k-NN, and RF have significant differences as well as training data such as GEDI and ICESat-2. In contrast, the levels of satellite data such as Sentinel-2 and Sentinel-1&2 have no significant difference with each other but are significantly different from Sentinel-1.
This study also concludes that integrating Sentinel-1 data with Sentinel-2 contributes to enhancing canopy height accuracy. The RMSE improved from 8.37 m to 8.00 m, and the PRMSE improved from 33.06% to 31.59% (GEDI-based canopy height). On the other hand, the mean canopy height bias of Sentinel-1, initially 4.03 m, decreased to 3.45 m when integrated with Sentinel-2.
Accurate prediction of canopy height is vital for effective forest management, offering valuable insights into forest structure. It informs decisions on timber harvesting, biodiversity conservation, and ecosystem health. Precise canopy height predictions also improve estimates of biomass and carbon sequestration, essential for understanding forests’ role in climate change mitigation. Knowledge of vegetation height and structure assists in assessing wildfire risk, allowing for the identification of high-fuel-load areas and prioritizing fire management efforts. In contrast, poor-quality canopy height prediction can lead to inaccurate estimates of biomass and carbon sequestration, misguide wildfire risk assessments, and impact decisions on green space allocation, tree planting, and landscape design. It may compromise biodiversity assessments, affect crop management, and influence hydrological models, leading to unreliable assessments of water-related processes. Additionally, it can hinder disaster management, misinform decisions about ecologically valuable areas, and affect the reliability of studies in ecology, climatology, and environmental science.