1. Introduction
Stem volume and biomass are among the most significant productivity and carbon sequestration factors since they are vital for sustainable forest management and climate change mitigation [
1]. According to [
2], plant biomass (above and below ground) is the main conduit for CO
2 removal from the atmosphere, storing more than 80% of the total aboveground carbon [
3,
4].
In recent decades, there has been an increasing demand for accurate and timely vegetation biomass estimation due to the rising consumption of biomass products, especially in managed forests. Above-ground biomass (AGB) is defined as the sum of the dry mass of every tree component standing above the soil level (e.g., stem, leaves, needles, branches, bark), typically expressed as mass at the individual tree level [
5,
6,
7]. Among the different AGB components, the stem biomass is considered to be the most crucial, being the dominant material for timber products and paper production [
8]. Traditionally, biomass estimation has been performed through destructive sampling, providing accurate and precise results, yet it is time-consuming, costly and laborious [
6,
9]. For decades, AGB estimates have mostly been based on allometric models [
10], the development of which requires destructive sampling of a significant amount of trees [
11] and measurement of their structural parameters, such as diameter at breast height (DBH) and/or tree height [
12,
13,
14]. Depending on the stem biomass allometric equation type, the total stem biomass (TSB) and the barkless stem biomass (BSB) are most frequently estimated, since the TSB is necessary for the AGB and carbon pool estimations, while the BSB is essential for logging productivity estimation. Nowadays, a large number of allometric equations has been established for the vast majority of species in different biomes [
15,
16,
17,
18,
19], reducing the need for destructive sampling [
20]. However, the specific biomass estimation method is rather limited in terms of spatial coverage and the need for equation reconstruction at regular temporal intervals.
Remote sensing technologies provide timely, low-cost, and reliable estimates on biomass and other forest attributes, especially over large and/or remote areas [
21]. Optical remote sensing data, such as aerial photography [
22,
23,
24,
25,
26], multispectral satellite imagery [
27,
28,
29,
30,
31,
32,
33], synthetic aperture radar (SAR) [
34,
35,
36,
37,
38,
39], light detection and ranging (respectively) [
40,
41,
42,
43,
44,
45,
46,
47] and multisource approaches [
48,
49,
50,
51,
52,
53], have been widely used for forest biomass and carbon quantification. Multispectral image analysis is the most common approach for large-scale biomass estimation [
21], although the main limitation is the signal saturation over forested areas with high AGB values [
54]. SAR systems have been successfully applied in biomass estimation, due to their canopy penetration capabilities which depend on the wavelength and the canopy characteristics [
55]. However, SAR data also present significant limitations, including saturation, high cost, and complex processing methods, especially over forested areas with complex topography [
56]. On the contrary, Light Detection and Ranging (LiDAR) data usually offer more accurate forest biomass estimates [
57], due to the ability of the emitted pulses to penetrate the forest canopy and, therefore, provide 3D information about all forest vegetation layers [
58].
Several studies have shown that LiDAR data provide accurate and efficient biomass estimates using either plot- or tree-level approaches. The first approach has been examined in different biomes [
9,
42,
44,
59,
60,
61,
62] and includes the implementation of height percentiles and other LiDAR-derived metrics in order to predict forest characteristics at the plot or stand level. However, modeling biomass and forest characteristics using the plot level approach requires a sufficient amount of field plot measurements in order to establish the relationship between the LiDAR-derived features and the reference data [
63].
On the contrary, the tree-level approach requires high pulse densities [
64] and is primarily based on single-tree segmentation, aiming to derive height and intensity metrics for each individual tree [
65]. The effectiveness of this approach is closely related to the accuracy of tree detection and tree crown delineation, which highly depends on the point cloud density and forest structure [
66]. Several algorithms have been proposed for accurate crown segmentation and evaluated in different forest ecosystems [
59,
65,
67,
68,
69], utilizing both point clouds and canopy height models (CHMs). According to [
70], tree crown delineation over broadleaved forests resulted in lower detection rates than the conifers, due to the complex shape and crown structure. Nevertheless, the specific approach provides reliable biomass estimates in mixed stands using species-specific models [
71] and valuable information for forest management purposes, while requiring a lower amount of field work compared to the plot-based method [
72]. Although the plot-level approach is the most widely used approach, tree-level AGB estimation is becoming increasingly popular due to the development of small-footprint and UAV-based LiDAR systems that are capable of providing high-density 3D point clouds.
Regardless of the estimation approach, various regression methods have been applied along with LiDAR data for forest biomass estimation, such as Support Vector Regression (SVR) [
73], Random Forest (RF) regression [
74], cubist regression trees, Linear Mixed-Effects (LME) [
44], Gaussian Process (GP) regression [
75], nonparametric regression [
76] and Artificial Neural Networks (ANNs) [
77]. In particular, [
44] compared the effectiveness of different machine learning (ML) algorithms for biomass estimation using both single-tree and plot-based approaches on LiDAR-derived point clouds, reporting that SVR produced the most accurate biomass models. In addition, the authors stated that manual crown delineation does not necessarily result in more accurate biomass estimates since the relationship between the accuracy of both crown delineation and biomass estimation is rather complex. In [
78], the authors performed a comparative analysis between the tree-level and area-based approach suggesting their combined implementation for reliable AGB estimation in a boreal forest using the nearest neighbor algorithm.
Although BSB is considered to be the most significant parameter for growth models, stem volume, biomass and carbon allocation among tree species [
79], most of the LiDAR-based studies focus on the TSB and AGB estimation. Recently, [
75] presented a comparison between the potential of multispectral single-photon and linear-mode LiDAR systems in plot-based TSB estimation, using GP regression. The results showcased that the linear mode LiDAR systems outperformed the single-photon systems in terms of Root Mean Square Error (RMSE). In [
74], the authors predicted TSB in a boreal forest, using measurements derived from 1476 trees, low-density LiDAR data and two regression algorithms (RF and linear). According to their findings, the estimation accuracy was similar using RF and linear regression in terms of RMSE and the coefficient of determination (R
2), although the RF provided more consistent results over the trials [
74]. The stem volume and TSB of individual pines were also examined using full-waveform [
1] and discrete-return LiDAR data [
80]. As reported by [
1], the full-waveform-derived metrics do not improve the volume and biomass estimates significantly compared to the discrete data, due to the limitations of single-tree delineation. In [
80], the authors employed the crown geometric volume (CGV) method on LiDAR point cloud to estimate the TSB for three different tree density classes. According to the results, the low tree density sites provided the most accurate stem volume estimates (R
2 = 0.68) compared to the others (i.e., R
2 = 0.62 for the medium and R
2 = 0.44 for the dense), since the high tree density leads to large tree segmentation errors.
Recent developments in LiDAR technology have led to commercial multispectral systems, which provide essential information on the vertical distribution of physiological processes [
81]. The intensity of multispectral LiDAR channels contains complementary information for improved land cover and species classification [
82,
83,
84,
85]. Until now, the intensity features have been most widely used for land cover and species classification but few studies have examined their contribution to forest inventory parameters and biomass estimation [
75,
86]. More specifically, [
86] compared the capability of multispectral and conventional LiDAR data for modelling and predicting forest characteristics (e.g., AGB, Ginis coefficient of DBHs, Shannon diversity index, and number of trees per hectare) at the plot level in a boreal forest. The results showcased that the multispectral data improved the AGB estimations (i.e., R
2 = 0.87 for the multispectral and R
2 = 0.72 for the single spectral). In 2020, [
87] compared the suitability of monospectral and multispectral LiDAR for canopy fuel parameter estimations in a boreal forest. According to their findings, the multispectral LiDAR provided improved canopy fuel estimates compared to the monospectral LiDAR. Additionally, multispectral LiDAR data are more capable for individual tree detection compared to the monospectral data due to the spectral differentiation of the vegetation objects [
88].
Although many different tree species have been examined in the literature, few studies have explored the potential of multispectral LiDAR remote sensing in biomass estimation of the different Abies species [
89,
90,
91], and there are still no studies related to the biomass estimation of the Abies borisii-regis. Furthermore, compared to the vast majority of studies on the biomass estimation in plantations and semi-natural regenerated forests, only a handful of studies have been conducted in uneven-aged structured forests [
12,
92,
93,
94,
95,
96]. In order to fill this literature gap, the present study aims to estimate TSB and BSB in an uneven-aged structured Abies borisii-regis forest employing multispectral LiDAR-derived predictive models. More specifically, two allometric equations were built to derive TSB and BSB estimates, respectively, which were used as additional reference data. Next, a set of regression algorithms (i.e., Generalized Linear Models-GLMs, GP, RF, SVR and Extreme Gradient Boosting—XGBoost) was examined using LiDAR-derived height and pulse intensity information in order to investigate their potential in reliable TSB and BSB estimation.
5. Discussion
The present study focuses on the potential of multispectral LiDAR-derived variables alongside with linear and ML regression algorithms for single-tree stem biomass estimation (i.e., TSB and BSB) in an uneven-aged structured forest. In the absence of species-specific allometric equations for biomass estimation, destructive sampling was performed for the construction of two allometric equations (i.e., for TSB and BSB estimation), so that the reference dataset could be enriched without performing additional destructive sampling. Thus, DBH was measured in the field and used as input data in the developed allometric equations for the generation of additional reference TSB and BSB estimates. All reference data were subsequently employed for the tree-level biomass estimation using five different regression algorithms (i.e., GLM, RF, SVR, GP, XGBoost). Both intensity and height metrics were extracted from the LiDAR point cloud and employed as potential independent variables in the predictive models. According to [
92], the intensity–height combined AGB models provided higher R
2 compared to the ones using only the height information, especially in a study area like the Pertouli University Forest, which is characterized by complex structure, high canopy cover and intense topography [
106]. Overall, the results demonstrate that BSB can be reliably estimated using the GLM and the ML algorithms (i.e., RF, SVR, GP and XGBoost) in an uneven-aged structured coniferous forest, while TSB estimations are more reliable using the RF, SVR and GP (with both kernels), with the RF providing the best results in both cases (i.e., BSB and TSB).
In the case of the GLM application, a maximum number of three independent variables (i.e., LiDAR-derived height and intensity metrics) was incorporated in the predictive models in order to avoid potential overfitting. In this study, an exhaustive screening variable selection method and the Small-Sample Corrected Akaike Information Criterion were employed for optimal linear model selection, taking into consideration the MAE, MSE, and rbias statistics. The estimation accuracy of the developed GLM for BSB estimation was similar to the ML models, resulting in the best prediction performance in terms of MAE, MSE, rbias, RMSE and R
2 (i.e., MAE = 121.71 kg, MSE = 35585.08 kg, rbias = 1.89%, RMSE = 167.09 kg, R
2 = 0.72). However, the GLM severely underperformed (MAE = 174.55 kg, MSE = 45549.33 kg and R
2 = 0.44) in the case of the TSB estimation compared to the rest of the examined algorithms (i.e., RF, GP and SVR). Even though the Green channel presents lower sensitivity in canopy returns registration, due to its large beam divergence [
142], the int_std was the most frequently selected variable in both BSB and TSB linear models (
Table 6 and
Table 9). It is noteworthy that the int_std of the Green channel is the variable selected in the best performing linear models, which indicates the significance of the LiDAR intensity information for stem biomass estimation. In general, height metrics were selected in every GLM but, contrary to the intensity metrics, the best performing BSB model consisted of the least frequent height-related variables (i.e., p10 and p70). These two variables are highly correlated, with a Pearson’s correlation coefficient greater than 0.5. Moreover, in the TSB GLM, both the p70 and the b50 were rated as the second and third most frequently selected variables in the best- and worst-performing regression models, respectively (
Table 10). A comparison of the BSB and the TSB GLM accuracy reveals that the former performs significantly better. The low predictive performance of the GLM in the TSB estimations can be attributed to the fact that bark volume and biomass variation do not only depend on the tree species but also on the growth rate, the environmental conditions and the genetic constitutions of trees [
137]. In addition, allometric equations can introduce uncertainty in the estimated biomass results mainly due to measurement errors, transect size, the fraction of the AGB and the characteristics of the study area [
143]. As a result, although the linear models performed well in the BSB predictions, they proved to be inadequate for TSB estimations (i.e., MAE = 173.92 kg, and R
2 = 0.32), and new bark biomass ones need to be developed with the use of a large sample size (e.g., sample size >100).
On the contrary, ML algorithms can reliably predict tree attributes in complex structured forests, which is mostly credited to their better generalization ability compared to the GLM [
144]. They also provide additional valuable information for the variables, such as the feature importance, which gives the possibility to exclude the insignificant variables with minor contribution to the predictions. However, one of the main limitations of ML algorithms is that they are considered as a black box to users, since it is not feasible to examine the internal iterations and perform detailed interpretations of all model components [
74,
145].
In the present work, all the ML models had similar performances in the BSB estimations, with the RF presenting the highest values of the statistical evaluation measures (i.e., MAE, RMSE, and R
2). As opposed to the GLM, all the ML models identified only height-related LiDAR-derived metrics as the most important features for the prediction of BSB, such as p99, p90, p95, and Hmax. With respect to TSB estimations, the RF algorithm resulted in an R
2 of 0.65 with among the lowest MAE and RMSE (i.e., MAE = 152.65 kg and RMSE = 175.76 kg), and identified the Hstd, the p80 and int_ske of both the Green and NIR channels as the most important variables of the regression model. On the contrary, the XGBoost algorithm was found to be prone to overfitting due to the small sample size (i.e., MAE = 236.67 kg, RMSE = 290.89 kg and R
2 = 0.31) [
146]. According to the literature, the RF algorithm has a significant advantage over the other ML algorithms in terms of its accuracy and its ability to perform with small sample sizes [
131]. The RF and GLM-related findings of this work are similar to other studies (e.g., [
44,
77,
146]), although none of them have been conducted in an uneven-aged structured forest.
Although individual tree approaches are becoming increasingly popular due to the rapid development of LiDAR sensors providing high-density point clouds, there is a restricted number of papers for individual tree stem biomass estimation. Consequently, we compared our results with similar reports that exploit either approach, for stem biomass or volume estimation. Despite the complex terrain, the multi-layered forest structure and the denser LiDAR point cloud of [
77], we achieved better results in terms of R
2 in all of the tested algorithms (i.e., RF, SVR, XGBoost and GLM). In 2020, [
147] tested the RF, SVR, and GLM in plantations, resulting in a higher performance compared to ours in terms of R
2, but much higher rbias values as well. In addition, [
60] used linear models with logarithmic transformations for stem volume estimation in a boreal forest, using primarily height metrics and reported superior results compared to the ones of the present study. This variation in the results can be attributed to the significant difference between the study sites, as the one in our work is a naturally regenerated uneven-aged forest with complex topography, contrary to semi-natural regenerated and hilly forest in Finland.
As for the tree segmentation process, its accuracy was of utmost importance in this study for the subsequent reliable stem biomass estimation since the LiDAR metrics (i.e., potential independent variables for the regression models) are calculated for each individually segmented tree. According to [
148], tree segmentation and matching in multi-layered forests is an arduous process resulting in the lowest matching accuracy among the different forest types. More specifically, trees in the lower layers of the forest are difficult to identify and the ability to segment those trees is contextual to the penetration capabilities and pulse characteristics of the LiDAR sensor. Moreover, although multispectral LiDAR sensors provide valuable information on the vertical distribution of physiological processes [
81], the higher the degree of canopy overlapping, the lower the capability of tree detection algorithms to accurately identify them [
149]. Consequently, only trees able to be detected in the LiDAR point cloud (e.g., dominant, co-dominant) were sampled, aiming to achieve the highest matching rate possible. The destructively sampled trees were all identified in the point cloud, while only 63% of the non-destructively sampled ones were correctly identified. Finally, tree segmentation errors caused by complex forest structure, intense topography, and detection algorithm inaccuracies, could be eliminated by employing a single-tree crown normalization process, based on the watershed segmentation algorithm, and segmenting the point cloud into single trees without prior treetop detection. As a result, the tree height was defined as the highest original point (in the height-unnormalized point cloud), overcoming deformations in the tree crown due to the normalization [
120].
Although this research reached its aim, there are some unavoidable limitations. First of all, the sample size is considered small for such a complex structured forest, which has potentially caused additional uncertainty in the predictions [
106]. Furthermore, due to legal restrictions, it was not possible to destructively sample a large number of small trees, which could provide better estimates. In addition, this study primarily focused on the trees that could be detected by the LiDAR sensor (e.g., dominant, co-dominant), excluding all the understory trees, which have a significant role in the total forest biomass and dynamics. As a result, the methodology and models can be transferred in similar biomes with the assumption of estimating the stem biomass mainly of the dominant and co-dominant trees. Nevertheless, the segmentation and identification of the understory trees in multi-layered forests with complex terrain is a laborious task due to the overlapping projections of the overstory and the understory tree, which should be studied further.
Overall, the results of our work demonstrated the ability of LiDAR data to estimate individual-tree stem biomass in a multilayered forest, which is critical for forest inventory management and could potentially replace the laborious field measurements required in traditional stem biomass estimation methods. This study also highlights the importance of the pulse intensity metrics in stem biomass estimation, apart from the standard height-derived ones. Besides, this study is the first contribution in the context of LiDAR remote sensing for biomass estimation in an uneven-aged structured Abies borissi-regis forest with complex topography, and one of the few studies that use primarily LiDAR data and destructive sampling for single-tree stem biomass estimation.