1. Introduction
Forest ecosystems, as one of the most important ecosystems on Earth, play a critical role in maintaining biodiversity, regulating climate, and conserving soil and water [
1,
2]. With growing demands for global environmental governance and ecosystem protection, the need for the accurate monitoring and management of forest ecosystems has become increasingly urgent [
3,
4]. Effective forest monitoring requires advanced methodologies capable of handling large, complex datasets and providing accurate classifications.
The Loess Plateau, one of China’s key ecologically vulnerable zones, has historically been faced with severe environmental challenges, including soil erosion, vegetation degradation, and desertification, due to its long-term irrational land use, water scarcity, and harsh terrain and climatic conditions [
5,
6]. To address these problems, the Chinese government has undertaken several ecological restoration initiatives since the 1990s, most notably, large-scale tree planting and afforestation efforts which have substantially expanded the artificial forest cover on the Loess Plateau. Among these, coniferous and broadleaf forests, mostly artificial, have become the primary forest types in the region. They play a crucial role in preventing wind and sand fixation effects, improving soil quality, and conserving biodiversity [
7,
8,
9]. In the face of escalating climate change and human activities, rapid and accurate identification and monitoring of these forest types is crucial for ecological restoration and sustainable management on the Loess Plateau.
In recent years, the rapid growth of remote sensing data resources and the swift development of deep learning technologies have unleashed unprecedented potential in the field of environmental monitoring [
10,
11,
12]. In image classification, deep learning improves its capabilities in handling large, complex datasets, feature extraction, and image recognition mainly through data fusion, multiscale feature learning, and transfer learning [
13,
14]. These methods effectively uncover and integrate the latent value from different remote sensing data sources. Among them, transfer learning stands out as a powerful framework that excels in handling large heterogeneous data and quickly adapting to new environments. It leverages pre-trained models on popular deep learning architectures to accelerate and improve learning efficiency for new tasks, even with limited labeled data [
15,
16,
17].
Residual Neural Networks (ResNet) is one of the most widely used models within deep learning architectures, particularly adept at image classification tasks [
18,
19,
20,
21]. It has been extensively applied in areas such as medical image diagnosis, identification of agricultural crops and pests, and soil property estimation [
22,
23,
24].
In forest monitoring and classification, emerging UAV data (hyperspectral cameras, RGB imagery, oblique photography, and LiDAR) have significantly improved classification accuracy and efficiency due to their high flexibility and resolution [
25,
26,
27]. Many studies focus on using transfer learning strategies, training these models on large and diverse datasets (such as ImageNet) as a pre-training source, and then fine-tuning them on specific small sample datasets. However, specific image recognition datasets such as ImageNet contain many object categories that are unrelated to specific applications, such as forest classification. This mismatch in category distribution reduces the ability of the model to transfer from the source domain to the target domain, thereby affecting its performance. The current research has underutilized data directly related to the task for pre-training. Investigating how models can better identify specific forest types is critical. Factors such as model depth, image size, and sample quantity also significantly affect performance [
28,
29,
30,
31]. In practical applications, it is essential to consider these factors comprehensively to optimize the performance of the ResNet model. Furthermore, due to the cost limitations of UAV data, these studies are often limited to small-plot applications. In contrast, medium- to low-resolution remote sensing data (Landsat, Sentinel, and GF-1) have improved the accuracy and efficiency of forest classification using deep learning techniques, while enabling large-scale classification [
32,
33,
34,
35]. However, they still fall short of the high-precision classification achieved by UAVs. Therefore, effectively identifying and selecting valuable features from integrated multi-source data using deep learning technologies has broad application prospects in expanding the application scope and efficiency of image classification. Some studies have made initial attempts, such as [
36], which effectively applied crop classification by integrating UAV and satellite images through data augmentation, transfer learning, and multimodal fusion techniques. Reference [
37] proposed an object-oriented convolutional neural network classification method combining Sentinel-2, RapidEye, and LiDAR data, one which significantly improved classification accuracy in complex forest areas. Reference [
38] developed a novel mangrove species classification framework using spectral, texture, and polarization information from three spatial image sources. However, there remains room for optimization in utilizing transfer learning techniques to explore the potential of emerging high-resolution multispectral UAV data for landscape-scale forest classification, while achieving data complementarity and cross-domain enhancements to improve classification accuracy and expand application scope.
In response to these challenges, this study selected three counties in the Loess Plateau region with similar forest types—Yongshou, Baishui, and Zhengning—as research sites. Among them, the forests in Yongshou County were selected as the primary research object, while Baishui County and Zhengning County served as supplementary research sites to validate the applicability and generalizability of the developed model. Adopting a transfer learning approach, the study uses the ResNet model in conjunction with multi-resource remote sensing data to establish an effective framework for identifying forest types. The effects of different combination strategies (sample quantity, model depth, and image size) on the model performance will be explored. The main objectives of this study are: (1) to develop an effective technical framework for rapidly distinguishing forest types by using deep learning technology combined with multisource remote sensing data; and (2) to reveal the impacts of image size, sample quantity, and model depth on the time efficiency and accuracy of the training model. Our research aims to provide a more accurate and efficient technical approach for remotely sensed forest classification, thereby offering stronger technical support for forest resource management and ecological monitoring.
4. Discussion
This study aimed to achieve two primary objectives, namely, to develop an effective technical framework for rapidly distinguishing forest types using deep learning technology combined with multisource remote sensing data, and to reveal the impacts of image size, sample quantity, and model depth on the time efficiency and accuracy of the training model. This research not only offers new perspectives and methods for forest landscape classification at the technical level but also provides effective technical support for forest management and environmental monitoring in the Loess Plateau.
This framework introduces a fine-tuning-based transfer learning strategy that effectively integrates cross-scale information sources (UAV multispectral data and Landsat remote sensing data), which significantly enhances the overall accuracy and efficiency of large-scale forest-type identification in the region. This improvement is likely due to the rigorous adherence to the fine-tuning-based transfer learning strategy, one which effectively addresses the issue of insufficient training data for deep neural networks [
43,
44]. First, knowledge from one or more source tasks is acquired during the pre-training phase, and then this knowledge is transferred to the target task during the fine-tuning phase. The rich knowledge acquired in the pre-training phase enables the model to effectively handle the target task with limited samples during the fine-tuning phase [
45]. The technical framework proposed in this study consists of three important components: the pre-trained model, the fine-tuned model, and the application to a wide range of data awaiting classification. The pre-trained and fine-tuned models, as the core parts of the entire framework, are especially beneficial for achieving advanced results in image classification when the target tasks in both phases are the same [
16]. In recent years, some studies have successfully identified forest tree species and achieved good classification results by adopting advanced deep learning architectures [
20,
27,
46]; other scholars have also recognized forest tree species by applying transfer learning strategies [
47,
48,
49]. While these specific classification tasks employ models such as fine-tuned transfer learning, they often overlook the quality and relevance of the metadata sets in the pre-training steps. Existing research indicates that if the source dataset differs substantially from the target application scenario, the model’s effectiveness might be limited [
16,
22]. Our study does not rely solely on the original image datasets in deep architectural models, but instead uses a large Landsat labeled dataset for training the model. Only after validating the model’s performance does this study use small-scale UAV labeled data for fine-tuning, taking full advantage of the hierarchical structure of the deep architectural model. Our results also confirm the maturity of our technical framework model, which achieves or exceeds a 90% accuracy threshold in all areas tested. This result not only demonstrates the effectiveness of the methods employed, but also underscores their reliability across different terrains and ecological conditions.
Selecting an appropriate image size for specific application scenarios and resource limitations and adjusting the data preprocessing workflow and model structure accordingly can more effectively balance size and performance [
50,
51]. The majority of deep convolutional neural networks, particularly those based on the ResNet model, are generally designed to handle deep learning tasks involving images with widths and heights ranging from tens to hundreds of pixels [
52]. This design aims to capture visual features sufficient for effective learning and prediction [
53]. Many readily available deep learning models and pre-trained weights are based on standard image sizes (224 × 224 pixels). In this study, to accurately reflect the impact of image size variations, the configuration of the model’s input layer was adjusted while keeping other conditions such as model architecture, training epochs, and learning rate unchanged. The performance of the loss curves in
Figure 6 and
Figure 7 demonstrates that images with dimensions of 1 × 1 or 3 × 3 pixels carry feature information sufficient to meet the model’s effectiveness and performance requirements. This indicates that for deep residual network models, satisfactory results can be achieved in terms of model performance for specific applications, even when using particular image sizes [
54,
55]. However, given the limited availability of samples from remote sensing data sources, whether a labeled dataset constructed with appropriate image sizes can enhance the model’s generalization capability and stability remains a topic for debate. Further research should aim to explore and optimize the integration of multisource remote sensing data using deep learning technologies to improve the spatial resolution of medium to low-resolution (Landsat) imagery. Although data augmentation and preprocessing techniques can effectively increase the sample quantity and ensure model stability, they cannot completely eliminate the potential negative impact of augmented samples on the model’s generalization ability. Therefore, further improvements are needed to enhance the interpretability of deep residual architecture models. Moreover, attention should be paid to increasing the efficiency and reducing the cost of genuine sample data collection, and more diverse model optimization strategies should be explored to further enhance the generalizability and practicality of the model. Additionally, investigating the application of these technologies on a global scale can provide more comprehensive support for environmental monitoring and forest management. This study has effectively improved the identification accuracy and efficiency of coniferous and broadleaf forests in the Loess Plateau by combining deep learning with multisource remote sensing data, paving a new path for the application of remote sensing technology in forest management and environmental monitoring.
Effectively quantifying and identifying factors that impact the performance of ResNet models is crucial for model interpretability [
56,
57]. In our study, we selected image size, sample quantity, and model depth as the three potential influencing factors for study, evaluating their impacts on the model through meticulously designed experiments. Unlike previous studies [
56,
58], this research focuses not only on the influence of individual factors, but also systematically examines the combined effects of these factors. The experiments comprised fifteen different combination strategies (
Table 6) to test their responsiveness to model performance. To comprehensively assess the model, this study employed a robust evaluation of performance across three dimensions, namely, stability, accuracy, and time efficiency, ultimately determining the model’s optimal combination strategy (G9). Our study highlights the substantial impacts of different combination strategies (including model depth, sample quantity, and image size) on the performance of forest-type classification tasks. It also emphasizes the importance of considering these factors—image size, sample quantity, and model depth—in the design and optimization of pre-trained models. Image size and sample quantity are key factors in enhancing model classification accuracy, while the choice of model depth should be flexibly determined based on the specific requirements of the task and the characteristics of the data, a determination which is consistent with the views of existing studies [
59,
60,
61]. Moreover, these factors substantially affect the time efficiency of model training and inference, which is particularly important in resource-constrained application scenarios. In summary, for optimal performance and efficiency, model design should carefully balance these factors, based on the specific task and available resources.
Although this study provides an effective method for classifying forest types on the Loess Plateau, there may still be issues with the singularity of the data sources. The limited coverage of UAV data and the spatial resolution constraints of Landsat data could affect the accuracy and representativeness of the study results. Additionally, the choices of sample quantity and study area may limit the generalizability of the findings. If the sample size is insufficient or the selected study area does not represent the diversity of the entire Loess Plateau region, the model’s generalization ability may be compromised. Using the ResNet model and transfer learning techniques can improve classification accuracy, but it also increases the model’s complexity and computational cost. In practical applications, this complexity might limit the model’s usability and practicality. Although the study evaluated different spectral band combinations, it may not have covered all possible combinations. For certain vegetation types, specific spectral information might be required to achieve optimal classification results. Moreover, the study may not have fully considered the impacts of seasonal and climatic changes on the spectral characteristics of vegetation. Vegetation spectral responses can vary under different seasonal and climatic conditions, thereby influencing classification results. Despite these limitations, the model can be adjusted and optimized through transfer learning, as long as corresponding remote sensing data and sufficient samples are available.
Our research is only a first step towards cross-scale forest monitoring. Future studies may extend upon this technical foundation using UAV hyperspectral and LiDAR data. By enriching spectral features and spatial information, these studies could further achieve large-scale tree species classification on the Loess Plateau, a process critically important for evaluating vegetation restoration outcomes.