1. Introduction
In the current field of environmental protection and ecological research, accurate identification and classification of tree species in forests have become important and challenging tasks. This is of significant importance for ecological conservation, resource management, forestry planning, and biodiversity research. However, traditional tree species identification methods mainly rely on manual observation and expert experience, which can be limited in efficiency and accuracy when dealing with large-scale forest areas and complex ecosystems. With the rapid development of remote sensing and deep learning technologies, the extraction of tree species features from remotely sensed imagery has become a hot topic in current remote sensing research. Due to the large field of view and strong global characteristics of remote sensing images, tree classification can be more comprehensive and effective. Deep learning algorithms, represented by convolutional neural networks, show strong potential for remote sensing image classification. Nevertheless, current tree species classification methods still face several challenges due to the diversity and similarity of species.
With the advancement of remote sensing imaging technology and the proliferation of high-resolution satellites, the exploration of satellite data, such as Landsat, for land cover classification dates back to as early as 1980 when Walsh [
1] identified and mapped 12 land cover types, including seven types of coniferous forests. Subsequently, research in tree species classification predominantly focused on pixel-level analysis. Damon et al. [
2] demonstrated that forest phenology changes could be accurately captured by combining image differencing and cap-index mapping. In recent years, we have gained access to higher resolution (spatial, spectral, radiometric, and temporal) satellite or aerial remote sensing images. However, this also places higher demands on our interpretation of remote sensing images. High-resolution remote sensing images provide rich information and data for remote sensing tree species image processing tasks [
3]. Michael et al. [
4] utilized the fusion of Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) imagery and airborne LiDAR data to map tree species at the crown level in urban areas, achieving an Overall Accuracy (OA) of 83.4% using canonical discriminant analysis. Chen Zhang et al. [
5] employed unmanned aerial vehicles to obtain remote sensing images and combined deep learning with RGB optical imagery for urban tree species classification. Hui Li et al. [
6] demonstrated that the use of LiDAR intensity images, in conjunction with panchromatic sharpening of WorldView-2 imagery, improved ITS classification accuracy without the need for atmospheric correction. These advancements have presented new opportunities for forest tree species identification but have also introduced new challenges, including tree species diversity and similarity, large-scale data processing, and variations caused by lighting conditions and occlusion.
The most challenging aspect lies in the diversity and similarity of forest tree species. Forests encompass a wide array of tree species, each characterized by unique morphological, textural, and color features. However, certain species may exhibit visually similar attributes, making it arduous to distinguish them in remote sensing images, as depicted in
Figure 1. The overall branch and leaf textures, as well as leaf colors of these tree species, bear striking resemblance, posing a significant challenge in differentiation to the naked eye. Such intricacy underscores the necessity for more precise and robust feature representation and classification models in tree species classification tasks.
In recent years, a series of deep learning models have been widely applied in the field of image classification, such as AlexNet [
7], GoogleNet [
8], VGGNet [
9], and more. These models have been improved to varying extents. For example, ResNet [
10], proposed by He et al., addresses the gradient vanishing problem during backpropagation by adding skip connections. Xie et al. introduced ResNeXt [
11], which utilizes group convolutions and ResNet-like merge operations. Huang et al. presented DenseNet [
12], aiming to include information from all layers in the output of each block, achieved through multiple dense blocks and a classification layer similar to ResNet. Chen et al. proposed Dual Path Networks (DPN) [
13], effectively combining the advantages of ResNet and DenseNet by incorporating both residual and dense blocks in a parallel structure. MobileNetv2 [
14] and MobileNetv3 [
15] are improved versions of MobileNet, designed for efficient image recognition on mobile devices with limited computational resources. These improvements employ depthwise separable convolutions, breaking down the standard convolution operation into depthwise and pointwise convolutions, reducing parameter count and computation cost. Zhang et al. introduced ShuffleNet [
16], which reduces parameter count and computation complexity through channel shuffling, while enhancing the network’s non-linearity. ShuffleNet performs impressively when deployed on resource-limited devices. Vision Transformer (ViT) [
17] is an image classification model based on the Transformer architecture, proposed by the Google Brain team in 2020. Originally designed for natural language processing tasks, the ViT demonstrated excellent representation learning capabilities when applied to the image domain. ViT splits the image into small image patches and uses multi-head self-attention mechanisms to capture relationships within the image and between different patches. While ViT performs well on some image classification tasks, it may incur high computational costs for larger images. To address this, Liu et al. introduced Swin Transformer [
18] as an improvement to ViT. Swin Transformer introduces Swin Blocks, a multi-scale, multi-level attention mechanism. Swin Blocks leverage local windows to capture information within the image and combine this information at different levels, effectively handling large-sized images and achieving impressive performance in image classification and object detection tasks.
Hu et al. [
19] introduced the Squeeze-and-Excitation (SE) mechanism into ResNet, proposing SE-Net. In the SE module, the output feature information is first compressed through global max pooling, and then two fully connected layers are reduced to a weight, which is multiplied with the original feature information of each channel. Yao et al. [
20] proposed a multimodal model for parallel branching of location-shared ViTs extended with separable convolutional modules, which extends the parallel branching of location-shared ViTs using separable convolutional modules, provides an economical way to exploit spatial and modality-specific channel information, and significantly improves the discrimination of classification tokens in each mode by fusing their labelled embeddings with cross-modal attention modules, thus achieving higher performance capability, resulting in higher performance. Hong et al. [
21] proposed a small batch GCN (miniGCN), which implements a combination of CNN and GCN models, to train large-scale GCNs in a small batch manner and extrapolates out-of-sample data, eliminating the need to re-train the network and improving the classification accuracy. Li et al. [
22] proposed a HAD baseline network (LRR-Net) that combines the LRR model with deep learning techniques, which efficiently solves the LRR model via a multiplicative alternating direction method (ADMM) optimizer, uses its solution as a priori knowledge of the deep network to guide parameter optimization, and transforms the regularization parameters into trainable parameters, reducing the need for manual parameter tuning. Liu et al. [
23] proposed an improved Res-UNet tree species classification model, which is based on a point-based deep neural network. This network can segment the Euclidean space of trees into multiple overlapping layers from the LiDAR data in forest areas, thereby obtaining partial 3D structures of trees. The overall features of trees are extracted using convolutional operations considering the characteristics of each layer. Nezami et al. [
24] compared the performance of 3D-CNN models trained with high-spectral (HS) channels, red-green-blue (RGB) channels, and canopy height models (CHM). The 3D convolution demonstrated excellent classification accuracy. Guo [
25] proposed a morphological feature-based double clustering network. First, mathematical morphology methods were used to extract morphological features from hyperspectral images. Based on this, the original hyperspectral remote sensing images were processed for coarsening and refining. Morphological and spectral feature information was simultaneously input into DNMF to obtain comprehensive evaluation indices and visual images. The advantage of the DNMF method lies in its decoupling of spatial-spectral data before fusion, thus separating spatial-spectral data. He et al. [
26] proposed the “Spatial Pyramid Pooling” fusion strategy, leading to a novel network architecture named SPP-net, which generates data representations independent of size and scale. The pyramid fusion is also stable in handling object deformations. Leveraging these advantages, the SPP neural network is generally an improvement upon CNN-based image classification algorithms. Wang et al. [
27] proposed a weakly-supervised fine-grained classification network based on a multi-scale pyramid, replacing ordinary convolutional kernels in residual networks with pyramid convolutional kernels to expand the receptive field and obtain features from different scales. Spatial and channel attentions were then introduced to acquire more informative feature maps. While the aforementioned models effectively achieve forest tree species classification, some of them focus solely on extracting local or global features, neglecting the joint feature extraction. Models with a single focus may face limitations in recognizing highly similar forest tree species. Concentrating solely on local features may cause the model to overlook the overall layout of the forest, while focusing solely on global features may lead to the neglect of individual tree details, such as textures and shapes. Additionally, the improvement of these models often overlooks the correlation between intermediate and deep-level feature information.
The main contributions of this paper are as follows:
In order to further improve the accuracy of forest tree species classification in remote sensing images, this paper proposes a remote sensing image forest tree species classification method based on the MCMFN network. The residual network model ResNet-50 is used as the baseline network to extract image features. Experiments are conducted on the Aerial dataset to evaluate the classification performance of the proposed MCMFN network. The results demonstrate that this method effectively enhances the accuracy of forest tree species classification in remote sensing images.
To effectively improve capability in extracting shallow features and to obtain richer feature information, the 7 × 7 convolution in the ResNet-50 network is replaced with the SMCAC module. The SMCAC module first utilizes convolutions with different scales and global average pooling to obtain different receptive fields. Then, point-wise convolutions are added to include pixel positional information, and the ACmix attention mechanism is introduced to focus on more informative features.
To address the correlation between intermediate and deep-level feature information and improve classification accuracy, the MSFF module is incorporated after the last residual block in the ResNet-50 network. This module extracts feature information and fuses it with the feature information obtained through a fully connected layer from the last residual block, resulting in the final feature representation.
4. Discussion
In order to discuss the contributions of the proposed SMCAC and MSFF modules to the entire MCMFN model, this section will provide a detailed analysis based on the experimental data.
4.1. Analysis of SMCAC Modules
The SMCAC module is designed to address the feature extraction at shallow layers by combining different scales of convolutional kernels in a pyramid structure, followed by pointwise convolution and attention mechanisms. In the original ResNet-50, a single 7 × 7 convolutional layer is used for shallow feature extraction, but this approach has limitations in capturing fine-grained local features. For instance, in forest tree species classification, some individual tree species may be small with intricate patterns, which are difficult to extract using the conventional convolutional kernels. In contrast, the SMCAC module utilizes 1 × 1, 3 × 3, and 5 × 5 convolutional kernels to capture more precise and detailed local features of tree species, enhancing the extraction of unique characteristics. The global average pooling is then applied to complement the global features and facilitate mutual supplementation of local and global information.
To better emphasize the extracted features, the ACmix attention mechanism is employed to focus on the most relevant local features within the entire image. The SMCAC module enhances the extraction of both local and global features, considering the interconnections among forest tree species, leading to more robust and informative feature representations. The performance comparison of the SMCAC module is shown in
Table 6. through relevant experiments.
From the experimental data, the original ResNet-50 achieved an Overall Accuracy (OA) of 84.95% on the dataset. When the SMCAC module was added without replacing the 7 × 7 convolutional layer, the OA obtained was 81.18%, which resulted in a decrease of 3.77% in accuracy compared to the original ResNet-50. However, when the 7 × 7 convolutional layer was removed, the OA improved to 89.33%, showing an overall accuracy increase of 4.38%. Finally, the validation of the SMCAC module without the ACmix attention mechanism and removal of the 7 × 7 convolutional layer achieved an OA of 88.32%, improving by 4.37%.
These results indicate that, after extracting features using the SMCAC module, applying the 7 × 7 convolutional layer to extract features would compromise the effectiveness of the features obtained from the SMCAC module. The 7 × 7 convolutional layer has a large receptive field, which may establish correlations between important and less relevant features, leading to a significant reduction in the effectiveness of feature extraction at intermediate and deep layers and, subsequently, a decrease in overall accuracy. Furthermore, the pyramid structure within the SMCAC module efficiently extracts tree contour feature information with varying receptive fields. After undergoing processing through the ACmix attention mechanism, this module can focus more on the subtle and significant feature information within tree species, further enhancing the precision of Overall Accuracy (OA). Although this module reduces the parameter count by 0.01 M, it also results in a fourfold increase in complexity, leading to longer training times. This increase in complexity may impose certain hardware constraints and requirements. Considering these factors, there is a trade-off between sacrificing some training time for increased accuracy, which can be acceptable in some contexts.
4.2. Analysis of MSFF Modules
The MSFF module performs feature fusion using intermediate and deep layer feature information, where different layers have varying degrees of abstraction and semantic information. There are certain correlation between the feature information from different layers, and by fusing them together, the lost feature information can be compensated. In the MSFF module, three layers from intermediate and deep layers are selected for fusion, combining the most effective detailed and contextual feature information from intermediate layers with the most effective abstract and contextual feature information from deep layers, thereby enhancing the feature representation capability. The experimental results related to the MSFF module are shown in
Table 7.
From the experimental results, the original ResNet-50 achieved an overall accuracy (OA) of 84.95% on the dataset. However, by incorporating the MSFF module, the OA improved to 86.91%, resulting in a significant increase of 1.96%. This indicates that, in traditional CNN architectures, the feature extraction process proceeds from shallow to deep layers, overlooking the unique feature information present in each layer. Moreover, the feature information extracted at each layer is interrelated. The MSFF module helps the model better understand the distinct hierarchical features in the input data and extract more comprehensive and robust feature representations.
4.3. General Discussion of the Modules
Through disintegration experiments involving two modules, we have generated a performance chart, as depicted in
Figure 11. Here, the
x-axis represents FLOPs (Floating-Point Operations per Second), the
y-axis signifies OA (Overall Accuracy), and the size of the circles reflects the model’s parameter count.
In our model presented in this paper, we have employed the SMCAC module primarily in the shallow layers for the extraction of features related to the contours of forest tree species. This module consists of three convolutional layers with different kernel sizes (1, 3, and 5) along with a global average pooling operation, forming a pyramid structure. These features are further extracted using the ACmix attention mechanism. The selection of convolutional kernels at sizes 1, 3, and 5 aims to capture features from different perspectives. When considering the trade-off between parameter count and complexity, choosing these three convolutional layers proves to be an optimal choice. However, if dealing with images of varying resolutions, it may be necessary to adjust the number of convolutional layers or utilize dilated convolutions to adapt feature extraction accordingly.
On the other hand, the MSFF module is primarily designed for finer-grained feature extraction related to forest tree species. It operates by fusing features from both mid-level and deep-level feature maps. Each of these levels offers unique feature information, and their combination through addition results in a richer set of feature representations. When applying the MSFF module to different models, one may consider the selection of the number of deep layers or weight coefficients for each layer, allowing for proportional feature fusion. Such adjustments can lead to varying effects in different model architectures.