1. Introduction
Lately, the interpretation of remote sensing images (RSIs) has become a focal point in the realm of Earth observation, propelled by the ongoing and swift progress in remote sensing technology. Furthermore, the obtention of very high-resolution RSIs (VHRRSIs) has become progressively convenient and cost-effective, leading to unprecedented growth in analytical techniques for RSIs across various research directions, including land use classification [
1,
2], target detection [
3,
4], autonomous driving, and three-dimensional reconstruction [
5]. The semantic labeling of VHRRSIs involves semantically annotating all pixels in an image simultaneously. This process is a crucial step in image interpretation. The accuracy of this segmentation directly impacts subsequent processing tasks.
VHRRSIs are typically highly accurate, containing hundreds or thousands of pixels, with varying scales and orientations of objects. These images often display rich details, complex textures, strong spatial correlations, and a broad range of categories. The complex imaging principle underlying RSI renders them highly ambiguous and uncertain, significantly increasing the difficulty of semantic labeling. As a result, semantic labeling has become among the most pivotal yet arduous tasks within the realm of computer vision. Lastly, the items of focus on an RSI are often visually small (e.g., cars) and densely distributed. These small objects are more prone to occlusion and misclassification, resulting in a substantial reduction in average segmentation accuracy.
To tackle these challenges, our proposal introduces ScasMNet for the semantic labeling of VHR images, with the aim of improving segmentation precision for small objects in RSI while preserving accuracy for other categories. Our approach comprises the following key components:
(1) We utilize a dual-input fully convolutional network (FCN) [
6] equipped with an encoder–decoder architecture, given that FCN-based networks are particularly adept at handling semantic labeling tasks within a fully supervised learning framework. In the encoding phase, we integrate heterogeneous data by fusing features derived from both spectral channel inputs and Digital Surface Model (DSM) [
7] data, thereby augmenting the complementarity of the data features.
(2) Instead of employing traditional up-sampling following the downsampling operation of the maximum pooling layer, we adopt dilated convolution [
8] with a range of dilation rates. This approach extends the receptive field without a concomitant increase in the number of parameters or computational overhead. By implementing dilated convolutions with varying dilation rates, we facilitate the resampling of contextual information across multiple specialized layers, which allows for the extraction of a more comprehensive spectrum of distinctive features.
(3) Following the extraction of feature maps from the network, we apply dense conditional random fields (DenseCRFs) [
9] to refine object boundaries, thereby enhancing the segmentation accuracy.
The structure of the subsequent sections of this manuscript is as follows:
Section 2 provides a comprehensive review of pertinent literature on semantic labeling techniques for VHRRSI. In
Section 3, we delineate our proposed method in depth, which encompasses the dual-path fully convolutional network architecture designed for the integration of heterogeneous data from DSM and optical imagery, the introduction of dilated convolution for feature extraction at various scales, and the utilization of DenseCRF to refine class boundaries and improve segmentation outcomes.
Section 4 presents the experimental findings and a comparative analysis with other state-of-the-art deep learning methodologies. The manuscript concludes with a summary of the study and final remarks in
Section 5.
2. Related Work
Recently, deep learning has achieved remarkable success within the field of computer vision, leading to a multitude of milestone accomplishments. Semantic labeling, also referred to as semantic segmentation in the computer vision literature, represents a core challenge in image analysis and plays a crucial role in the broader domain of computer vision. As a form of pixel-level classification, semantic labeling aims to assign semantic annotations to each pixel in an image, thereby distinguishing various categories through the delineation of segmented regions, each represented by a unique color. Semantic labeling is a core undertaking in the realm of VHRRSIs processing, and it serves as a critical technology for remote sensing application systems.
A. Single-Modal Semantic Labeling
Convolutional neural networks (CNNs) [
10] constitute one of the most of prevalent architectural frameworks for deep learning within the domain of semantic segmentation for remote sensing imagery. Scholars have successfully enhanced the accuracy of semantic segmentation in RSIs by refining the conventional CNN architecture, including strategies such as deepening the network architecture and incorporating residual connections. FCN has revolutionized the approach by substituting the traditional CNN’s fully connected layers with convolutional layers. Thereby facilitating end-to-end pixel-level classification. Koltun et al. [
11] enhanced FCN network performance by incorporating pyramid downsampling and deconvolution layers, albeit with less than optimal label accuracy. Inspired by FCNs, researchers have proposed a variety of improved FCN structures, such as U-Net [
12], DeepLab series [
13,
14], etc., to further enhance the segmentation performance. A large number of encoder–decoder-based network structures are used for RSIs semantic labeling task, this type of network extracts image features by encoder and then realizes the reconstruction of feature maps by decoder so as to realize high-precision semantic segmentation, e.g., SegNet [
15], and so on. The SegNet model constructed an encoder–decoder symmetric structure based on the FCN architecture to accomplish end-to-end pixel-level image segmentation, uniquely utilizing the decoder to upscale its lower-resolution input feature map. Chen et al. [
16] expanded filter support and minimized input feature map downsampling for dense labeling. These methods have proven effective in pixel-level classification of RSIs, showcasing superiority over traditional pixel-level classification approaches that depend on manual feature descriptors. To address the limitations of some semantic segmentation methods that suffer from reduced image resolution due to convolution or pooling layers, the proposed spatial pyramid pooling module, called as PSPNet can introduce more contextual and multiscale information to minimize mis-segmentation probabilities. RefineNet [
17] is a multipath reinforcement network that leverages all down-sampling process information and achieves high-resolution prediction through remote residual connections. Drawing inspiration from deep networks with stochastic depth, a Dropout-like approach has been proposed to enhance ResNet in DenseNet [
18], significantly boosting its generalization capability [
19]. The integration of Atrous Spatial Pyramid Pooling (ASPP) from the DeepLab series and dense connections from DenseNet in DenseASPP [
20] yields a larger capture field and more dense sampling points, achieving state-of-the-art labeling on CityScape. FastFCN [
21], proposed in 2019, improves semantic segmentation by incorporating the JPU (Joint Pyramid Upsampling) module into the semantic segmentation model. The UNet3+ [
22] model is specifically designed for segmenting and labeling buildings in RSI.
More recently, Vision Transformer (ViT) [
23] was proposed, which is an innovative approach to apply the Transformer architecture to computer vision, which achieves pixel-level prediction by categorizing each image block, such as SETR (Segment Transformer) [
24,
25] and TransUNet. Combining the advantages of ViT with those of CNNs improves the segmentation accuracy, which has also inspired many following works [
25,
26,
27,
28]. However, ViT is computationally and memory intensive, and is not friendly to mobile terminal deployment of algorithms, especially for high-resolution semantic labeling. And a multistage attention resu-net [
29] is proposed for semantic segmentation of high-resolution RSIs. Swin-Unet [
30] is proposed for medical image segmentation by Unet-like pure transformer. And swin transformer embedding UNet is used for RSIs semantic labeling.
B. Multimodal Semantic Labeling Mulitimodal remote sensing technology can fuse data from multiple sensors, such as optical, LiDAR, thermal infrared, etc., to provide richer and complementary feature information, thus improving the recognition accuracy of feature targets. In recent years, with the development of deep learning, multimodal RSIs semantic segmentation has made significant progress. Optical images can provide features, such as texture, color, etc., while DSM data from LiDAR represent height information of ground objects. For roads and buildings with more regular shapes, as well as trees and low vegetation with similar colors and geographic locations, the involvement of elevation information from DSM data makes it possible to better distinguish the two from each other in terms of differences in height. With the acquisition of DSM data no longer difficult, the study of optical image fusion of DSM data has gained more attention. Both FuseNet [
31] and ResUNet-a [
32] designed a dual-input deep learning network structure that fuses RGB data and DSM data, allowing the network to extract complementary features to improve segmentation accuracy. vFuseNet [
33] is a similar structure. GSCNN [
34] employs two parallel CNN structures for regular extraction and boundary-related information extraction, utilizing a traditional semantic segmentation model-like Regular stream and a Shape stream dedicated to boundary information acquisition. Finally, these two streams are fused to generate segmentation results. CMGFNet [
35] proposed a gated fusion module to combine two modalities for building extraction. CIMFNet [
36] designed the cross-layer gate fusion mechanism. ABHNet [
37] explored feature fusion based on attention mechanisms and residual connections. DKDFN [
38] is domain knowledge-guided deep collaborative fusion network for multimodal unitemporal remote sensing land cover classification. And other similar multimodal networks [
39,
40,
41] for classification of RSIs have obtained better results.
However, the fact that these methods ignore long-range spatial dependencies makes them perform poorly in extracting global semantic information. The transformer architecture is capable of capturing global contextual information in images, which is essential for distinguishing targets (different categories of vegetation) having similar appearances but different categories. TransFuser [
42] incorporates the attention mechanism of transformer in the feature extraction layers of different modalities to fuse global contextual information of 3D scenes and integrate it into end-to-end autonomous driving tasks. TransUNet is a network that combines the transformer and U-Net architectures. It utilizes transformer to extract global contextual information while utilizing the encoder–decoder architecture of U-Net to maintain spatial information for accurate segmentation. And in 2024, TransUNet [
43] was used for medical image segmentation through the lens of transformer by rethinking the U-Net architecture design. STransFuse [
44] fused a swin transformer and convolutional neural network for RSIs semantic labeling. Similar networks for semantic labeling of RSIs are CMFNet [
45], EDFT [
46], MFTransNet [
47], and FTransUNet [
48], and the last one is proposed to provide a robust and effective multimodal fusion backbone for semantic segmentation by integrating both CNN and Vit into one unified fusion framework.
C. Contional Random Fields for Postprocessing
Energy-based random fields, such as Markov random fields (MRFs) [
49] and conditional random fields (CRFs) [
50], have proven invaluable for extracting background information from natural and RSI. In 2001, Lafferty [
51] proposed a CRF model for 1-D sequence data processing based on MRF theory, effectively overcoming the aforementioned limitations of MRF. Currently, researchers continue to explore the combination of deep learning models and CRF. In recent studies, CNNs or FCNs integrated with CRF have been employed to enhance RSI segmentation accuracy [
52], improve road detection [
53], building detectionli2018building, and water body detection [
54] efficiency.
D. Semisupervised/Unsupervised Learning Methods
A high-resolution image has hundreds of thousands of pixels, and sometimes even more. In order to alleviate the dependence on labeled data in training, semisupervised and unsupervised learning methods all reduce the need for labeled data in training. Semisupervised learning methods [
55,
56] are able to achieve better performance using consistency regularization and average updates of pseudolabels. Unsupervised learning methods [
57,
58] can accomplish recognition and segmentation tasks without a large amount of labeled data, using techniques such as data preprocessing, feature extraction, clustering, optimization iteration, and postprocessing. The application of unsupervised learning methods in the semantic labeling of RSIs is feasible, especially in the case of scarce labeled data. However, the disadvantage of unsupervised learning is its low performance compared with supervised learning methods.
3. Proposed Method
In this section, we aim to offer a comprehensive introduction to ScasMNet. Firstly, we propose a novel dual-path data fusion network that seamlessly integrates optical images and DSM data within an end-to-end fully convolutional network. This approach enhances the effectiveness of semantic labeling for VHRRSIs by fusing heterogeneous features. Furthermore, we discussed the principle of dilated convolution and its benefits for semantic labeling. Our method employs multiscale dilated convolutions within a fully convolutional deep network to facilitate semantic information fusion. Lastly, building upon the multiscale semantic information fusion achieved via dilated convolutions, we utilize a dense conditional random field (DenseCRF) [
59] model to establish point-to-potential energy relationships between all pixel pairs within the image, which results in improved refinement and segmentation outcomes.
3.1. Dual-Path Fully Convolution Network
DSM is a ground elevation model encompassing the heights of structures, such as buildings, bridges, and trees on the ground. DSM genuinely depicts the undulations of the ground and finds applications in a broad range of industries. As remote sensing technology advances, obtaining DSM has become more convenient, rapid, and precise. Leveraging the land surface undulation information provided by DSM can significantly enhance the effectiveness of RSI analysis.
Nevertheless, combining three-channel optical data and DSM into a four-input dimension structure and feeding it into the network is not an optimal approach. This is due to the distinct information contained in these two data sources. The optical data from IRRG images, obtained from the same sensor, typically encompasses appearance information such as object color and texture. In contrast, DSM data, acquired using a different type of sensor, represents the height information of surface undulations. Consequently, our focus is on exploring how to process and fuse these heterogeneous source data to enhance model performance.
Drawing inspiration from the concept of multimodal fusion, we present a technique to enhance the semantic labeling accuracy of VHRRSIs. Our approach employs the FCN framework overall while constructing a multimodal network that incorporates dual-path data input to enhance feature diversity and expand the network’s capabilities. The primary objective of our method is to enhance the accuracy of semantic labeling for RSIs.
Figure 1 illustrates the complete structure of the proposed network.
3.2. Multiscale Feature Fusion
Dilated convolution, also referred to as atrous convolution, was designed to address image segmentation challenges. Traditional image segmentation algorithms often employ pooling layers and convolutional layers to broaden the receptive field, thereby reducing the size of the feature map. This is followed by upsampling to restore the original image size. However, the process of shrinking and expanding the feature map may result in accuracy losses. In contrast, dilated convolutional operations can increase the receptive field without altering the size of the feature map. This approach is utilized in this study to encompass a wider spectrum of information, thereby expanding the receptive field. Dilated convolution incorporates a hyperparameter known as the “dilation rate”, which determines the distance between values when the convolution kernel processes the data. In our research, we explored various dilation rates (rate = 6, rate = 12, rate = 18, rate = 24) to systematically aggregate multiscale contexts without sacrificing resolution. This approach allowed us to achieve high-precision dense predictions.
As depicted in
Figure 2, dilated convolution utilizing the same feature map achieves a larger receptive field compared with basic convolution, resulting in denser data acquisition. A broader receptive field enhances the overall performance of small object recognition and segmentation tasks. Importantly, employing dilated convolution instead of downsampling or upsampling effectively preserves the spatial characteristics of the image without compromising information loss. When network layers demand a larger perceptual field but increasing the number or size of convolution kernels is impractical due to limited computational resources, the use of dilated convolution proves to be advantageous.
In this paper, we propose a parallel dilated convolution module with varying dilation rates to expand the receptive field and enhance the network’s ability to extract features. This module, referred to as the multiscale convolutional block, is presented in
Figure 3. The primary operation involves performing feature fusion in the final stage of the encoder, which serves as the input feature for the decoder stage. For the input feature map, we conduct four parallel dilated convolution operations with distinct dilation rates. Subsequently, we fuse the corresponding feature maps of the same size from the encoder and perform convolution operations with a 1 × 1 × 6 kernel size. Ultimately, the four branches of the parallel dilated convolution generate feature maps of identical scales. After fusion, the remaining operations in the decoder stage are executed. A comprehensive overview is provided in
Figure 3.
3.3. DenseCRF Model
Our model employs several upsampling operations utilizing deconvolution, which not only resizes the feature map back to its original image dimensions but also leads to feature loss. This, in turn, generates blurred classification target boundaries. To achieve more precise final classification results, we incorporate the DenseCRF model following the presegmentation outcomes. DenseCRF is an enhanced version of CRF that optimizes the deep learning-based classification results by considering the relationships among all pixels in the original image. This approach corrects missegmented regions and provides more detailed segmentation boundaries.
In the DenseCRF model, the Gaussian kernel function of the pixel pair is expressed by Equation (
1),
where
takes into account the shape, texture, and
color information of the pixel pairs in the image and considers the position information of the pixel pairs. As can be seen, the function considers both color and position information by incorporating
and
. It encourages pixels with similar colors and close positions to be assigned the same label, while pixels with greater differences receive different labels. Consequently, the DenseCRF model can segment images along boundaries as accurately as possible, providing a comprehensive description of the relationships between pixels regarding color and position.
In our network architecture depicted in
Figure 4, a dual-path fully convolutional network model is employed to fuse IRRG data and DSM data and appropriately cascade them to achieve feature complementarity. The encoder section comprises two input data paths, one for optical channel data and the other for DSM data. Given that the topologies of the two encoder branches are similar, cropping the input images of both branches to the same size and normalizing the DSM data to nDSM enables them to share the same range of values as the optical path images. In the final part of the encoder, a dilated convolution operation employing diverse dilation rates is utilized to acquire feature maps at various scales, with the maps being cascaded at the same scale. The input of the dual-stream data fusion module consists of optical IRRG data and DSM data, referred to as a four-channel image. The feature map thus originates from the two aforementioned branches. The feature map is represented as
, where a and b denote the features learned from the IRRG path and DSM path, respectively. The output of the fusion module can be expressed as Equation (
2):
where
and
denote the fusion weights from the respective streams. Feature fusion is executed throughout the training process, allowing the learned features and fusion weights to be adjusted and optimized together. Consequently, the fusion weight that can be learned ensures a more suitable fusion strategy by controlling the contribution of the two data sources to the segmentation target based on the difference in extracted characteristics from the two diverse branches of the data stream. In this paper, we utilize the network structure diagram presented in
Table 1 as the foundation for relevant network fusion. Notably, the dual-stream network fusion is achieved by expanding and convolving the second half of the network based on the dual inputs’ realization, thereby enabling parallelism at this stage. The two-stream network incorporates shallow-layer feature information, which is richer than the previous single-layer result. Redundant information is discarded through pooling operations, followed by expansion convolutions after fusion. The employment of dilated convolutions enlarges the receptive field and relatively enhances high-level semantic segmentation outcomes. Furthermore, dense CRF is incorporated into smooth segmentation edges and enhances segmentation accuracy.
4. Experimental Results
In this segment of the document, we implemented the proposed method and assessed its performance on the two datasets provided by the ISPRS competition, aiming to verify its feasibility and effectiveness. We further compared it with several established deep learning models, including FastFCN, PSPNet, DeepLabv3+, MFTransUNet, and CMFNet. The section is structured as follows: It commences with a succinct overview of the utilized datasets and assessment metrics and subsequently proceeds to the comprehensive presentation of the overall outcomes. And finally, we show the ablation studies and conclusion.
4.1. Data Description
We conducted experimental assessments using datasets obtained from Vaihingen and Potsdam with a resolution finer than a decimeter. These datasets are state-of-the-art airborne image datasets provided by the ISPRS 2-D semantic labeling challenge, covering true orthophoto tiles of extremely high resolution and their corresponding DSMs generated through dense image-matching techniques. The Vaihingen dataset comprises 33 RSIs of varying dimensions, with an average size of 2494 × 2064. Each image is composed of three bands: Near Infrared (NIR), Red (R), and Green (G), offering a spatial resolution of 9 cm. Sixteen of these images have complete annotations for six primary land cover/land use classes (impervious surfaces, building, low vegetation, tree, car, and clutter/background). Additionally, the associated DSMs generated through dense image matching techniques are included, and they have undergone normalization to nDSMs, as discussed in [
60]. The Potsdam dataset comprises 38 patches of size (6000 × 6000), with a ground sampling distance of 5 cm for both TOP and DSMs. The manually labeled categories and numbering of the Potsdam dataset are consistent with those of the Vaihingen dataset.
In this dataset, “building” and “impervious surface” are relatively easy to identify and segment accurately due to their regular shapes. However, distinguishing between “trees” and “low vegetation” can be challenging, as they share similar colors and are often geographically connected. “Car” segments exhibit the lowest accuracy due to their smaller target size and vulnerability to occlusion. These issues are indeed prevalent in other RSIs as well.
4.2. Evaluation Metrics
To facilitate the assessment of the model’s impact on image segmentation accuracy, it is crucial to establish a unified standard for evaluating the model’s accuracy. Within image semantic segmentation, frequently used performance assessment metrics encompass accuracy, recall, precision, F1, and MeanIoU (mIoU). The calculation formulas for these metrics are provided below.
where TP (True Positive) signifies both the actual and predicted labels being positive, indicating a correct (true) prediction. FN (False Negative) represents a false prediction despite the actual label being true. FP (False Positive) indicates a positive prediction despite the actual label being negative, while TN (True Negative) denotes both the actual and predicted labels being negative. In our experiments, we employ three specific quantitative evaluation metrics: overall accuracy (OA), F1-score, and per-class average pixel-level accuracy, in compliance with dataset guidelines. OA functions as a comprehensive measure of segmentation accuracy, providing an overview of the proportion of accurately classified pixels. Nevertheless, a drawback of OA is its tendency to prioritize classes with a significant number of samples, potentially overshadowing the contributions of smaller classes with larger ones. Conversely, the F1-score is specific to each class and remains unaffected by class size. It represents the balanced average of precision and recall.
4.3. Implementation Details
VHRRSIs are typically comprising thousands of pixels or more. Transmitting such massive images to a deep learning network in a single go is challenging. Furthermore, the labeled data within a benchmark is often limited, and not all of the datasets in ISPRS are annotated. To address these issues, We partitioned the original image data into a sequence of uniform-sized overlapping patches employing a sliding window method. This technique not only enlarged the training set but also enabled the deep learning network to undergo batch training, thus reducing computational demands. Following the aforementioned processing steps, we obtained numerous fixed-size training datasets and their corresponding labeled data. Prior to training, we subjected them to random transformations to augment the dataset and increase the randomness of the data. These transformations included rotating the training images by 90°, 180°, and 270°, randomly scaling them, adding noise, and horizontally/vertically flipping them. In our experimental set-up, the initial training images, which include IRRG and nDSM data, are standardized to have a mean of zero and a variance of one. Both the raw images and the labeled ones used for training are divided into a series of patches measuring 256 × 256, with corresponding numbers marked. For the Vinhingen dataset, 10 original images are employed for training, resulting in 40,000 patches from the training set. Similarly, for the Potsdam dataset, 10 original images are used for training, yielding 10,000 patches from the training set. The respective numbers of images for the training and test data used in this study are presented in
Table 2, with a ratio of 3 training patches to 1 validation patch.
This experimental environment utilizes Python 3.5, with the network application built on the TensorFlow and Keras framework. The hardware platform is an NVIDIA-SMI 440.44 GPU, employing Cuda 10.2 for accelerated calculations, and a GPU memory of 11.91 GiB. Throughout the network’s application process, numerous experiments revealed that a learning rate of lr = l × yielded the best results for the fusion network. To prevent premature convergence caused by the network’s deep structure, the Batch Normalization (BN) layer and the Rectified Linear Unit (ReLu) layer were added to the convolutional layer. Additionally, issues, such as the convergence phenomenon and slowed network training speed due to the extensive workload were addressed. Simultaneously, a basic semantic labeling network was compared and tested on the corresponding dataset, including FCN-16s, SegNet, U-Net, ICNet, DeepLabV3+, PspNet, and other network applications, all based on the TensorFlow and Keras framework. All optimizers used in this study are Adam optimizers, aiming to achieve the control variable method comparison experiment’s objectives.
4.4. Experimental Results and Analysis
In this section, we showcase our experimental outcomes obtained using our proposed method on two datasets sourced from the ISPRS competition. These include numerical and visual results, along with a comparison of our approach with other classical models from previous literature. For the alignment of data in all tables, impervious surfaces are abbreviated with Imp. surf, and low vegetation is represented by Low veg.
4.4.1. Experimental Results on the Vaihingen Dataset
Table 3 presents the results obtained from the Vaihingen datasets, with bold numbers indicating superior performance. The second to sixth rows display the test results of five comparison methods utilizing classical semantic segmentation models, while the last row showcases the outcomes achieved by our proposed method. By neglecting the “clutter” category, our approach achieved the best performance across various evaluation metrics, demonstrating excellent performance in five distinct categories. Notably, the accuracy in classifying the “building” category achieved 94.82%, marking the highest among all classes. This indicates that a majority of the pixels belonging to the “building” class were accurately classified. This is attributed to the relatively simple texture information of this category, making it easier for deep learning models to extract relevant feature details and produce accurate segmentation. In comparison, the “car” class achieved the lowest segmentation accuracy, indicating that objects with dense distributions and small targets are less likely to be precisely segmented. This highlights one of the research challenges in VHRRSIs semantic labeling task.
As shown in
Table 3, it can be seen that the segmentation performance of the multimodal fusion methods is better than those of single-modal segmentation methods. However, our proposed ScasMNet achieved optimal results. MFTransUNet is able to recover local information due to its powerful coding capability, making its segmentation performance suboptimal. The computational cost required for each method is discussed in
Section 4.7. The segmentation accuracy is around 90% for both regular-shaped buildings and impervious surfaces. Especially for trees and low vegetation with similar colors and locations, the segmentation performance of the multimodal approach is substantially improved. And for densely distributed small-scale targets, such as cars, the segmentation accuracy of our method reached 86.73%, which is five percentage points higher than MFTransUNet.
Since DSM data are a single-channel grayscale image, it is not conducive to visual observation. Therefore, the original DSM images are not shown in the subsequent visualization results.
Experimental results for three scenarios are presented in
Figure 5. Most of the pixels in the first row belong to either the “building” class (labeled in blue) or the “impervious surface” class (labeled in black). It can be seen that our method can segment all pixels correctly, and fewer pixels are wrongly segmented. Of course, these two classes are also the easiest to segment. However, there are still obvious missegmentations used PSPNet and CMFNet. The circular marking results are the most obvious, but not only limited to there.
Most pixels in the second row belong to either the “tree” class (green labeling) or the “low vegetation” class (yellow labeling), which are also difficult to segment due to their similar colors and locations. However, our method performed optimally to segment all boundaries. The DSM data are input to be able to extract complementary features to further distinguish these two objects from each other in terms of height.
The last row has a small-scale target object (Car), for which segmentation is most difficult. For this reason, we use multiscale dilated convolution (rate = 1, 6, 12, 18) in our method, with the aim of being able to extract features of large-scale targets (buildings and impervious surfaces) while not ignoring the presence of the small-scale targets.
In conclusion, we can see that the qualitative and quantitative results are consistent.
4.4.2. Experimental Results on the Potsdam Dataset
For the Potsdam dataset, we replicated the comparison experiments by training the same network model using identical training data.
Table 4 displays the performance of six different models, with the last row showcasing the results of our proposed ScasMNet model. Our model evidently showcases superior performance compared with the previously mentioned models. The segmentation accuracy of the “Low veg.” class has been improved by 3.1% at least, and the “Tree” class has seen an improvement of 1.3% at least. This demonstrated that DSM possesses superior recognition ability for objects with similar spectral information but varying height profiles. Most notably, the segmentation accuracy of the “car” class has been enhanced by nearly 10% compared with signal-modal-based FastFCN. We can thus infer that employing multiple atrous convolutions with distinct atrous rates contributes to extracting richer features of small target objects. Simultaneously, the complementary DSM has played an indispensable role. Overall, our proposed ScasMNet model yields better performance for semantic segmentation of RSI compared with other models inferred in our experiments.
The qualitative results for three different scenarios are presented in
Figure 6. Most of the pixels in the first row of patches belong to either impervious surfaces (labeled in black) or buildings (labeled in blue). It can be seen that our proposed method correctly segmented the edges and contours of these two classes with almost no misclassification. As for the other five methods, all of them have some misclassification. Most of the pixels in the second row of the patch belong to the “tree” class (labeled in green) or the “low vegetation” class (labeled in yellow). Our proposed ScasMNet’s performance is the best, almost the same as the ground truth. For example, FastFCN and MFTransUNet do not segment the trees in the upper right corner correctly, while the misclassification of CMFNet is also obvious. The third row of paths covers cars, which are densely distributed and small. Our method also correctly segmented the targets, checking the places marked by circles in each patch. In summary, the qualitative visualization and quantitative results remain completely consistent.
4.5. Effect of the Size of a Patch
To achieve optimal experimental results, we randomly and repetitively divide the training samples into patches of 128 × 128, 256 × 256, and 512 × 512 and send them to the training network for comparative experimental analysis. As evident from the data in
Figure 7 and
Figure 8, it can be seen that when the training samples are cropped to 256 × 256, the highest values for OA, F1, and MIoU are obtained. In contrast, when cropping to 128 × 128 or 512 × 512, the evaluation metrics did not improve but decreased. Consequently, in all experiments conducted throughout this paper, we randomly and repetitively crop the original image to a size of 256 × 256.
4.6. Ablation Study for ScasMNet
To demonstrate the substantiality of ScasMNet, we conduct ablation experiments on the aforementioned two datasets, with the results presented in
Table 5 and
Table 6. These findings demonstrate that when training samples are cropped to 256 × 256 and RGB and nDSM are employed as inputs to ScasMNet utilizing a dual-path data strategy, the segmentation accuracy achieves optimal performance.
As depicted in
Table 5, when employing the identical network architecture with RGB single-path data input and training sample sizes of 128 × 128 or 256 × 256, the values of OA, F1, and MIoU witness minor fluctuations, yet these changes are not particularly significant. For the Potsdam dataset,
Table 6 presents analysis results that are virtually identical to those in
Table 5. By utilizing a dual-path input and 256 × 256 training samples, the segmentation accuracy achieves the highest level throughout the entire experiment, essentially registering a five percentage point increase.
To validate the effectiveness of the DenseCRF module, that is the postprocessing operation, we also implemented ablation experiments. In our proposed ScasMNet model, the ablation experiments were performed with patch sizes of 256 × 256 as input, which were removed from and added to the DenseCRF, respectively. The results on the Vaihingen dataset and the Potsdam dataset are shown in
Table 7 and
Table 8, respectively. The symbol “+” represents the addition of a DenseCRF module, and the symbol “-” represents the removal of the corresponding module.
From the results shown in
Table 7 and
Table 8, we can summarize that adding the postprocessing operation in the DenseCRF module can effectively improve the segmentation accuracy by at least 1%, as can be seen from the index values of F1, OA, and MIoU. In order to avoid repetition, the textural content will not be repeated.
4.7. Model Complexity Analysis
We evaluate the computational complexity of the proposed ScasMNet using the floating point operation count (FLOPs) and the number of model parameters. FLOPs are used to evaluate the model complexity whereas the number of model parameters. Ideally, an efficient model should have a smaller value in the FLOPs and the number of model parameters.
Table 9 showed the complexity analysis results of all comparing methods considered in this paper.
Table 9 indicates that the proposed ScasMNet exhibited lower FLOPs, fewer parameters, and smaller memory occupancy than conventional CMFNet and TransUNet. It is observed that the proposed ScasMNet demonstrated better performance than other methods. Single-modal methods have lower FLOPs and fewer parameters than those of multimodal methods because the former just have one modal input and less computational complexity.
5. Conclusions
In this paper, a novel FCN-based Self-Cascaded Multi-Modal and Multi-Scale Fully Convolutional Neural Network was proposed for the semantic segmentation of VHRRSI. Our framework boasts three significant advantages. First, the dual-channel input framework is employed to facilitate information complementarity between the two-channel data, resulting in richer extracted features. Second, our approach enhances the complementarity of features at different scales between layers through a multiscale feature fusion mechanism, allowing the network to accurately and efficiently extract rich and useful features. Lastly, DenseCRF is applied to presegmentation results, taking into full consideration the spatial consistency relationship between pixels and thereby improving segmentation accuracy.
Experimental findings indicate that the ScasMNet model design enhances the segmentation accuracy of trees and low vegetation with similar color and geographic location. This is attributed to the fact that elevation information from DSM complementarily augments spatial information from RGB, as evident from ablation experiment results. Furthermore, the incorporation of both the multiscale module and DenseCRF leads to improved segmentation accuracy for other categories, with optimal performance observed for small-sized cars.
However, there are several extensions of this study that can be further explored. In particular, distinguishing trees from low vegetation remains challenging. Therefore, it is of interest to develop new strategies for ground targets with similar colors and irregular boundaries, without degrading the segmentation accuracy of small-scale target objects. In addition, due to the high resolution of remote sensing images, it is of great relevance to explore image-based elevation estimation for downstream remote sensing tasks, such as crop identification and planting decision implementation and plant disease identification and growth monitoring. Finally, research on incorporating large-scale models, such as the segment anything model (SAM), into the semantic segmentation framework in remote sensing is needed.