1. Introduction
With the advancement of remote sensing technology and the rise in commercial satellite platforms, the increasing volume of remote sensing images (RSIs) provides a foundation for data-intensive techniques such as deep learning to accomplish the task of RSI interpretation [
1,
2]. Semantic segmentation is a critical RSI interpretation task, which aims to assign unique semantic labels to each pixel in RSIs. It can provide precise semantic annotation and positional information for desired objects, and is thus widely used in urban planning [
3], vegetation cover [
4], disaster assessment [
5], and other fields. However, the haphazard distribution of land cover, drastic grayscale variation, and serious category stacking in RSIs lead to the difficulty of distinguishing relative foreground–background categories and exacerbate the uncertainty of category boundaries [
6,
7]. How to efficiently extract and utilize the complex information in RSIs to accomplish precise segmentation is a challenging issue.
In recent decades, semantic segmentation methods based on convolutional neural networks (CNNs) have made great advances, with deep network structures and complex parameter settings enabling the capture of the underlying features and the performance of high-level abstraction [
8,
9,
10,
11]. A series of research studies such as efficient encoder–decoder structures [
12], dilation convolution [
13], residual structure [
14], and diverse attention mechanisms [
15] further improve the performance of CNN-based semantic segmentation methods. Despite the superior performance of these methods, the ability to extract long-range dependencies is weak due to the fixed geometry of convolution operation, and the receptive field can only focus on features within the convolution kernel [
16]. In contrast, Vision Transformer (ViT) [
17] and its variants [
18,
19,
20] rely on multi-head self-attention and well-designed structures to transform the semantic segmentation task into a sequence-to-sequence problem, which shows significant advantages in capturing long-range dependencies and global information. More recently, research on hybrid networks that integrate CNNs and Transformers has achieved a balance between local and global information, further improving the segmentation performance [
21,
22].
However, most of the existing methods focus on extracting and optimizing information in the spatial while ignoring information in the frequency. This results in misclassification still occurring frequently, especially in regions obscured by shadows. Frequency information proved to be more sensitive to these regions [
23]. Therefore, it is necessary to integrate frequency information into the segmentation network. In addition, the fine segmentation of category boundaries remains a challenge. Numerous studies have considered feature extraction and boundary guidance as two separate tasks, which makes the feature extraction and boundary information relatively independent, and the correlation within the network is reduced, which leads to segmentation results falling short of expectations [
24,
25,
26].
Overall, the challenges of the semantic segmentation task for RSIs can be summarized as follows: (i) Grayscale variation: As shown in
Figure 1a, shadows interfere with the stability of intra-class and inter-class variance, which makes category differentiation difficult. (ii) Category stacking: As shown in
Figure 1b, trees are stacked on top of low vegetation to form a relative foreground–background relationship. These categories with similar spatial representations are stacked together, interfering with the distinction between categories and exacerbating the uncertainty of category boundaries.
To cope with these challenges, a novel dual-domain fusion network is proposed in this paper. Multi-level frequency features are captured to enhance the distinction of features within the regions of grayscale variation. Fuzzy modeling is performed in the spatial to adaptively constrain the boundary features, ultimately achieving semantic alignment and feature fusion explicitly between the dual domains. Specifically, in order to solve the interference caused by grayscale variations, we introduce a multi-level wavelet frequency decomposition module (MWFD). Multi-level high-frequency and low-frequency feature representations are extracted by successive Haar wavelet transforms to enhance the sensitivity to feature representations in regions with unstable grayscale, thus enhancing the differentiation between categories and improving the robustness to cope with category-scale variations. In consideration of the uncertain category boundaries, we design a type-2 fuzzy spatial constraint module (T2FSC). Based on the Gaussian regression model, interval type-2 fuzzy membership functions are constructed to extend the fixed constraints to the elastic space, thus effectively constraining the boundary features in the spatial and alleviating the category boundary uncertainty. Finally, a dual-domain feature fusion (DFF) module is developed to seek consistency between dual-domain features, enhance appropriate correlations, and suppress inappropriate correlations. DFF reduces the semantic gap between features in the frequency and spatial domains, effectively aggregates the dual-domain features, and produces more accurate and robust feature representations, which further improves the accuracy of semantic segmentation.
The main contributions of this paper are as follows:
- (1)
A novel dual-domain fusion network is proposed to achieve precise segmentation of RSIs. Both crucial semantic features in the spatial are retained and the multi-level frequency features are introduced, and the effective integration of the dual-domain features is realized.
- (2)
In the frequency, the proposed MWFD aggregates multi-level frequency features after decomposing the spatial features into high-frequency and low-frequency features, which enhances the discriminative ability for unstable grayscale regions. In the spatial, the proposed T2FSC constructs adaptive upper and lower membership functions to achieve higher-order fuzzy modeling with flexible representations that effectively constrain the boundary features.
- (3)
Extensive experiments and visualizations on Vaihingen, Potsdam, and GID datasets demonstrate the superiority of our proposed method and achieve an excellent balance between segmentation accuracy and computational overhead.
2. Related Works
In this section, we briefly review the landmark methods and recent advances in the field of RSI semantic segmentation, as well as the applications of fuzzy learning and frequency domain processing in this field. Finally, the contributions and limitations of these works are summarized.
2.1. Semantic Segmentation of RSIs
The advancement of CNN-based semantic segmentation of RSIs has accelerated significantly since the landmark FCN [
27] achieved end-to-end pixel-level prediction. A series of encoder–decoder structures [
12,
28] have been proposed to mitigate the disadvantages of feature loss during downsampling and low resolution of segmentation results. Subsequently, the convolution operation was also greatly improved [
29]. The DeepLab series [
13,
30] proposed ASPP, which is compatible with multi-scale receptive fields without additional parameters and implements a more robust encoder–decoder structure. Breakthroughs in attentional mechanisms have further improved the ability to capture semantic dependencies [
31,
32]. A2FPN [
33] proposes an attention aggregation module based on the feature pyramid network, which enhances multi-scale feature learning through attention-guided feature aggregation. Qi et al. [
34] introduced a deformable ConvNet in the residual module and combined the convolutional block attention module at skip connections to improve the accuracy of RSI farmland segmentation. Subsequently, more specific tasks and strategies derived from RSI semantic segmentation emerged, including small-object segmentation [
35], few-shot RSIs segmentation [
36], semi-supervised RSIs segmentation [
37], and prototype-based segmentation [
38]. In addition, there are methods that can be applied to a variety of tasks. Mask R-CNN [
39] performs both object detection and pixel-level segmentation, generating high-quality object masks, which improves the accuracy and efficiency of segmentation. YOLOv8 is the latest model in the YOLO family from Ultralytics, which supports a wide range of computer vision tasks. It generates fine object masks in the segmentation task with efficient real-time object detection and instance segmentation. All the above methods bring effective solutions and research directions for RSI semantic segmentation tasks. Nevertheless, due to the limitations of convolutional operations in a physical structure, these methods are still essentially limited to modeling local information, and the ability to model global dependencies still needs to be further improved.
More recently, inspired by the tremendous success of Transformers in the field of natural language processing, ViT and variants have been proposed. ViT defines a new paradigm for semantic segmentation by converting the task into a sequence-to-sequence problem and explicitly modeling global contextual dependencies through excellent architecture and multi-head attention [
40]. Subsequent research works have proposed hybrid networks [
41,
42,
43,
44] combining CNNs and Transformers, an architecture that balances the extraction of local features and long-range dependencies. For example, CMTF [
45] combines the Transformer and dilation convolution to fully integrate features with different scales, enhancing the ability to tackle object-scale variations. Li et al. [
46] proposed a dual-branch structure encoder with a local–global interaction module combined with a geometric prior-generation module to comprehensively capture the context and achieve lossless decoding and deterministic inference, which further improves the accuracy of semantic segmentation. Hybrid networks that integrate CNNs and Transformers have become the current mainstream, and extensive research and experiments have demonstrated the superiority of this architecture.
The method proposed in this paper is essentially a hybrid network based on CNN and Transformer, specifically optimized for the distinctive challenges in the RSI segmentation task, with a balance between efficiency and effectiveness.
2.2. Frequency Domain Processing in Remote Sensing
Frequency information has received more and more attention in image processing [
47], especially in processing RSIs with significant advantages [
48]. Textures and patterns that are difficult to observe in the spatial can be effectively separated by analyzing the frequency components, enabling accurate feature extraction. These features can serve as effective complements and are essential for learning the complex and disordered land covers in RSIs [
49,
50].
As a simple and effective frequency signal processing method, wavelet transform is widely used in the fields of data compression, signal processing, and image processing [
51,
52]. It inherits and develops the principle of short-time Fourier transform, overcomes the limitation of fixed window size, and has excellent time–frequency localization; thus, it is more sensitive to local features, boundaries, and textures in images. The most commonly used is the Haar wavelet transform, famous for its simplicity and efficient binary operations, avoiding the complexity of multiplication operations. In recent years, many researchers have begun to explore methods to integrate the wavelet transform with CNNs [
53,
54]. One strategy is to utilize wavelet transforms instead of pooling layers to mitigate the loss of information from downsampling. These studies bring additional parameters while capturing frequency information, and the improvement in the results is not obvious [
55]. Another strategy is to decompose the input features into feature components by wavelet transform and then incorporate these sub-features directly into the CNNs as additional components [
56,
57]. However, the rough splicing operation may lead to insufficient information aggregation, thus affecting the final performance. Therefore, how to effectively and efficiently integrate dual-domain information remains a key issue.
2.3. Fuzzy Learning in Remote Sensing
Influenced by the imaging conditions of remote sensing and the complexity of land cover, RSIs can be affected by phenomena such as noise interference and mixed pixels, which notably increase the uncertainty of RSIs [
58,
59]. Fuzzy learning proposes the concept of membership function, which extends the traditional rough binary judgment to the spatial range to effectively deal with this uncertainty problem [
60]. In the field of remote sensing, fuzzy learning has made initial progress [
61,
62]. FNNN [
63] designed a fuzzy neighborhood module to establish fuzzy relationship between the center pixel and the neighboring pixels, effectively smoothing the noise and solving the interpixel uncertainty problem. MBFNet [
64] proposed a fuzzy-driven strategy to model the pixel-level uncertainty by using a two-dimensional Gaussian fuzzy, which helps to overcome the influence of the difficult pixels. PFMNet [
65] proposed a fuzzy aware module to perceive the boundary pixels by the difference between the neighboring pixels and the center pixel, which further improves the segmentation accuracy. Despite the achievements of the above fuzzy modeling approaches in dealing with pixel classification uncertainty, a fixed single affiliation representation cannot fully consider intra-class variation and possibly has a negative impact on semantic modeling. To alleviate this limitation and describe uncertainty accurately, researchers have proposed type-2 fuzzy modeling [
66], especially based on interval type-2 fuzzy modeling [
67,
68,
69]. Wang et al. [
70] constructed an interval type-2 fuzzy membership function based on a Gaussian regression model to effectively mitigate the noise interference in land cover classification.
Existing fuzzy models demonstrate the advantages of fuzzy learning in dealing with uncertainty, but further consideration of the adaptive optimization of feature space correlation and membership functions, as well as effective integration with deep learning methods, is needed.
3. Methodology
In this section, we provide an overview about the overall architecture of the proposed dual-domain fusion network. Subsequently, we will detail the key components of our method and the loss function employed.
3.1. Overview
The architecture of the proposed dual-domain fusion network is shown in
Figure 2. Given the original image, it is first fed to the backbone for spatial feature extraction. Then, the feature maps
are fused to generate the feature map
F, expressed as follows:
where
denotes the convolution operation with kernel size
for unifying the number of channels in the feature maps
.
denotes the bilinear interpolation upsampling operation to keep the width and height of the feature maps
the same as
.
denotes the concatenation operation to integrate the feature map
F. This fusion effectively combines shallow features and high-level semantic features to provide stable fundamental features. Subsequently,
F is utilized as the input to MWFD and T2FSC, generating three feature maps in the frequency and spatial, respectively, the high-frequency feature
, the low-frequency feature
, and the fuzzy spatial constraint feature
.
Then, DFF matches the dual-domain features for semantic alignment and performs Transformer-based feature extraction and fusion. Finally, the fused features are fed into the prediction head. The network adopts a concise architecture that effectively utilizes and aggregates frequency and spatial features to improve segmentation performance and robustness.
3.2. Multi-Level Wavelet Frequency Decomposition
The wavelet frequency decomposition (WFD) can effectively capture and separate the signals of the image in different frequency domains with low computational overhead by connecting high-pass and low-pass filters in series. As shown in
Figure 3, each WFD decomposes the feature into three high-frequency sub-features and one low-frequency sub-feature with halved resolution; thus, the MWFD provides the ability to naturally express frequency information at various scales. The MWFD contains three levels of WFD to effectively extract and integrate multi-level frequency features. The incorporation of attention to high- and low-frequency texture features results in significantly enhanced discrimination of categories within the grayscale varied region and improved adaptability to scale variations.
Specifically, the Haar wavelet decomposes the input
F into four sub-components,
,
,
and
. The
-th value of
,
,
and
at level
can be written as
is utilized as the input to the next level of WFD. After three levels WFD, 12 multi-level sub-components which contain abundant spatial and frequency features can be obtained. In order to fully utilize these components, three high-frequency components of the same scale are firstly merged, and then nonlinear operations are performed to reduce the channel dimension and perform feature mapping to obtain high-frequency features with less redundant information and noise:
where
and
denote the high-frequency and low-frequency features generated at the
k-th level (
k = 1, 2, 3), respectively.
denotes batch normalization, and
denotes convolutional operation with a kernel size of
. The 12 sub-components will be merged into 3 high-frequency components
and 3 low-frequency components
. Finally, the high-frequency feature
and low-frequency feature
are obtained by unifying the resolution of the multi-level components:
where
denotes the bilinear interpolation upsampling operation to keep the width and height of each component the same as
. MWFD decomposes the spatial features into high-frequency feature
with local details and low-frequency feature
with global features, which provides stable frequency features for the network, and effectively mitigates the disadvantages brought by the grayscale variation regions in RSIs.
3.3. Type-2 Fuzzy Spatial Constraint
To address the negative impact of category boundary uncertainty on the segmentation task, T2FSC is proposed to adaptively constrain boundary features.
Previous studies have demonstrated that fuzzy learning can smooth boundary features and make the network easier to discriminate [
61]. Current studies are generally based on type-1 fuzzy, such as the improved Gaussian fuzzy module, which uniquely determines the degree of affiliation based on the position of neighboring pixels:
where
denotes the principal membership,
i denotes the pixel index,
is the model coefficient of the current feature space, and
represents the pixel value.
and
denote the mean and standard deviation of all pixel values, respectively.
is the 8 neighborhood pixels in the
kernel centered on the
i-th pixel, # denotes the operator for obtaining the pixels in
, and
denotes the pixel index in the kernel. The membership function is uniquely determined by fixing the weights through the positional relationship between neighborhood pixels and center pixel in the kernel. This strategy smooths the boundary features, but the distribution of pixels in RSIs does not exactly obey the same Gaussian distribution.
Based on this, we introduce the upper and lower membership functions
and
, as shown in
Figure 4. The membership space is extended so that flexible feature constraints can be realized instead of being constrained by a unique fixed membership function. The upper membership function
can be expressed as
In Equation (
7), the mean value
is changed to the interval
, where
and
denote the mean values of the left and right boundaries of the interval, respectively. Similarly, the lower membership function
can be described as
In Equations (7) and (8),
and
are calculated according to Equations (9) and (10):
where
is the interval adjustment factor for mean deviation. Assuming that each membership function has the same probability of appearing in the interval, the probability that the Gaussian distribution membership function falls on
is 99.7%. Moreover, take
to limit the uncertainty variation of the fuzzy model. The lower and upper membership functions will change dynamically and further produce dynamic intervals to realize the feature constraints in the space. The computed weight matrix is dot-multiplied with the kernel matrix and summed element by element to determine the feature representation of the center pixel
P:
T2FSC is traversed throughout the feature map, automatically matching and for different features. The constraints on boundary and noise pixels are more pronounced, which allows the network to distinguish them better.
3.4. Dual-Domain Feature Fusion
There are semantic gaps between features capturing different aspects and attributes in the frequency and spatial domains; thus, the primary task is to perform semantic alignment to ensure consistency and complementarity of feature representations. DFF is designed to achieve the semantic alignment of dual-domain features and to fuse them efficiently.
In DFF, the features
are divided into two groups,
, for semantic alignment and feature extraction. The high-frequency feature
has excellent frequency local features that can pinpoint transient changes. This is similar to the local feature
F extracted from the backbone, and
F contains high-resolution low-level semantic features, both of which complement each other in the frequency and spatial domains; thus, we extracted the feature by aggregating
F with
:
where
denotes the maximum pooling operation.
retains the primary energy and global information of the feature map, provides an overall view, and contains less noise. However, there is a relative lack of detailed information in
. To compensate for this, we conduct a semantic alignment and feature integration of
together with
, which possesses comprehensive boundary details.
Inspired by the structure of a multi-head self-attention (MHSA) mechanism, DFF is implemented based on the Transformer encoder–decoder structure, in which the key component MHSA can be represented as
where
,
, and
denote the learnable weight matrices that map the input feature map
into the
dimension. Then, the correlation between
Q and
K is used by the softmax activation function as the weights of
V to generate the attention map:
Finally, the MHSA concatenates the output from each head:
where
is the weight of the MHSA. For the feature map
, we used the MHSA twice and transformed the output of the MHSA with the FFN through a series of linear transformations and nonlinear activations. For feature maps
and
, semantic alignment between spatial and frequency domains is accomplished through multi-head cross-attention (MHCA). Also, following the generation of
in Equation (
13), the
Q values generated by
and
are interactively used to generate the attention maps separately:
The subsequent process is consistent with . DFF fully considers the similarity and complementarity between dual-domain feature maps and effectively achieves semantic alignment and feature aggregation of dual-domain features.
3.5. Loss Function
In this study, we employ the cross-entropy loss function to supervise training, which performs excellently in multi-category segmentation tasks and effectively measures the discrepancy between predictions and labels.
where
and
K denote the number of pixels and categories, respectively.
is a sign function, whereby, if the real class of
n is
k,
; otherwise,
.
denotes the probability that
n belongs to
k in the prediction result.
4. Experiment
In this section, the datasets and evaluation metrics are first introduced, followed by rigorous ablation studies of our method, and finally, a comprehensive comparison with state-of-the-art methods is conducted.
4.1. Datasets
4.1.1. Vaihingen Dataset
The Vaihingen dataset is a well-recognized and widely used high-resolution RSI semantic segmentation dataset. It contains 33 orthorectified images of Vaihingen acquired by a UAV platform, each with a ground sampling distance of 9 cm and an average size of 2494 × 2064 pixels. The dataset covers six categories: impervious surface (imp. surf.), building, low vegetation (low veg.), tree, car, and background. For this dataset, we used 23 images for training and 10 images for testing. We cropped the images into blocks of size 512 × 512 pixels for training.
4.1.2. Potsdam Dataset
The Potsdam dataset contains 38 orthophotos with a ground sampling distance of 5 cm, and the size of each image is 6000 × 6000 pixels, acquired by a UAV platform. The categories in this dataset are the same as in the Vaihingen dataset. In this dataset, 25 images are used for training and 13 images are used for testing. We cropped the images into blocks of size 512 × 512 pixels for training. In quantitative evaluation, we excluded the “background” category.
4.1.3. GID Dataset
The GID dataset contains 150 GaoFen-2 satellite images with size of 6800 × 7200. It contains six land cover categories: build-up, farmland, forest, meadow, waters and other. This dataset has the advantages of large coverage, wide distribution, and high spatial resolution. In this dataset, we selected representative scenes: 28 images for training and 11 images for testing. We cropped the images into blocks of size 512 × 512 pixels for training. In quantitative evaluation, we excluded the “other” category.
4.2. Implement Details
All the experiments were implemented on a single RTX 3090 GPU. We used 512 × 512 images as input, with a fixed batch size of 8 and 200 training epochs. No data enhancement techniques were applied during training. The initial learning rate for training was set to 0.0001 and dynamically calculated at each iteration. The Adam optimization algorithm was used with a momentum of 0.9. To ensure a fair comparison, the backbone of all methods was uniformly ResNet50 and no pre-trained models were used.
For a comprehensive comparative analysis, we selected 13 methods from classical to state-of-the-art: FCN [
27], UNet [
12], PSPNet [
29], A2FPN [
33], Deeplabv3+ [
30], MANet [
28], ConvFormer [
42], BANet [
43], UNetFormer [
44], CMTF [
45], WFENet [
54], FNNet [
63], and PFMNet [
65]. In particular, FCN [
27], UNet [
12], PSPNet [
29], A2FPN [
33], Deeplabv3+ [
30], and MANet [
28] are CNN-based methods; ConvFormer [
42], BANet [
43], UNetFormerr [
44], and CMTF [
45] are Transformer-based methods; WFENet [
54] is based on wavelet transform; and FNNet [
63] and PFMNet [
65] are fuzzy-based methods.
4.3. Evaluation Metrics
To evaluate the segmentation results of the methods, we employed five representative evaluation metrics for the segmentation task: IoU (Intersection over Union) per category, OA (Overall Accuracy), mF1 score, mIoU (Mean Intersection over Union), and inference speed FPS. In addition, parameters and FLOPs (Floating Point operations) were employed to evaluate the computational complexity and overhead of the methods.
4.4. Ablation Study
In order to validate the effectiveness of the modules in our method, we conducted ablation studies on the Vaihingen and Potsdam datasets to discuss and analyze the proposed modules in our method.
Section 4.4.1 adds modules one by one, starting from the baseline model.
Section 4.4.2,
Section 4.4.3 and
Section 4.4.4 analyze each module in the proposed method in detail.
4.4.1. Components of the Method
Table 1 presents the experimental results of adding individual modules to the baseline model (Vit). Since the DFF module is used to semantically align the outputs of the other two modules and perform further feature extraction, we did not add a separate DFF module. From
Table 1, it can be observed that adding MWFD and T2FSC to the baseline model in the Vaihingen dataset improves OA, F1, and mIoU by 0.68% (0.95%), 1.48% (1.39%), and 2.34% (1.89%), respectively. In the Potsdam dataset, OA, F1, and mIoU performance improved by 1.47% (1.26%), 1.57% (1.26%), and 1.95% (1.39%), respectively. Subsequently, we will validate the effectiveness of the three modules (MWFD, T2FSC, and DFF) by performing detailed ablation analysis through quantitative and qualitative experiments.
4.4.2. Discussion of MWFD
The quantitative results without the MWFD and its components are presented in
Table 2, and the components removed in the experiments are indicated by “w/o”. MWFD-H denotes the high-frequency branch of MWFD, and MWFD-L denotes the low-frequency branch of MWFD. In particular, the removal of the MWFD resulted in the greatest decrease in effect, especially in mIoU, which decreased by 2.43% (2.32%). The removal of the high- and low-frequency branches separately also negatively affected the results. This demonstrates that MWFD can enhance the effect significantly and that both high-frequency and low-frequency branches are essential.
The visualization results without the MWFD and its components are shown in
Figure 5. We have enlarged these hard regions where shadows are severe and the categories converge. It can be clearly seen that the results of the method with MWFD have a significant advantage, with good discrimination of continuous large objects and high attention to continuous large shaded regions, which is lacking in spatial features. The results after removing the MWFD are the worst visually, with confusing segmentation results for shaded regions and serious confusion between low vegetation and building categories. After removing the high-frequency branch, the segmentation of category boundaries becomes worse. After removing the low-frequency branch, the completeness of the segmentation results within the category decreased, and the effect was especially obvious for the building category. This demonstrates that MWFD can effectively learn the feature representation of the grayscale variation region and enhance the discriminative ability for this region.
The trends of the evaluation metrics for different levels of WFD in MWFD are illustrated in
Figure 6. From the top left to the bottom right, the graphs represent OA, mF1, mIoU, and FLOPs, respectively. The metrics significantly improve with 1–3 levels of WFD, but begin to decrease slightly as the number of WFDs continue to increase. Since each WFD halves the scale of the feature map, when the number increases, the feature map scale is too small and the segmentation effect decreases. However, as the WFD increases, the FLOPs gradually increase. Therefore, three levels of WFD is the best solution to balance the performance and computational overhead.
4.4.3. Discussion of T2FSC
The quantitative results without T2FSC on the Vaihingen and Potsdam datasets are presented in
Table 3. After removing T2FSC, the OA, mF1, and mIoU of our method decreased by 1.27% (1.41%), 1.02% (1.30%), and 2.13% (2.53%), respectively. The visualization results after removing T2FSC are shown in
Figure 7. It can be clearly observed in the enlarged areas that the phenomenon of category stacking seriously affects the segmentation results after removing T2FSC, and the category boundaries become rough. This indicates that T2FSC achieves more accurate constraints on spatial features, especially boundary features, so that the network can segment the category boundaries more accurately. In the second visualization scenario, after removing T2FSC, the inside of the impervious surface is incorrectly segmented into building and car. This indicates that T2FSC mitigates the segmentation instability caused by intra-class variations.
4.4.4. Discussion of DFF
The quantitative results without DFF on the Vaihingen and Potsdam datasets are presented in
Table 4. The OA, mF1, and mIoU of our method decreased by 1.22% (1.18%), 0.57% (1.28%), and 1.88% (2.15%), respectively. The visualization results after removing DFF are shown in
Figure 7. It can be clearly seen that the category integrity of the segmentation results decreases significantly after removing DFF, and mis-segmentation occurs both at the boundary and intra-class. This is attributed to the semantic gaps between spatial and frequency features, and direct fusion leads to unstable feature representations, which leads to confusing segmentation results. DFF alleviates the differences between the semantics and effectively aggregates the dual-domain features.
4.5. Comparisons with the State-of-the-Art
A comprehensive quantitative and qualitative comparison of our proposed method with state-of-the-art methods was performed on the Vaihingen, Potsdam, and GID datasets.
4.5.1. Results on the Vaihingen Dataset
The quantitative comparison results of our method with other methods on the Vaihingen dataset are shown in
Table 5, which comprehensively evaluates their performance strengths and weaknesses by evaluation metrics. Our method outperforms the other methods in most of the evaluation metrics. In particular, the best results were obtained in the comprehensive evaluation metrics of OA (87.97), mF1 (85.11), and mIoU (74.56). This demonstrates that our method can effectively extract and integrate the complex and variable feature representations in Vaihingen to achieve more accurate segmentation results. In addition, a large lead is achieved in the low vegetation and tree categories, which indicates that our method can effectively solve the challenge of relative foreground–background segmentation in RSIs.
In contrast, methods such as PFMNet, CMTF, and MANet have also performed well. In particular, the results of MANet in the impervious surface and low vegetation categories are very close to our method. CNN-based models such as FCN, UNet, PSPNet, and A2FPN also obtained good results, but generally not as good as methods that include Transformer, which highlights the superiority of Transformer. Transformer-based methods such as ConvFormer, BANet, and UNetFormer performed satisfactorily in most categories. However, for the impervious surface and low vegetation categories, our methods are superior in comparison. DeepLabv3+ achieves the highest FPS. Our method also achieved good results for an excellent balance between FPS and performance.
The visualization results in
Figure 8 provide a further complement to the quantitative results of the Vaihingen dataset. These prediction results visually compare the advantages and disadvantages between the methods and provide a qualitative assessment of the performance of each method. The results of the large scene visualization in Vaihingen clearly show that the results of our method are much closer to the ground truth, especially for the relative foreground–background and multi-category intersections. For the building and low vegetation categories, our method achieves a smooth and complete segmentation effect, which corresponds to the high mF1 and mIoU in the quantitative analysis. The red boxes in
Figure 8 mark the areas that are prone to mis-segmentation, where the boundaries of the categories are easily confused. It can be clearly seen that our method segmented clear boundaries, especially for the building, tree, and car categories, which are almost consistent with the ground truth. This demonstrates the superiority of our method in boundary segmentation.
In contrast, methods such as FCN, PSPNet, and A2FPN had more misclassifications, with broken and distorted results and insufficient completeness. Segmentation of building categories was particularly evident, with frequent misclassifications of buildings as low vegetation or impervious surfaces. Methods such as BANet, CMTF, and PFMNet achieve satisfactory segmentation results, and the segmentation completeness is greatly improved compared to CNN-based methods. However, the segmentation results are still not fine enough for category boundaries and intra-classes, which can be attributed to the challenges posed by the instability of intra-class and inter-class variance.
Figure 9a shows the comprehensive evaluation of mIoU and computational overhead for various methods in the Vaihingen dataset. FLOPs are positively correlated with the diameter of the circle of each method. The mIoU of our method is considerably ahead compared to CNN-based methods. Compared to other state-of-the-art methods, our method has the highest performance, with params and FLOPs close to them. Our method achieves the best results in terms of mIoU, with a good balance between performance and computational overhead.
The loss curves of various methods during training on the Vaihingen dataset are shown in
Figure 10a. In particular, our method has the fastest convergence rate at the beginning of training and shows a strong learning ability. During the 100–200 epochs of training, our method also maintains a smoother trend of decreasing loss and eventually achieves the best convergence results. MANet and PFMNet stand out among the many comparative methods and also show excellent training performance. However, the loss curve of MANet consistently has a large oscillation amplitude, which indicates that the model has a tendency to be unstable. Compared with MANet, PFMNet is more stable, but the final convergence is not as good as our method.
Comprehensive and diverse experimental results in Vaihingen demonstrate the superiority of our method in semantic segmentation tasks. Our method outperforms other methods in evaluation metrics, with a good balance between performance and efficiency, and the loss curves further validate the robustness of our method. This analysis not only validates the superiority of our method, but also provides a new perspective for the comparison of semantic segmentation methods for RSIs.
4.5.2. Results on the Potsdam Dataset
The quantitative results of the Potsdam dataset are shown in
Table 6, and our method outperforms the other methods in most of the evaluation metrics. In particular, for the impervious surface (62.57) and low vegetation (71.00) categories, the superiority of our method is obvious. This proves that our method has a stronger learning ability for such categories with heterogeneous distributions and unstable representations, which can be attributed to our effective aggregation of frequency information. For the composite metrics of OA, mF1, and mIoU, our method yields the best results while achieving a high FPS, which proves that our method has the most comprehensive performance.
In contrast, the performance of FCN and UNet based on conventional convolution is poor. PSPNet and DeepLabv3+ models have improved some segmentation effects by introducing attentional mechanisms or dilated convolutions. PFMNet achieves sub-optimal results, with the best results seen in the car category. CMTF and MANet also achieve good results, but with lower metrics in the individual categories.
The visualization results provided in
Figure 11 correspond to the quantitative results in
Table 6. Similar to the visualization results on the Vaihingen dataset, the predictions of our methods on Potsdam are closer to the ground truth, which is consistent with the fact that we achieved the highest OA, mF1, and mIoU in quantitative analysis.
Specifically, the red boxes in
Figure 11 indicate difficult regions. In the first scenario, the two regions that are labeled are very dense and cluttered with buildings, which make them difficult to be clearly segmented. As we expected, many methods suffer from some degree of misclassification in these two regions. In contrast, our method segmented the buildings with clear boundaries that were almost identical to the ground truth. The difficult regions in the other two scenarios contain larger complete buildings with large intra-class variation, which greatly increases the difficulty of complete segmentation. Most comparison methods misclassify in this region, in contrast to our method, which overcomes intra-class variations and segments buildings completely.
The comprehensive evaluation of the mIoU and computational overhead of the various methods in the Potsdam dataset is shown in
Figure 9b. It can be clearly seen that our methods achieve the highest mIoU with an excellent balance between performance and computational overhead.
Figure 10b represents the loss curves of each method when trained on the Potsdam dataset. Our method also maintains the fastest convergence rate, and the loss decreases smoothly during training, ultimately achieving the best convergence results.
4.5.3. Results on the GID Dataset
The quantitative results of the GID dataset are shown in
Table 7. Our method outperforms other methods on many evaluation metrics and achieves the best results on the OA, mF1, and mIoU. In particular, our mIoU result outperforms the sub-best method by 2.4%. CMTF and PFMNet also achieve excellent results, with the best metrics in the meadow and forest categories, respectively.
The results of the visualization of the GID are presented in
Figure 12. The visualization results of our method have the best completeness and clear representation of details. Specifically, in the first scenario, our method significantly outperforms the other methods on the build-up category. In the hard regions of the second scenario, our method achieves accurate segmentation for the cluttered distribution of farmland categories. UNetFormer, FNNet, and other comparison methods provide confusing segmentation results for this region. Similarly, in the third scenario, our method segmented the forest category completely, which has a clear advantage over other methods.
Figure 9c shows that our method achieves an excellent balance between effectiveness and computational overhead.
Figure 10c shows the loss curves of the various methods when trained on the GID dataset. Our method still maintains the fastest convergence rate and ultimately achieves the best convergence. In the loss curves in
Figure 10, our method shows the best convergence speed and convergence results in three datasets, and the loss decreases steadily during the training, further validating the robustness of our method.
Extensive experimental results on the Vaihingen, Potsdam, and GID datasets demonstrate the superior adaptability and robustness of our method for the segmentation of complex scenes and uncertain boundaries. In the diverse land cover categories used in this study, our method exhibits extremely high accuracy and excellent detail retention, and maintains an excellent balance between performance and computational overhead.
5. Discussion
In this study, three datasets which are high-resolution RSIs from different data platforms are utilized: the Vaihingen dataset, the Potsdam dataset, and the GID dataset. Qualitative and quantitative experiments demonstrate that our method achieves excellent accuracy and robustness on these datasets, outperforming state-of-the-art methods. Ablation experiments systematically validate the effectiveness of each module, and the soundness of our design is further verified by incrementally adding or removing components from the method.
Despite the excellent performance of our method, some limitations remain. The T2FSC module is introduced to mitigate the boundary uncertainty caused by category stacking, but this also introduces additional computational overhead. It is particularly important to address fuzzy parameter simplification and downscaling for model efficiency and practical applications. In addition, our method only focuses on optical RSIs and lacks the consideration of other types of RSIs, which limits its applicability to a certain extent, which is also a common problem of some existing methods. Therefore, subsequent research should focus on public datasets with multiple morphologies as well as real-world data, and explore the implementation strategies of various RSI semantic segmentation methods, such as semi-supervised learning, prototype learning, and migration learning.
In summary, this study has both theoretical and practical significance and provides valuable insights and inspiration for the advancement in the field of semantic segmentation. In future work, we will explore multiple implementation strategies and multi-source RSI semantic segmentation methods, and work on developing methods with lower computational overhead and better generalization capabilities.
6. Conclusions
This paper proposes a novel dual-domain fusion network, which is a specialized method to address the unique and challenging characteristics encountered in RSI segmentation. In particular, the proposed MWFD effectively extracts and aggregates multi-level frequency features, which alleviates the difficulties of segmentation caused by grayscale variations. T2FSC achieves feature constraints by means of adaptive upper and lower membership functions to provide more stable feature representations, thus enabling the network to achieve higher segmentation accuracy when dealing with category boundaries. In addition, the proposed DFF achieves the effective aggregation of dual-domain features, which further enhances the segmentation capability of the network. Extensive experiments on Vaihingen, Potsdam, and GID datasets verify that the proposed method significantly outperforms existing state-of-the-art methods in terms of both segmentation accuracy and robustness. Detailed comparative analyses reveal that our method outperforms in handling complex scenarios, demonstrating its potential for practical applications. In future research, we intend to explore more lightweight type-2 fuzzy models and adapt them to semantic segmentation tasks for multiple types of images to extend their applications.
Author Contributions
Conceptualization, G.W.; Methodology, G.W., J.X. and Q.C.; Software, G.W.; Validation, G.W.; Formal analysis, G.W. and H.X.; Investigation, G.W., J.X. and Q.C.; Resources, G.W., J.X. and W.Y.; Data curation, W.Y. and H.X.; Writing—original draft, G.W.; Writing—review and editing, G.W., J.X. and Q.C.; Visualization, G.W., J.X. and Q.C.; Supervision, J.X., W.Y. and M.N.; Project administration, G.W., J.X. and M.N.; Funding acquisition, J.X., H.X. and M.N. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (NSFC) under Grant 62072391 and Grant 62066013, and the Graduate Innovation Foundation of Yantai University (GIFYTU) under Grant GGIFYTU2442.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Chen, Z.; Deng, L.; Luo, Y.; Li, D.; Junior, J.M.; Gonçalves, W.N.; Nurunnabi, A.A.M.; Li, J.; Wang, C.; Li, D. Road extraction in remote sensing data: A survey. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102833. [Google Scholar] [CrossRef]
- Lv, Z.; Huang, H.; Li, X.; Zhao, M.; Benediktsson, J.A.; Sun, W.; Falco, N. Land cover change detection with heterogeneous remote sensing images: Review, progress, and perspective. Proc. IEEE 2022, 110, 1976–1991. [Google Scholar] [CrossRef]
- Zheng, Q.; Seto, K.; Zhou, Y.; You, S.; Weng, Q. Nighttime light remote sensing for urban applications: Progress, challenges, and prospects. ISPRS-J. Photogramm. Remote Sens. 2023, 202, 125–141. [Google Scholar] [CrossRef]
- Li, L.; Mu, X.; Jiang, H.; Chianucci, F.; Hu, R.; Song, W.; Qi, J.; Liu, S.; Zhou, J.; Chen, L.; et al. Review of ground and aerial methods for vegetation cover fraction (fcover) and related quantities estimation: Definitions, advances, challenges, and future perspectives. ISPRS-J. Photogramm. Remote Sens. 2023, 199, 133–156. [Google Scholar] [CrossRef]
- Jiang, H.; Peng, M.; Zhong, Y.; Xie, H.; Hao, Z.; Lin, J.; Ma, X.; Hu, X. A Survey on Deep Learning-Based Change Detection from High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 1552. [Google Scholar] [CrossRef]
- Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Farseg++: Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13715–13729. [Google Scholar] [CrossRef]
- Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
- Liu, H.; Wang, C.; Zhao, J.; Chen, S.; Kong, H. Adaptive fourier convolution network for road segmentation in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5617214. [Google Scholar] [CrossRef]
- Xiao, R.; Zhong, C.; Zeng, W.; Cheng, M.; Wang, C. Novel convolutions for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5907313. [Google Scholar] [CrossRef]
- Hang, R.; Yang, P.; Zhou, F.; Liu, Q. Multiscale progressive segmentation network for high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5412012. [Google Scholar] [CrossRef]
- Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on convolutional neural networks (cnn) in vegetation remote sensing. ISPRS-J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Yang, X.; Li, S.; Chen, Z.S.; Chanussot, J.; Jia, X.; Zhang, B.; Li, B.; Chen, P. An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery. ISPRS-J. Photogramm. Remote Sens. 2021, 177, 238–262. [Google Scholar] [CrossRef]
- Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
- Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.-S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
- Yao, M.; Zhang, Y.; Liu, G.; Pang, D. Ssnet: A novel transformer and cnn hybrid network for remote sensing semantic segmentation. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 3023–3037. [Google Scholar] [CrossRef]
- Xiang, X.; Gong, W.; Li, S.; Chen, J.; Ren, T. Tcnet: Multiscale fusion of transformer and cnn for semantic segmentation of remote sensing images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 3123–3136. [Google Scholar] [CrossRef]
- Xu, C.; Jia, W.; Wang, R.; Luo, X.; He, X. Morphtext: Deep morphology regularized accurate arbitrary-shape scene text detection. IEEE Trans. Multimedia. 2022, 25, 4199–4212. [Google Scholar] [CrossRef]
- Wang, Y.; Zhang, H.; Hu, Y.; Hu, X.; Chen, L.; Hu, S. Geometric boundary guided feature fusion and spatial-semantic context aggregation for semantic segmentation of remote sensing images. IEEE Trans. Image Process. 2023, 32, 6373–6385. [Google Scholar] [CrossRef]
- Xu, Y.; Zhu, Z.; Guo, M.; Huang, Y. Multiscale edge-guided network for accurate cultivated land parcel boundary extraction from remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4501020. [Google Scholar] [CrossRef]
- He, C.; Li, S.; Xiong, D.; Fang, P.; Liao, M. Remote Sensing Image Semantic Segmentation Based on Edge Information Guidance. Remote Sens. 2020, 12, 1501. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- He, P.; Jiao, L.; Shang, R.; Wang, S.; Liu, X.; Quan, D.; Yang, K.; Zhao, D. MANet: Multi-Scale Aware-Relation Network for Semantic Segmentation in Aerial Scenes. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624615. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference 612 on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
- Liu, J.; Hua, W.; Zhang, W.; Liu, F.; Xiao, L. Stair fusion network with context-refined attention for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701517. [Google Scholar] [CrossRef]
- Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A synergistical attention model for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400916. [Google Scholar] [CrossRef]
- Hu, M.; Li, Y.; Fang, L.; Wang, S. A2-FPN: Attention aggregation based feature pyramid network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Nashville, TN, USA, 20–25 June 2021; pp. 15343–15352. [Google Scholar]
- Qi, L.; Zuo, D.; Wang, Y.; Tao, Y.; Tang, R.; Shi, J.; Gong, J.; Li, B. Convolutional Neural Network-Based Method for Agriculture Plot Segmentation in Remote Sensing Images. Remote Sens. 2024, 16, 346. [Google Scholar] [CrossRef]
- Chong, Q.; Ni, M.; Huang, J.; Liang, Z.; Wang, J.; Li, Z.; Xu, J. Pos-DANet: A dual-branch awareness network for small object segmentation within high-resolution remote sensing images. Eng. Appl. Artif. Intell. 2024, 133, 107960. [Google Scholar] [CrossRef]
- Li, S.; Liu, F.; Jiao, L.; Liu, X.; Chen, P.; Li, L. Mask-Guided Correlation Learning for Few-Shot Segmentation in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5636714. [Google Scholar] [CrossRef]
- Huang, W.; Shi, Y.; Xiong, Z.; Zhu, X. Decouple and weight semi-supervised semantic segmentation of remote sensing images. ISPRS-J. Photogramm. Remote Sens. 2024, 212, 13–26. [Google Scholar] [CrossRef]
- Sun, W.; Zhang, J.; Lei, Y.; Hong, D. RSProtoSeg: High Spatial Resolution Remote Sensing Images Segmentation Based on Non-Learnable Prototypes. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5626610. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–27 October 2017; pp. 2980–2988. [Google Scholar]
- Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
- Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and cnn hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408820. [Google Scholar] [CrossRef]
- Lin, X.; Yan, Z.; Deng, X.; Zheng, C.; Yu, L. ConvFormer: Plug-and-play CNN-style transformers for improving medical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2023, Vancouver, BC, Canada, 8–12 October 2023; pp. 642–651. [Google Scholar]
- Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS-J. Photogramm. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. Cmtfnet: Cnn and multiscale transformer fusion network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
- Li, X.; Xu, F.; Liu, F.; Tong, Y.; Lyu, X.; Zhou, J. Semantic Segmentation of Remote Sensing Images by Interactive Representation Refinement and Geometric Prior-Guided Inference. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5400318. [Google Scholar] [CrossRef]
- Xiang, S.; Liang, Q. Remote sensing image compression based on high-frequency and low-frequency components. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5604715. [Google Scholar] [CrossRef]
- Cui, Y.; Ren, W.; Cao, X.; Knoll, A. Image restoration via frequency selection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1093–1108. [Google Scholar] [CrossRef]
- Yang, Y.; Jiao, L.; Liu, F.; Liu, X.; Li, L.; Chen, P.; Yang, S. An explainable spatial–frequency multiscale transformer for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5907515. [Google Scholar] [CrossRef]
- Fan, J.; Li, J.; Liu, Y.; Zhang, F. Frequency-aware robust multidimensional information fusion framework for remote sensing image segmentation. Eng. Appl. Artif. Intell. 2024, 129, 107638. [Google Scholar] [CrossRef]
- Ma, H.; Liu, D.; Yan, N.; Li, H.; Wu, F. End-to-end optimized versatile image compression with wavelet-like transform. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1247–1263. [Google Scholar] [CrossRef]
- Huang, Y.; Huang, J.; Liu, J.; Yan, M.; Dong, Y.; Lv, J.; Chen, C.; Chen, S. Wavedm: Wavelet-based diffusion models for image restoration. IEEE Trans. Multimed. 2024, 26, 7058–7073. [Google Scholar] [CrossRef]
- Xu, J.; Zhao, J.; Fu, Y. An efficient hyperspectral image classification method using deep fusion of 3-d discrete wavelet transform and CNN. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5505905. [Google Scholar] [CrossRef]
- Li, Y.; Liu, Z.; Yang, J.; Zhang, H. Wavelet Transform Feature Enhancement for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 5644. [Google Scholar] [CrossRef]
- De Souza Brito, A.; Vieira, M.B.; De Andrade, M.L.S.C.; Feitosa, R.Q.; Giraldi, G.A. Combining max-pooling and wavelet pooling strategies for semantic image segmentation. Expert Syst. Appl. 2021, 183, 115403. [Google Scholar] [CrossRef]
- Zong, H.; Zhang, E.; Li, X.; Zhang, H. Multiscale self-supervised sar image change detection based on wavelet transform. IEEE Trans. Geosci. Remote Sens. 2024, 21, 4006205. [Google Scholar] [CrossRef]
- Zhou, Y.; Huang, J.; Wang, C.; Song, L.; Yang, G. Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic segmentation of biomedical images. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 21085–21096. [Google Scholar]
- Kotaridis, I.; Lazaridou, M. Remote sensing image segmentation advances: A meta-analysis. ISPRS-J. Photogramm. Remote Sens. 2021, 171, 309–322. [Google Scholar] [CrossRef]
- Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.-S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
- Bustince, H.; Barrenechea, E.; Pagola, M.; Fernandez, J.; Xu, Z.; Bedregal, B.; Montero, J.; Hagras, H.; Herrera, F.; De Baets, B. A historical account of types of fuzzy sets and their relationships. IEEE Trans. Fuzzy Syst. 2016, 24, 179–194. [Google Scholar] [CrossRef]
- Xu, J.; Li, K.; Li, Z.; Chong, Q.; Xing, H.; Xing, Q.; Ni, M. Fuzzy graph convolutional network for hyperspectral image classification. Eng. Appl. Artif. Intell. 2024, 127, 107280. [Google Scholar] [CrossRef]
- Ye, F.; Luo, W.; Dong, M.; Li, D.; Min, W. Content-based remote sensing image retrieval based on fuzzy rules and a fuzzy distance. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8002505. [Google Scholar] [CrossRef]
- Qu, T.; Xu, J.; Chong, Q.; Liu, Z.; Yan, W.; Wang, X.; Liu, Z.; Yan, W.; Wang, X.; Song, Y.; et al. Fuzzy neighbourhood neural network for high-resolution remote sensing image segmentation. Eur. J. Remote Sens. 2023, 56, 1. [Google Scholar] [CrossRef]
- Chong, Q.; Xu, J.; Ding, Y.; Dai, Z. A multiscale bidirectional fuzzy-driven learning network for remote sensing image segmentation. Int. J. Remote Sens. 2023, 44, 6860–6881. [Google Scholar] [CrossRef]
- Wei, G.; Xu, J.; Chong, Q.; Huang, J.; Xing, H. Prior-Guided Fuzzy-Aware Multibranch Network for Remote Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 9, 4. [Google Scholar] [CrossRef]
- Mittal, K.; Jain, A.; Vaisla, K.S.; Castillo, O.; Kacprzyk, J. A comprehensive review on type 2 fuzzy logic applications: Past, present and future. Eng. Appl. Artif. Intell. 2020, 95, 103916. [Google Scholar] [CrossRef]
- Ge, A.; Chang, Z.; Feng, J.-E. Interval type-2 fuzzy relation matrix model via semitensor product. IEEE Trans. Fuzzy Syst. 2023, 31, 3984–3994. [Google Scholar] [CrossRef]
- Dombi, J.; Hussain, A. Data-driven interval type-2 fuzzy inference system based on the interval type-2 distending function. IEEE Trans. Fuzzy Syst. 2023, 31, 2345–2359. [Google Scholar] [CrossRef]
- Beke, A.; Kumbasar, T. More than accuracy: A composite learning framework for interval type-2 fuzzy logic systems. IEEE Trans. Fuzzy Syst. 2023, 31, 734–744. [Google Scholar] [CrossRef]
- Wang, C.; Wang, X.; Wu, D.; Kuang, M.; Li, Z. Meticulous Land Cover Classification of High-Resolution Images Based on Interval Type-2 Fuzzy Neural Network with Gaussian Regression Model. Remote Sens. 2022, 14, 3704. [Google Scholar] [CrossRef]
Figure 1.
Visualization of the two challenges facing the RSI semantic segmentation task, with typical regions circled in dashed lines. (a) Shadows cause variations in grayscale, which makes it difficult to distinguish between categories. (b) Stacking of categories with similar spatial representations increases boundary uncertainty.
Figure 1.
Visualization of the two challenges facing the RSI semantic segmentation task, with typical regions circled in dashed lines. (a) Shadows cause variations in grayscale, which makes it difficult to distinguish between categories. (b) Stacking of categories with similar spatial representations increases boundary uncertainty.
Figure 2.
Overall architecture of the proposed method.
Figure 2.
Overall architecture of the proposed method.
Figure 3.
Illustration of continuous WFD. After 1–3 levels of WFD, 12 multi-level high-frequency and low-frequency sub-features were obtained.
Figure 3.
Illustration of continuous WFD. After 1–3 levels of WFD, 12 multi-level high-frequency and low-frequency sub-features were obtained.
Figure 4.
Illustration of the process of T2FSC constrained feature map. T2FSC is iterated through the entire feature map, and N denotes the number of iterations.
Figure 4.
Illustration of the process of T2FSC constrained feature map. T2FSC is iterated through the entire feature map, and N denotes the number of iterations.
Figure 5.
Visualization of ablation studies for MWFD and its components on the Vaihingen dataset. Red boxes indicate where differences are evident.
Figure 5.
Visualization of ablation studies for MWFD and its components on the Vaihingen dataset. Red boxes indicate where differences are evident.
Figure 6.
The impact of different levels of WFD in MWFD.
Figure 6.
The impact of different levels of WFD in MWFD.
Figure 7.
Visualization of ablation studies for T2FSC and DFF on the Vaihingen dataset. Red boxes indicate where differences are evident.
Figure 7.
Visualization of ablation studies for T2FSC and DFF on the Vaihingen dataset. Red boxes indicate where differences are evident.
Figure 8.
Segmentation visualization results with different methods on the Vaihingen dataset. The red boxes indicate where our method clearly outperforms the others.
Figure 8.
Segmentation visualization results with different methods on the Vaihingen dataset. The red boxes indicate where our method clearly outperforms the others.
Figure 9.
Comparison between the mIoU and computational overhead of different methods in (a) Vaihingen, (b) Potsdam, and (c) GID datasets. The vertical axis represents mloU and the horizontal axis represents the params. The diameter of the circle is positively correlated with the FLOPs of the methods.
Figure 9.
Comparison between the mIoU and computational overhead of different methods in (a) Vaihingen, (b) Potsdam, and (c) GID datasets. The vertical axis represents mloU and the horizontal axis represents the params. The diameter of the circle is positively correlated with the FLOPs of the methods.
Figure 10.
Loss curves of different methods on (a) Vaihingen, (b) Potsdam, and (c) GID datasets. The vertical axis represents the loss and the horizontal axis represents the epoch.
Figure 10.
Loss curves of different methods on (a) Vaihingen, (b) Potsdam, and (c) GID datasets. The vertical axis represents the loss and the horizontal axis represents the epoch.
Figure 11.
Segmentation visualization results with different methods on the Vaihingen dataset. The red boxes indicate where our method clearly outperforms the others.
Figure 11.
Segmentation visualization results with different methods on the Vaihingen dataset. The red boxes indicate where our method clearly outperforms the others.
Figure 12.
Segmentation visualization results with different methods on the GID dataset. The red boxes indicate where our method clearly outperforms the others.
Figure 12.
Segmentation visualization results with different methods on the GID dataset. The red boxes indicate where our method clearly outperforms the others.
Table 1.
Results of adding individual modules on the baseline model. Bold indicates the best result in the current column.
Table 1.
Results of adding individual modules on the baseline model. Bold indicates the best result in the current column.
Dataset | MWFD | T2FSC | DFF | OA (%) | mF1 (%) | mIoU (%) |
---|
Vaihingen | - | - | - | 85.42 | 82.35 | 69.08 |
✓ | - | - | 86.10 | 83.83 | 71.42 |
- | ✓ | - | 86.37 | 83.74 | 70.97 |
✓ | ✓ | ✓ | 87.97 | 85.11 | 74.56 |
Potsdam | - | - | - | 83.38 | 80.67 | 68.83 |
✓ | - | - | 84.85 | 82.24 | 70.78 |
- | ✓ | - | 84.64 | 81.93 | 70.22 |
✓ | ✓ | ✓ | 86.81 | 84.18 | 73.60 |
Table 2.
Results after removing the MWFD and its components. Bold indicates the best result in the current column.
Table 2.
Results after removing the MWFD and its components. Bold indicates the best result in the current column.
Methods | Vaihingen | Potsdam |
---|
OA (%) | mF1 (%) | mIoU (%) | OA (%) | mF1 (%) | mIoU (%) |
---|
Ours w/o MWFD | 86.12 | 83.87 | 72.13 | 85.39 | 82.13 | 71.28 |
Ours w/o MWFD-H | 87.24 | 83.64 | 72.92 | 86.37 | 82.81 | 72.35 |
Ours w/o MWFD-L | 87.25 | 84.08 | 73.58 | 86.32 | 83.10 | 72.51 |
Ours | 87.97 | 85.11 | 74.56 | 86.81 | 84.18 | 73.60 |
Table 3.
Results after removing the T2FSC. Bold indicates the best result in the current column. Bold indicates the best result in the current column.
Table 3.
Results after removing the T2FSC. Bold indicates the best result in the current column. Bold indicates the best result in the current column.
Methods | Vaihingen | Potsdam |
---|
OA (%) | mF1 (%) | mIoU (%) | OA (%) | mF1 (%) | mIoU (%) |
---|
Ours w/o T2FSC | 86.70 | 84.09 | 72.43 | 85.40 | 82.88 | 71.07 |
Ours | 87.97 | 85.11 | 74.56 | 86.81 | 84.18 | 73.60 |
Table 4.
Results after removing the DFF.
Table 4.
Results after removing the DFF.
Methods | Vaihingen | Potsdam |
---|
OA (%) | mF1 (%) | mIoU (%) | OA (%) | mF1 (%) | mIoU (%) |
---|
Ours w/o DFF | 86.75 | 84.54 | 72.68 | 85.63 | 82.88 | 71.45 |
Ours | 87.97 | 85.11 | 74.56 | 86.81 | 84.18 | 73.60 |
Table 5.
Results on the Vaihingen dataset. The bold text indicates the best results. Bold indicates the best result in the current column.
Table 5.
Results on the Vaihingen dataset. The bold text indicates the best results. Bold indicates the best result in the current column.
Methods | IoU (%) Per Category | Evaluation Metrics |
---|
Imp. Surf. | Building | Low Veg. | Tree | Car | OA (%) | mF1 (%) | mIoU (%) | FPS |
---|
FCN [27] | 80.11 | 82.81 | 67.94 | 64.93 | 40.78 | 83.87 | 74.44 | 61.57 | 190 |
UNet [12] | 79.66 | 82.02 | 66.67 | 65.35 | 47.92 | 83.53 | 75.68 | 63.65 | 134 |
PSPNet [29] | 77.70 | 79.76 | 64.89 | 62.72 | 42.04 | 83.10 | 74.40 | 60.87 | 150 |
A2FPN [33] | 78.10 | 80.25 | 67.05 | 64.31 | 43.36 | 83.88 | 77.14 | 63.92 | 182 |
Deeplabv3+ [30] | 81.24 | 87.01 | 68.92 | 65.92 | 55.02 | 86.17 | 81.32 | 68.30 | 196 |
MANet [28] | 83.08 | 88.11 | 70.54 | 67.43 | 66.33 | 87.14 | 82.86 | 71.49 | 179 |
ConvFormer [42] | 79.62 | 87.42 | 61.09 | 60.60 | 61.59 | 85.64 | 79.24 | 71.11 | 137 |
BANet [43] | 80.09 | 86.49 | 62.75 | 59.07 | 57.77 | 81.22 | 81.02 | 66.57 | 140 |
UNetFormer [44] | 76.25 | 86.74 | 64.91 | 63.50 | 57.58 | 85.44 | 79.16 | 71.16 | 164 |
CMTF [45] | 80.23 | 86.45 | 67.12 | 65.10 | 62.33 | 85.78 | 81.68 | 72.23 | 168 |
WFENet [54] | 76.68 | 79.63 | 66.16 | 61.09 | 52.01 | 83.99 | 78.35 | 65.27 | 129 |
FNNet [63] | 78.10 | 81.09 | 66.43 | 64.07 | 54.59 | 84.88 | 77.13 | 65.91 | 137 |
PFMNet [65] | 81.77 | 86.60 | 69.41 | 66.98 | 69.24 | 86.41 | 82.79 | 72.70 | 158 |
Ours | 84.37 | 89.19 | 71.58 | 68.64 | 70.98 | 87.97 | 85.11 | 74.56 | 152 |
Table 6.
Results on the Potsdam dataset. The bold text indicates the best results. Bold indicates the best result in the current column.
Table 6.
Results on the Potsdam dataset. The bold text indicates the best results. Bold indicates the best result in the current column.
Methods | IoU (%) Per Category | Evaluation Metrics |
---|
Imp. Surf. | Building | Low Veg. | Tree | Car | OA (%) | mF1 (%) | mIoU (%) | FPS |
---|
FCN [27] | 48.02 | 59.28 | 51.04 | 54.80 | 72.52 | 82.44 | 75.72 | 61.26 | 131 |
UNet [12] | 51.98 | 68.39 | 60.48 | 57.03 | 73.84 | 82.91 | 72.12 | 62.77 | 127 |
PSPNet [29] | 50.71 | 73.46 | 62.55 | 59.57 | 82.76 | 83.66 | 80.37 | 67.65 | 112 |
A2FPN [33] | 52.74 | 71.53 | 55.14 | 55.57 | 72.67 | 77.85 | 73.58 | 62.54 | 150 |
Deeplabv3+ [30] | 57.93 | 75.67 | 69.16 | 62.54 | 85.62 | 85.24 | 82.37 | 71.05 | 157 |
MANet [28] | 60.13 | 77.79 | 68.16 | 63.85 | 85.78 | 85.85 | 83.08 | 72.06 | 132 |
ConvFormer [42] | 52.45 | 66.00 | 54.42 | 53.41 | 71.33 | 81.06 | 79.32 | 70.86 | 129 |
BANet [43] | 49.86 | 60.92 | 49.26 | 50.26 | 69.78 | 76.64 | 72.86 | 64.89 | 134 |
UNetFormer [44] | 53.10 | 67.59 | 47.72 | 45.92 | 71.93 | 80.56 | 77.82 | 69.65 | 153 |
CMTF [45] | 58.03 | 71.89 | 57.13 | 53.21 | 80.10 | 83.38 | 80.76 | 71.53 | 134 |
WFENet [54] | 53.02 | 69.47 | 65.46 | 58.41 | 75.21 | 83.57 | 73.69 | 65.50 | 122 |
FNNet [63] | 50.09 | 68.51 | 55.86 | 56.82 | 76.75 | 79.90 | 75.60 | 64.76 | 125 |
PFMNet [65] | 61.55 | 79.29 | 69.80 | 64.38 | 86.05 | 85.58 | 83.22 | 72.10 | 138 |
Ours | 62.57 | 79.64 | 71.00 | 65.35 | 85.68 | 86.81 | 84.18 | 73.60 | 135 |
Table 7.
Results on the GID dataset. The bold text indicates the best results. Bold indicates the best result in the current column.
Table 7.
Results on the GID dataset. The bold text indicates the best results. Bold indicates the best result in the current column.
Methods | IoU (%) Per Category | Evaluation Metrics |
---|
Imp. Surf. | Building | Low Veg. | Tree | Car | OA (%) | mF1 (%) | mIoU (%) | FPS |
---|
FCN [27] | 72.44 | 79.51 | 76.16 | 57.49 | 85.14 | 81.37 | 72.72 | 71.39 | 140 |
UNet [12] | 59.76 | 79.38 | 55.87 | 58.22 | 88.28 | 80.32 | 70.33 | 69.09 | 124 |
PSPNet [29] | 69.06 | 82.20 | 80.07 | 54.11 | 90.35 | 82.65 | 72.76 | 74.67 | 113 |
A2FPN [33] | 69.65 | 79.71 | 80.42 | 65.69 | 87.93 | 84.50 | 74.78 | 76.36 | 149 |
Deeplabv3+ [30] | 75.38 | 85.11 | 79.09 | 56.31 | 90.99 | 82.81 | 74.13 | 75.52 | 155 |
MANet [28] | 77.22 | 86.84 | 80.47 | 54.69 | 91.02 | 83.57 | 74.62 | 76.60 | 131 |
ConvFormer [42] | 70.30 | 78.56 | 81.21 | 63.12 | 83.32 | 83.56 | 74.49 | 75.77 | 129 |
BANet [43] | 66.30 | 78.35 | 76.25 | 61.25 | 84.67 | 82.23 | 73.39 | 73.64 | 142 |
UNetFormer [44] | 67.67 | 73.76 | 85.76 | 61.55 | 87.39 | 82.45 | 73.05 | 74.11 | 154 |
CMTF [45] | 65.25 | 79.37 | 81.12 | 67.97 | 88.74 | 84.15 | 74.29 | 75.47 | 141 |
WFENet [54] | 67.27 | 66.44 | 59.25 | 53.78 | 67.81 | 82.63 | 73.04 | 74.23 | 118 |
FNNet [63] | 64.90 | 79.63 | 77.90 | 56.65 | 89.62 | 81.75 | 72.66 | 72.44 | 126 |
PFMNet [65] | 81.09 | 87.15 | 82.46 | 57.13 | 93.27 | 84.67 | 74.97 | 78.61 | 134 |
Ours | 83.42 | 89.02 | 81.51 | 67.87 | 94.73 | 86.94 | 76.91 | 81.01 | 130 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).