Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

Liu, Jiang; Cheng, Shuli; Du, Anyu

doi:10.3390/rs16173184

Open AccessArticle

Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

by

Jiang Liu

,

Shuli Cheng

and

Anyu Du

^*

School of Computer Science and Technology, Xinjiang University, Ürümqi 830046, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3184; https://doi.org/10.3390/rs16173184

Submission received: 24 June 2024 / Revised: 16 August 2024 / Accepted: 27 August 2024 / Published: 28 August 2024

(This article belongs to the Special Issue Image Enhancement and Fusion Techniques in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Semantic segmentation is currently a hot topic in remote sensing image processing. There are extensive applications in land planning and surveying. Many current studies combine Convolutional Neural Networks (CNNs), which extract local information, with Transformers, which capture global information, to obtain richer information. However, the fused feature information is not sufficiently enriched and it often lacks detailed refinement. To address this issue, we propose a novel method called the Multi-View Feature Fusion and Rich Information Refinement Network (MFRNet). Our model is equipped with the Multi-View Feature Fusion Block (MAFF) to merge various types of information, including local, non-local, channel, and positional information. Within MAFF, we introduce two innovative methods. The Sliding Heterogeneous Multi-Head Attention (SHMA) extracts local, non-local, and positional information using a sliding window, while the Multi-Scale Hierarchical Compressed Channel Attention (MSCA) leverages bar-shaped pooling kernels and stepwise compression to obtain reliable channel information. Additionally, we introduce the Efficient Feature Refinement Module (EFRM), which enhances segmentation accuracy by interacting the results of the Long-Range Information Perception Branch and the Local Semantic Information Perception Branch. We evaluate our model on the ISPRS Vaihingen and Potsdam datasets. We conducted extensive comparison experiments with state-of-the-art models and verified that MFRNet outperforms other models.

Keywords:

semantic segmentation; feature refinement; remote sensing; multi-view feature fusion

Graphical Abstract

1. Introduction

The applications of remote sensing semantic segmentation are extensive, encompassing both urban and rural areas, and playing a vital role in various scenarios such as road construction [1], residential area planning [2], land use surveys [3], building extractions [4], agricultural land mapping [5], and crop planning [6]. The fundamental task of semantic segmentation is to assign a category label to every pixel in the entire remote sensing image. However, remote sensing images often exhibit significant intra-class variation and inter-class similarity, posing a substantial challenge for accurate segmentation, as shown in Figure 1. Consequently, achieving high-resolution remote sensing image segmentation remains a demanding endeavor. In recent years, the advancement of deep learning and Convolutional Neural Networks (CNNs) has sparked a surge of research on CNN-based semantic segmentation. This has resulted in a notable improvement in the field of remote sensing semantic segmentation. Fully Convolutional Networks (FCNs) [7] have played a crucial role in driving researchers to explore the application of semantic segmentation in deep learning. FCNs employ a fully convolutional network structure to extract local information effectively. Additionally, the introduction of UNet [8] has further enhanced feature integration through the use of skip connections.

However, the limitations of convolutional operations make modeling non-local information challenging [9], particularly overlooking the role of channel information at the semantic level. As a result, researchers have turned their attention to channel attention mechanisms [10,11]. The introduction of channel attention enabled semantic segmentation networks to effectively utilize features in the channel dimension. Squeeze-and-Excitation Networks (SENet) [12] introduced a channel attention mechanism based on squeeze–excitation. The squeeze module primarily utilizes global average pooling (GAP) to compress channels, followed by operations such as fully connected layers in the excitation module to learn the importance of each channel. EncNet [13] utilizes the context encoding module (CEM) [13] and combines it with semantic encoding loss (SE-loss) [13] to establish the relationship between context and classification categories. This approach effectively utilizes global information for semantic segmentation. However, they are not perfect in addressing the issue of information loss introduced by GAP. Additionally, the compression method used does not consider the impact of local information on channel-dimensional features.

In response to the issue of information loss generated by the existing channel attention using GAP and the limited focus on local information by channel attention, we introduce the MSCA. MSCA utilizes rectangular pooling kernels to extract local information. In the first compression step, the feature map is compressed into a column and a row, and then a matrix multiplication is performed on the obtained two features to achieve features with correlations between two feature maps. The second compression step is carried out to meet the condition of the feature shape similar to GAP, but with more information, fulfilling our intended purpose.

The Transformer performs self-attention on sequences, allowing it to model global information. It excels at capturing the semantic structure in images, making it well suited for applications in the field of semantic segmentation. In computer vision, the Vision Transformer (ViT) [14,15,16,17] was introduced and applied. ViT divides 2D images into smaller patches [18], incorporates positional encodings, and utilizes these operations to extract global information and compute global correlations. These methods enabled ViT to attain cutting-edge performance during that period. However, despite overcoming the limitations of Convolutional Neural Networks, ViT lacked local information. To address this limitation, hybrid models combining CNNs and Transformers have been developed. GLOTS [19] and CMTFNet [20] are two models that use different architectures for their encoder and decoder. GLOTS makes use of a Transformer for the encoding and a CNN for the decoding. It introduces a feature separation–aggregation module (FSAM) [19] to aggregate features and a global–local attention module (GLAM) [19] that consists of a global attention block (GAB) [19], local attention block (LAB) [19], and shifted attention block (SAB) [19]. These modules allow the model to capture and fuse both global and local features. On the other hand, CMTFNet takes the opposite approach by using a CNN as the encoder and a Transformer as the decoder. It incorporates a multiscale multi-head self-attention (M2SA) [20] mechanism to extract multi-scale global information and channel information. Additionally, it employs an efficient feed-forward network (E-FFN) [20] to fuse information obtained using convolutional kernels of different sizes. However, we believe that simply fusing local and global information is not enough to effectively represent and utilize the extracted information. Further refinement is deemed necessary.

For the task of semantic segmentation of remote sensing images, we believe that the fusion of different types of features is necessary to address the highly complex nature of these images. Relying on a single type of feature can result in insufficient richness in the features learned by the model, making it difficult to understand the intra-class and inter-class relationships within remote sensing images. This often leads to poor segmentation performance. Additionally, when the model learns a sufficiently large number of features, feature redundancy can cause the model to overly focus on specific contours of the classes. Therefore, a feature refinement module is needed to help the model organize the abundant features, making them more accurate. The fusion module and refinement module work together synergistically.

LSRFormer [21] proposes the use of multi-scale features and global features applied to semantic segmentation of remote sensing images. By enhancing the interaction of different types of features, it achieves a more comprehensive feature set, thus improving segmentation accuracy. However, the current model lacks rich spatial features and contains fewer reliable channel features. Based on the effectiveness of this method of extracting multiple features in the field of semantic segmentation, we designed MAFF. Efficient Multi-Scale Attention (EMA) [22] uses cross-space learning methods and multi-scale parallel sub-networks to establish short-term and long-term dependencies, thereby integrating features. However, both pooling and convolution focus on local information and neglect the impact of global information on model refinement. Based on this idea, we designed EFRM.

Addressing the lack of feature refinement in current networks that combine CNNs and transformers, we propose the EFRM. EFRM extracts long-range information through a Long-Range Information Perception Branch composed of self-attention and local semantic information through a convolutional Local Semantic Information Perception Branch. By computing branch weight scores, multiplying the weights with features, and merging features, EFRM achieves the result of feature refinement. When processing remote sensing datasets, placing EFRM at the end of the model enables effective feature refinement, addressing both intra-class variation and inter-class similarity challenges.

In addition, we have designed the MAFF. MAFF utilizes features from Multi-View for deep fusion, aiding the model in learning a diverse range of features. This makes the model more suitable for dealing with complex environments in remote sensing images. MAFF is a Transformer-type module that employs a Convolutional Multilayer Perceptron (Conv-MLP) for feature fusion, which incorporates the MSCA for extracting reliable channel information and the SHMA for extracting local, non-local, and positional information. SHMA employs a self-attention mechanism using a sliding window, enabling the extraction of long-range information through self-attention while obtaining rich positional information by restoring the size of the sliding window. Therefore, MAFF can extract and fusion features from multiple perspectives, assisting the model in learning more information to address the challenges posed by the complexity of remote sensing images, particularly in tackling segmentation difficulties.

The contributions of our paper can be summarized as follows:

We propose a network designed for semantic segmentation of remote sensing images, named the Multi-View Feature Deep Fusion and Rich Information Refinement Network (MFRNet). MFRNet utilizes MAFF to extract features from Multi-View and uses a Conv-MLP for comprehensive feature fusion. The feature refinement is achieved through EFRM. Our designed encoder can be applied to different depths and types of backbone networks. Additionally, it can address the shortcomings of CNN and Transformer in extracting comprehensive information.
To extract and fuse features from Multi-View, we introduce MAFF, a Transformer-type module incorporating the MSCA for reliable channel information and the SHMA for extracting local, non-local, and positional information. Finally, we employ a Conv-MLP to scale the features in the channel dimension, achieving the goal of deep feature fusion.
We creatively introduced EFRM to facilitate the interaction between remote information and local semantic information. By obtaining refinement weights, rich features can be refined, addressing the issue of balanced attention across various categories caused by feature redundancy. This assists the model in focusing on similar categories within the optical remote sensing dataset.

The main research content of this article will be elaborated on in the subsequent chapters. In Section 2, we will discuss the current research status of semantic segmentation and its applications in high-resolution remote sensing images. Section 3 will provide a detailed explanation of our network model and its various components, along with the rationale behind the implementation of modules to achieve the desired functionality. Additionally, in Section 4, we will provide a comprehensive description of our experimental setup, introduce the datasets used, conduct ablation experiments to demonstrate the positive impact of each component on the model, and compare our model with current state-of-the-art methods to prove its competitiveness. Finally, Section 5 will summarize our contributions and provide future prospects.

2. Related Work

2.1. Semantic Segmentation Based on CNN

2.1.1. CNN Semantic Segmentation Network

Over the past few years, deep learning methods have gained prominence in remote sensing semantic segmentation applications, thanks to the advancement of computer vision. CNNs [23,24,25] have shown impressive ability to accurately perceive local information and have become the most widely used deep learning networks. A group of researchers introduced the Fully Convolutional Network (FCN), which consisted of a fully convolutional part and a deconvolutional part. At that time, FCN achieved the best semantic segmentation results. However, the encoder they designed led to less fine-grained results and a lack of sensitivity to details, as they did not consider the spatial relationships between pixels. Following that, another team of researchers, proposed the UNet architecture, originally developed for medical segmentation. UNet comprises an encoder and a decoder and incorporates features from different scales through skip connections. As the network deepens, the receptive field expands, and the features contain rich semantic information. UNet has proven to be highly effective for high-resolution remote sensing image semantic segmentation. However, the UNet architecture suffers from information loss when modeling global information. To address this issue, a team of researchers introduced PSPNet [26]. PSPnet primarily utilizes the Pyramid Pooling Module to integrate context from different regions, effectively capturing global context. This pixel-level structure yielded excellent results at the time. Despite the notable accomplishments of these networks, they exhibit poor performance when applied to high-resolution images in remote sensing semantic segmentation. This is attributed to the limited global perception capabilities of convolutional networks.

2.1.2. Attention Mechanism in CNN

To address the limited perception problem in Convolutional Neural Networks, researchers have focused on attention mechanisms. Squeeze-and-Excitation Networks (SENet) introduced a channel attention mechanism. This mechanism utilizes global average pooling to compress features into a

C \times 1 \times 1

tensor and activates the channels, highlighting the importance of channel dimensions. SPNet [27] introduced the Strip Pooling Module (SPM) [27] which uses stripe-shaped kernels to integrate both global and local features into the feature map. This integration leads to more accurate segmentation results. Convolutional Block Attention Module (CBAM) [28] has also received significant attention from researchers. CBAM combines channel attention and spatial attention, allowing the fusion of channel and spatial features, thereby advancing the development of attention mechanisms. The Multiattention Network (MANet) [29] utilizes a multi-scale strategy to hierarchically aggregate relevant contextual features. It introduces a Kernel Attention Mechanism to address the significant computational demand of attention modules. The Attentive Bilateral Contextual Network (ABCNet) [30] includes an Attention Enhancement Module (AEM) [30] to explore long-range contextual information, and a Feature Aggregation Module (FAM) [30] for precise fusion of attention mechanisms. However, attention modules dominated by CNNs tend to overly rely on convolution, which limits the performance of global features in remote sensing image segmentation.

2.2. Transformer-Based Semantic Segmentation

2.2.1. General Semantic Segmentation

The development of Transformer in computer vision has led to impressive applications in image segmentation. TransUNet [31] combines CNNs and Transformers in a cascading manner. Skip connections are utilized to incorporate outputs from different stages, allowing for long-range contextual modeling and improved segmentation accuracy. SegFormer [32] integrates Transformers with a lightweight Multi-Layer Perceptron (MLP) [32] decoder. It introduces a novel hierarchical structure in the Transformer encoder, which produces multi-scale features without the need for position encodings. This approach avoids interpolation issues when testing at different resolutions than training and helps maintain performance. The proposed Conv-MLP decoder effectively combines local and global attention, resulting in strong representational capabilities. Vision Transformer with Bi-Level Routing Attention (BiFormer) [33] addresses concerns regarding memory usage and computational cost in traditional Transformer architectures. It introduces dynamic sparse attention through bi-level routing, filtering out irrelevant key-value pairs in coarse regions. This approach reduces memory usage and saves computational resources, while still achieving competitive performance within a certain range. However, when working with remote sensing images, researchers must find an equilibrium between the model’s effectiveness and computational costs.

2.2.2. Remote Sensing Semantic Segmentation

UNetFormer [34] incorporates the Global–local Transformer block (GLTB) [34], a dual-branch attention mechanism that effectively captures both global and local information, leading to remarkable segmentation performance. Densely Connected Swin (DC-Swin) [35] employs a Transformer as a feature extractor to extract multi-scale global contextual features. These features are then fused using the Densely Connected Feature Aggregation Module (DCFM) [35] to obtain multi-scale semantic features that are enhanced by semantic relationships. Mixed-Mask Transformer (MMT) [36] introduces mask classification into remote sensing images and develops a Transformer network in a foreground–background balanced manner. It introduces a Multiscale Learning Strategy (MSL) [36] with cyclic shifting to effectively learn large-scale features in a Transformer decoder, while maintaining acceptable computational complexity. This results in finer segmentation outcomes. Currently, there is a growing interest among researchers in exploring the combination of semantic segmentation in remote sensing imagery and Transformers.

3. Methodology

In this section, we present a comprehensive overview of the architecture and component modules of MFRNet. We start by introducing the overall structure of our network. Subsequently, we will presentation the MAFF along with the EFRM. Moreover, within the MAFF, we provide a detailed explanation of the SHMA and the MSCA.

3.1. Overall Architecture of the Model

To extract comprehensive and rich information, and refine it to improve segmentation accuracy, we have developed MFRNet. The architecture of our network comprises an encoder and a decoder, as depicted in Figure 2. The encoder structure is primarily based on ResNet18 and Swin-S, which serves as the feature extractor. On the other hand, the decoder employs our proposed MAFF. Subsequently, the features obtained through EFRM for integration and refinement. Finally, the processed features are utilized for semantic segmentation with the Semantic Segmentation Head (SSH), as illustrated in Figure 3b.

In our study, we begin with a three-channel RGB image, consisting of red, green, and blue channels. We use ResNet18 and Swin-S as feature extraction networks, which provides four layers of features labeled as

R_{i} \in R^{C_{i} \times H_{i} \times W_{i}}, ResNet : C_{i} \in \{64, 128, 256, 512\},

Swin

: C_{i} \in \{96, 192, 384, 768\}, H_{i} = W_{i} \in \{256, 128, 64, 32\}, i \in \{1, 2, 3, 4\}

. The output of each decoder layer is denoted as

X_{i} \in R^{C \times H_{i} \times W_{i}}

, where

C = 256

. Due to the intricate nature of information in optical remote sensing images and the crucial importance of feature details for each pixel in segmentation tasks, we adopt a UNet-like architecture. This architecture utilizes skip connections to integrate the encoder’s feature extraction results with the decoder’s processing. Specifically, we upsample the features of

X_{i}

and restore them through convolution to obtain

X_{i}^{u p}

. We align the channel dimension of

R_{i + 1}

with

X_{i}

through convolution, resulting in

R_{i + 1}^{'}

. After concatenating

X_{i}^{u p}

and

R_{i + 1}^{'}

, we employ convolution to fuse the information and reduce the channel dimension to C. The obtained output serves as the input for the next decoder iteration. When obtaining

X_{3}

, it contains rich semantic information but lacks shape information due to the network’s depth. As it is well known that shallow layers in ResNet contain abundant shape information, we directly feed the processed result of the skip connection between

X_{3}

and

R_{4}

into the Efficient Feature Integration and Refinement Module (EFRM). Once we obtain the final feature information, it is passed through the Semantic Segmentation Head (SSH), resulting in an output with dimensions of

O U T \in R^{c l a s s e s \times H \times W}

. This output represents the segmentation results of our network. In the following sections, we will provide detailed explanations of each component within the network.

3.2. Encoder

In the encoder section, to verify the efficacy of the decoder we designed across CNN and Transformer-like models, we employed two backbone networks: ResNet18 and Swin Transformer (SwinT) [37]. The model utilizing ResNet18 as its backbone is named MFRNet-R, while the model utilizing SwinT as its backbone is named MFRNet-S.

3.3. Multi-View Feature Fusion Block

The proposal of MAFF aims to address the issue of incomplete information learning in current models, which often include only global and local information, frequently neglecting the roles of channel information and element position information. While some models may incorporate channel information, the use of global average pooling often results in information loss, making channel features unreliable. Additionally, the insufficient fusion of rich features contributes to suboptimal segmentation results. Furthermore, remote sensing images, in comparison to other types of images, present increased complexity. We believe that when applying semantic segmentation to remote sensing images, a more comprehensive set of information is required to assist the model in learning features effectively. The MAFF is structured as shown in Figure 3a. This module is composed of the SHMA, the MSCA, and a Conv-MLP. SHMA and MSCA are modules designed for multi-view feature extraction in our approach. MAFF is a decoder of the Transformer type, primarily based on this architecture’s ability to introduce residuals to prevent information loss. Finally, Conv-MLP is used to scale features along the channels to achieve deep feature fusion. In Conv-MLP, we use convolutional layers instead of fully connected layers to reduce the number of parameters and bring richer local information. This ensures that the model learns features comprehensively and facilitates a more thorough fusion, providing reliable features for subsequent refinement processes.

In order to extract and integrate information from various perspectives, such as local, non-local, channel, and positional information, for the feature extraction part, we creatively introduced two modules, SHMA and MSCA. SHMA is capable of obtaining features using a sliding window, followed by shape restoration to obtain features containing local and positional information. Additionally, due to SHMA’s use of self-attention mechanisms, it effectively extracts long-range information and integrates positional information. MSCA utilizes pooling kernels of multiple scales to obtain features with local information, and it employs a stepwise approach to obtain channel weights, emphasizing channel information to enhance the reliability of channel features.

The process of multi-view feature extraction and feature fusion is outlined as follows:

M^{'} = H_{B N} (M_{I})

(1)

M ″ = M' + H_{S C} [H_{S H M A} (M^{'}) + H_{M S C A} (M')]

(2)

where

M_{I}

represents the input to MAFF,

M'

represents the output after the Batch Norm layer,

M ″

represents the output of the first residual, and

H_{B N}

,

H_{S H M A}

,

H_{M S C A}

represent the Batch Norm layer, SHMA layer, and MSCA layer, respectively.

After combining the features from both modules, the integration of SHMA and MSCA features is achieved through a separable convolutional layer with a kernel size equal to the sliding window size, and a Conv-MLP is used to scale the features in the channel dimension, achieving comprehensive integration. Through the entire processing of MAFF, we obtain features that are rich in information and fully integrated. The specific process formula is as follows:

M_{O} = M ″ + H_{B N} [H_{C o n v - M L P} (M ″)]

(3)

where

H_{C o n v - M L P}

represents the Conv-MLP layer, and

M_{O}

represents the overall output of MAFF. Next, we will specifically introduce SHMA and MSCA, which play a key role in MAFF.

We validate the feasibility of our method through a heatmap, as shown in Figure 4. In semantic segmentation, classifying all elements is crucial. In the heatmap, it is evident that before applying MAFF, the model focuses only on a specific region. However, after MAFF processing, the model emphasizes a broader area. This demonstrates that our approach has a positive impact on extracting features from multiple perspectives and deep fusion of features. The comprehensive information brought by MAFF is particularly beneficial for semantic segmentation.

3.3.1. Sliding Heterogeneous Multi-Head Attention

In optical remote sensing datasets, there is a significant amount of information pertaining to objects of the same category but located at different positions. To improve the model’s capability in recognizing the same element at different positions, we propose performing feature expansion during the processing stage to achieve data augmentation. This approach also helps introduce long-range information and local details to enrich the features. To accomplish this, we have designed SHMA, as illustrated in Figure 5.

The overall design concept of SHMA involves leveraging a sliding window to enrich positional information and employing modules similar to self-attention for element-wise correlation calculations. In the sliding window (SW) section, we use a fixed-sized window to obtain a feature map with unchanged element values. After restoring the feature size, due to the overlapping nature of the sliding window, our feature map is extended to achieve data augmentation. During the restoration process, the same element appears in different positions, with only the non-overlapping part between the first window and other windows showing the phenomenon of being present only at the current position. For clarity, in Figure 5, we provide two examples, marking elements that cannot add extra positions as category B and those that can add positions as category A. The proportion of category B elements in the entire feature map is negligible.

Feature enhancement and extraction is implemented as follows:

Q, K, V = H c o n v (S_{I})

(4)

\{\begin{matrix} Q' = H_{R S} (Q) \\ K' = H_{T R P} \{H_{R S} [H_{S W} (K)]\} \\ V' = H_{R S} [H_{S W} (V)] \end{matrix}

(5)

where

S_{I}

denotes the input to SHMA,

H c o n v

represents the convolution operation,

H_{R S}

indicates reshape,

H_{T R P}

represents the transpose operation, and

H_{S W}

signifies the sliding window module. Furthermore,

Q'

and

K'

correspond to the inputs of the first correlation calculation, while

V'

corresponds to the input of the second correlation calculation.

In the stage where self-attention calculates correlations through matrix multiplication, we perform a simple spatial transformation operation on Q. The purpose is to ensure the use of relatively raw information to evaluate positional and long-range information, making the obtained information reliable. We apply a sliding window operation to K and V to obtain more accurate information after two rounds of correlation calculations. Additionally, by leveraging the properties of matrix multiplication, we directly obtain the original size instead of employing other spatial dimension reduction methods that may impact the resulting features. Furthermore, we employ a multi-head mechanism to simultaneously focus on different locations, thereby enhancing the model’s expressive capability. Through the processing of features using SHMA, we achieve data augmentation during the module processing stage. The resulting features possess enriched positional information, long-range information, and local details.

The correlation calculation is implemented as follows:

D' = Q' \times K'

(6)

S_{O} = V' \times H_{S T M} (D')

(7)

where

H_{S T M}

represents the softmax activation function, and

D'

corresponds to the inputs of the second correlation calculation. Ultimately,

S_{O}

represents the output of the entire SHMA process.

In SHMA, our window size is set to 8. We describe the window size ablation experiments in detail in Section 4.

3.3.2. Multi-Scale Hierarchical Compressed Channel Attention

To address the issue of information loss in Global Average Pooling (GAP) within channel attention and introduce both local and non-local information to enhance the accuracy of channel weight computation, we designed the Multi-Scale Stepwise Compressed Channel Attention (MSCA), as illustrated in Figure 6. Utilizing convolution at different scales to obtain features with local information, in the initial compression process, we employed two types of rectangular average pooling kernels to compress the feature maps into a single row and a single column. The characteristics of whole-row

(H)

and whole-column

(W)

pooling kernels contribute to capturing long-range feature similarity, aiding in more accurate channel weight calculation. In the final compression process, the compressed features undergo matrix multiplication to establish correlations between each element and are further compressed to a size of

C \times 1 \times 1

to obtain channel weights. This stepwise compression in spatial dimensions provides richer feature information compared to previous enhancements of channel attention by researchers, specifically for obtaining channel weights.

The initial compression process of the MSCA compression process is as follows:

I P_{I} = H_{C 3} [H_{C 1} (M C_{I})]

(8)

\{\begin{matrix} W_{n} = H_{X P} (I P_{I}) \\ H_{n} = H_{Y P} (I P_{I}) \end{matrix}

(9)

where

M C_{I}

represents the input to MSCA,

H_{C 1}

and

H_{C 3}

represent

1 \times 1

convolution and

3 \times 3

convolution,

I P_{I}

represents the input to the initial compression process,

H_{X P}

and

H_{Y P}

represent the average pooling layers applied to entire rows and columns, and

W_{n}

and

H_{n}

represent the compression results of the initial compression process.

In MSCA, we utilized four different sizes of pooling kernels for diverse scale feature extraction, concatenating them along the channel dimension. The pooling kernel sizes: height of

H \times N_{i}

(N_{i} \in \{\frac{H}{2}, \frac{H}{4}, \frac{H}{8}, 1\} i \in \{1, 2, 3, 4\})

and other with a width of

M_{j} \times W M_{j} \in \{\frac{W}{2}, \frac{W}{4}, \frac{W}{8}, 1\}

j \in \{1, 2, 3, 4\}

, specially designed for non-uniform distribution of the same category in remote sensing datasets. Through these various-sized pooling kernels, different local information can be perceived, enriching the feature information. After channel dimension reduction, a sigmoid activation function is applied to scale the compressed data and weight them with the input to MSCA, resulting in features that focus more on the channels. Due to the impact of the pooling layer, inevitable information loss occurs, and to mitigate this drawback, we introduced residual connections.

The process of the final compression process and the post-compression handling process of MSCA are as follows:

F P_{O} = \underset{n = 1}{\overset{4}{H_{C a t}}} (W_{n} \times H_{n})

(10)

M C_{O} = \{σ [H_{S C} (F P_{O})] \times I P_{I}\} + I P_{I}

(11)

where

H_{C a t}

represents the concatenation operation,

F P_{O}

represents the output of the final compression process,

σ

represents the sigmoid activation function, and

M C_{O}

represents the output of MSCA.

The remarkable feature extraction modules, MSCA and SHMA, in MAFF provide the network with abundant local, non-local, channel, and positional information. This addresses the challenge of inadequate and insufficiently rich information extraction in current models, given the complexity and richness of information categories in remote sensing images. By adopting Transformer-type structures, we overcome the issue of insufficient feature fusion. Our MAFF demonstrates outstanding performance, as detailed in Table 1 in Section 3.

3.4. Efficient Feature Refinement Module

Optical remote sensing images inherently possess complex shapes, necessitating the thorough refinement of rich information to assist the model in accurately distinguishing between different categories. In the skip connections before the EFRM, the simple fusion of the rich shape information from the first-level output of encoder and the decoder features containing abundant semantic information leads to a lack of interaction between information, which adversely affects segmentation accuracy. To address this limitation, we designed the EFRM to facilitate information interaction and promote feature refinement. Specific details are illustrated in Figure 7.

In the EFRM, we devised two branches: the Long-Range Information Perception Branch and the Local Semantic Information Perception Branch. This decision was based on the understanding that shape perception involves recognizing and understanding the object’s outline, and long-range information contributes to capturing the overall structure and spatial relationships of objects, while local information is crucial for semantic understanding. The Long-Range Information Perception Branch consists of a traditional sub-attention module, while the Local Semantic Information Perception Branch is formed by convolution operations. Furthermore, we partition the input of the EFRM into 8 sub-features grouped by channels (divisible by the number of channels). The self-attention mechanism performs attention calculations within each channel group, aiding the model in better understanding the relationships between local and global features, thereby enhancing the receptive field. Channel grouping assists the model in learning local features more effectively, as each channel group focuses solely on a subset of input features, facilitating the extraction of local information.

The inputs for the two branches are implemented as follows:

q, k, v, l = H_{G} (E F_{I})

(12)

where

E F_{I}

represents the input to EFRM,

H_{G}

denotes feature grouping, and q, k, v, l represent the outputs of the grouped features.

Given that the EFRM is positioned at the end of the model, where the input already contains rich feature information, and its primary purpose is to deepen information interaction through the two branches to enhance segmentation accuracy, we opted for a relatively simple perception module. Specifically, we obtain the results of the Long-Range Information Perception Branch and the Local Semantic Information Perception Branch, applying global average pooling and softmax activation to obtain the weight scores for the two branches, defined as the Long-Range Information Score

(f_{1})

and the Semantic Information Score

(f_{2})

.

\hat{q}, \hat{k}, \hat{v} = H_{c o n v} (q, k, v)

(13)

F_{1} = H_{c o n v} [\hat{v} \times H_{S T M} (\hat{q} \times {\hat{k}}^{T})]

(14)

f_{1} = H_{S T M} [H_{A P} (F_{1})]

(15)

F_{2} = H_{c o n v} (l)

(16)

f_{2} = H_{S T M} [H_{A P} (F_{2})]

(17)

where

\hat{q}

,

\hat{k}

, and

\hat{v}

represent the results after convolution for q, k, and v;

F_{1}

represents the output of the Long-Range Information Perception Branch;

F_{2}

represents the result of the Local Semantic Information Perception Branch; and

H_{A P}

represents the global average pooling layer.

By using matrix multiplication, we cross-multiply the perception results with the weight scores, achieving the interaction between the two types of information. After the feature fusion through interaction, we employ a sigmoid function to score and obtain the interaction information score

(f_{3})

. This score represents the final refinement score, used to weight the features from the module inputs and obtain the refined feature map, and a residual connection is introduced to prevent information loss. Ultimately, refined features are obtained, serving the purpose of enhancing the module’s accuracy. The specific implementation process is as follows:

E F_{O} = E F_{I} + σ [(f_{1} \times F_{2}) + (f_{2} \times F_{1})] \times E F_{I}

(18)

where

E F_{O}

represents the output of EFRM. EFRM facilitates the interaction of long-range and local semantic information on datasets for remote sensing semantic segmentation. Through the above formula, the overall design concept of our EFRM can be more clearly understood. By using a scoring mechanism to weight different branches, we achieve the goal of feature refinement.

EFRM efficiently refines the features at the model’s tail. These features, processed by EFRM, are obtained through a simple combination of shape information from the encoder’s shallow layers and semantic information from the decoder. EFRM addresses challenges related to misclassifications in remote sensing images arising from the similarity between different categories and differences within the same category. Considering that the superimposition of multiple layers may lead to excessive processing, resulting in the loss of more shape information and the failure to achieve fine feature refinement, we opted for single-layer processing. Remarkably, employing just one layer of EFRM yielded impressive results. The detailed experimental outcomes are outlined in Table 2 of Section 3.

We validate the refinement of rich features by EFRM through the heatmap, as shown in Figure 4. Prior to EFRM processing, the model exhibited a stronger focus on categories like vehicles with fixed shapes. Due to feature redundancy and the high inter-class similarity in remote sensing images, other categories had a more balanced level of attention. After undergoing EFRM processing, the model actively focuses on key information of other categories as well, demonstrating that our designed EFRM can refine rich features and positively impact semantic segmentation tasks.

4. Experimental Comparison and Analysis

To assess the effectiveness of our suggested model and its components, we will provide a detailed description of our experimental process in this section. This will include an introduction to the dataset, conducting ablation experiments to check the individual effects of our two components on the model, and comparative studies with current state-of-the-art networks to assess the competitiveness of our model. Finally, we will comprehensively analyze the outcomes of the experiments. The ablation and comparative experiments of the main modules were conducted on both MFRNet-R and MFRNet-S to verify that our designed encoder can be applied to different depths and types of backbone networks. Additionally, these experiments verified that the decoder can address the shortcomings of CNN and Transformer in extracting comprehensive information.

4.1. Datasets

The experiments in this study utilized the International Society of Photogrammetry and Remote Sensing (ISPRS) Vaihingen and ISPRS Potsdam datasets. All evaluations conducted in this paper are based on these two datasets. For quantitative analysis, all comparative and ablation experiments were assessed on the dataset that underwent identical preprocessing.

4.1.1. Vaihingen

The Vaihingen dataset consists of 33 highly detailed TOP images captured at a ground sampling distance (GSD) of 9 cm. These images have varying spatial resolutions, with an average of

2494 \times 2064

pixels. The dataset also includes RGB images with three spectral bands, as well as digital surface models (DSMs) and normalized DSMs (NDSMs). It is divided into 6 different classes: impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. For our experiments, we specifically chose images with the following IDs for testing:

1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26,

28, 30, 32, 34

, and 37. Image ID 31 was selected for validation, and the remaining 16 images were used for training. Additionally, we resized the RGB images to patches with dimensions of

1024 \times 1024

pixels, using both padding and cropping techniques as necessary. This process used a stride of 512 for cropping, and the images were horizontally flipped and vertically rotated before cropping. The total number of processed images is 888 for the training set and 97 for the test set.

4.1.2. Potsdam

The Potsdam dataset consists of 38 highly detailed TOP images captured at a ground sampling distance (GSD) of 5 cm. These images have varying spatial resolutions, with an average resolution of

6000 \times 6000

pixels. The dataset includes RGB images with three spectral bands, as well as digital surface models (DSMs) and normalized DSMs (NDSMs). It is categorized into 6 different classes: impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. For our experiments, we selected specific images for testing, namely

2_13, 2_14, 3_13, 3_14, 4_13,

4_14, 4_15, 5_13,

5_14,

5_15, 6_13,

6_14, 6_15,

and

7_13

. Image ID

2_10

was used for validation, while the remaining 22 images (excluding image

7_10

due to error annotations) were used for training. In addition, we resized the RGB images to patches with dimensions of

1024 \times 1024

pixels, using padding and cropping techniques to meet the requirements of our experiments. This process used a stride of 512 for cropping, and the images were horizontally flipped and vertically rotated before cropping. The total number of processed images is 2592 for the training set and 504 for the test set.

4.2. Experimental Design

4.2.1. Implementation Details

We employed the PyTorch framework to implement the experimental models, which were trained on a GPU of NVIDIA RTX 3090 Ti. To accelerate model convergence, we leveraged the AdamW optimizer with a learning rate of 6 × 10⁻⁴ and a weight decay of 2.5 × 10⁻⁴. Additionally, a cosine annealing strategy was adopted to adjust the learning rate. During the training process, data augmentation techniques such as random vertical flips, random horizontal flips, and random rotations were applied to preprocess the dataset. And all augmentation methods were used only once. The experiment was configured with 150 epochs and a batch size of 8. For the testing phase, multi-scale augmentation was used with scaling factors in the range of [0.75, 1.0, 1.25, 1.5], and horizontal flipping was also employed.

4.2.2. Loss Function

During the training period, we employed a mixture of two loss functions to train the entire network: Cross-Entropy Loss

(L_{c e})

, which measures the difference between probability distributions, and Dice Loss

(L_{d i c e})

, which better preserves object boundary information. The overall loss function is represented by the following equation:

L_{c e} = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} x_{k}^{n} log {\bar{x}}_{k}^{n}

(19)

L_{d i c e} = 1 - \frac{2}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} \frac{x_{k}^{n} {\bar{x}}_{k}^{n}}{x_{k}^{n} + {\bar{x}}_{k}^{n}}

(20)

L_{o u r} = L_{c e} + L_{d i c e}

(21)

In this context, we employ the loss function denoted by

L_{o u r}

; N represents the number of samples and K represents the number of categories.

x^{n}

and

{\bar{x}}^{n}

denote the one-hot encoding of the true semantic labels and the corresponding softmax output of the network, where

n \in [1, \dots, N]

.

{\bar{x}}_{k}^{n}

is the confidence of sample n belonging to the category k.

4.2.3. Validation Metrics

To effectively evaluate the superiority of our model, we have selected three accuracy metrics: Overall Accuracy (OA), F1 score (F1), and Mean Intersection over Union (mIoU). These metrics enable us to evaluate the performance of our model in terms of correctly segmented representations, accuracy and recall in segmentation tasks, and overall segmentation result accuracy. The calculation methods for these metrics are as shown below:

O A = \frac{\sum_{C S} T P_{C S}}{\sum_{C S} T P_{C S} + F P_{C S} + T N_{C S} + F N_{C S}}

(22)

F 1 = 2 \times \frac{Re c a l l_{C S} \times P r e c i s i o n_{C S}}{Re c a l l_{C S} + P r e c i s i o n_{C S}}

(23)

m I o U = \frac{1}{C S} \sum_{C S = 1}^{C S} \frac{T P_{C S}}{T P_{C S} + F P_{C S} + F N_{C S}}

(24)

where

T P_{C S}

,

F P_{C S}

,

T N_{C S}

, and

F N_{C S}

represent true positives, false positives, true negatives, and false negatives, respectively, when the number of classes is

C S

.

4.3. Ablation Experiment

Experimental results not specifically stated in the Experimental Analysis section are from MFRNet-R.

4.3.1. Ablation Experiment Analysis of the Main Components

To gauge the efficiency of every proposed component, we carried out a sequence of ablation experiments on the Vaihinen dataset. To guarantee a fair comparison and impartially evaluate the performance of each component, we exclusively incorporated individual components into the baseline model at their corresponding positions. Our baseline model uses the ResNet18 and SwinT as the encoder. And we used the publicly available Swin-S pre-trained model of Swin Transformer trained on the ImageNet-22K dataset. The results of our ablation experiments are presented in Table 1. When the EFRM operates independently, we observe improvements in various metrics. When using MFRNet-R, F1 improves by 1.86%, mIoU improves by 2.99%, and OA improves by 1.50%. Similarly, when using MFRNet-S, F1 improves by 4.82%, mIoU improves by 6.68%, and OA improves by 0.66%. These findings indicate that incorporating the feature reconstruction module at the end of the model significantly enhances its performance. Furthermore, when the MAFF operates independently, we observe even more substantial improvements. When using MFRNet-R, F1 improves by 2.62%, mIoU improves by 4.16%, and OA improves by 1.94%. Similarly, when using MFRNet-S, F1 improves by 6.04%, mIoU improves by 7.59%, and OA improves by 1.67%. The results clearly showcase the efficacy of the MAFF approach in enhancing model performance. By fusing local, global, and channel-based features into the model, we achieve significant improvements in various evaluation metrics.

Additionally, our designed module achieves good results with only a slight increase in the number of parameters, as shown in Table 1. When using MFRNet-R, MAFF increases the parameter count by just 5.45 M over the baseline (ResNet18), resulting in a 2.62% improvement in the F1 score. Notably, EFRM adds only 0.2M parameters, yielding a 1.89% increase in the F1 score. When using MFRNet-S, MAFF increases the parameter count by just 8.55 M over the baseline (Swin-S), resulting in a 6.04% improvement in the F1 score. Notably, EFRM adds only 3.62 M parameters, yielding a 4.82% increase in the F1 score. These data demonstrate that our module achieves better segmentation accuracy with minimal additional computational burden. The error in module parameter count and complexity is mainly due to differences in the baseline number of channels.

The ablation experiment results are presented in Figure 8. The segmentation results when the components are used individually clearly show that compared to the baseline, the outlines of the buildings are more complete. Furthermore, the model correctly identifies the vehicle categories that were misclassified by the baseline. Moreover, the segmentation results are more accurate when the two components work together.

We convert the neural network’s feature maps into heatmaps by calculating the channel mean, performing interpolation, normalizing, and applying a color map. The heatmaps for each module are presented in Figure 4. Since semantic segmentation requires classification for each pixel without background information, we have provided global heatmaps to demonstrate the effectiveness of our modules. From the regions with higher attention in the heatmaps, it can be concluded that our modules achieve better global attention, proving that our modules have a positive impact on the semantic segmentation task. And the heatmap clearly demonstrates that without the EFRM at the model’s tail, the model overly focuses on vehicles with specific shapes. However, after incorporating the EFRM at the tail, the model enhances its attention to other contour types. This indicates that placing the EFRM at the end of the model can resolve issues of intra-class variation and inter-class similarity.

4.3.2. Subcomponent Ablation Experiments

We conducted ablation experiments on the two components of MAFF using the Vaihingen dataset, and the results are shown in Table 2. When each component is used individually, the performance is consistently better than the baseline, with improvements of 2.6% and 2.5% in mIoU, respectively. Importantly, their combined effect performs even better, showing a 4.16% improvement in mIoU. SHMA focuses on extracting remote and positional information, while MSCA emphasizes local and channel information. This shows that fusing more different kinds of features will have a positive impact on the experimental results and verifies the feasibility of semantic segmentation of remote sensing images using multi-view features.

In addition, we also conducted ablation experiments on the combination of individual components in MAFF with EFRM. When SHMA is combined with EFRM, compared to using SHMA alone, there is an improvement of 1.02% in F1 and 1.96% in mIoU. Similarly, when MSCA is combined with EFRM, compared to using MSCA alone, there is an improvement of 1.37% in F1 and 2.56% in mIoU. This is because the refinement of features for the entire model is positively influenced by adding EFRM at the end of the model. The experimental results indicate that our designed subcomponent modules have a positive impact on the entire model when split and combined.

4.3.3. Ablation Experiment Analysis of SHMA Window Size

To explore the impact of different window sizes on SHMA, we conducted corresponding ablation experiments on two datasets, as shown in Figure 9. The results indicate that the optimal performance is achieved when the window size is set to 8 × 8. This is because smaller windows are insensitive to non-local information, while larger windows are less sensitive to local details. Therefore, a moderate window size performs the best, balancing the advantages of both small and large windows in information extraction and avoiding additional information loss.

4.3.4. Internal Ablation Experiments on MSCA

To validate the reliability of our designed MSCA compared to using Global Average Pooling (GAP), we conducted relevant ablation experiments in the Vaihingen dataset, and the results are shown in Figure 10. MSCA outperformed GAP with an increase of 1.43% in mIoU, 0.86% in F1, and 0.87% in OA. By comparing these metrics, it is confirmed that the channel information obtained by MSCA is more reliable.

4.4. Comparative Experiments with Advanced Networks

Experimental results not specifically stated in the Experimental Analysis section are from MFRNet-R.

To quantitatively analyze the advantages and disadvantages of our approach in contrast to widely used methods, we conducted comparative experiments on the Vaihingen and Potsdam datasets. The results of these experiments are exhibited in Table 3 and Table 4. On the Vaihingen dataset, MFRNet-R outperformed other models with a 0.96% higher MeanF1, a 1.58% higher mIoU, and a 0.39% higher OA compared to the second-best model, and MFRNet-S outperformed other models with a 1.65% higher MeanF1, a 2.73% higher mIoU, and a 1.06% higher OA compared to the second-best model. These results demonstrate the effectiveness of our designed decoder, which can compensate for the shortcomings of feature extraction in CNN and Transformer models. Furthermore, these results prove that our decoder possesses strong generalization capabilities, capable of handling backbone networks of varying depths.

The segmentation results of our comparative experiments are illustrated in Figure 11 and Figure 12. In the Vaihingen dataset, our model can correctly identify vehicles with significantly different shapes from others and distinguish challenging architectural structures, demonstrating the competitiveness of our model in addressing both intra-class variations and high inter-class similarity in remote sensing images. In the Potsdam dataset, our model successfully segments occluded objects, showcasing its capability to handle complex environments in remote sensing images and outperforming other models.

Through the comparison of data and segmentation results, although many models can extract information from different categories, it is evident that they are not comprehensive enough for remote sensing semantic segmentation tasks. Our model brings more comprehensive information to assist in segmentation tasks, and experiments have proven the superiority of our strategy. Additionally, most other models do not have a feature refinement process, they simply apply complex features directly, which exacerbates the difficulty of distinguishing between categories with small inter-class differences in remote sensing images. Therefore, the introduction of the feature refinement module is also crucial for our model’s superiority over others.

From the perspective of model parameters and complexity, our model achieves superior performance at an acceptable cost. Compared to other models with similar metrics, such as SACANet [41] and DC-Swin [35], we outperform them in terms of parameter count and complexity. Taking these considerations into account, our outperforms other models. training performance.

Finally, we present the training and validation accuracy plot of MFRNet on the Vaihingen dataset, as shown in Figure 13. Our model remains stable during the mid and late stages, indicating good generalization ability. The F1 scores of the validation set and the training set are close during these stages, suggesting that the model does not exhibit overfitting and demonstrates effective Ultimately, we attribute the superiority of our model over other network models to the following factors:

Our model utilizes a U-shaped structure to obtain features at multiple scales while preserving abundant shape information from the shallow layers of the encoder. The designed MAFF module can extract and fuse diverse feature information. Compared to other models, our model’s features are more comprehensive.
Compared with other models, we include EFRM in the tail to refine the rich features that have just fused shape information by interacting with long-distance information and local semantic information. Preventing redundancy in feature information from leading to category discrimination errors.
Through comparative and ablation experiments, we can fully confirm that our designed encoder compensates for the shortcomings of feature extraction in CNN and Transformer models. Additionally, it achieves better segmentation results in deeper networks by utilizing feature refinement. This proves that our belief in the necessity of applying multi-view feature extraction and feature refinement in the field of remote sensing image semantic segmentation is correct.

5. Conclusions

In this article, we propose a network model named MFRNet for semantic segmentation tasks in remote sensing image scenes. We introduce the MAFF for extracting and fusing multi-view features and the EFRM for effectively refining rich features. In MAFF, we designed SHMA and MSCA to extract comprehensive and rich features. MAFF employs a Transformer-type structure for fusion. For the encoder, we selected ResNet18 and Swin Transformer to verify that our designed decoder can be applied to backbone networks of different depths and types. Exceptional performance is demonstrated through experiments conducted on the Vaihingen and Potsdam datasets. Ablation experiments showcase the robust capabilities of individual modules, and compared with the current state-of-the-art networks, our model surpasses others on various metrics, revealing the exceptional performance of our model. In the future, we hope to encourage more researchers to explore the potential of deep learning in high-resolution remote sensing image semantic segmentation.

Author Contributions

Conceptualization, S.C. and J.L.; methodology, S.C. and J.L.; software, J.L.; validation, J.L., S.C., and A.D.; resources, S.C.; data curation, A.D.; writing—original draft preparation, J.L.; visualization, J.L.; supervision, S.C.; project administration, A.D.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientiffc and Technological Innovation 2030 Major Project under Grant 2022ZD0115800, the Basic Research Funds for Colleges and Universities in Xinjiang Uygur Autonomous Region under Grant XJEDU2023P008, the Key Laboratory Open Projects in Xinjiang Uygur Autonomous Region under Grant 2023D04028, and the Graduate Research and Innovation Project of Xinjiang Uygur Autonomous Region under Grant XJ2024G086.

Data Availability Statement

Data will be made available on request. The datasets in our study are public.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, F.; Liu, C.; Tian, Q.; Qu, H. ACTNet: A dual-attention adapter with a CNN-transformer network for the semantic segmentation of remote sensing imagery. Remote Sens. 2023, 15, 2363. [Google Scholar] [CrossRef]
Wang, S.; Huang, X.; Han, W.; Li, J.; Zhang, X.; Wang, L. Lithological mapping of geological remote sensing via adversarial semi-supervised segmentation network. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103536. [Google Scholar] [CrossRef]
Yuan, M.; Ren, D.; Feng, Q.; Wang, Z.; Dong, Y.; Lu, F.; Wu, X. MCAFNet: A multiscale channel attention fusion network for semantic segmentation of remote sensing images. Remote Sens. 2023, 15, 361. [Google Scholar] [CrossRef]
Chen, J.; Sahli, H.; Chen, J.; Wang, C.; He, D.; Yue, A. A hybrid land-use mapping approach based on multi-scale spatial context. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 771–774. [Google Scholar] [CrossRef]
Xiong, X.; Wang, X.; Zhang, J.; Huang, B.; Du, R. TCUNet: A Lightweight Dual-Branch Parallel Network for Sea–Land Segmentation in Remote Sensing Images. Remote Sens. 2023, 15, 4413. [Google Scholar] [CrossRef]
Sherrah, J. Fully Convolutional Networks for Dense Semantic Labelling of High-Resolution Aerial Imagery. arXiv 2016, arXiv:1606.02585. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Su, Y.; Wu, Y.; Wang, M.; Wang, F.; Cheng, J. Semantic Segmentation of High Resolution Remote Sensing Image Based on Batch-Attention Mechanism. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 3856–3859. [Google Scholar] [CrossRef]
Long, W.; Zhang, Y.; Cui, Z.; Xu, Y.; Zhang, X. Threshold Attention Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4600312. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context Encoding for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar] [CrossRef]
Cui, W.; Feng, Z.; Chen, J.; Xu, X.; Tian, Y.; Zhao, H.; Wang, C. Long-Tailed Effect Study in Remote Sensing Semantic Segmentation Based on Graph Kernel Principles. Remote Sens. 2024, 16, 1398. [Google Scholar] [CrossRef]
Zhang, X.; Cheng, S.; Wang, L.; Li, H. Asymmetric Cross-Attention Hierarchical Network Based on CNN and Transformer for Bitemporal Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3245674. [Google Scholar] [CrossRef]
Yang, Y.; Dong, J.; Wang, Y.; Yu, B.; Yang, Z. DMAU-Net: An Attention-Based Multiscale Max-Pooling Dense Network for the Semantic Segmentation in VHR Remote-Sensing Images. Remote Sens. 2023, 15, 1328. [Google Scholar] [CrossRef]
Wang, J.; Li, F.; An, Y.; Zhang, X.; Sun, H. Towards Robust LiDAR-Camera Fusion in BEV Space via Mutual Deformable Attention and Temporal Aggregation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 5753–5764. [Google Scholar] [CrossRef]
Wang, Q.; Jin, X.; Jiang, Q.; Wu, L.; Zhang, Y.; Zhou, W. DBCT-Net: A dual branch hybrid CNN-transformer network for remote sensing image fusion. Expert Syst. Appl. 2023, 233, 120829. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Wang, Y.; Mei, S. Rethinking Transformers for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3302024. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3314641. [Google Scholar] [CrossRef]
Zhang, R.; Zhang, Q.; Zhang, G. LSRFormer: Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3366709. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Li, R.; Liu, W.; Yang, L.; Sun, S.; Hu, W.; Zhang, F.; Li, W. DeepUNet: A Deep Fully Convolutional Network for Pixel-Level Sea-Land Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3954–3962. [Google Scholar] [CrossRef]
Kampffmeyer, M.; Salberg, A.B.; Jenssen, R. Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 680–688. [Google Scholar] [CrossRef]
Liu, Q.; Xiao, L.; Yang, J.; Wei, Z. CNN-Enhanced Graph Convolutional Network With Pixel- and Superpixel-Level Feature Fusion for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8657–8671. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4002–4011. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3093977. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Álvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2021, 190, 196–214. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 3143368. [Google Scholar] [CrossRef]
Xu, Z.; Geng, J.; Jiang, W. MMT: Mixed-Mask Transformer for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3289408. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Li, Y.; Chen, X.; Zhu, Z.; Xie, L.; Huang, G.; Du, D.; Wang, X. Attention-Guided Unified Network for Panoptic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7019–7028. [Google Scholar] [CrossRef]
Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. LeViT: A Vision Transformer in ConvNet’s Clothing for Faster Inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12239–12249. [Google Scholar] [CrossRef]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A2-FPN for semantic segmentation of fine-resolution remotely sensed images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Ma, X.; Che, R.; Hong, T.; Ma, M.; Zhao, Z.; Feng, T.; Zhang, W. SACANet: Scene-aware class attention network for semantic segmentation of remote sensing images. In Proceedings of the IEEE International Conference Multimedia Expo. (ICME), Brisbane, Australia, 10–14 July 2023; pp. 828–833. [Google Scholar] [CrossRef]
Ma, X.; Ma, M.; Hu, C.; Song, Z.; Zhao, Z.; Feng, T.; Zhang, W. Log-Can: Local-Global Class-Aware Network For Semantic Segmentation of Remote Sensing Images. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. The characteristics of remote sensing images summarized. The car types have different shapes, with RVs and containers having similar shapes.

Figure 2. The overall structure of MFRNet, containing a CNN encoder and our designed decoder MAFF and feature refinement module EFRM.

Figure 3. Specific design of MAFF and SSH. (a) Multi-View Feature Fusion Block (MAFF); (b) Semantic Segmentation Header (SSH).

Figure 4. Heatmap demonstration of Multi-View Feature Fusion Module (MAFF) and Efficient Feature Refinement Module (EFRM).

Figure 5. The specific implementation of the Sliding Heterogeneous Multi-Head Attention (SHMA).

Figure 6. The specific design method for Multi-Scale Hierarchical Compressed Channel Attention (MSCA).

Figure 7. Detailed structure of the Efficient Feature Refinement Module (EFRM).

Figure 8. The visual results of the ablation experiment, GT being ground truth.

Figure 9. The ablation experiment results on SHMA window sizes on the Vaihingen and Potsdam datasets. Horizontally, window sizes are represented, while vertically, metric values are displayed as percentages.

Figure 10. The ablation experiment results of MSCA compared to the use of Global Average Pooling (GAP).

Figure 11. The segmentation results of the Vaihingen dataset compared to other network models, with the RGB blue box corresponding to the red box in the segmentation result, and GT indicating ground truth.

Figure 12. Comparison of segmentation results from the Potsdam dataset with other network models, where the RGB blue box corresponds to the red box in the segmentation result, and GT represents ground truth.

Figure 13. Validation and training accuracy graphs.

Table 1. The results of component ablation experiments in the Vaihingen dataset, where bold represents the best values, and all values are in percentages. The results are presented in percentages (%).

Models	Component	Params (M)	FLOPs (G)	F1	mIoU	OA
MFRNet-R	Baseline (ResNet18)	13.02	131.22	87.84	78.69	89.09
	Baseline + EFRM	13.22	138.32	89.73	81.68	90.59
	Baseline + MAFF	18.25	160.38	90.46	82.85	91.03
	MFRNet	18.47	167.48	91.45	84.51	91.97
MFRNet-S	Baseline (Swin-S)	45.91	68.95	85.06	75.18	89.75
	Baseline + EFRM	49.53	70.61	89.88	81.86	90.41
	Baseline + MAFF	54.46	76.27	91.10	82.77	91.42
	MFRNet	54.46	77.94	92.14	85.66	92.64

Table 2. The results of subcomponent ablation experiments on the Vaihingen dataset. The results are presented in percentages (%).

Component	SHMA	MSCA	EFRM	F1	mIoU	OA
Baseline				87.84	78.69	89.09
Baseline + MAFF (SHMA)	√			89.79	81.29	90.31
Baseline + MAFF (MSCA)		√		89.49	81.28	90.21
Baseline + EFRM			√	89.73	81.68	90.59
Baseline + MAFF (SHMA + MSCA)	√	√		90.46	82.85	91.03
Baseline + MAFF (SHMA) + EFRM	√		√	90.81	83.25	91.43
Baseline + MAFF (MSCA) + EFRM		√	√	90.86	83.84	91.47
Baseline + MAFF (SHMA + MSCA) + EFRM	√	√	√	91.45	84.51	91.97

Table 3. The performance of state-of-the-art network models on the Vaihingen dataset was evaluated and compared. The metric for each category is the F1 score. The results are presented in percentages (%), with the best data in each column highlighted in bold and the second-highest data underlined.

Method	Backbone	Params (M)	FLOPs (G)	Imp.Surf.	Building	Lowveg.	Tree	Car	F1	mIoU	OA
UNet [8]	ResNet18	22.61	71.26	91.11	95.25	81.11	88.21	83.48	87.84	78.69	89.09
PSPNet [26]	ResNet18	12.66	40.10	91.91	95.25	80.92	86.70	79.47	86.85	77.28	89.21
DANet [38]	ResNet18	12.09	36.92	90.17	94.74	81.12	86.90	66.77	83.94	73.46	88.59
BANet [39]	ResT-Lite	12.14	49.09	93.05	96.41	82.49	88.99	91.03	90.49	82.93	90.94
ABCNet [30]	ResNet18	14.06	18.72	88.20	91.17	78.19	86.05	68.80	82.48	70.96	86.62
UNetFormer [34]	ResNet18	11.14	10.94	93.13	96.42	83.86	89.60	88.97	90.40	82.57	91.26
DC-Swin [35]	Swin-S	63.80	258.13	93.36	96.54	84.74	89.92	86.17	90.15	82.36	91.58
MANet [29]	ResNet18	11.43	82.96	89.89	93.23	79.45	86.04	72.11	84.14	73.35	87.61
$A^{2}$ -FPN [40]	ResNet18	22.77	158.84	92.69	96.15	83.64	89.38	88.81	90.13	82.31	90.89
SACANet [41]	HRNet-v2	28.81	210.65	92.04	95.84	84.89	91.00	86.32	90.02	82.09	91.00
LOG-CAN [42]	ResNet50	29.48	184.7	91.13	94.78	84.57	89.45	81.14	88.21	79.25	90.07
CMTFNet [20]	ResNet50	28.68	122.07	92.13	95.22	83.16	89.23	85.59	89.07	80.57	90.37
MFRNet-R (ours)	ResNet18	18.47	167.48	93.72	97.11	84.78	90.19	91.46	91.45	84.51	91.97
MFRNet-S (ours)	Swin-S	54.46	77.94	94.43	97.31	85.94	90.90	92.15	92.14	85.66	92.64

Table 4. The comparison results with state-of-the-art network models on the Potsdam dataset are presented in percentages (%). The metric for each category is the F1 score. The best data in each column are highlighted in bold and the second-highest data are underlined.

Method	Backbone	Params (M)	FLOPs (G)	Imp.Surf.	Building	Lowveg.	Tree	Car	F1	mIoU	OA
UNet [8]	ResNet18	22.61	71.26	90.42	94.41	83.51	85.69	94.35	89.68	81.58	87.69
PSPNet [26]	ResNet18	12.66	40.10	90.64	94.64	84.38	85.85	94.08	89.92	81.95	88.25
DANet [38]	ResNet18	12.09	36.92	91.06	95.22	86.09	87.97	86.00	89.27	80.80	89.34
BANet [39]	ResT-Lite	12.14	49.09	92.68	96.24	87.06	88.79	95.68	92.09	85.56	90.67
ABCNet [30]	ResNet18	14.06	18.72	90.36	93.34	84.05	84.98	93.73	89.29	80.90	87.64
UNetFormer [34]	ResNet18	11.14	10.94	92.07	95.86	86.74	88.05	95.21	91.59	84.69	90.03
DC-Swin [35]	Swin-S	63.80	258.13	93.03	96.42	87.78	88.67	95.85	92.35	85.99	90.95
MANet [29]	ResNet18	11.43	82.96	86.73	89.37	80.60	81.01	91.22	85.91	75.36	83.75
$A^{2}$ -FPN [40]	ResNet18	22.77	158.84	92.59	96.08	87.06	88.54	95.86	92.03	85.45	90.54
SACANet [41]	HRNet-v2	28.81	210.65	92.92	96.03	88.13	88.97	96.65	92.54	86.31	90.98
LOG-CAN [42]	ResNet50	29.48	184.70	91.91	96.60	86.07	88.17	95.00	91.55	84.66	90.17
CMTFNet [20]	ResNet50	28.68	122.07	92.69	95.87	87.54	88.14	95.50	91.95	85.30	90.57
MFRNet-R (ours)	ResNet18	18.47	167.48	93.12	96.67	87.68	89.05	96.44	92.59	86.43	91.12
MFRNet-S (ours)	Swin-S	54.46	77.94	94.73	97.30	88.71	90.03	96.95	93.55	88.08	92.25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Cheng, S.; Du, A. Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2024, 16, 3184. https://doi.org/10.3390/rs16173184

AMA Style

Liu J, Cheng S, Du A. Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images. Remote Sensing. 2024; 16(17):3184. https://doi.org/10.3390/rs16173184

Chicago/Turabian Style

Liu, Jiang, Shuli Cheng, and Anyu Du. 2024. "Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images" Remote Sensing 16, no. 17: 3184. https://doi.org/10.3390/rs16173184

APA Style

Liu, J., Cheng, S., & Du, A. (2024). Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images. Remote Sensing, 16(17), 3184. https://doi.org/10.3390/rs16173184

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation Based on CNN

2.1.1. CNN Semantic Segmentation Network

2.1.2. Attention Mechanism in CNN

2.2. Transformer-Based Semantic Segmentation

2.2.1. General Semantic Segmentation

2.2.2. Remote Sensing Semantic Segmentation

3. Methodology

3.1. Overall Architecture of the Model

3.2. Encoder

3.3. Multi-View Feature Fusion Block

3.3.1. Sliding Heterogeneous Multi-Head Attention

3.3.2. Multi-Scale Hierarchical Compressed Channel Attention

3.4. Efficient Feature Refinement Module

4. Experimental Comparison and Analysis

4.1. Datasets

4.1.1. Vaihingen

4.1.2. Potsdam

4.2. Experimental Design

4.2.1. Implementation Details

4.2.2. Loss Function

4.2.3. Validation Metrics

4.3. Ablation Experiment

4.3.1. Ablation Experiment Analysis of the Main Components

4.3.2. Subcomponent Ablation Experiments

4.3.3. Ablation Experiment Analysis of SHMA Window Size

4.3.4. Internal Ablation Experiments on MSCA

4.4. Comparative Experiments with Advanced Networks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI