1. Introduction
Semantic segmentation entails partitioning an image into regions belonging to the same class, assigning a label to each pixel [
1,
2,
3,
4,
5,
6]. Unlike image classification, which categorizes an entire image holistically, semantic segmentation provides granular class-level delineations. Stemming from image segmentation techniques, it focuses on embedding semantic context, thereby rendering segmented regions meaningful. This capability is essential for numerous computer vision tasks and finds application in diverse fields, including medical and natural image parsing [
7,
8,
9,
10], autonomous driving [
11,
12,
13,
14], road network extraction [
15,
16,
17,
18], and land-use classification [
19,
20,
21,
22,
23].
Early approaches to semantic segmentation relied on manually designed low-level features, such as color, shape, and texture [
24,
25,
26]. These features were processed using clustering or classification algorithms in high-dimensional spaces to perform segmentation. While these methods laid the foundation for early advancements, they exhibited clear limitations, including low flexibility and inadequate abstraction to handle complex segmentation tasks.
To overcome these constraints, convolutional neural networks (CNNs) revolutionized segmentation tasks by enabling hierarchical feature learning [
27,
28,
29]. Among them, fully convolutional networks (FCNs) [
30,
31,
32,
33,
34,
35] established a new standard for semantic segmentation with their end-to-end trainable frameworks. However, due to their simplistic decoder structures, FCNs often produce coarse predictions, which led to the development of more advanced encoder–decoder architectures, such as UNet [
31], SegNet [
32], and UNet++ [
33]. These models leverage hierarchical features to capture both local and global contexts, substantially improving segmentation precision, especially in medical imaging and natural scene analysis. For instance, UNet [
31] adopts a symmetric encoder–decoder structure with skip connections to fuse low-level spatial features with high-level semantic information, becoming a cornerstone model in segmentation tasks. SegNet [
32] enhances computational efficiency by reusing max-pooling indices, making it particularly appealing for real-time applications. UNet++ [
33] introduces nested and dense skip connections to refine feature fusion, further boosting segmentation accuracy. Despite these advancements, the inherent challenges of remote sensing images (RSIs) remain unresolved due to their unique spectral and spatial complexities.
RSIs are distinct from natural images due to their intricate spatial structures and diverse spectral information. These images often exhibit high intraclass variability and low interclass separability, such as urban areas with overlapping spectral characteristics, which significantly complicates semantic segmentation. A key limitation of existing methods is their inability to consistently handle small objects and preserve fine boundaries in high-resolution remote sensing images (RSIs). For example, in complex urban environments, densely packed cars and intricate building edges are often misclassified or blurred by conventional models. This occurs because most CNN-based methods rely heavily on localized convolutional operations, which struggle to capture the high-frequency details necessary for accurate boundary delineation. Similarly, attention mechanisms and transformer-based models, while improving context modeling, still face challenges in maintaining a balance between local detail preservation and long-range dependencies. This inadequacy in handling small, detailed objects and accurately segmenting complex boundaries motivates the need for an approach that can effectively integrate spectral and spatial information.
Attention mechanisms (AMs) [
36,
37] have emerged as a transformative solution for capturing spatial and contextual dependencies in RSIs. By dynamically focusing on salient regions and features within an image, AMs enhance interpretability and segmentation accuracy, particularly in heterogeneous and complex environments. For instance, MACU-Net [
38] integrates multiscale features and hierarchical connections, outperforming UNet in satellite image segmentation tasks. Similarly, A2FPN [
39] incorporates an attention aggregation module to simultaneously enhance spatial and contextual understanding, demonstrating its effectiveness across diverse datasets. SAPNet [
40], a more recent contribution, combines spatial and channel attention, achieving finer segmentation granularity. These advances highlight the potential of attention-driven frameworks to tackle RSIs’ multiscale challenges.
Transformers have recently garnered significant attention in computer vision due to their ability to capture both local and global dependencies [
41,
42,
43]. SETR [
44] reformulates semantic segmentation as a sequence-to-sequence prediction problem by encoding images as patch sequences, achieving strong results on natural image benchmarks. Similarly, Segmenter [
45], based on a vision transformer, employs a pre-trained image classification model and a simple linear decoder to produce segmentation masks. In the context of remote sensing images (RSIs), Blaga and Nedevschi [
46] proposed a transformer-based U-Net with Guided Focal-Axial Attention, combining global and localized attention to improve the segmentation of high-resolution RSIs. Wang et al. [
47] introduced UNetFormer, a hybrid transformer model that integrates local spatial details and global dependencies for efficient urban scene segmentation. Lin et al. [
48] presented the Swin Transformer Segmentor, which integrates Swin Transformers with CNNs to refine boundary details in RSIs and improve the segmentation accuracy. Li et al. [
49] proposed GPINet, which fuses CNN and transformer features with geometric priors to enhance segmentation performance. Similarly, He et al. [
50] developed ST-UNet by combining Swin transformers and CNNs to leverage global context and spatial detail for better RSI segmentation. Lastly, Long et al. [
51] introduced CLCFormer, a hybrid network that combines fine-grained spatial features and long-range global contexts using CNNs and transformers, achieving state-of-the-art results on high-resolution RSI datasets. However, there are two main issues to be addressed:
Existing methods often rely heavily on spatial features, while spectral richness in RSIs remains underexplored. This limitation reduces their capacity to distinguish subtle interclass differences, particularly in complex scenarios involving overlapping spectral features.
Capturing both local spatial details and long-range global dependencies is crucial for high-resolution RSIs. However, many existing models struggle to effectively balance these two aspects, limiting their segmentation accuracy in heterogeneous landscapes.
In recent years, frequency domain learning has gained traction for its ability to complement spatial domain approaches in image processing tasks [
52,
53]. Research has revealed that neural networks exhibit a spectral bias, naturally favoring low-frequency representations while struggling to capture high-frequency details critical for tasks such as edge preservation and fine-grained segmentation [
54]. Inspired by these advancements, this paper proposes a novel frequency attention-enhanced network (FAENet) for semantic segmentation of high-resolution RSIs. The key contributions are summarized as follows:
We propose a frequency attention model (FreqA) that explicitly incorporates spectral and spatial contexts. Using discrete wavelet transformation (DWT), FreqA decomposes feature maps into frequency components. Inner-component channel attention (ICCA) and cross-component channel attention (CCCA) are designed to selectively emphasize informative spectral bands. These enhanced features are processed by a self-attention (SA) module, enabling joint modeling of spectral and spatial dependencies.
We design FAENet, an encoder–decoder architecture equipped with FreqA modules. This design enables hierarchical learning and multiscale feature refinement, allowing the model to handle the complexity and variability in RSIs effectively. FAENet balances local spatial detail capture with long-range dependency modeling.
Extensive experiments on the ISPRS Potsdam [
55] and LoveDA [
56] benchmarks demonstrate that FAENet demonstrates state-of-the-art segmentation accuracy, achieving improvements across key metrics such as AF, OA, and mIoU. Ablation studies further confirm the critical roles of ICCA and CCCA in spectral–spatial modeling, validating the robustness and generalizability of the proposed approach for complex remote sensing tasks. Moreover, efficiency comparisons confirm the superiority of the proposed FAENet.
This paper is organized as follows:
Section 2 provides an overview of related works in the semantic segmentation of RSIs and the advanced methods based on frequency analysis.
Section 3 introduces FAENet.
Section 4 gives the results.
Section 5 draws the conclusion and points out future directions.
2. Related Works
2.1. CNN-Based Methods for Semantic Segmentation of Remote Sensing Images
CNNs have become a cornerstone for the semantic segmentation of RSIs, demonstrating their ability to extract hierarchical features effectively. RSIs, characterized by high spatial resolution and complex spectral variations, has driven the adaptation of CNN architectures to address unique challenges in this domain. A key advancement is the ResUNet-a model [
57], which incorporates residual connections and atrous convolutions to enhance multiscale feature learning while maintaining efficient feature propagation. This architecture has shown significant improvements in handling RSIs with intricate spatial structures and complex boundaries. Similarly, D-LinkNet [
58] extends traditional CNN-based architectures by introducing dense connections within an encoder–decoder framework, achieving enhanced feature reuse and better segmentation accuracy for road extraction tasks in RSIs. PSPNet [
59] has been adapted for RSIs by leveraging pyramid pooling to capture global context, addressing the challenge of large-scale scene variability in RSIs. For vegetation segmentation, methods like DeepLabV3+ [
60] have been customized to incorporate multiscale feature extraction and boundary refinement, enabling accurate mapping of vegetation classes in high-resolution aerial imagery.
Despite these advancements, CNN-based methods exhibit notable limitations when applied to RSIs. Specifically, they struggle to capture long-range dependencies due to their reliance on localized convolutional operations. Additionally, their primary focus on spatial features leads to an underutilization of the rich spectral information inherent in RSIs, which is critical for differentiating spectrally similar classes. These shortcomings highlight the necessity for advanced architectures that can integrate spectral and spatial contexts effectively, motivating the development of approaches like FAENet proposed in this work.
2.2. Attention-Based Methods for Semantic Segmentation of Remote Sensing Images
Attention mechanisms (AMs) have proven to be a powerful augmentation to CNNs, addressing their limitations in capturing long-range dependencies and improving contextual representation for semantic segmentation tasks. By dynamically emphasizing salient regions and focusing on relevant features, attention-based methods have significantly enhanced segmentation accuracy in RSIs, which are characterized by complex spatial and spectral patterns. CAS-Net [
61] enhances small object segmentation by integrating coordinate attention (CA) and SPD-Conv layers to better capture orientation-sensitive and positional information. MTCNet [
62] combines CBAMs with multiscale transformers for improved spatial–contextual modeling. Similarly, the AD-HRNet model [
63] leverages high-resolution attention modules to refine feature representations at multiple scales. LANet [
64] introduced a patch-wise attention module to preserve local details during multi-level feature fusion, resulting in improved segmentation accuracy. Similarly, Li et al. [
65] proposed a hybrid attention mechanism, applying spatial attention in shallow layers to capture local features and channel attention in deeper layers to enhance hierarchical feature learning for satellite image segmentation. Following this trend, hybrid designs such as SCAttNet [
66], HMANet [
67], and HCANet [
37] effectively combined spatial and channel attention mechanisms to enrich feature representations before final inference. MDANet [
68] introduces a deformable attention module (DAM) to enhance locality awareness and structural adaptability for high-resolution remote sensing (HRRS) images, achieving significant performance gains with a multiscale strategy. In contrast, CCAFFMNet [
69] focuses on dual-spectral (RGB-thermal) segmentation, utilizing channel-coordinate attention feature-fusion modules (CCAFFMs) to refine infrared and RGB feature integration. More recent advancements have focused on improving long-range dependency modeling and global contextual representation. For instance, WiCoNet [
70] employed a dual-branch structure with two CNNs to independently model local and global features, effectively capturing long-range dependencies for improved segmentation performance. Li et al. [
40] introduced a synergistic attention model (SAM) to simultaneously capture spatial and channel dependencies, mitigating attention bias often present in traditional methods. By integrating SAM, the SAPNet framework achieved state-of-the-art performance by enhancing feature representations with comprehensive contextual information.
Despite these advancements, attention-based methods still face notable limitations. First, they often struggle to effectively decouple spatial and spectral features, leading to suboptimal fusion and reduced segmentation accuracy in spectrally complex RSIs. Second, these methods face challenges in balancing the preservation of local details with the integration of long-range dependencies, both of which are crucial for accurate segmentation in high-resolution remote sensing tasks. These limitations highlight the need for advanced architectures, such as FAENet, that address these issues by integrating frequency attention mechanisms and synergizing spatial and spectral feature learning.
2.3. Transformer-Based Methods for Semantic Segmentation of Remote Sensing Images
Transformer-based architectures have emerged as a transformative approach in semantic segmentation, particularly for RSIs, due to their ability to model long-range dependencies and capture global context effectively. These capabilities make transformers well-suited to address the challenges posed by high-resolution RSIs, such as complex spatial structures and spectral variability. In the context of RSIs, transformer-based models have shown great potential in improving segmentation performance. Blaga and Nedevschi [
46] proposed a transformer-based U-Net model with Guided Focal-Axial Attention, which combines the global attention mechanism of transformers with localized attention to enhance feature representation in high-resolution RSIs. Wang et al. [
47] introduced UNetFormer, a hybrid architecture that incorporates transformer layers into a U-Net structure, enabling the model to effectively capture both local spatial details and long-range global dependencies, making it particularly suited for urban scene segmentation tasks. Lin et al. [
48] presented the Swin Transformer Segmentor, which integrates Swin Transformers with CNN-based architectures to refine boundary information and achieve significant improvements in segmentation accuracy.
More advanced transformer-based frameworks have further refined these ideas. GLOTS [
71] introduces a unified transformer encoder–decoder structure, leveraging a masked image modeling pre-training strategy and a global–local attention mechanism to capture multiscale contexts effectively. Another study [
72] employs the Swin Transformer backbone with a densely connected feature aggregation module (DCFAM) to restore resolutions, demonstrating the strength of transformers in producing fine-grained segmentation maps. EMRT [
73] combines CNNs and deformable self-attention mechanisms within a transformer-based architecture, enabling efficient multiscale representation learning by fusing local and global features. These works highlight the growing focus on overcoming the limitations of standalone CNNs or transformers through hybrid models and advanced attention mechanisms. Li et al. [
49] proposed GPINet, a hybrid network that combines CNN and transformer features with geometric priors to improve the semantic segmentation of high-resolution RSIs. Similarly, He et al. [
50] developed ST-UNet, which integrates a Swin Transformer with a CNN-based U-Net to leverage both global context and detailed spatial features. Hybrid models such as CLCFormer [
51] balance fine-grained spatial details with global contextual information through a cross-learning framework, achieving state-of-the-art performance on several very high-resolution (VHR) remote sensing datasets. Meanwhile, LETFormer [
74] addresses structural limitations in tokenization by integrating intra-window self-attention with cross-window context interactions, enhancing global feature modeling and spatial representation. Both CLCFormer [
51] and LETFormer [
74] exemplify the advancements in transformer-based methods for RSI segmentation, tackling key challenges such as balancing local spatial details and global contextual understanding.
However, these methods still face limitations in decoupling spatial and spectral features effectively and integrating multiscale contextual information. The proposed FAENet aims to address these issues by incorporating frequency attention mechanisms, enabling a more robust and balanced approach to spectral and spatial feature learning, which is critical for accurate semantic segmentation of high-resolution RSIs.
2.4. Learning in Frequency Domain
In recent years, frequency domain learning has gained significant attention for its ability to complement spatial domain approaches in image processing tasks [
52,
53]. Frequency-based methods enable the separation of high-frequency components, such as textures and edges, from low-frequency components, such as smooth regions, facilitating nuanced feature extraction. Xu et al. [
54] conducted a theoretical analysis of neural networks’ spectral bias using Fourier transformations, revealing a natural inclination toward low-frequency representations and challenges in capturing high-frequency details. These findings catalyzed further research into frequency domain techniques aimed at improving high-frequency representation. For instance, Azad et al. [
75] redesigned the self-attention mechanism to operate in the frequency domain, enhancing contextual cues and uncovering finer details for improved feature representation. Similarly, Zhang et al. [
76] explored the transformation of spatial-domain CNNs into frequency-domain equivalents to harness their unique properties, enabling better utilization of spectral information. Additionally, the development of frequency channel attention (FCA)-based networks enabled explicit spectral feature processing without complex frequency transformations, showing promising results in various applications. Zhang et al. [
77] further advanced this concept with FsaNet, a framework leveraging frequency self-attention to improve edge preservation and computational efficiency. FsaNet demonstrated state-of-the-art performance on several benchmarks, achieving significant improvements in segmentation accuracy and efficiency.
In the domain of remote sensing, frequency domain methods have shown substantial potential for addressing the challenges of high-resolution remote sensing imagery (HRRSI). Techniques such as discrete cosine transformation (DCT) and Fourier analysis enable the effective separation of spectral components, aiding in feature extraction for tasks like semantic segmentation. Su et al. [
78] proposed the Complete Frequency Channel Attention Network (CFCANet), which integrates DCT frequency components into feature maps by assigning the most significant eigenvalues to each channel. This approach significantly enhances noise resistance, particularly in noisy remote sensing imagery. For semantic segmentation, Li et al. [
79] introduced the Spectrum-Space Collaborative Network (SSCNet), which employs a joint spectral–spatial attention (JSSA) module to simultaneously model spectral (SpeA) and spatial (SpaA) dependencies, leading to improved segmentation quality in HRRSIs. Hybrid architectures that integrate spatial and frequency domain features have also emerged as a promising direction for remote sensing tasks. Hong et al. [
80] presented the Spatial-Frequency Information Integration Network (SFINet), which leverages invertible neural operators in the spatial domain and deep Fourier transformation in the frequency domain. SFINet excels in multimodal image fusion tasks, such as pan-sharpening and depth super-resolution, by effectively integrating local and global information to capture high-frequency details. Similarly, the Fourier Frequency Domain Convolutional Neural Network (FFDCNet) [
81] introduces a dynamic frequency filtering mechanism that decomposes feature maps into low- and high-frequency components, improving classification accuracy and segmentation robustness. FFDCNet has shown particular effectiveness in addressing fragmented boundaries and incomplete extractions in crop segmentation tasks, making it highly relevant for remote sensing applications.
Despite these advancements, existing frequency domain methods still face notable limitations. Many approaches struggle to jointly integrate spectral and spatial features effectively, resulting in suboptimal segmentation performance, particularly in spectrally complex and high-resolution RSIs. Furthermore, current methods often lack a unified framework to balance local detail preservation and global contextual understanding across spatial and frequency domains. These limitations motivate the development of FAENet, which explicitly combines spectral and spatial learning by leveraging frequency attention mechanisms, providing a more robust and efficient solution for semantic segmentation of HRRSIs.
3. Method
3.1. Overview of FAENet
FAENet, as illustrated in
Figure 1, is a frequency attention-enhanced network designed for the semantic segmentation of RSIs. It adopts a symmetric encoder–decoder architecture wherein the encoder progressively down-samples the input to extract multiscale features, while the decoder up-samples the encoded features to produce a pixel-wise segmentation mask. The core innovation of FAENet lies in its integration of the proposed Frequency Attention Model (FreqA) within the encoder–decoder framework, enabling joint spectral and spatial feature learning. Each stage of the encoder and decoder consists of multiple Conv Blocks interleaved with FreqA modules, ensuring both local detail preservation and global context modeling.
The encoder is based on a modified ResNet-50 backbone, where each Conv Block corresponds to a residual block from ResNet-50. A Conv Block consists of three convolutional layers with kernel size , followed by batch normalization and ReLU activation. Skip connections are employed within each Conv Block to enable efficient gradient propagation and prevent vanishing gradients during training.
Let the input image be denoted as
, where
H,
W, and
C represent the height, width, and number of channels, respectively. The encoder progressively down-samples the input by a factor of 2 using pooling layers, resulting in feature maps
at different scales, where
i denotes the level in the encoder hierarchy.
Here,
, and
N is the total number of Conv Blocks in the encoder. After each Conv Block, a FreqA module is applied to enhance the extracted features by incorporating spectral and spatial contexts, but its details will be elaborated in
Section 3.2.
The decoder follows a symmetric structure, where each level up-samples the corresponding encoder feature map using bilinear interpolation. Let
denote the up-sampled feature map at level
i in the decoder, which is computed as follows:
To improve feature refinement, the up-sampled feature map
is concatenated with the corresponding feature map from the encoder at the same level:
The concatenated feature
is then processed by a Conv Block to refine the combined representation before being passed to the next level in the decoder. The final output is obtained by applying a
convolution followed by a softmax activation to produce a pixel-wise class probability map
, where
K is the number of classes:
In summary, FAENet employs a carefully designed encoder–decoder architecture with integrated frequency attention to address the complex spectral–spatial characteristics of HRRSIs. By combining hierarchical feature extraction, multiscale feature fusion, and frequency attention mechanisms, FAENet is well-suited for high-resolution semantic segmentation tasks.
3.2. Frequency Attention Model
The Frequency Attention Model (FreqA) is designed to enhance feature representations by jointly modeling spectral and spatial contexts, which are crucial for accurate semantic segmentation in RSIs. As shown in
Figure 2, FreqA operates by first transforming the input feature maps into the frequency domain using Discrete Wavelet Transformation (DWT) and applying attention mechanisms to emphasize the most informative spectral components. Finally, the refined features are transformed back into the spatial domain using Inverse Discrete Wavelet Transformation (iDWT).
Given an input feature map , where H, W, and C represent the height, width, and number of channels, FreqA begins by applying DWT to decompose into four frequency components: : Low-frequency component, representing the coarse approximation of the input feature map. : High-frequency component in the horizontal direction. : High-frequency component in the vertical direction. : High-frequency component in the diagonal direction.
The result is a set of four sub-band feature maps, each of size
:
where DWT facilitates the separation of low- and high-frequency components, enabling nuanced feature extraction for tasks such as edge preservation and texture representation.
Once the feature maps are transformed into the frequency domain, FreqA applies the Frequency Channel Attention (FCA) mechanism to selectively emphasize important spectral information. FCA consists of two key sub-modules: (1) Inner-Component Channel Attention (ICCA): Enhances the discriminative power of each frequency component by modeling channel-wise dependencies within the same component. (2) Cross-Component Channel Attention (CCCA): Captures correlations across different frequency components, improving spectral coherence across the frequency domain.
For each frequency component
, the FCA module produces refined feature maps
by sequentially applying ICCA and CCCA:
The outputs from all frequency components are then concatenated along the channel dimension to form an aggregated feature map:
To further capture long-range dependencies and enhance global contextual representation, FreqA applies a self-attention (SA) module to the aggregated feature map
. The self-attention mechanism computes query (
), key (
), and value (
) representations of the feature map:
where
,
, and
are learnable projection matrices. The attention map is computed using the scaled dot-product operation:
where
is the dimensionality of the key. The output of the self-attention module, denoted as
, is computed by applying the attention map to the value representation:
Finally, the refined feature map
is transformed back into the spatial domain using Inverse Discrete Wavelet Transformation (iDWT), producing the output feature map
:
This process enables FAENet to jointly model spectral and spatial information, addressing the key challenges in high-resolution remote sensing segmentation tasks.
3.3. Frequency Channel Attention
The Frequency Channel Attention (FCA) module is a key component of the proposed Frequency Attention Model (FreqA), designed to enhance feature representations by capturing both intra-component and cross-component spectral dependencies. As shown in
Figure 3, ICCA operates on individual frequency components to refine channel-wise features, while CCCA models interactions across different frequency components to improve spectral coherence. The final output of FCA is a concatenated feature map that integrates refined information from all frequency components.
ICCA aims to enhance the discriminative power of each frequency component by modeling channel-wise dependencies within the same component. Given a frequency component , where H, W, and C are the height, width, and number of channels, respectively, ICCA computes a refined feature map by applying a channel attention mechanism.
First, a channel descriptor
is obtained by applying global average pooling (GAP) across spatial dimensions:
where
denotes the value at position
in channel
c of the frequency component
.
Next, a fully connected (FC) layer followed by a ReLU activation is applied to
to obtain a transformed channel descriptor
:
where
and
are learnable parameters, and
r is the reduction ratio.
To obtain the final attention weights, another FC layer with a sigmoid activation is applied:
where
and
are learnable parameters. The attention weights
are then used to reweight the original feature map
channel-wise:
where
is the refined output of ICCA for the frequency component
.
After applying ICCA to each frequency component independently, CCCA is employed to capture cross-frequency dependencies by modeling interactions across different frequency components. The goal of CCCA is to improve spectral coherence by leveraging correlations between the low-frequency and high-frequency components.
Let
denote the refined frequency components after ICCA. For a given channel
c, CCCA computes the correlation between the low-frequency component
and the high-frequency components
:
where
are learnable weights, and
denotes a correlation function that employs cosine similarity to compute across corresponding channels in different frequency components. This operation ensures that each channel in the low-frequency component is enriched with cross-component information from the high-frequency components.
The same process is applied to all channels in all frequency components, resulting in refined feature maps .
Finally, the refined frequency components after applying CCCA are concatenated along the channel dimension to form the output feature map of FCA:
where
is the final output of the FCA module, containing enriched spectral and spatial information from all frequency components.
To sum up, FCA ensures that both intra-frequency and inter-frequency dependencies are effectively captured, enabling FAENet to jointly model spectral and spatial contexts for enhanced segmentation performance in HRRSIs.
4. Experiments
4.1. Settings
In experiments, we implemented our FAENet and all comparative models under the same settings, using PyTorch on a Linux OS with an NVIDIA A40 GPU. Data augmentations, such as random flipping and cropping operations, were applied to all datasets and networks. The initial learning rate and maximum epoch were fixed at 0.02 and 500, respectively. We adopt stochastic gradient descent (SGD) as the optimizer, with a momentum of 0.9. The learning rate was adjusted using a polynomial decay strategy, defined as follows:
where
is the learning rate at iteration
t,
is the initial learning rate,
T is the total number of iterations, and
p is a decay exponent set to 0.9 in our experiments. This strategy ensures that the learning rate smoothly decreases as the number of iterations increases, helping the model achieve better convergence. The model parameter file with the lowest validation loss was saved for final evaluation.
We employ four metrics to evaluate the performance of the predicted results on the test set: class-wise
-score (
), average
-score across all classes (AF), overall accuracy (OA), and mean intersection over union (mIoU). The equations for
, OA, and IoU are provided in Equation (
19), Equation (
20), and Equation (
21), respectively.
Here,
and
are defined as follows:
where
,
,
, and
represent the number of true positive, true negative, false positive, and false negative samples, respectively.
As for comparative methods, we selected several well-established baselines and state-of-the-art (SOTA) methods tailored for RSI segmentation. FCN-8s [
30] and DANet [
36] are pioneering fully convolutional and attention-based models that have achieved notable success in general computer vision tasks. For RSIs, ResUNet-a [
57] was designed specifically for RSI segmentation by incorporating atrous convolution, multiscale feature fusion, and residual connections, effectively addressing the challenges posed by complex spatial structures in RSIs.
More recent advancements in attention-based methods for RSI segmentation include MACU-Net [
38], HCANet [
37], SCAttNet [
66], and A2FPN [
39], which leverage various attention mechanisms to improve feature representation and segmentation accuracy. These models, published after 2021, have demonstrated state-of-the-art performance across multiple RSI datasets by dynamically capturing long-range dependencies and enhancing spatial context understanding.
Furthermore, we included the latest transformer-based architectures tailored for RSI segmentation: ICTNet [
82], CLCFormer [
51], and LETFormer [
74]. These models exploit the global attention capabilities of transformers, providing a more holistic understanding of complex scenes in high-resolution RSIs. Specifically, CLCFormer employs a cross-learning mechanism to balance fine-grained spatial details with global context, while LETFormer introduces a novel intra-window self-attention mechanism for improved structural modeling in RSIs. These methods serve as strong baselines for benchmarking the performance of our proposed FAENet.
4.2. Datasets
The ISPRS Potsdam dataset [
55] comprises 38 high-resolution orthophotos with a ground sampling distance (GSD) of 5 cm, each measuring
pixels. The dataset includes four spectral bands: near-infrared (NIR), red (R), green (G), and blue (B), along with corresponding digital surface model (DSM) and normalized digital surface model (NDSM) data. These additional data sources provide valuable elevation information, aiding in distinguishing objects with similar spectral characteristics. The dataset covers diverse urban scenes, such as buildings, roads, trees, and cars, making it a benchmark for evaluating semantic segmentation performance in high-resolution urban imagery. For this study, we focused on the RGB image data for training and testing, as they are widely used in standard semantic segmentation tasks.
The LoveDA dataset [
56] presents a unique challenge in semantic segmentation by incorporating large-scale satellite images with a spatial resolution of 0.3 m. The dataset spans over 536 square kilometers and includes both rural and urban regions from three cities: Nanjing, Changzhou, and Wuhan. Each image has a spatial size of
pixels and exhibits substantial variability in object scale, size, and surface type. LoveDA is designed to evaluate the robustness of segmentation models in handling imbalanced class distributions and challenging environmental conditions, such as varying lighting and atmospheric effects. For our experiments, we utilized 2522 images for training, 834 for validation, and 835 for testing.
For both datasets, the images were cropped into subpatches of size to ensure uniform input dimensions for the model. These subpatches were randomly divided into training, validation, and testing sets in a 1:1:1 ratio, providing a balanced and comprehensive evaluation framework.
4.3. Results on the ISPRS Potsdam Dataset
4.3.1. Numerical Evaluations
The numerical results in
Table 1 demonstrate the superior performance of FAENet compared to existing methods across key evaluation metrics, highlighting its effectiveness in semantic segmentation of the ISPRS Potsdam dataset. FAENet achieves the highest OA of 92.31%, surpassing both LETFormer (91.17%) and CLCFormer (89.97%), which are state-of-the-art transformer-based methods. This improvement underscores the ability of FAENet to generalize effectively across diverse scenes, benefiting from the proposed frequency attention mechanism that enhances both spectral and spatial feature extraction.
In terms of mIoU, FAENet achieves a score of 83.58%, outperforming LETFormer (82.67%) by nearly 1% and CLCFormer (81.68%) by almost 2%. The consistent improvement in mIoU indicates that FAENet excels in capturing inter-class separability and handling challenging scenarios with overlapping class boundaries. Despite the improvement being under 1%, statistical significance tests (t-tests) were conducted over five repeated runs, confirming that the observed improvements are statistically significant ().
When examining class-wise F1-scores, FAENet demonstrates outstanding performance in the “Low vegetation” and “Car” categories, achieving scores of 88.21 and 94.75, respectively. These results are particularly noteworthy because “Car” is a small and intricate class, often challenging for segmentation models due to its limited spatial representation. Similarly, FAENet’s ability to achieve the highest F1-score in “Low vegetation” reflects its effectiveness in managing fine-grained spectral details, which are critical for distinguishing between similar classes in HRRSIs. Compared to LETFormer, FAENet improves the F1-score for “Low vegetation” by nearly 1% (88.21 vs. 87.25) and for “Car” by 0.42% (94.75 vs. 94.33), indicating consistent performance across both large and small objects.
FAENet also achieves an AF of 92.71, reflecting its balanced segmentation performance across all classes. This improvement over LETFormer (92.47) and CLCFormer (91.61) demonstrates the efficacy of the frequency channel attention and self-attention modules in harmonizing spectral and spatial features. These mechanisms allow FAENet to mitigate class imbalances and achieve more precise segmentation results, especially in complex urban environments.
The overall trends in the results demonstrate that FAENet consistently outperforms existing state-of-the-art methods across various evaluation metrics. Specifically, FAENet achieves superior class-wise segmentation precision, particularly in challenging categories such as “Low vegetation” and “Car”, which require precise boundary delineation and fine-grained feature discrimination. The improvements in these categories indicate that the proposed frequency attention mechanism effectively captures and integrates both spectral and spatial information, leading to better overall segmentation accuracy. Moreover, FAENet’s performance in terms of OA, mIoU, and AF highlights its robustness in handling diverse and complex urban scenes in RSIs. The combination of frequency-based feature decomposition and attention mechanisms allows FAENet to generalize well across varying scene types and object scales, making it a strong candidate for real-world RSI segmentation applications.
4.3.2. Statistical Significance Analysis
To ensure the robustness of our results and address potential variability caused by random initialization, we conducted repeated experiments with fixed random seeds. Specifically, we ran the experiments five times and computed the mean and standard deviation of all overall evaluation metrics. A two-tailed paired
t-test was performed to assess the statistical significance of the improvements achieved by FAENet compared to LETFormer, the most competitive baseline.
Table 2 summarizes the mean, standard deviation, and
p-values obtained from the
t-test. The
p-values for all metrics are below 0.05, indicating that the improvements achieved by FAENet are statistically significant.
4.3.3. Visual Comparisons
The visual comparisons presented in
Figure 4 showcase the segmentation outputs of various state-of-the-art methods on the ISPRS Potsdam dataset. FAENet consistently produces clearer segmentation maps, particularly in regions with intricate boundaries and small objects. In complex scenes with transitions between “Building” and “Impervious surfaces”, FAENet demonstrates superior boundary delineation compared to earlier methods such as FCN-8s (c) and DANet (d), which tend to blur boundaries and introduce artifacts. The proposed frequency attention mechanism enables FAENet to preserve fine-grained details, resulting in sharper edges and fewer misclassified pixels.
FAENet also excels in segmenting small and detailed objects like “Car”, where methods such as ResUNet-a (e) and MACU-Net (f) struggle with fragmentation. As shown in the marked regions of the figure, FAENet achieves more cohesive and accurate car segments, underscoring its capability to enhance high-frequency feature representation through the frequency attention mechanism.
When comparing FAENet with advanced transformer-based models like CLCFormer (k) and LETFormer (l), FAENet demonstrates better spatial consistency and reduced boundary misalignment. Although LETFormer produces reasonably accurate results, minor inconsistencies are observed in densely vegetated areas (“Low vegetation” and “Tree”), where FAENet delivers smoother transitions and better-defined class boundaries. The highlighted regions in
Figure 4 indicate FAENet’s ability to maintain structural integrity and spatial coherence in complex scenes.
Overall, the marked improvements in the figure emphasize FAENet’s robustness in capturing both local and global contexts, leading to segmentation outputs that closely resemble ground truth labels. These results validate FAENet as a highly effective solution for RSI segmentation.
4.4. Results on the LoveDA Dataset
4.4.1. Numerical Evaluations
As shown in
Table 3, FAENet achieves superior performance across key metrics on the LoveDA dataset, which poses unique challenges due to its mixed urban and rural landscapes, varying spatial resolutions, and imbalanced class distributions. The dataset’s diversity in land cover scenarios requires models to generalize well across both densely built-up urban areas and sparsely populated rural regions.
FAENet attains the highest OA of 72.93%, outperforming LETFormer (72.12%) and CLCFormer (70.45%), demonstrating its improved generalization ability across complex landscape types. Additionally, FAENet achieves the best mIoU of 66.91%, surpassing LETFormer (66.01%) by 0.9%, highlighting its capacity to handle diverse land cover categories effectively.
In terms of class-wise F1-scores, FAENet excels in critical categories such as “Building” (81.74), “Road” (82.82), and “Water” (92.50), where accurate boundary delineation and fine-grained segmentation are crucial. Compared to LETFormer, FAENet records a 0.79% improvement in the “Road” category (82.82 vs. 82.03) and a notable 1.26% gain in the “Water” category (92.50 vs. 91.24). These improvements underscore FAENet’s ability to capture fine-grained details and segment linear and irregular features effectively.
Although LETFormer slightly outperforms FAENet in the “Barren” and “Forest” categories, with F1-scores of 56.75 and 70.05, respectively, FAENet remains competitive, achieving scores of 53.53 and 68.33 in these categories. Despite this trade-off, FAENet achieves the highest overall AF of 76.49, surpassing LETFormer (76.13) and other state-of-the-art models, indicating consistent performance across all classes.
The results validate FAENet’s frequency attention mechanism, which enhances the integration of spectral and spatial features, enabling robust performance across diverse environments. Compared to the ISPRS Potsdam dataset, where the focus is on urban-specific segmentation, the LoveDA dataset presents additional challenges due to the significant class imbalance and landscape variability. FAENet’s consistent improvements across both datasets highlight its versatility and robustness in handling different types of RSIs.
4.4.2. Visual Comparisons
The visual results in
Figure 5 showcase segmentation outputs for the LoveDA dataset, enabling a comprehensive comparison of FAENet (m) with state-of-the-art models such as FCN-8s (c), DANet (d), ResUNet-a (e), and transformer-based architectures like CLCFormer (k) and LETFormer (l). Each subfigure highlights the ability of these models to delineate and classify diverse land cover types, including “Building”, “Road”, “Water”, “Barren”, “Forest”, and “Agriculture”.
FAENet demonstrates superior segmentation quality, particularly in capturing fine-grained details and complex boundaries. For example, in areas dominated by “Building” and “Road”, FAENet generates outputs that closely match the ground truth (b), with sharper edges and fewer misclassifications compared to other models like FCN-8s (c) and DANet (d). These improvements underline the strength of FAENet’s frequency attention mechanism in refining both spectral and spatial representations.
Compared to ResUNet-a (e) and MACU-Net (f), FAENet significantly enhances the segmentation of challenging classes like “Water” and “Agriculture”. In “Water” regions, FAENet accurately captures smooth boundaries and mitigates over-segmentation issues prevalent in other models. Similarly, in agricultural areas, where fine texture details are crucial, FAENet outperforms other models by producing more uniform and accurate classifications.
Transformer-based models such as LETFormer (l) and CLCFormer (k) provide strong baseline results, particularly in handling large-scale features like “Forest.” However, FAENet surpasses these methods in maintaining spatial consistency and reducing noise in densely packed regions. For example, in mixed-class areas with overlapping “Barren” and “Forest” regions, FAENet demonstrates better discrimination and smoother transitions between classes.
Overall, the visual comparisons clearly illustrate FAENet’s ability to produce segmentation maps with superior boundary alignment, reduced artifacts, and enhanced class differentiation. These results validate the efficacy of FAENet’s spectral–spatial feature integration in handling the diverse and complex landscapes of the LoveDA dataset, further emphasizing its robustness and generalization capabilities.
4.5. Efficiency Analysis
The results in
Table 4 demonstrate that FAENet achieves a strong balance between computational efficiency and segmentation accuracy, outperforming both CNN-based and transformer-based state-of-the-art methods in terms of inference speed and computational cost. FAENet’s inference time of 42.2 ms and 60.7 GFLOPs place it among the most efficient models evaluated.
FAENet exhibits a faster inference time compared to transformer-based architectures such as LETFormer (48.3 ms) and CLCFormer (49.6 ms). This reduction of approximately 6.1 ms and 7.4 ms, respectively, highlights FAENet’s streamlined architecture, which effectively integrates spectral and spatial attention mechanisms without imposing excessive computational demands. This efficiency makes FAENet suitable for real-time or large-scale remote sensing applications.
In terms of FLOPs, FAENet achieves significant reductions compared to transformer-based models, such as CLCFormer (75.6 G) and HCANet (72.3 G). This reduction, amounting to over 20%, demonstrates FAENet’s ability to deliver high segmentation accuracy while minimizing resource usage. The integration of frequency attention mechanisms enables FAENet to focus computational efforts on the most relevant spectral and spatial features, leading to improved performance at a lower computational cost.
Compared to CNN-based models, FAENet demonstrates slightly higher computational costs than MACUNet (55.1 G) and ResUNet-a (58.2 G) while delivering substantially better segmentation results. The 42.2 ms inference time is competitive, demonstrating that the additional complexity introduced by spectral–spatial attention does not significantly impact processing speed. This balance underscores FAENet’s capability to combine the strengths of CNN and transformer designs.
FAENet’s frequency domain approach enhances computational efficiency by leveraging DWT to decompose features, allowing for the targeted refinement of spectral–spatial representations. This design reduces redundancy and focuses processing power where it is most impactful, resulting in both faster inference and superior accuracy compared to conventional methods.
In summary, FAENet’s efficiency analysis reveals that it is both computationally economical and highly effective, setting a new standard for balancing speed, complexity, and accuracy in semantic segmentation of remote sensing images. Its scalability and efficiency make it an excellent choice for diverse applications, including real-time processing and large-scale geographic analysis.
4.6. Effects of ICCA and CCCA
Table 5 presents the results of an ablation study assessing the contributions of the ICCA and CCCA modules to the performance of FAENet on the ISPRS Potsdam and LoveDA datasets. The study includes four configurations: FAENet without ICCA and CCCA, FAENet with only ICCA, FAENet with only CCCA, and FAENet with both modules combined. This comprehensive analysis reveals that the combined incorporation of ICCA and CCCA achieves the highest scores across all metrics, emphasizing their complementary roles in refining spectral–spatial representations.
The baseline configuration (FAENet without ICCA and CCCA) achieves AF/OA/mIoU scores of 82.01/81.34/73.15 on the Potsdam dataset and 66.22/63.45/57.93 on the LoveDA dataset. While these results are competitive, they are significantly lower than those achieved by FAENet with either ICCA or CCCA individually, and even more so when both modules are combined.
When only ICCA is incorporated, FAENet achieves AF/OA/mIoU scores of 85.95/84.59/76.97 on the Potsdam dataset and 69.78/67.11/61.07 on LoveDA, highlighting ICCA’s ability to capture spectral nuances within individual frequency components. Incorporating only CCCA results in AF/OA/mIoU scores of 86.67/85.40/77.72 on Potsdam and 70.46/67.76/61.66 on LoveDA, demonstrating that CCCA effectively enhances cross-frequency interactions, further improving segmentation performance.
The combined use of ICCA and CCCA achieves the highest scores, illustrating their synergistic effect in enhancing feature representation. ICCA focuses on refining channel-specific spectral information, while CCCA facilitates cross-component interaction, enabling comprehensive spectral–spatial context modeling. Together, these modules improve class-wise segmentation precision, ensure better boundary preservation, and enhance feature discrimination, particularly in complex remote sensing scenarios.
These findings validate the design of the frequency attention mechanism and confirm that the integration of ICCA and CCCA is critical to achieving state-of-the-art performance in semantic segmentation tasks. The consistent improvements across the ISPRS Potsdam and LoveDA datasets further underscore the generalizability of the proposed approach.
5. Conclusions
This study introduces FAENet, a novel frequency attention-enhanced network, specifically designed for the semantic segmentation of HRRSIs. By leveraging the FreqA, FAENet effectively integrates spectral and spatial contexts, addressing the limitations of traditional CNN and transformer-based approaches in capturing fine-grained spectral details. Experimental evaluations on the ISPRS Potsdam and LoveDA datasets demonstrate that FAENet outperforms state-of-the-art methods, achieving superior segmentation accuracy, particularly in complex and heterogeneous scenes. Ablation studies further validate the contributions of the ICCA and CCCA modules, underscoring their complementary roles in enhancing spectral–spatial feature representation.
An important aspect of FAENet is its potential transferability to other datasets or applications beyond the ISPRS Potsdam and LoveDA datasets. Given that FAENet effectively models both spectral and spatial information, it can be applied to other high-resolution remote sensing datasets with similar characteristics, such as urban mapping or land-use classification tasks. Moreover, the frequency attention mechanism is designed to handle diverse spectral variations, making it adaptable to datasets with varying spectral bands, including hyperspectral and multispectral imagery.
Additionally, FAENet’s encoder–decoder architecture, combined with frequency attention, positions it as a candidate for broader remote sensing applications, such as object detection, instance segmentation, and even change detection. Future work could explore fine-tuning FAENet on such tasks, potentially leading to enhanced generalization across different remote sensing domains.
In conclusion, FAENet represents a significant advancement in remote sensing semantic segmentation, with its innovative frequency domain approach setting a new benchmark for feature refinement. Future research could extend this framework to incorporate additional modalities, such as hyperspectral and LiDAR data, which provide richer spectral and elevation information, respectively. By leveraging the fine spectral granularity of hyperspectral imagery and the precise elevation details from LiDAR data, FAENet has the potential to further enhance segmentation performance in applications requiring high spatial–spectral discrimination or detailed topographic analysis. Furthermore, future work could explore its application in other remote sensing tasks like object detection and change detection. The promising results of FAENet pave the way for more robust, generalizable, and efficient methods in remote sensing image analysis.