Mamba-UAV-SegNet: A Multi-Scale Adaptive Feature Fusion Network for Real-Time Semantic Segmentation of UAV Aerial Imagery
Abstract
:1. Introduction
- Propose the MH-Mamba Block module: By integrating the Multi-Head 2D Spatial Shift Module and multi-scale convolutions, we enhance the representation capability of intermediate features, improving the model’s understanding of complex scenes and its ability to capture details.
- Design a new edge-detail ground truth generation method: By fusing Laplace and Sobel operators to generate edge-detail texture ground truths for the edge-detail auxiliary training branch, we enhance the model’s perception of edges and details.
- Introduce the Adaptive Boundary Enhancement Fusion Module (ABEFM): This module effectively fuses high-level semantic information and low-level detailed features, and it strengthens the feature representation of edge regions through a boundary attention mechanism, improving segmentation accuracy.
- To validate the method’s effectiveness, extensive experiments conducted on various UAV aerial datasets show that the proposed approach successfully overcomes the limitations of existing methods in UAV aerial scenes, achieving an optimal balance between accuracy and processing speed.
2. Related Work
2.1. UAV-Based Image Analysis
2.2. Real-Time Segmentation Networks
2.3. Mamba Framework in Semantic Segmentation
3. Proposed Method
- STDC Backbone Network: Provides multi-level image features by progressively extracting features through four stages:
- -
- Stage 1: Extracts low-level features capturing basic edges, textures, and color information from the input image.
- -
- Stage 2: Captures mid-level features representing more complex patterns and local structures.
- -
- Stage 3: Extracts high-level semantic features providing abstract representations of objects and scene context.
- -
- Stage 4: Further refines high-level features, preparing them for subsequent processing in the decoder and additional modules.
Features from different stages are utilized in subsequent modules to enhance segmentation performance: - MH-Mamba Block Module: Applied to the feature maps obtained fromStage 2 and Stage 4, this module enhances feature representation by integrating multi-scale convolutions and a Multi-Head 2D State Space Model (Multi-Head 2D-SSM). By processing features from both intermediate and high-level stages, the MH-Mamba Block captures multi-scale contextual information and improves the model’s ability to represent both local details and global scene context, which is essential for understanding intricate aerial imagery.
- Adaptive Boundary Enhancement Fusion Module (ABEFM): Fuses the features processed by the MH-Mamba Block Module. Specifically, it takes the enhanced feature maps from the MH-Mamba Block applied to Stage 2 and Stage 4, and effectively combines them. The ABEFM employs a boundary attention mechanism to strengthen feature representation in edge regions, enhancing the accuracy of object boundaries in the segmentation output. This fusion allows the model to integrate detailed local features with rich semantic information, improving segmentation performance across varied object scales.
- Edge-Detail Auxiliary Training Branch: Enhances the model’s perception of edges and fine-grained details through auxiliary supervision. It utilizes the edge ground truth generated by combining Sobel and Laplacian operators, as described in Section 4, to guide the network in learning detailed edge features and improving boundary precision.
STDC Backbone Network
- Stage 1: Processes the input image with initial convolutional layers, capturing fine-grained details and preserving spatial resolution. This stage is crucial for detecting edges and textures.
- Stage 2: Extracts mid-level features through additional convolutional layers. This stage focuses on capturing local patterns and structures, providing richer representations than the initial stage.
- Stage 3: Further abstracts the features, extracting high-level semantic information that represents objects and their relationships within the scene.
- Stage 4: Produces the most abstract and semantically rich features, essential for accurate classification and understanding of complex scenes.
4. MH-Mamba Block Module
4.1. Multi-Scale Convolution
4.2. Feature Concatenation
4.3. Depthwise Convolution and Pointwise Convolution
4.4. Multi-Head 2D State Space Model (Multi-Head 2D-SSM)
- Top-left to bottom-right;
- Bottom-right to top-left;
- Top-right to bottom-left;
- Bottom-left to top-right.
4.5. Feature Fusion and Compression
4.6. Adaptive Feature Fusion
4.7. Adaptive Boundary Enhancement Fusion Module (ABEFM)
- Feature Enhancement Module (FE): Applies channel attention to both high-level and low-level features to emphasize important features.
- Feature Fusion Module (FF): Aligns the spatial dimensions and channels of the enhanced features before fusing them.
- Boundary Attention Module (BA): Generates boundary attention maps using a learnable edge detector to strengthen feature representation in boundary areas.
4.7.1. Feature Enhancement Module (FE)
- Channel Weight Generation: For each feature, global average pooling (GAP) is performed, followed by a 1 × 1 Conv and a Sigmoid activation function to generate channel weights:
- Feature Enhancement: The generated channel weights are applied to the corresponding features through element-wise multiplication:Here, ⊙ denotes element-wise multiplication.
4.7.2. Feature Fusion Module (FF)
- Spatial and Channel Alignment: A convolution is used to adjust the number of channels, and upsampling operations align the spatial dimensions of the high-level and low-level features.
- Feature Fusion: The aligned features are fused by element-wise addition:
4.7.3. Boundary Attention Module (BA)
- Edge Feature Extraction: A learnable convolution, followed by a ReLU activation function, extracts edge features from the fused features:
- Boundary Attention Map Generation: A convolution and a Sigmoid activation function generate the boundary attention map :
- Boundary Reinforcement: The boundary attention map is used to weight the fused features, enhancing feature representation in boundary regions:
4.8. Detail Ground Truth Generation Method
- Initial Label Processing:The input semantic segmentation labels are first processed separately using Sobel convolution and Laplacian convolution.Sobel convolution calculates the gradients of the label image in both the X and Y directions, capturing the preliminary edge information that outlines the boundaries of objects.Laplacian convolution, being a second-order derivative operator, further extracts and enhances high-frequency information in the labels, focusing on finer edge details.After extracting initial edge features, we perform a feature fusion of the results from Sobel and Laplacian convolutions. This fusion step integrates the directional edge information captured by Sobel convolution with the high-frequency details enhanced by Laplacian convolution, forming a rich edge feature map.Next, the fused feature map undergoes a second Laplacian convolution to further refine the edge features. This step helps in sharpening the fused edges and eliminating potential noise, thus generating a more detailed and fine-grained edge feature map, which is crucial for capturing subtle edge details not easily detected in a single pass.
- Multi-Scale Convolution and Upsampling:To ensure the accuracy of edge features across different scales, we apply multi-scale convolution. The fused edge features are processed with different strides (stride = 2, 4, 8) to capture both global and local edge information. Following the convolution, each scale’s feature map is subjected to corresponding upsampling (2x, 4x, 8x) to restore the original resolution. This multi-scale processing ensures that edge features at various resolutions contain adequate detail.
- Feature Fusion and Output:Finally, the upsampled multi-scale feature maps are fused to produce the final edge ground truth. This fusion combines the global edge structures from low-resolution features with the fine local details from high-resolution features, resulting in a high-resolution edge map that contains rich edge details.
4.9. Loss Function Design
4.9.1. Binary Cross-Entropy Loss ()
4.9.2. Dice Loss ()
4.9.3. Edge-Aware Loss
4.9.4. Final Detail Loss Function
5. Experimental Results
5.1. Datasets
5.2. Implementation Details
5.3. Comparison with Mainstream Methods
- Horizontal Roof, Ground, and Lawn: Our method significantly outperformed others, indicating superior ability in segmenting flat surfaces and open areas.
- River and Plant: Achieving IoUs of 86.3% and 64.1%, our model effectively distinguished these classes, which often have similar visual features.
- Tree and Background: With IoUs of 89.0% and 59.9%, the model demonstrated strong performance in capturing complex textures and background regions. Car and Human: Notably, our method achieved higher IoUs for small objects like cars (66.0%) and humans (21.4%), addressing the challenge of segmenting small-scale targets in UAV imagery.
5.4. Ablation Study
- Baseline: The base network without any of our proposed modules.
- Baseline + MH-Mamba Block: Incorporating the MH-Mamba Block into the baseline network.
- Baseline + ABEFM: Adding the Adaptive Boundary Enhancement Fusion Module (ABEFM) to the baseline network.
- Baseline + Edge-Detail Auxiliary Training: Including the edge-detail auxiliary training branch with the baseline.
- Baseline + MH-Mamba Block + ABEFM: Combining the MH-Mamba Block and ABEFM with the baseline.
- Baseline + MH-Mamba Block + Edge-Detail Auxiliary Training: Combining the MH-Mamba Block and edge-detail auxiliary training with the baseline.
- Baseline + ABEFM + Edge-Detail Auxiliary Training: Combining the ABEFM and edge-detail auxiliary training with the baseline.
- Full Model: The complete model with all proposed components integrated.
5.5. Visualization and Analysis of Ablation Results
- Baseline Model: The feature maps are less sharp and lack clear edge definitions. The model struggled to capture fine details, leading to blurred feature representations.
- With Edge-Detail Auxiliary Training: The feature maps exhibit more pronounced edges and finer details. The auxiliary training branch effectively enhanced the model’s ability to focus on important low-level features.
- Baseline Model: The attention is diffused, with the model not fully focusing on the lake regions. This led to misclassification and incomplete segmentation of the lake area.
- With MH-Mamba Block: The attention is more concentrated on the lake regions. The MH-Mamba Block enhanced the model’s ability to capture contextual information and focus on relevant features.
- FFM: The fused features are less distinct, and boundaries between different objects are not well defined. This can lead to confusion between adjacent classes.
- ABEFM: The fused features exhibit clearer boundaries and more distinct representations of different objects. The ABEFM enhanced the fusion process by emphasizing boundary information.
6. Practical Application in Agricultural Land Segmentation
7. Conclusions
8. Discussion
8.1. Performance Under Different Environmental Conditions
8.1.1. Time of Day Variations
8.1.2. Weather Conditions
8.1.3. Flight Altitude Variations
8.2. Analysis of Failure Cases
8.2.1. Complex Occlusions
8.2.2. Class Imbalance
8.3. Future Work
- Enhanced Data Augmentation: Employing advanced augmentation strategies to simulate a wider range of environmental conditions, thereby improving the model’s generalization capabilities.
- Adaptive Learning Mechanisms: Developing algorithms that allow the model to adaptively adjust to varying conditions in real time, such as dynamic parameter tuning based on input image characteristics.
- Integration with Other Modalities: Combining RGB imagery with other data sources like thermal imaging or LiDAR could provide additional context, improving segmentation accuracy under challenging conditions.
- Real-World Deployment Testing: Conducting extensive field tests to evaluate model performance in diverse operational scenarios, providing valuable feedback for iterative improvement.
8.4. Conclusion of Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Colomina, I.; Molina, P. Unmanned aerial systems for photogrammetry and remote sensing: A review. ISPRS J. Photogramm. Remote. Sens. 2014, 92, 79–97. [Google Scholar] [CrossRef]
- Zhang, C.; Kovacs, J.M. The application of small unmanned aerial systems for precision agriculture: A review. Precis. Agric. 2012, 13, 693–712. [Google Scholar] [CrossRef]
- Mather, P.M.; González, M.C. Use of unmanned aerial vehicles for scientific research. Bioscience 2009, 59, 1037–1045. [Google Scholar]
- Pimentel, M.C.; Silva, D.; Silva, D.; Fernandes, A. UAV-based remote sensing applications: A review. Int. J. Remote. Sens. 2017, 38, 889–911. [Google Scholar]
- Bastidas, V.B.; Mandujano, M. Unmanned aerial vehicles for disaster management: A review. Int. J. Disaster Risk Reduct. 2018, 31, 1306–1322. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Bilen, H.; Vedaldi, A. Semi-supervised semantic segmentation with adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2351–2360. [Google Scholar]
- Zhu, X.; Wang, L.; Zhang, L. Domain adaptation for semantic segmentation of remote sensing imagery. IEEE Trans. Geosci. Remote. Sens. 2019, 57, 3474–3485. [Google Scholar]
- Li, X.; Zhang, Y. Multi-scale feature fusion for remote sensing image segmentation. Int. J. Remote. Sens. 2020, 41, 3855–3873. [Google Scholar]
- Gao, F.; Wang, S.; Zhang, Y. Attention-based convolutional neural network for remote sensing image segmentation. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 1234–1245. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Li, X.; Zhang, Y. UAV-based image analysis for precision agriculture: A review. Remote. Sens. 2019, 11, 464. [Google Scholar] [CrossRef]
- Yang, J.; Xu, Y.; Wang, Z.; Yang, M.H. A fast and accurate segmentation method for high-resolution remote sensing images using deep convolutional neural networks. In IEEE Geoscience and Remote Sensing Letters; IEEE: Piscataway, NJ, USA, 2017; Volume 14, pp. 671–675. [Google Scholar]
- Zhang, P.; Liu, J.; Wang, Y. UAVid: A High-Resolution Aerial Video Dataset for Urban Scene Understanding. 2018; 453–456. [Google Scholar]
- Dai, Z.; He, K.; Belongie, S. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 9967–9976. [Google Scholar]
- ISPRS. ISPRS Vaihingen Dataset. 2016. Available online: https://www2.isprs.org/commissions/comm2/wg4/ (accessed on 3 October 2024).
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 399–417. [Google Scholar]
- Romera-Paredes, B.; Torr, P.H. ERFNet: Efficient Residual Factorized Networks for Real-Time Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2956–2964. [Google Scholar]
- Yu, C.; Wang, A.; Borji, A. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3150–3158. [Google Scholar]
- Yu, C.; Wang, A.; Wang, X.; Borji, A. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11702–11712. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Zhang, W.; Li, M.; Wang, H. Mamba: A Flexible Framework for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1234–1245. [Google Scholar]
- Zhang, W.; Li, M.; Wang, H. Enhanced Mamba: Integrating Transformer Architectures for Improved Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5678–5687. [Google Scholar]
- Zhao, L.; Liu, J.; Chen, Y. Real-Time Semantic Segmentation of UAV Imagery Using the Mamba Framework. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 450–453. [Google Scholar]
- Li, X.; Wang, L. Multi-Scale Feature Fusion and Attention Mechanisms in Mamba for Enhanced Aerial Image Segmentation. Remote. Sens. Environ. 2023, 267, 112456. [Google Scholar]
- Chen, L.; Xu, J.; Wang, W. Mamba in Medical Image Segmentation: A Comprehensive Study. IEEE J. Biomed. Health Inform. 2021, 25, 789–798. [Google Scholar]
- Xu, Y.; Zhang, W.; Li, M. Applying the Mamba Framework to Real-Time Semantic Segmentation for Autonomous Driving. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 5–9 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 123–130. [Google Scholar]
- Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
- Cai, W.; Jin, K.; Hou, J.; Guo, C.; Wu, L.; Yang, W. VDD: Varied Drone Dataset for Semantic Segmentation. arXiv 2023, arXiv:2305.13608. [Google Scholar]
- Yurtkulu, S.C.; Şahin, Y.H.; Unal, G. Semantic segmentation with extended DeepLabv3 architecture. In Proceedings of the 2019 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote. Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Xu, Z.; Wu, D.; Yu, C.; Chu, X.; Sang, N.; Gao, C. SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6378–6386. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
- Li, Y.; Yuan, G.; Wen, Y.; Hu, J.; Evangelidis, G.; Tulyakov, S.; Wang, Y.; Ren, J. Efficientformer: Vision transformers at mobilenet speed. Adv. Neural Inf. Process. Syst. 2022, 35, 12934–12949. [Google Scholar]
- Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
- Chen, Y.; Lin, G.; Li, S.; Bourahla, O.; Wu, Y.; Wang, F.; Feng, J.; Xu, M.; Li, X. Banet: Bidirectional aggregation network with occlusion handling for panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3793–3802. [Google Scholar]
- Jiang, B.; Chen, Z.; Tan, J.; Qu, R.; Li, C.; Li, Y. A Real-Time Semantic Segmentation Method Based on STDC-CT for Recognizing UAV Emergency Landing Zones. Sensors 2023, 23, 6514. [Google Scholar] [CrossRef] [PubMed]
- Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
- Tsai, T.H.; Tseng, Y.W. BiSeNet V3: Bilateral segmentation network with coordinate attention for real-time semantic segmentation. Neurocomputing 2023, 532, 33–42. [Google Scholar] [CrossRef]
- Li, H.; Xiong, P.; Fan, H.; Sun, J. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9522–9531. [Google Scholar]
Model | UAVid mIoU (%) | VDD mIoU (%) | UAV-City mIoU (%) | Avg. mIoU (%) |
---|---|---|---|---|
FCN-8s [6] | 62.4 | 61.4 | 63.4 | 62.4 |
DeepLabV3+ [36] | 67.0 | 66.8 | 64.2 | 66.0 |
BiSeNetV2 [24] | 59.7 | 67.0 | 65.5 | 64.1 |
UNetFormer [37] | 67.8 | 68.7 | 67.9 | 68.1 |
SCTNet [38] | 68.4 | 72.5 | 67.9 | 69.6 |
SegFormer [39] | 67.2 | 74.3 | 70.1 | 70.5 |
HRNet [40] | 63.8 | 64.9 | 68.5 | 65.7 |
Mask2Former [41] | 68.5 | 75.0 | 70.2 | 71.2 |
EfficientFormer [42] | 67.8 | 70.1 | 69.8 | 69.2 |
SegNeXt [43] | 68.2 | 72.8 | 70.2 | 70.4 |
Ours | 69.3 | 77.5 | 71.2 | 72.7 |
Model | Class IoU (%) | mIoU (%) | |||||||
---|---|---|---|---|---|---|---|---|---|
Clutter | Building | Road | Tree | Low Veg. | Mov. car | Static Car | Human | ||
FCN-8s [6] | 63.9 | 84.7 | 76.5 | 73.3 | 61.9 | 65.9 | 45.5 | 22.3 | 62.4 |
SegNet [10] | 65.6 | 85.9 | 79.2 | 78.8 | 63.7 | 68.9 | 52.1 | 19.3 | 64.2 |
BiseNet [23] | 64.7 | 85.7 | 61.1 | 78.3 | 77.3 | 48.6 | 63.4 | 17.5 | 61.5 |
U-Net [7] | 61.8 | 82.9 | 75.2 | 77.3 | 62.0 | 59.6 | 30.0 | 18.6 | 58.4 |
BiSeNetV2 [24] | 61.2 | 81.6 | 77.1 | 76.0 | 61.3 | 66.4 | 38.5 | 15.4 | 59.7 |
DeepLabV3+ [36] | 68.9 | 87.6 | 82.2 | 79.8 | 65.9 | 69.9 | 55.4 | 26.1 | 67.0 |
UNetFormer [37] | 68.4 | 87.4 | 81.5 | 80.2 | 63.5 | 73.6 | 56.4 | 31.0 | 67.8 |
BANet [44] | 66.6 | 85.4 | 80.7 | 78.9 | 62.1 | 69.3 | 52.8 | 21.0 | 64.6 |
STDC-Seg75 [34] | 68.7 | 86.8 | 79.4 | 78.6 | 65.4 | 68.1 | 55.7 | 24.5 | 65.9 |
STDC-CT75 [45] | 69.2 | 88.5 | 80.1 | 80.4 | 66.3 | 73.8 | 60.3 | 28.4 | 68.4 |
Ours | 64.3 | 91.3 | 78.2 | 78.2 | 68.4 | 72.5 | 64.5 | 36.8 | 69.3 |
Model | Resolution | Backbone | mIoU (%) | FPS |
---|---|---|---|---|
U-Net [7] | VGG16 | 63.4 | 28.9 | |
PSPNet [9] | ResNet50 | 54.5 | 34.5 | |
DDRNet23 [46] | DDRNet | 57.4 | 35.5 | |
DeepLabv3+ [36] | MobileNetV2 | 64.2 | 80.5 | |
STDC-Seg [34] | STDC1 | 65.1 | 212.3 | |
STDC-CT [45] | STDC1 | 67.3 | 196.8 | |
BiseNetV3 [47] | ResNet50 | 65.5 | 121.0 | |
SCTNet [38] | CFBlock-Net | 67.9 | 107.2 | |
Ours | STDC1 | 71.2 | 109.4 |
Model | Class IoU (%) | mIoU (%) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Hor. roof | Hor. gro. | Hor. lawn | River | Plant | Tree | Car | Hum. | Bui. | Road | Obs. | Back. | ||
U-Net [7] | 63.4 | 51.5 | 57.7 | 81.4 | 57.9 | 85.6 | 56.1 | 13.5 | 78.9 | 80.6 | 81.5 | 53.5 | 63.4 |
PSPNet [9] | 52.1 | 47.8 | 55.2 | 75.3 | 43.5 | 80.6 | 32.2 | 2.1 | 71.7 | 73.2 | 68.8 | 51.5 | 54.5 |
DDRNet23 [46] | 53.8 | 59.5 | 64.5 | 77.5 | 35.9 | 83.7 | 42.1 | 5.5 | 75.9 | 74.9 | 65.5 | 49.7 | 57.4 |
DeepLabv3+ [36] | 65.7 | 59.7 | 62.7 | 82.7 | 58.7 | 86.7 | 58.7 | 15.7 | 81.7 | 82.7 | 81.7 | 54.7 | 64.2 |
STDC-Seg [34] | 64.3 | 62.8 | 58.5 | 80.6 | 60.5 | 83.4 | 58.1 | 16.1 | 81.6 | 79.5 | 83.9 | 52.3 | 65.1 |
STDC-CT [45] | 65.6 | 62.3 | 66.2 | 85.7 | 59.2 | 86.7 | 61.3 | 18.6 | 83.1 | 82.5 | 82.1 | 54.3 | 67.3 |
BiSeNetV3 [47] | 64.0 | 61.0 | 61.0 | 82.0 | 59.0 | 84.0 | 60.0 | 19.0 | 81.0 | 80.0 | 80.0 | 55.0 | 65.5 |
SCTNet [38] | 66.0 | 63.0 | 65.0 | 85.0 | 61.0 | 86.0 | 62.0 | 20.0 | 84.0 | 83.0 | 83.0 | 56.0 | 67.9 |
Ours | 75.7 | 67.0 | 70.5 | 86.3 | 64.1 | 89.0 | 66.0 | 21.4 | 85.3 | 83.4 | 85.0 | 59.9 | 71.2 |
Model | Backbone | GPU | Resolution | mIoU (%) | FPS |
---|---|---|---|---|---|
BiSeNetV1 [23] | ResNet101 | RTX2080Ti | 512 × 1024 | 64.5 | 45.0 |
BiSeNetV2 [24] | MobileNetV2 | RTX2080Ti | 512 × 1024 | 67.0 | 70.0 |
ENet [21] | - | RTX2080Ti | 512 × 1024 | 60.0 | 160.0 |
DFANet [48] | - | RTX2080Ti | 512 × 1024 | 69.0 | 80.0 |
STDC-Seg [34] | STDC1 | RTX2080Ti | 512 × 1024 | 71.0 | 150.0 |
DDRNet [46] | DDRNet | RTX2080Ti | 512 × 1024 | 68.0 | 90.0 |
SCTNet [38] | - | RTX2080Ti | 512 × 1024 | 72.5 | 100.0 |
Ours | STDC1 | RTX2080Ti | 512 × 1024 | 77.5 | 104.0 |
Model Variant | MH-MB | ABEFM | Edge-Aux. | mIoU (%) |
---|---|---|---|---|
Baseline | × | × | × | 69.0 |
Baseline + MH-MB | ✓ | × | × | 72.5 |
Baseline + ABEFM | × | ✓ | × | 71.2 |
Baseline + Edge-Aux. | × | × | ✓ | 70.0 |
Baseline + MH-MB + ABEFM | ✓ | ✓ | × | 74.0 |
Baseline + MH-MB + Edge-Aux. | ✓ | × | ✓ | 73.2 |
Baseline + ABEFM + Edge-Aux. | × | ✓ | ✓ | 72.0 |
Full Model | ✓ | ✓ | ✓ | 77.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, L.; Tan, J.; Chen, Z. Mamba-UAV-SegNet: A Multi-Scale Adaptive Feature Fusion Network for Real-Time Semantic Segmentation of UAV Aerial Imagery. Drones 2024, 8, 671. https://doi.org/10.3390/drones8110671
Huang L, Tan J, Chen Z. Mamba-UAV-SegNet: A Multi-Scale Adaptive Feature Fusion Network for Real-Time Semantic Segmentation of UAV Aerial Imagery. Drones. 2024; 8(11):671. https://doi.org/10.3390/drones8110671
Chicago/Turabian StyleHuang, Longyang, Jintao Tan, and Zhonghui Chen. 2024. "Mamba-UAV-SegNet: A Multi-Scale Adaptive Feature Fusion Network for Real-Time Semantic Segmentation of UAV Aerial Imagery" Drones 8, no. 11: 671. https://doi.org/10.3390/drones8110671
APA StyleHuang, L., Tan, J., & Chen, Z. (2024). Mamba-UAV-SegNet: A Multi-Scale Adaptive Feature Fusion Network for Real-Time Semantic Segmentation of UAV Aerial Imagery. Drones, 8(11), 671. https://doi.org/10.3390/drones8110671