MFMamba: A Mamba-Based Multi-Modal Fusion Network for Semantic Segmentation of Remote Sensing Images
Abstract
:1. Introduction
- A novel multi-modal fusion network based on Mamba (named as MFMamba) is proposed for semantic segmentation of remote sensing images. In this novel network, a Mamba-based auxiliary encoder is utilized to effectively capture global information from the DSM, while ensuring the network maintains low computational complexity.
- An innovative feature fusion block (FFB) is designed to effectively fuse the features extracted from HRRSI and its corresponding DSM data, where the multi-convolutional kernel attention (MCKA) unit can further capture local details, while the efficient additive attention (EAA) unit can effectively capture long-range dependencies.
- Extensive comparison experiments conducted on the Vaihingen dataset and the Potsdam dataset demonstrate that our proposed MFMamba has superior semantic segmentation performance and low computational complexity compared with seven state-of-the-art methods.
2. Related Work
2.1. Single-Modal Semantic Segmentation
2.2. Multi-Modal Semantic Segmentation
3. Method
3.1. Framework of MFMamba
3.2. Mamba-Based Auxiliary Encoder
3.3. CNN-Based Main Encoder
3.4. Feature Fusion Block
3.5. Transformer-Based Decoder
3.6. Loss Function
4. Experiments and Results
4.1. Datasets
4.2. Evaluation Metrics and Experimental Setup
4.3. Experimental Results
4.3.1. Comparison Results on the Vaihingen Dataset
4.3.2. Comparison Results on the Potsdam Dataset
4.3.3. Analysis of Computational Complexity
4.4. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Gao, Y.; Zhang, S.; Zuo, D.; Yan, W.; Pan, X. TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images. Sensors 2023, 23, 5909. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Zhou, Y.; Zhang, Y.; Zhong, L.; Wang, J.; Chen, J. DKDFN: Domain knowledge-guided deep collaborative fusion network for multimodal unitemporal remote sensing land cover classification. ISPRS J. Photogramm. Remote Sens. 2022, 186, 170–189. [Google Scholar] [CrossRef]
- Xing, J.; Sieber, R.; Caelli, T. A scale-invariant change detection method for land use/cover change research. ISPRS J. Photogramm. Remote Sens. 2018, 141, 252–264. [Google Scholar] [CrossRef]
- Samie, A.; Abbas, A.; Azeem, M.M.; Hamid, S.; Iqbal, M.A.; Hasan, S.S.; Deng, X. Examining the impacts of future land use/land cover changes on climate in Punjab province, Pakistan: Implications for environmental sustainability and economic growth. Environ. Sci. Pollut. Res. 2020, 27, 25415–25433. [Google Scholar] [CrossRef] [PubMed]
- Griffiths, D.; Boehm, J. Improving public data for building segmentation from Convolutional Neural Networks (CNNs) for fused airborne lidar and image data using active contours. ISPRS J. Photogramm. Remote Sens. 2019, 154, 70–83. [Google Scholar] [CrossRef]
- Salach, A.; Bakuła, K.; Pilarska, M.; Ostrowski, W.; Górski, K.; Kurczyński, Z. Accuracy assessment of point clouds from LiDAR and dense image matching acquired using the UAV platform for DTM creation. ISPRS Int. J. Geo-Inf. 2018, 7, 342. [Google Scholar] [CrossRef]
- Ma, X.P.; Zhang, X.K.; Pun, M.O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. Ieee Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Gao, L.; Li, J.; Khodadadzadeh, M.; Plaza, A.; Zhang, B.; He, Z.; Yan, H. Subspace-based support vector machines for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2014, 12, 349–353. [Google Scholar]
- Gislason, P.O.; Benediktsson, J.A.; Sveinsson, J.R. Random forests for land cover classification. Pattern Recognit. Lett. 2006, 27, 294–300. [Google Scholar] [CrossRef]
- Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
- Ma, X.P.; Zhang, X.K.; Pun, M.O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. Ieee Geosci. Remote Sens. Lett. 2024, 21, 3414293. [Google Scholar] [CrossRef]
- Hosseinpour, H.; Samadzadegan, F.; Javan, F.D. CMGFNet: A deep cross-modal gated fusion network for building extraction from very high-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2022, 184, 96–115. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Peng, Z.; Guo, Z.; Huang, W.; Wang, Y.; Xie, L.; Jiao, J.; Tian, Q.; Conformer, Q.Y. Local features coupling global representations for recognition and detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 2, 9454–9468. [Google Scholar] [CrossRef] [PubMed]
- Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef]
- Vaswani, A. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18-24 June 2022; pp. 11976–11986. [Google Scholar]
- Wang, X.; Wang, S.; Ding, Y.; Li, Y.; Wu, W.; Rong, Y.; Kong, W.; Huang, J.; Li, S.; Yang, H. State space model for new-generation network alternative to transformers: A survey. arXiv 2024, arXiv:2404.09516. [Google Scholar]
- Ruan, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
- Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
- Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
- Liang, D.; Zhou, X.; Wang, X.; Zhu, X.; Xu, W.; Zou, Z.; Ye, X.; Bai, X. Pointmamba: A simple state space model for point cloud analysis. arXiv 2024, arXiv:2402.10739. [Google Scholar]
- Liu, C.; Chen, K.; Chen, B.; Zhang, H.; Zou, Z.; Shi, Z. Rscama: Remote sensing image change captioning with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6010405. [Google Scholar] [CrossRef]
- He, X.; Cao, K.; Yan, K.; Li, R.; Xie, C.; Zhang, J.; Zhou, M. Pan-Mamba: Effective pan-sharpening with State Space Model. arXiv 2024, arXiv:2402.12192. [Google Scholar] [CrossRef]
- Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic segmentation of remotely sensed images with state space model. arXiv 2024, arXiv:2404.01705. [Google Scholar] [CrossRef]
- Ma, X.P.; Zhang, X.K.; Pun, M.O. A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data. Ieee J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3463–3474. [Google Scholar] [CrossRef]
- Zhang, Y.; Sidibé, D.; Morel, O.; Mériaudeau, F. Deep multimodal fusion for semantic image segmentation: A survey. Image Vis. Comput. 2021, 105, 104042. [Google Scholar] [CrossRef]
- Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
- Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Revised Selected Papers, Part I 13, 2017. Taipei, Taiwan, 20-24 November 2016; pp. 213–228. [Google Scholar]
- Zhang, P.; Du, P.; Lin, C.; Wang, X.; Li, E.; Xue, Z.; Bai, X. A hybrid attention-aware fusion network (HAFNet) for building extraction from high-resolution imagery and LiDAR data. Remote Sens. 2020, 12, 3764. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Prakash, A.; Chitta, K.; Geiger, A. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7077–7087. [Google Scholar]
- He, S.; Yang, H.; Zhang, X.; Li, X. MFTransNet: A multi-modal fusion with CNN-transformer network for semantic segmentation of HSR remote sensing images. Mathematics 2023, 11, 722. [Google Scholar] [CrossRef]
- Wan, Z.; Wang, Y.; Yong, S.; Zhang, P.; Stepputtis, S.; Sycara, K.; Xie, Y. Sigma: Siamese mamba network for multi-modal semantic segmentation. arXiv 2024, arXiv:2404.04256. [Google Scholar]
- Zhang, R.; Xu, L.; Yang, S.; Wang, L. MambaReID: Exploiting Vision Mamba for Multi-Modal Object Re-Identification. Sensors 2024, 24, 4639. [Google Scholar] [CrossRef]
- Wang, L.B.; Li, R.; Zhang, C.; Fang, S.H.; Duan, C.X.; Meng, X.L.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. Isprs J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.-H.; Khan, F.S. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17425–17436. [Google Scholar]
- Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
- Li, R.; Zheng, S.Y.; Zhang, C.; Duan, C.X.; Wang, L.B.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. Isprs J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
- Li, R.; Zheng, S.Y.; Duan, C.X.; Su, J.L.; Zhang, C. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. Ieee Geosci. Remote Sens. Lett. 2022, 19, 3063381. [Google Scholar] [CrossRef]
- Wu, H.L.; Huang, P.; Zhang, M.; Tang, W.L.; Yu, X.Y. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. Ieee Trans. Geosci. Remote Sens. 2023, 61, 3314641. [Google Scholar] [CrossRef]
Batch Size | Learning Rate Scheduler | OA | mF1 | mIoU |
---|---|---|---|---|
16 | [25, 35, 45] | 91.60 | 90.30 | 82.74 |
24 | [25, 35, 45] | 91.74 | 90.37 | 82.90 |
28 | [25, 35, 45] | 91.50 | 90.36 | 82.85 |
30 | [25, 35, 45] | 91.67 | 90.38 | 82.91 |
36 | [25, 35, 45] | 91.38 | 90.22 | 82.58 |
48 | [25, 35, 45] | 91.47 | 90.35 | 82.83 |
32 | [25, 35, 45] | 91.64 | 90.38 | 82.93 |
32 | [15, 25, 35] | 91.71 | 90.45 | 83.01 |
32 | [10, 20, 30] | 91.81 | 90.52 | 83.13 |
Method | Backbone | F1/IoU | OA | mF1 | mIoU | ||||
---|---|---|---|---|---|---|---|---|---|
Imp. | Bui. | Low. | Tre. | Car | |||||
CMFNet [31] | VGG-16 | 90.11/81.99 | 94.51/89.60 | 77.72/63.56 | 90.09/81.97 | 86.52/76.24 | 89.38 | 87.79 | 78.67 |
ABCNet [47] | ResNet-18 | 92.08/85.32 | 95.96/92.24 | 79.87/66.49 | 90.38/82.45 | 85.61/74.84 | 90.79 | 88.78 | 80.27 |
TransUNet [21] | R50-ViT-B | 92.21/85.54 | 96.10/92.48 | 80.79/67.77 | 90.87/83.27 | 89.60/81.16 | 91.21 | 89.91 | 82.04 |
UNetFormer [43] | ResNet-18 | 92.23/85.58 | 96.34/92.93 | 80.74/67.70 | 91.04/83.55 | 90.37/82.43 | 91.29 | 90.14 | 82.44 |
MAResU-Net [48] | ResNet-34 | 92.66/86.33 | 96.84/93.87 | 80.57/67.47 | 90.84/83.22 | 89.93/81.71 | 91.50 | 90.17 | 82.51 |
CMTFNet [49] | ResNet-50 | 92.68/86.37 | 96.71/93.63 | 80.47/67.33 | 90.78/83.11 | 90.22/82.18 | 91.42 | 90.17 | 82.52 |
RS3Mamba [12] | R18-Mamba-T | 92.69/86.38 | 96.67/93.55 | 80.54/67.42 | 90.59/82.79 | 90.49/82.64 | 91.30 | 90.20 | 82.56 |
MFMamba (Ours) | R18-Mamba-T | 93.22/87.31 | 97.17/94.50 | 80.63/67.54 | 91.05/83.58 | 90.53/82.70 | 91.81 | 90.52 | 83.13 |
Method | Backbone | F1/IoU | OA | mF1 | mIoU | ||||
---|---|---|---|---|---|---|---|---|---|
Imp. | Bui. | Low. | Tre. | Car | |||||
CMFNet [31] | VGG-16 | 93.09/87.07 | 96.90/93.99 | 85.88/75.26 | 86.52/76.25 | 96.17/92.62 | 90.72 | 91.71 | 85.04 |
ABCNet [47] | ResNet-18 | 92.90/86.74 | 96.99/94.16 | 86.11/75.62 | 87.02/77.02 | 96.31/92.88 | 90.82 | 91.87 | 85.28 |
TransUNet [21] | R50-ViT-B | 93.08/87.06 | 96.88/93.94 | 86.74/76.59 | 87.66/78.03 | 96.40/93.05 | 91.03 | 92.15 | 85.73 |
UNetFormer [43] | ResNet-18 | 93.02/86.95 | 97.14/94.43 | 86.21/75.76 | 86.93/76.88 | 96.79/93.78 | 90.89 | 92.02 | 85.56 |
MAResU-Net [48] | ResNet-34 | 93.15/87.17 | 97.21/94.57 | 86.73/76.57 | 87.14/77.21 | 96.67/93.56 | 91.05 | 92.18 | 85.82 |
CMTFNet [49] | ResNet-50 | 93.08/87.06 | 97.30/94.73 | 86.32/75.94 | 87.13/77.20 | 96.89/93.97 | 90.97 | 92.15 | 85.78 |
RS3Mamba [12] | R18-Mamba-T | 93.20/87.27 | 97.32/94.79 | 86.07/75.54 | 86.60/76.37 | 96.74/93.68 | 90.92 | 91.99 | 85.53 |
MFMamba (Ours) | R18-Mamba-T | 93.31/87.46 | 97.81/95.75 | 86.76/76.62 | 87.19/77.28 | 96.63/93.48 | 91.38 | 92.34 | 86.12 |
Method | FLOPs (G) | Parameter (M) | mIoU (%) |
---|---|---|---|
CMFNet [31] | 255.28 | 104.07 | 78.67 |
ABCNet [47] | 12.58 | 13.67 | 80.27 |
TransUNet [21] | 123.49 | 105.32 | 82.04 |
UNetFormer [43] | 9.45 | 11.69 | 82.44 |
MAResU-Net [48] | 23.08 | 26.28 | 82.51 |
CMTFNet [49] | 28.68 | 30.07 | 82.52 |
RS3Mamba [12] | 31.65 | 43.32 | 82.56 |
MFMamba (Ours) | 30.59 | 62.43 | 83.13 |
Dataset | Bands | Class OA(%) | ||||
---|---|---|---|---|---|---|
Imp. | Bui. | Low. | Tre. | Car | ||
Vaihingen | NIRRG | 91.58 | 97.01 | 80.08 | 91.24 | 87.86 |
NIRRG + DSM | 92.18 (+0.60) | 97.91 (+0.90) | 79.67 (−0.41) | 92.08 (+0.84) | 89.43 (+1.57) | |
Potsdam | RGB | 92.65 | 97.94 | 88.60 | 86.40 | 96.69 |
RGB + DSM | 93.06 (+0.41) | 98.29 (+0.35) | 88.64 (+0.04) | 87.19 (+0.79) | 96.31 (−0.38) |
MCKA | EAA | OA (%) | mF1 (%) | mIoU (%) |
---|---|---|---|---|
√ | 91.68 | 90.45 | 82.98 | |
√ | 91.50 | 90.22 | 82.60 | |
√ | √ | 91.81 | 90.52 | 83.13 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Cao, L.; Deng, H. MFMamba: A Mamba-Based Multi-Modal Fusion Network for Semantic Segmentation of Remote Sensing Images. Sensors 2024, 24, 7266. https://doi.org/10.3390/s24227266
Wang Y, Cao L, Deng H. MFMamba: A Mamba-Based Multi-Modal Fusion Network for Semantic Segmentation of Remote Sensing Images. Sensors. 2024; 24(22):7266. https://doi.org/10.3390/s24227266
Chicago/Turabian StyleWang, Yan, Li Cao, and He Deng. 2024. "MFMamba: A Mamba-Based Multi-Modal Fusion Network for Semantic Segmentation of Remote Sensing Images" Sensors 24, no. 22: 7266. https://doi.org/10.3390/s24227266
APA StyleWang, Y., Cao, L., & Deng, H. (2024). MFMamba: A Mamba-Based Multi-Modal Fusion Network for Semantic Segmentation of Remote Sensing Images. Sensors, 24(22), 7266. https://doi.org/10.3390/s24227266