TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images
Abstract
:1. Introduction
- A two-branch multi-scale semantic segmentation network (TMNet) for remote sensing images is proposed to improve the model image classification performance by the global feature information encoding capability of the Swin Transformer.
- The MFM is proposed to merge shallow spatial features of images at different scales into deep features and increase the fusion of multi-scale deep image features of the model to recover more minute details in the classification process.
- FEM and CEM are added to the main and auxiliary encoders, respectively, to enhance feature extraction. FEM enhances feature information interaction by calculating the relationship between its own feature information and updateable feature storage units; CEM further increases the spatial correlation of global features by establishing an inter-channel correlation to encode the spatial information of Swin Transformer.
2. Related Work
2.1. Classical Image Semantic Segmentation Network
2.2. Work on Transformer
3. Methods
3.1. Network Structure
3.2. Main Encoder
3.2.1. CNN Blocks
3.2.2. Multi-Scale Feature Fusion Module
3.2.3. Feature Enhancement Module
3.3. Auxiliary Encoder
3.3.1. Swin Transformer
3.3.2. Channel Enhancement Module
3.4. Decoder
3.5. Loss Function
4. Experiments and Results
4.1. Datasets
4.2. Training Setup and Evaluation Index
4.3. Comparison Results on WHDLD
4.4. Comparison Results on Potsdam Dataset
4.5. Ablation Study
4.5.1. Effect of SW and MFM
4.5.2. Effect of FEM and CEM
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Morin, E.; Herrault, P.-A.; Guinard, Y.; Grandjean, F.; Bech, N. The promising combination of a remote sensing approach and landscape connectivity modelling at a fine scale in urban planning. Ecol. Indic. 2022, 139, 108930. [Google Scholar] [CrossRef]
- Qing, Y.; Ming, D.; Wen, Q.; Weng, Q.; Xu, L.; Chen, Y.; Zhang, Y.; Zeng, B. Operational earthquake-induced building damage assessment using CNN-based direct remote sensing change detection on superpixel level. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102899. [Google Scholar] [CrossRef]
- Lu, W.; Zhao, L.; Xu, R. Remote sensing image processing technology based on mobile augmented reality technology in surveying and mapping engineering. Soft Comput. 2023, 27, 423–433. [Google Scholar] [CrossRef]
- Wang, P.; Bayram, B.; Sertel, E. A comprehensive review on deep learning based remote sensing image super-resolution methods. Earth-Sci. Rev. 2022, 232, 104110. [Google Scholar] [CrossRef]
- De Geus, D.; Dubbelman, G. Intra-Batch Supervision for Panoptic Segmentation on High-Resolution Images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 3165–3173. [Google Scholar]
- Zhao, Z.; Long, S.; Pi, J.; Wang, J.; Zhou, L. Instance-specific and Model-adaptive Supervision for Semi-supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 23705–23714. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. pp. 234–241. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Chen, X.; Li, Z.; Jiang, J.; Han, Z.; Deng, S.; Li, Z.; Fang, T.; Huo, H.; Li, Q.; Liu, M. Adaptive effective receptive field convolution for semantic segmentation of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3532–3546. [Google Scholar] [CrossRef]
- Zhao, Y.; Zhang, X.; Feng, W.; Xu, J. Deep Learning Classification by ResNet-18 Based on the Real Spectral Dataset from Multispectral Remote Sensing Images. Remote Sens. 2022, 14, 4883. [Google Scholar] [CrossRef]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
- Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 426–435. [Google Scholar] [CrossRef]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
- Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Dong, R.; Pan, X.; Li, F. DenseU-net-based semantic segmentation of small objects in urban remote sensing images. IEEE Access 2019, 7, 65347–65356. [Google Scholar] [CrossRef]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Hou, J.; Guo, Z.; Wu, Y.; Diao, W.; Xu, T. BSNet: Dynamic Hybrid Gradient Convolution Based Boundary-Sensitive Network for Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 3176028. [Google Scholar] [CrossRef]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, L.; Fang, L.; Yue, J.; Zhang, B.; Ghamisi, P.; He, M. Deep Bilateral Filtering Network for Point-Supervised Semantic Segmentation in Remote Sensing Images. IEEE Trans. Image Process 2022, 31, 7419–7434. [Google Scholar] [CrossRef]
- Wang, J.; Feng, Z.; Jiang, Y.; Yang, S.; Meng, H. Orientation Attention Network for semantic segmentation of remote sensing images. Knowl.-Based Syst. 2023, 267, 110415. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
- Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9924–9935. [Google Scholar]
- Ren, H.; Dai, H.; Dai, Z.; Yang, M.; Leskovec, J.; Schuurmans, D.; Dai, B. Combiner: Full attention transformer with sparse computation cost. Adv. Neural Inf. Process. Syst. 2021, 34, 22470–22482. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3202–3211. [Google Scholar]
- Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
- Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. CrossFormer++: A Versatile Vision Transformer Hinging on Cross-scale Attention. arXiv 2023, arXiv:2303.06908. [Google Scholar]
- Stergiou, A.; Poppe, R.; Kalliatakis, G. Refining activation downsampling with SoftPool. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10357–10366. [Google Scholar]
- Ye, X.; Xiong, F.; Lu, J.; Zhao, H.; Zhou, J. M 2-Net: A Multi-scale Multi-level Feature Enhanced Network for Object Detection in Optical Remote Sensing Images. In Proceedings of the 2020 Digital Image Computing: Techniques and Applications (DICTA), Melbourne, Australia, 29 November–2 December 2022; pp. 1–8. [Google Scholar]
- Vaswani, A.; Ramachandran, P.; Srinivas, A.; Parmar, N.; Hechtman, B.; Shlens, J. Scaling local self-attention for parameter efficient visual backbones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12894–12904. [Google Scholar]
- Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
- Shao, Z.; Yang, K.; Zhou, W. Performance evaluation of single-label and multi-label remote sensing image retrieval using a dense labeling dataset. Remote Sens. 2018, 10, 964. [Google Scholar] [CrossRef] [Green Version]
- Shao, Z.; Zhou, W.; Deng, X.; Zhang, M.; Cheng, Q. Multilabel remote sensing image retrieval based on fully convolutional network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 318–328. [Google Scholar] [CrossRef]
- ISPRS Vaihingen 2D Semantic Labeling Dataset; ISPRS: Hannover, Germany, 2018.
- Li, H.; Xiong, P.; Fan, H.; Sun, J. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 9522–9531. [Google Scholar]
- Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Tian, Z.; He, T.; Shen, C.; Yan, Y. Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 3126–3135. [Google Scholar]
- Sun, Y.; Bi, F.; Gao, Y.; Chen, L.; Feng, S. A multi-attention UNet for semantic segmentation in remote sensing images. Symmetry 2022, 14, 906. [Google Scholar] [CrossRef]
- Li, Z.; Shen, H.; Cheng, Q.; Liu, Y.; You, S.; He, Z. Deep learning based cloud detection for medium and high resolution remote sensing images of different sensors. ISPRS J. Photogramm. Remote Sens. 2019, 150, 197–212. [Google Scholar] [CrossRef] [Green Version]
- Wieland, M.; Li, Y.; Martinis, S. Multi-sensor cloud and cloud shadow segmentation with a convolutional neural network. Remote Sens. Environ. 2019, 230, 111203. [Google Scholar] [CrossRef]
- Gu, J.; Kwon, H.; Wang, D.; Ye, W.; Li, M.; Chen, Y.-H.; Lai, L.; Chandra, V.; Pan, D.Z. Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12094–12103. [Google Scholar]
Name | Input Size | Layer Structure | Out Size |
---|---|---|---|
CNN block-0 | 3 × H × W | Conv2d(3,64,3,2,1), Relu | 128 × H/2 × W/2 |
Conv2d(64,96,3,1,1), BN, Relu | |||
Conv2d(96,128,3,1,1), BN, Relu | |||
CNN block-1 | 128 × H/2 × W/2 | MaxPool2d(2) | 256 × H/4 × W/4 |
Conv2d(128,192,3,1,1), BN, Relu | |||
Conv2d(192,256,3,1,1), BN, Relu | |||
CNN block-2 | 256 × H/4 × W/4 | MaxPool2d(2) | 512 × H/8 × W/8 |
Conv2d(256,384,3,1,1), BN, Relu | |||
Conv2d(384,512,3,1,1), BN, Relu |
Method | Parameters (M) | FLOPs (G) | MIOU | MF1 | MPA | IOU | |||
---|---|---|---|---|---|---|---|---|---|
Bare Soil | Building | Vegetable | Water | ||||||
DFANet | 2.32 | 0.947 | 54.93 | 68.46 | 67.88 | 49.46 | 47.75 | 77.02 | 90.56 |
DenseASPP | 46.04 | 48.178 | 55.94 | 69.23 | 67.60 | 49.04 | 47.14 | 78.63 | 92.41 |
PSPNet | 72.31 | 70.058 | 56.41 | 69.88 | 68.44 | 48.18 | 53.43 | 76.28 | 91.74 |
DUNet | 29.21 | 30.889 | 56.86 | 70.27 | 68.44 | 49.27 | 54.43 | 76.42 | 91.97 |
MAUNet | 14.55 | 41.263 | 57.12 | 70.41 | 69.56 | 50.88 | 49.79 | 78.99 | 92.70 |
MUNet | 7.77 | 13.756 | 58.50 | 71.65 | 70.20 | 51.40 | 54.45 | 79.38 | 93.17 |
MSCFF | 51.95 | 140.680 | 59.87 | 72.88 | 70.82 | 53.60 | 56.86 | 79.41 | 94.05 |
Deeplabv3plus | 59.34 | 22.243 | 60.37 | 73.32 | 71.61 | 56.54 | 53.04 | 80.80 | 93.96 |
HRNet | 49.20 | 49.302 | 60.43 | 73.21 | 71.51 | 55.66 | 54.01 | 80.96 | 93.97 |
SegFormer | 47.34 | 79.105 | 60.45 | 73.33 | 71.55 | 55.26 | 52.91 | 81.16 | 94.18 |
HRVit | 28.62 | 27.415 | 60.71 | 73.57 | 71.82 | 55.17 | 53.42 | 81.12 | 93.89 |
TMNet | 20.55 | 47.499 | 61.23 | 74.14 | 72.57 | 55.77 | 54.06 | 81.19 | 94.28 |
Method | MIOU | MF1 | MPA | IOU | |||||
---|---|---|---|---|---|---|---|---|---|
Impervious Surface | Building | Low Vegetation | Tree | Car | Clutter/Background | ||||
DFANet | 61.05 | 73.70 | 72.30 | 70.92 | 77.79 | 60.18 | 63.49 | 73.16 | 20.75 |
MAUNet | 62.31 | 75.01 | 73.15 | 70.55 | 77.75 | 61.30 | 65.53 | 74.07 | 24.66 |
DenseASPP | 62.93 | 75.92 | 75.24 | 75.00 | 80.02 | 61.29 | 56.69 | 72.53 | 32.02 |
PSPNet | 63.76 | 76.37 | 75.51 | 74.28 | 82.83 | 62.01 | 59.49 | 73.99 | 29.99 |
DUNet | 64.70 | 77.00 | 75.65 | 76.70 | 83.17 | 63.09 | 58.47 | 76.29 | 30.45 |
Deeplabv3plus | 65.32 | 77.49 | 77.46 | 75.73 | 83.90 | 63.06 | 63.46 | 75.51 | 30.25 |
HRNet | 66.22 | 78.18 | 77.49 | 76.62 | 83.51 | 64.66 | 61.53 | 77.28 | 33.69 |
MSCFF | 66.35 | 78.36 | 78.88 | 75.28 | 84.31 | 64.95 | 63.21 | 77.83 | 32.48 |
MUNet | 66.60 | 78.42 | 77.05 | 76.26 | 83.10 | 64.13 | 62.94 | 77.44 | 32.31 |
HRVit | 67.33 | 79.72 | 77.70 | 76.70 | 84.86 | 64.71 | 64.05 | 77.65 | 35.99 |
SegFormer | 67.43 | 79.07 | 78.43 | 76.61 | 84.66 | 65.08 | 63.70 | 78.17 | 36.36 |
TMNet | 68.15 | 79.91 | 78.77 | 78.63 | 84.20 | 66.65 | 63.74 | 78.57 | 37.14 |
Method | MIOU | MF1 | MPA |
---|---|---|---|
Baseline | 58.20 | 71.43 | 69.95 |
Baseline + MFM | 59.38 | 72.36 | 71.17 |
Baseline + MFM + SW | 60.41 | 73.42 | 72.08 |
Baseline + MFM + SW + FEM | 60.78 | 73.71 | 72.59 |
Baseline + MFM + SW + CEM | 60.89 | 73.82 | 72.34 |
Baseline + MFM + SW + FEM + CEM | 61.23 | 74.14 | 72.57 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gao, Y.; Zhang, S.; Zuo, D.; Yan, W.; Pan, X. TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images. Sensors 2023, 23, 5909. https://doi.org/10.3390/s23135909
Gao Y, Zhang S, Zuo D, Yan W, Pan X. TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images. Sensors. 2023; 23(13):5909. https://doi.org/10.3390/s23135909
Chicago/Turabian StyleGao, Yupeng, Shengwei Zhang, Dongshi Zuo, Weihong Yan, and Xin Pan. 2023. "TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images" Sensors 23, no. 13: 5909. https://doi.org/10.3390/s23135909
APA StyleGao, Y., Zhang, S., Zuo, D., Yan, W., & Pan, X. (2023). TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images. Sensors, 23(13), 5909. https://doi.org/10.3390/s23135909