CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation
Abstract
:1. Introduction
- (1).
- Above all, we propose CTFuse which uses a hybrid CNN and Transformer architecture to use a CNN to extract detailed spatial information and a Transformer to obtain global context information. Then, the obtained information is combined with the detailed spatial information through upsampling to achieve precise positioning. In the CNN part, we use the multi-scale convolutional attention (MSCA) module in SegNeXt [33], which uses a large number of Depthwise Separable Convolutions [34] in the model, which effectively reduces parameter number and calculation costs. The final parameters of our model combined with the Transformer are also far smaller than other CNN-combined Transformer models such as TransUNet [35], ST-UNet [36], etc.
- (2).
- In order to effectively encode the features extracted by the convolution module, we propose a spatial and channel attention module (SCA_C) in convolution, a dual-branch structure for extracting local and global feature information. SCA_C can effectively combine MSCA to improve the model interaction ability for spatial and channel information, realize the complete fusion of multi-scale hierarchical and spatial channel fusion features, and further improve the model’s performance.
- (3).
- We design a spatial and channel attention module in the Transformer (SCA_T) which can effectively supplement the model’s global modeling ability and channel information modeling ability while also assisting the self-attention module in extracting more detailed features.
2. Related Work
2.1. Semantic Segmentation Method Based on CNN
2.2. Semantic Segmentation Method Based on Transformer
2.3. Self-Attention Mechanism
3. Methods
3.1. Overall Network Structure
3.2. SC_MSCAN Block and SC_Transformer Block
3.3. Spatial and Channel Attention Module in Convolution (SCA_C)
3.4. Spatial and Channel Attention Module in Transformer (SCA_T)
4. Experiment Results
4.1. Dataset
4.1.1. The Vainhingen Dataset
4.1.2. The Potsdam Dataset
4.2. Evaluation Metric
4.3. Training Settings
Algorithm 1: Training Process of CTFuse |
: (Training images) and (Corresponding labels) |
: (Prediction mask) |
//Step1: Extract features by SC_MSCAN |
=SC_MSCAN(X) |
i in {1, 2} |
=Downsample() |
=SC_MSCAN() |
//Step2: Extract features by SC_Transformer |
=Reshape( |
i in {3, 4} |
=PatchMerging() |
=SC_Transformer() |
//Step3: Get prediction mask |
=Upsample(Reshape(),Reshape()) |
i in {0, 2} |
=Upsample(, ) |
= |
//Step4: Calculate loss =+ |
//Step5: Update the network parameters |
4.4. Ablation Studies
4.4.1. Validity of CNN and Transformer Hybrid Structure
4.4.2. Validity of SCA_C
4.4.3. Validity of SCA_T
4.5. Comparing the Segmentation Results of Different Models
4.5.1. Experiments on the Vaihingen Dataset
4.5.2. Experiments on the Potsdam Dataset
4.5.3. Performance Analysis
4.6. Visualization Analysis
4.7. Confusion Matrix
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Sample Availability
References
- Zhang, J.; Feng, L.; Yao, F. Improved maize cultivated area estimation over a large scale combining modis–evi time series data and crop phenological information. ISPRS J. Photogramm. Remote Sens. 2014, 94, 102–113. [Google Scholar] [CrossRef]
- Zhang, C.; Harrison, P.A.; Pan, X.; Li, H.; Sargent, I.; Atkinson, P.M. Scale sequence joint deep learning (ss-jdl) for land use and land cover classification. Remote Sens. Environ. 2020, 237, 111593. [Google Scholar] [CrossRef]
- Sahar, L.; Muthukumar, S.; French, S.P. Using aerial imagery and gis in automated building footprint extraction and shape recognition for earthquake risk assessment of urban inventories. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3511–3520. [Google Scholar] [CrossRef]
- Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. Joint deep learning for land cover and land use classification. Remote Sens. Environ. 2019, 221, 173–187. [Google Scholar] [CrossRef]
- Fu, Y.; Zhao, C.; Wang, J.; Jia, X.; Yang, G.; Song, X.; Feng, H. An improved combination of spectral and spatial features for vegetation classification in hyperspectral images. Remote Sens. 2017, 9, 261. [Google Scholar] [CrossRef]
- Aslam, B.; Maqsoom, A.; Khalil, U.; Ghorbanzadeh, O.; Blaschke, T.; Farooq, D.; Tufail, R.F.; Suhail, S.A.; Ghamisi, P. Evaluation of different landslide susceptibility models for a local scale in the chitral district, northern pakistan. Sensors 2022, 22, 3107. [Google Scholar] [CrossRef]
- Tatsumi, K.; Yamashiki, Y.; Torres, M.A.C.; Taipe, C.L.R. Crop classification of upland fields using random forest of time-series landsat 7 etm+ data. Comput. Electron. Agric. 2015, 115, 171–179. [Google Scholar] [CrossRef]
- Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
- Belgiu, M.; Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
- Cheng, Q.; Varshney, P.K.; Arora, M.K. Logistic regression for feature selection and soft classification of remote sensing data. IEEE Geosci. Remote Sens. Lett. 2006, 3, 491–494. [Google Scholar] [CrossRef]
- Lee, S. Application of logistic regression model and its validation for landslide susceptibility mapping using gis and remote sensing data. Int. J. Remote Sens. 2005, 26, 1477–1491. [Google Scholar] [CrossRef]
- Mas, J.F.; Flores, J.J. The application of artificial neural networks to the analysis of remotely sensed data. Int. J. Remote Sens. 2008, 29, 617–663. [Google Scholar] [CrossRef]
- Gopal, S.; Woodcock, C. Remote sensing of forest change using artificial neural networks. IEEE Trans. Geosci. Remote Sens. 1996, 34, 398–404. [Google Scholar] [CrossRef]
- Chebud, Y.; Naja, G.M.; Rivero, R.G.; Melesse, A.M. Water quality monitoring using remote sensing and an artificial neural network. Water Air Soil Pollut. 2012, 223, 4875–4887. [Google Scholar] [CrossRef]
- Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative cnns. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
- Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
- Shen, C.; Nguyen, D.; Zhou, Z.; Jiang, S.B.; Dong, B.; Jia, X. An introduction to deep learning in medical physics: Advantages, potential, and challenges. Phys. Med. Biol. 2020, 65, 05TR01. [Google Scholar] [CrossRef]
- Hu, A.; Wu, L.; Chen, S.; Xu, Y.; Wang, H.; Xie, Z. Boundary shape-preserving model for building mapping from high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610217. [Google Scholar] [CrossRef]
- Hua, Y.; Mou, L.; Jin, P.; Zhu, X.X. Multiscene: A large-scale dataset and benchmark for multiscene recognition in single aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
- Sun, J.; Yang, S.; Gao, X.; Ou, D.; Tian, Z.; Wu, J.; Wang, M. Masa-segnet: A semantic segmentation network for polsar images. Remote Sens. 2023, 15, 3662. [Google Scholar] [CrossRef]
- Grinias, I.; Panagiotakis, C.; Tziritas, G. Mrf-based segmentation and unsupervised classification for building and road detection in peri-urban areas of high-resolution satellite images. ISPRS J. Photogramm. Remote. Sens. 2016, 122, 145–166. [Google Scholar] [CrossRef]
- Benedek, C.; Descombes, X.; Zerubia, J. Building development monitoring in multitemporal remotely sensed image pairs with stochastic birth-death dynamics. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 33–50. [Google Scholar] [CrossRef] [PubMed]
- Yi, Y.; Zhang, Z.; Zhang, W.; Zhang, C.; Li, W.; Zhao, T. Semantic segmentation of urban buildings from vhr remote sensing imagery using a deep convolutional neural network. Remote Sens. 2019, 11, 1774. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Qin, Y.; Kamnitsas, K.; Ancha, S.; Nanavati, J.; Cottrell, G.; Criminisi, A.; Nori, A. Autofocus layer for semantic segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, 16–20 September 2018; Proceedings, Part III 11. Springer: Berlin/Heidelberg, Germany, 2018; pp. 603–611. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Sinha, A.; Dolz, J. Multi-scale self-guided attention for medical image segmentation. IEEE J. Biomed. Health Inform. 2020, 25, 121–130. [Google Scholar] [CrossRef]
- Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with Transformers. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H. Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar]
- Ioannou, Y.; Robertson, D.; Cipolla, R.; Criminisi, A. Deep roots: Improving cnn efficiency with hierarchical filter groups. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 1231–1240. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding unet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Song, P.; Li, J.; An, Z.; Fan, H.; Fan, L. Ctmfnet: Cnn and Transformer multi-scale fusion network of remote sensing urban scene imagery. IEEE Trans. Geosci. Remote Sens. 2022. [Google Scholar] [CrossRef]
- Zhang, Y.; Lu, H.; Ma, G.; Zhao, H.; Xie, D.; Geng, S.; Tian, W.; Sian, K.T.C.L.K. Mu-net: Embedding mixformer into unet to extract water bodies from remote sensing images. Remote Sens. 2023, 15, 3559. [Google Scholar] [CrossRef]
- Wang, D.; Chen, Y.; Naz, B.; Sun, L.; Li, B. Spatial-aware transformer (sat): Enhancing global modeling in transformer segmentation for remote sensing images. Remote Sens. 2023, 15, 3607. [Google Scholar] [CrossRef]
- Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention network for semantic segmentation of fine-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
- Ashish, V. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. Segformer: Simple and efficient design for semantic segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of Transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure Transformer for medical image segmentation. In Proceedings of the Computer Vision–ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part III. Springer: Berlin/Heidelberg, Germany, 2023; pp. 205–218. [Google Scholar]
- Yu, C.; Wang, F.; Shao, Z.; Sun, T.; Wu, L.; Xu, Y. Dsformer: A double sampling transformer for multivariate time series long-term prediction. arXiv 2023, arXiv:2308.03274. [Google Scholar]
- Zhang, W.; Huang, Z.; Luo, G.; Chen, T.; Wang, X.; Liu, W.; Yu, G.; Shen, C. Topformer: Token pyramid transformer for mobile semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12083–12093. [Google Scholar]
- Ba, J.; Mnih, V.; Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv 2014, arXiv:1412.7755. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- ISPRS. Semantic Labeling Contest-Vaihingen (2018). Available online: https://www2.isprs.org/commissions/comm2/wg4/benchmark/2d-sem-label-vaihingen/ (accessed on 4 September 2021).
- Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. Stransfuse: Fusing swin Transformer and convolutional neural network for remote sensing image semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. Unetformer: A unet-like Transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- ISPRS. Semantic Labeling Contest-Potsdam (2018). Available online: http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-potsdam.html (accessed on 4 September 2021).
- Milletari, F.; Navab, N.; Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Method | Modules | IoU(%) | mIoU(%) | mF1(%) | |||||
---|---|---|---|---|---|---|---|---|---|
SCA_C | SCA_T | Building | Low Vegetation | Tree | Car | Impervious Surface | |||
Baseline UNet | 80.13 | 58.06 | 65.40 | 48.57 | 75.07 | 65.45 | 78.53 | ||
CNN_Trans | 80.37 | 60.71 | 67.76 | 49.24 | 75.66 | 66.75 | 79.52 | ||
CNN_Trans+SCA_C | √ | 80.19 | 61.30 | 66.69 | 54.72 | 75.88 | 67.76 | 80.41 | |
CNN_Trans+SCA_T | √ | 79.32 | 61.74 | 66.90 | 55.27 | 75.30 | 67.71 | 80.41 | |
CNN_Trans+SCA_C+SCA_T | √ | √ | 81.29 | 60.97 | 68.04 | 56.34 | 76.02 | 68.53 | 80.97 |
Method | Building | Low Vegetation | Tree | Car | Impervious Surface | mIoU (%) | mF1 (%) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
DeepLabv3 [28] | 74.28 | 85.24 | 56.86 | 72.50 | 64.54 | 78.45 | 27.81 | 43.52 | 70.75 | 82.87 | 58.85 | 72.52 |
FCN [24] | 74.96 | 85.69 | 57.66 | 73.14 | 65.43 | 79.10 | 26.22 | 41.55 | 70.72 | 82.85 | 59.00 | 72.47 |
MANet [42] | 72.96 | 84.37 | 55.52 | 71.40 | 63.89 | 77.96 | 27.90 | 43.63 | 81.94 | 76.29 | 57.93 | 70.73 |
PSPNet [29] | 74.13 | 85.14 | 57.87 | 73.32 | 66.24 | 79.70 | 30.46 | 46.69 | 70.83 | 82.92 | 59.91 | 73.55 |
Res_UNet [40] | 75.80 | 86.24 | 57.83 | 73.28 | 64.90 | 78.72 | 46.87 | 63.82 | 72.60 | 84.12 | 63.60 | 77.24 |
Unet [26] | 80.13 | 88.97 | 58.06 | 73.47 | 65.40 | 79.08 | 48.57 | 65.38 | 75.07 | 85.76 | 65.45 | 78.53 |
ST_UNet [36] | 74.29 | 85.25 | 52.53 | 68.88 | 57.72 | 73.19 | 22.17 | 36.29 | 69.73 | 82.16 | 55.29 | 69.15 |
Swin_UNet [46] | 70.37 | 82.61 | 54.15 | 70.26 | 61.97 | 76.52 | 14.55 | 25.41 | 68.19 | 81.09 | 53.85 | 68.18 |
Trans_UNet [35] | 74.75 | 85.55 | 56.17 | 71.94 | 62.87 | 77.20 | 34.71 | 51.53 | 71.19 | 83.17 | 59.94 | 73.88 |
CTFuse |
Method | Building | Low Vegetation | Tree | Car | Impervious Surface | mIoU (%) | mF1 (%) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | IoU | F1 | |||
DeepLabv3 [28] | 81.71 | 89.93 | 60.77 | 75.60 | 55.71 | 71.56 | 68.00 | 80.95 | 74.49 | 85.38 | 68.14 | 80.68 |
FCN [24] | 79.72 | 88.72 | 62.72 | 77.09 | 61.11 | 75.86 | 69.60 | 82.08 | 74.81 | 85.59 | 69.59 | 81.87 |
MANet [42] | 78.07 | 87.68 | 61.13 | 75.88 | 56.97 | 72.59 | 67.67 | 80.72 | 72.89 | 84.32 | 67.35 | 80.24 |
PSPNet [29] | 77.84 | 87.54 | 61.72 | 76.33 | 56.57 | 72.26 | 67.16 | 80.36 | 73.40 | 84.66 | 67.34 | 80.23 |
Res_UNet [40] | 78.11 | 87.71 | 63.09 | 77.37 | 60.06 | 75.05 | 71.18 | 83.16 | 72.83 | 84.28 | 69.05 | 81.51 |
UNet [26] | 81.77 | 89.97 | 64.34 | 78.30 | 61.70 | 76.32 | 72.39 | 83.98 | 75.22 | 85.86 | 71.08 | 82.89 |
ST_UNet [36] | 82.20 | 90.23 | 62.45 | 76.88 | 58.62 | 73.91 | 67.90 | 80.88 | 73.32 | 84.60 | 68.90 | 81.30 |
Swin_UNet [46] | 79.21 | 88.40 | 60.87 | 75.68 | 54.64 | 70.67 | 61.87 | 76.44 | 72.34 | 83.95 | 65.78 | 79.03 |
Trans_UNet [35] | 81.95 | 90.08 | 62.37 | 76.83 | 58.09 | 73.49 | 67.17 | 80.36 | 74.11 | 85.13 | 68.74 | 81.18 |
CTFuse |
Method | Parameters (MB) | Speed (FPS) | Vaihingen (mIoU) | Potsdam (mIoU) |
---|---|---|---|---|
DeepLabv3 [28] | 39.64 | 40.16 | 58.85 | 68.14 |
FCN [24] | 32.95 | 47.95 | 59.00 | 69.59 |
MANet [42] | 35.86 | 43.92 | 57.93 | 67.35 |
PSPNet [29] | 65.58 | 5.50 | 59.91 | 67.34 |
Res_Unet [40] | 13.04 | 53.82 | 63.60 | 69.05 |
Unet [26] | 17.27 | 87.06 | 65.45 | 71.08 |
ST_Unet [36] | 168.79 | 13.61 | 55.29 | 68.90 |
Swin_Unet [46] | 27.18 | 60.19 | 53.85 | 65.78 |
Trans_Unet [35] | 105.32 | 34.42 | 59.94 | 68.74 |
CTFuse | 41.46 | 34.56 | 68.53 | 72.46 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, X.; Li, D.; Liu, M.; Jia, J. CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation. Remote Sens. 2023, 15, 4455. https://doi.org/10.3390/rs15184455
Chen X, Li D, Liu M, Jia J. CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation. Remote Sensing. 2023; 15(18):4455. https://doi.org/10.3390/rs15184455
Chicago/Turabian StyleChen, Xin, Dongfen Li, Mingzhe Liu, and Jiaru Jia. 2023. "CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation" Remote Sensing 15, no. 18: 4455. https://doi.org/10.3390/rs15184455
APA StyleChen, X., Li, D., Liu, M., & Jia, J. (2023). CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation. Remote Sensing, 15(18), 4455. https://doi.org/10.3390/rs15184455