TGSNet: Multi-Field Feature Fusion for Glass Region Segmentation Using Transformers
Abstract
:1. Introduction
- We first construct a CAM to extract backbone features through a self-attention approach. We then propose a VIT-based deep semantic segmentation architecture, called MFT, which associates multilevel receptive field features and retains the feature information captured by each level of features.
- A CCFF module is designed. It can extract boundary and segmentation features from multiscale and cross-modal features and associate contextual field-of-view fusion features.
- A model named TGSNet is constructed, which outperforms existing glass detection methods in terms of both performance and visual performance.
2. Related Work
3. Proposed Method
3.1. CAM
3.2. MFT
3.3. CCFF
3.4. Loss Function
4. Experiment
4.1. Dataset and Settings
4.2. Evaluation Metrics
4.3. Comparison Methods
4.4. Comparison with Existing Methods
4.5. Ablation Experiments
4.5.1. Effectiveness of the CAM Module
4.5.2. Effectiveness of the MFT Module
4.5.3. Effectiveness of the CCFF Module
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Gao, R.; Li, M.; Yang, S.J. Reflective Noise Filtering of Large-Scale Point Cloud Using Transformer. Remote Sens. 2022, 14, 577. [Google Scholar] [CrossRef]
- Liu, F.; Shen, C.; Lin, G. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5162–5170. [Google Scholar]
- Zheng, C.; Cham, T.J.; Cai, J. T2net: Synthetic-to-realistic translation for solving single-image depth estimation tasks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 767–783. [Google Scholar]
- Zhang, L.; Dai, J.; Lu, H. A bi-directional message passing model for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1741–1750. [Google Scholar]
- Wu, Z.; Su, L.; Huang, Q. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3907–3916. [Google Scholar]
- Liu, J.J.; Hou, Q.; Cheng, M.M. A simple pooling-based design for real-time salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3917–3926. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Dai, J.; Li, Y.; He, K. R-fcn: Object detection via region-based fully convolutional networks. In Proceedings of the Advances in Neural Information Processing Systems, Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 568–578. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Zhang, H.; Dana, K.; Shi, J. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7151–7160. [Google Scholar]
- Mei, H.; Yang, X.; Wang, Y. Don’t hit me! glass detection in real-world scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3687–3696. [Google Scholar]
- Xie, E.; Wang, W.; Wang, W. Segmenting Transparent Objects in the Wild with Transformer. arXiv 2021, arXiv:2101.08461. [Google Scholar]
- Yang, X.; Mei, H.; Xu, K. Where is my mirror? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8809–8818. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Cao, Y.; Zhang, Z.; Xie, E. FakeMix augmentation improves transparent object detection. arXiv 2021, arXiv:2103.13279. [Google Scholar]
- Yu, L.; Mei, H.; Dong, W. Progressive Glass Segmentation. IEEE Trans. Image Process. 2022, 31, 2920–2933. [Google Scholar] [CrossRef] [PubMed]
- Huynh, C.; Tran, A.T.; Luu, K. Progressive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 16755–16764. [Google Scholar]
- Zhao, J.X.; Liu, J.J.; Fan, D.P. EGNet: Edge guidance network for salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8779–8788. [Google Scholar]
- Huo, D.; Wang, J.; Qian, Y. Glass Segmentation with RGB-Thermal Image Pairs. arXiv 2022, arXiv:2204.05453. [Google Scholar]
- Lin, J.; Yeung, Y.H.; Lau, R.W.H. Depth-aware glass surface detection with cross-modal context mining. arXiv 2022, arXiv:2206.11250. [Google Scholar]
- Mei, H.; Dong, B.; Dong, W. Glass Segmentation Using Intensity and Spectral Polarization Cues. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 12622–12631. [Google Scholar]
- He, H.; Li, X.; Cheng, G. Enhanced boundary learning for glass-like object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 15859–15868. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C.; Papandreou, G.; Kokkinos, I. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C.; Zhu, Y.; Papandreou, G. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
- Wang, W.; Xie, E.; Li, X. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
- Guo, M.H.; Lu, C.Z.; Hou, Q. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar]
- Zhang, J.; Yang, K.; Constantinescu, A. Trans4Trans: Efficient transformer for transparent object segmentation to help visually impaired people navigate in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 1760–1770. [Google Scholar]
- Merget, D.; Rock, M.; Rigoll, G. Robust facial landmark detection via a fully-convolutional local-global context network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 781–790. [Google Scholar]
- Aboutalebi, H.; Pavlova, M.; Gunraj, H. MEDUSA: Multi-Scale Encoder-Decoder Self-Attention Deep Neural Network Architecture for Medical Image Analysis. Front. Med. 2021, 8, 2891. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; p. 30. [Google Scholar]
- Xiao, J.; Zhao, T.; Yao, Y. Context augmentation and feature refinement network for tiny object detection. In Proceedings of the ICLR 2022 Conference, Virtual, 25–29 April 2022. [Google Scholar]
- Chen, L.I.; Jianxun, L.I. Orthogonal Features Extraction Method and Its Application in Convolution Neural Network. J. Shanghai Jiaotong Univ. 2021, 55, 1320. [Google Scholar]
- Zhou, B.; Khosla, A.; Lapedriza, A. Object detectors emerge in deep scene cnns. arXiv 2014, arXiv:1412.6856. [Google Scholar]
- Peng, C.; Zhang, X.; Yu, G. Large kernel matters—Improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4353–4361. [Google Scholar]
- Ding, X.; Guo, Y.; Ding, G. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1911–1920. [Google Scholar]
- Liu, W.; Rabinovich, A.; Berg, A.C. Parsenet: Looking wider to see better. arXiv 2015, arXiv:1506.04579. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Margolin, R.; Zelnik-Manor, L.; Tal, A. How to evaluate foreground maps? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 248–255. [Google Scholar]
- Nguyen, V.; Yago Vicente, T.F.; Zhao, M.; Hoai, M.; Samaras, D. Shadow detection with conditional generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4510–4518. [Google Scholar]
- Fan, D.P.; Cheng, M.M.; Liu, Y. Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4548–4557. [Google Scholar]
- Zhao, H.; Qi, X.; Shen, X. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 405–420. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
- Li, X.; Zhao, H.; Han, L. Gated fully fusion for semantic segmentation. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11418–11425. [Google Scholar] [CrossRef]
- Huang, S.; Lu, Z.; Cheng, R. FaPN: Feature-aligned pyramid network for dense image prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 864–873. [Google Scholar]
- Chen, S.; Tan, X.; Wang, B. Reverse attention for salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 234–250. [Google Scholar]
- Hou, Q.; Cheng, M.M.; Hu, X. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3203–3212. [Google Scholar]
- Wei, J.; Wang, S.; Huang, Q. F3Net: Fusion, feedback and focus for salient object detection. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12321–12328. [Google Scholar]
- Xie, E.; Wang, W.; Wang, W. Segmenting transparent objects in the wild. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 696–711. [Google Scholar]
- Lin, J.; He, Z.; Lau, R.W.H. Rich context aggregation with reflection prior for glass surface detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13415–13424. [Google Scholar]
- Zhou, H.; Xie, X.; Lai, J.H. Interactive two-stream decoder for accurate and fast saliency detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9141–9150. [Google Scholar]
- Available online: https://github.com/sovrasov/flops-counter.pytorch (accessed on 2 February 2023).
Method | Published Journals | Backbone | GDD [15] | |||
---|---|---|---|---|---|---|
IoU↑ | MAE↓ | BER↓ | ||||
PSPNet [10] | CVPR’17 | ResNet-50 | 84.06 | 0.867 | 0.084 | 8.79 |
ICNet [50] | ECCV’18 | ResNet-50 | 69.59 | 0.747 | 0.164 | 16.10 |
DeepLab3+ [31] | ECCV’18 | ResNet-50 | 69.95 | 0.767 | 0.147 | 15.49 |
BiSeNet [51] | ECCV’19 | ResNet-50 | 80.00 | 0.830 | 0.106 | 11.04 |
DANet [7] | CVPR’19 | ResNet-50 | 84.15 | 0.864 | 0.089 | 8.96 |
CCNet [52] | ICCV’19 | ResNet-50 | 84.29 | 0.867 | 0.085 | 8.63 |
GFFNet [53] | AAAI’20 | ResNet-50 | 82.41 | 0.855 | 0.090 | 9.11 |
FaPN [54] | ICCV’21 | ResNet-101 | 86.65 | 0.887 | 0.062 | 5.69 |
RAS [55] | ECCV’18 | ResNet-50 | 80.96 | 0.830 | 0.106 | 9.48 |
DSS [56] | TPAMI’19 | ResNet-50 | 80.24 | 0.799 | 0.123 | 9.73 |
EGNet [36] | ICCV’19 | ResNet-50 | 85.05 | 0.870 | 0.083 | 7.43 |
F3Net [57] | AAAI’20 | ResNet-50 | 84.79 | 0.870 | 0.082 | 7.38 |
ITSD [60] | CVPR’20 | ResNet-50 | 83.72 | 0.862 | 0.087 | 7.77 |
MirrorNet [17] | ICCV’19 | ResNeXt-101 | 85.07 | 0.866 | 0.083 | 7.67 |
TransLab [58] | ECCV’20 | ResNet-50 | 81.64 | 0.849 | 0.097 | 9.70 |
GSD [59] | CVPR’21 | ResNeXt-101 | 87.53 | 0.895 | 0.066 | 5.90 |
PGSNet [20] | TIP’22 | ResNeXt-101 | 87.81 | 0.901 | 0.062 | 5.56 |
TGSNet (our) | \ | ResNeXt-101 | 88.47 | 0.908 | 0.058 | 5.70 |
Method | Published Journals | Backbone | GDD [15] | |||
---|---|---|---|---|---|---|
IoU↑ | MAE↓ | BER↓ | ||||
Trans2seg [16] | IJCAI’21 | ResNet-50 | 84.41 | 0.872 | 0.078 | 7.36 |
Trans4Trans [35] | ICCVW’21 | PVT-Medium | 84.94 | 0.878 | 0.076 | 6.86 |
GDNet [15] | CVPR’20 | ResNeXt-101 | 87.63 | 0.898 | 0.063 | 5.62 |
EBLNet [26] | CVPR’21 | ResNeXt-101 | 84.98 | 0.879 | 0.076 | 7.24 |
TGSNet (our) | \ | ResNeXt-101 | 88.47 | 0.908 | 0.058 | 5.70 |
Methods | FLOPs (G) | MParams | Speed (per Image) | Memory | |||
---|---|---|---|---|---|---|---|
352 × 352 | 384 × 384 | 416 × 416 | 512 × 512 | ||||
GDNet [15] | 194.48 | 231.45 | 271.63 | 411.46 | 201.72 | 0.18 s | 1623 MiB |
GSD [59] | 77.892 | 92.697 | 108.790 | / | / | / | / |
EBLNet [26] | 255.34 | 303.87 | 356.63 | 540.2 | 111.45 | 0.23 s | 6515 MiB |
PGSNet [20] | 80.789 | 96.145 | 112.837 | / | / | / | / |
TGSNet (our) | 49.86 | 58.98 | 69.57 | 104.87 | 185.472 | 0.26 s | 8921 MiB |
Networks | Backbone | GDD [15] | |||
---|---|---|---|---|---|
IoU↑ | MAE↓ | BER↓ | |||
a. Conv + MFT + CCFF | ResNeXt-101 | 87.29 | 0.898 | 0.062 | 6.36 |
b. CAM + MFT + CCFF | ResNeXt-101 | 88.47 | 0.908 | 0.058 | 5.70 |
Networks | Backbone | MParams | GDD [15] | |||
---|---|---|---|---|---|---|
IoU↑ | MAE↓ | BER↓ | ||||
a. CAM + CCFF | ResNeXt-101 | 63.260 | 87.58 | 0.884 | 0.074 | 6.36 |
b. CAM + self atten +CCFF | ResNeXt-101 | 249.389 | 88.28 | 0.902 | 0.060 | 5.89 |
c. CAM + conv atten +CCFF | ResNeXt-101 | 185.472 | 88.47 | 0.908 | 0.058 | 5.70 |
Networks | Backbone | GDD [15] | |||
---|---|---|---|---|---|
IoU↑ | MAE↓ | BER↓ | |||
a. CAM + MFT + C-ASPP | ResNeXt-101 | 87.31 | 0.896 | 0.063 | 6.58 |
b. CAM + MFT + C-ASPP + MHB | ResNeXt-101 | 88.05 | 0.903 | 0.060 | 6.00 |
c. CAM + MFT + C-ASPP + TCH + MHB | ResNeXt-101 | 88.47 | 0.908 | 0.058 | 5.70 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, X.; Gao, R.; Yang, S.; Cho, K. TGSNet: Multi-Field Feature Fusion for Glass Region Segmentation Using Transformers. Mathematics 2023, 11, 843. https://doi.org/10.3390/math11040843
Hu X, Gao R, Yang S, Cho K. TGSNet: Multi-Field Feature Fusion for Glass Region Segmentation Using Transformers. Mathematics. 2023; 11(4):843. https://doi.org/10.3390/math11040843
Chicago/Turabian StyleHu, Xiaohang, Rui Gao, Seungjun Yang, and Kyungeun Cho. 2023. "TGSNet: Multi-Field Feature Fusion for Glass Region Segmentation Using Transformers" Mathematics 11, no. 4: 843. https://doi.org/10.3390/math11040843
APA StyleHu, X., Gao, R., Yang, S., & Cho, K. (2023). TGSNet: Multi-Field Feature Fusion for Glass Region Segmentation Using Transformers. Mathematics, 11(4), 843. https://doi.org/10.3390/math11040843