Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation
Abstract
:1. Introduction
2. Related Work
3. Proposed Method
3.1. Overview
3.2. Cross-Modal Feature Rectification Module
3.3. Coordinate Attention Fusion Module
3.4. Loss Function
4. Experiment
4.1. Datasets and Evaluation Measures
4.2. Implementation Details
4.3. Quantitative Results on NYUv2 and SUNRGB-D
4.4. Qualitative Results on NYUv2
4.5. Ablation Study on NYUv2
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Liu, H.; Jin, Y.; Zhao, C. Real-time trust region ground plane segmentation for monocular mobile robots. In Proceedings of the 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), Macau, China, 5–8 December 2017; pp. 952–958. [Google Scholar]
- Li, Y.; Liu, H.; Tang, H. Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1456–1463. [Google Scholar]
- Hu, P.; Held, D.; Ramanan, D. Learning to optimally segment point clouds. IEEE Robot. Autom. Lett. 2020, 5, 875–882. [Google Scholar] [CrossRef]
- Nowicki, M.R. Spatiotemporal calibration of camera and 3D laser scanner. IEEE Robot. Autom. Lett. 2020, 5, 6451–6458. [Google Scholar] [CrossRef]
- Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9167–9176. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted Res-UNet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education (ITME), Hangzhou, China, 19–21 October 2018; pp. 327–331. [Google Scholar]
- Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; De Lange, T.; Halvorsen, P.; Johansen, H.D. Resunet++: An advanced architecture for medical image segmentation. In Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), San Diego, CA, USA, 9–11 December 2019; pp. 225–2255. [Google Scholar]
- Zhou, H.; Qi, L.; Huang, H.; Yang, X.; Wan, Z.; Wen, X. Canet: Co-attention network for RGB-D semantic segmentation. Pattern Recognit. 2022, 124, 108468. [Google Scholar]
- Zhou, W.; Yang, E.; Lei, J.; Wan, J.; Yu, L. PGDENet: Progressive guided fusion and depth enhancement network for RGB-D indoor scene parsing. IEEE Trans. Multimed. 2022, 25, 3483–3494. [Google Scholar] [CrossRef]
- Zhou, W.; Yang, E.; Lei, J.; Yu, L. FRNet: Feature reconstruction network for RGB-D indoor scene parsing. IEEE J. Sel. Top. Signal Process. 2022, 16, 677–687. [Google Scholar] [CrossRef]
- Su, Y.; Yuan, Y.; Jiang, Z. Deep feature selection-and-fusion for RGB-D semantic segmentation. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Wang, Y.; Sun, F.; Huang, W.; He, F.; Tao, D. Channel exchanging networks for multimodal and multitask dense image prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5481–5496. [Google Scholar] [CrossRef] [PubMed]
- Liu, H.; Zhang, J.; Yang, K.; Hu, X.; Stiefelhagen, R. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. arXiv 2022, arXiv:2203.04838. [Google Scholar]
- Alonso, I.; Riazuelo, L.; Murillo, A.C. Mininet: An efficient semantic segmentation convnet for real-time robotic applications. IEEE Trans. Robot. 2020, 36, 1340–1347. [Google Scholar] [CrossRef]
- Hao, X.; Hao, X.; Zhang, Y.; Li, Y.; Wu, C. Real-time semantic segmentation with weighted factorized-depthwise convolution. Image Vis. Comput. 2021, 114, 104269. [Google Scholar] [CrossRef]
- Hu, P.; Perazzi, F.; Heilbron, F.C.; Wang, O.; Lin, Z.; Saenko, K.; Sclaroff, S. Real-time semantic segmentation with fast attention. IEEE Robot. Autom. Lett. 2020, 6, 263–270. [Google Scholar] [CrossRef]
- Seichter, D.; Fischedick, S.B.; Köhler, M.; Groß, H.M. Efficient multi-task RGB-D scene analysis for indoor environments. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–10. [Google Scholar]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. ECCV 2012, 7576, 746–760. [Google Scholar]
- Song, S.; Lichtenberg, S.P.; Xiao, J. Sun RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar]
- Cao, J.; Leng, H.; Lischinski, D.; Cohen-Or, D.; Tu, C.; Li, Y. ShapeConv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7088–7097. [Google Scholar]
- Chen, L.Z.; Lin, Z.; Wang, Z.; Yang, Y.L.; Cheng, M.M. Spatial information guided convolution for real-time RGBD semantic segmentation. IEEE Trans. Image Process. 2021, 30, 2313–2324. [Google Scholar] [CrossRef]
- Chen, X.; Lin, K.Y.; Wang, J.; Wu, W.; Qian, C.; Li, H.; Zeng, G. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XI; Springer: Cham, Switzerland, 2020; pp. 561–577. [Google Scholar]
- Lu, M.; Chen, Z.; Wu, Q.; Wang, N.; Rong, X.; Yan, X. FRNet: Factorized and regular blocks network for semantic segmentation in road scene. IEEE Trans. Intell. Transp. Syst. 2020, 23, 3522–3530. [Google Scholar] [CrossRef]
- Elhassan, M.; Huang, C.; Yang, C.; Munea, T. DSANet: Dilated spatial attention for real-time semantic segmentation in urban street scenes. Expert Syst. Appl. 2021, 183, 115090. [Google Scholar] [CrossRef]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Roberts, M.; Ramapuram, J.; Ranjan, A.; Kumar, A.; Bautista, M.A.; Paczan, N.; Webb, R.; Susskind, J.M. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10912–10922. [Google Scholar]
- Zhou, W.; Lin, X.; Lei, J.; Yu, L.; Hwang, J.N. MFFENet: Multiscale feature fusion and enhancement network for RGB–thermal urban road scene parsing. IEEE Trans. Multimed. 2021, 24, 2526–2538. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Seichter, D.; Köhler, M.; Lewandowski, B.; Wengefeld, T.; Gross, H.M. Efficient RGB-D semantic segmentation for indoor scene analysis. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13525–13531. [Google Scholar]
- Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
Method | Backbone | NYUv2 | SUNRGB-D | FPS | ||
---|---|---|---|---|---|---|
PA (%) | mIoU (%) | PA (%) | mIoU (%) | |||
ShapeConv * [22] | ResNext101 32x8d | 76.4 | 51.3 | - | - | 25.97 |
ShapeConv * [22] | Res101 | - | - | 82.2 | 48.6 | 25.97 |
SGNet * [23] | Res101 | 76.8 | 51.1 | 82.0 | 48.6 | N/A |
SA-Gate [24] | Res50 | 77.9 | 52.4 | 82.5 | 49.4 | 24.71 |
CMX * [15] | SegFormer-B5 | 80.1 | 56.9 | 83.8 | 52.4 | 11.13 |
ESANet (pre. SceneNet) [31] | Res34NBt1D | - | 51.6 | - | 48.5 | 43.66 |
EMSANet (pre. Hypersim) [19] | Res34NBt1D | 78.1 | 53.3 | 81.9 | 48.5 | 46.23 |
Ours (Semantic only) | Res34NBt1D | 76.7 | 50.3 | 82.0 | 48.1 | 42.97 |
Ours | Res34NBt1D | 76.8 | 51.3 | 81.9 | 48.3 | 42.97 |
Ours (pre. Hypersim) | Res34NBt1D | 78.8 | 54.5 | 82.3 | 49.1 | 42.97 |
Backbone | CM-FRM | Fusion | NYUv2 | SUNRGB-D | FPS |
---|---|---|---|---|---|
mIoU (%) | mIoU (%) | ||||
Res34NBt1D | × | RGB-D Fusion | 49.20 | 46.81 | 46.18 |
Res34NBt1D | ✓ | RGB-D Fusion | 50.27 | 47.88 | 43.26 |
Res34NBt1D | × | CAFM | 49.41 | 47.11 | 45.28 |
Res34NBt1D | ✓ | CAFM | 50.38 | 48.10 | 42.97 |
Res18 | ✓ | CAFM | 46.04 | 45.17 | 63.37 |
Res34 | ✓ | CAFM | 47.06 | 46.36 | 54.57 |
Res50 | ✓ | CAFM | 49.86 | 47.05 | 28.03 |
Res18NBt1D | ✓ | CAFM | 46.74 | 46.09 | 55.21 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, X.; Liu, J.; Liu, H. Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation. Electronics 2023, 12, 3943. https://doi.org/10.3390/electronics12183943
Xu X, Liu J, Liu H. Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation. Electronics. 2023; 12(18):3943. https://doi.org/10.3390/electronics12183943
Chicago/Turabian StyleXu, Xinhua, Jinfu Liu, and Hong Liu. 2023. "Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation" Electronics 12, no. 18: 3943. https://doi.org/10.3390/electronics12183943
APA StyleXu, X., Liu, J., & Liu, H. (2023). Interactive Efficient Multi-Task Network for RGB-D Semantic Segmentation. Electronics, 12(18), 3943. https://doi.org/10.3390/electronics12183943