Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos
Abstract
:1. Introduction
- We propose a feature separation method for 3D semantic estimation and 3D mesh reconstruction, addressing the interference between these tasks in existing methods;
- We introduce the use of depth priors from a pretrained MVS network to guide the allocation of image features in 3D voxels, improving the accuracy of feature placement;
- We utilize a pretrained ViT, which is adept at capturing global contextual information through self-supervised learning techniques, enhancing the robustness and accuracy of semantic features;
- Our method produces a 3D semantic volume that can be combined with reconstructed 3D scene that obtained from any existing 3D reconstruction method, enhancing flexibility and applicability;
- We present experimental results if real-world datasets that demonstrate significant improvements in 3D semantic estimation accuracy, validating the effectiveness of the proposed method.
2. Related Works
2.1. Three-Dimensional Semantic Segmentation Using Three-Dimensional Data Obtained from Three-Dimensional Sensor
2.2. Three-Dimensional Semantic Scene Reconstruction from Videos
2.2.1. Object-Level 3D Semantic Reconstruction Methods
2.2.2. Scene-Level 3D Semantic Reconstruction Methods
3. Three-Dimensional Semantic Volume Estimation via Depth Prior-Guided Three-Dimensional Voxel Feature Fusion
3.1. Overview
3.2. Keyframe Processing
3.3. Depth Prior-Guided 3D Voxel Feature Fusion
3.4. Implementation Details
4. Experimental Results and Analysis
4.1. Datasets and Metrics
4.2. Experimental Results of ScanNet v2 Dataset
4.2.1. Quantitative Comparisons
4.2.2. Qualitative Comparisons
4.3. Generalization Experimental Results on SceneNN Dataset
4.3.1. Quantitative Comparisons
4.3.2. Qualitative Comparisons
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shi, W.; Xu, J.; Zhu, D.; Zhang, G.; Wang, X.; Li, J.; Zhang, X. RGB-D semantic segmentation and label-oriented voxelgrid fusion for accurate 3D semantic mapping. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 183–197. [Google Scholar] [CrossRef]
- Han, L.; Zheng, T.; Zhu, Y.; Xu, L.; Fang, L. Live semantic 3d perception for immersive augmented reality. IEEE Trans. Vis. Comput. Graph. 2020, 26, 2012–2022. [Google Scholar] [CrossRef]
- Kundu, A.; Genova, K.; Yin, X.; Fathi, A.; Pantofaru, C.; Guibas, L.J.; Tagliasacchi, A.; Dellaert, F.; Funkhouser, T. Panoptic neural fields: A semantic object-aware neural scene representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 12871–12881. [Google Scholar]
- Li, S.; Cheng, K.-T. Joint stereo 3D object detection and implicit surface reconstruction. Sci. Rep. 2024, 14, 13893. [Google Scholar] [CrossRef] [PubMed]
- Shao, H.; Wang, L.; Chen, R.; Li, H.; Liu, Y. Safety-enhanced autonomous driving using interpretable sensor fusion transformer. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; pp. 726–737. [Google Scholar]
- Maninis, K.-K.; Popov, S.; Nießner, M.; Ferrari, V. Vid2cad: Cad model alignment using multi-view constraints from videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1320–1327. [Google Scholar] [CrossRef] [PubMed]
- Wen, M.; Cho, K. Object-aware 3d scene reconstruction from single 2d images of indoor scenes. Mathematics 2023, 11, 403. [Google Scholar] [CrossRef]
- Pham, Q.-H.; Hua, B.-S.; Nguyen, T.; Yeung, S.-K. Real-time progressive 3D semantic segmentation for indoor scenes. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 7–11 January 2019; pp. 1089–1098. [Google Scholar]
- Huang, S.-S.; Chen, H.; Huang, J.; Fu, H.; Hu, S.-M. Real-time globally consistent 3D reconstruction with semantic priors. IEEE Trans. Vis. Comput. Graph. 2021, 29, 1977–1991. [Google Scholar] [CrossRef]
- Yang, Y.-Q.; Guo, Y.-X.; Xiong, J.-Y.; Liu, Y.; Pan, H.; Wang, P.-S.; Tong, X.; Guo, B. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv 2023, arXiv:2304.06906. [Google Scholar]
- Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point transformer v2: Grouped vector attention and partition-based pooling. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 33330–33342. [Google Scholar]
- Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. OneFormer3D: One transformer for unified point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 20943–20953. [Google Scholar]
- Murez, Z.; Van As, T.; Bartolozzi, J.; Sinha, A.; Badrinarayanan, V.; Rabinovich, A. Atlas: End-to-end 3d scene reconstruction from posed images. In Proceedings of the European Conference and Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 414–431. [Google Scholar]
- Sun, J.; Xie, Y.; Chen, L.; Zhou, X.; Bao, H. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 15598–15607. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–5 May 2021. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Stier, N.; Ranjan, A.; Colburn, A.; Yan, Y.; Yang, L.; Ma, F.; Angles, B. Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 18423–18432. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
- Narita, G.; Seno, T.; Ishikawa, T.; Kaji, Y. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, China, 2–8 November 2019; pp. 4205–4212. [Google Scholar]
- Cavallari, T.; Di Stefano, L. Semanticfusion: Joint labeling, tracking and mapping. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 648–664. [Google Scholar]
- Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An open-source library for real-time metric-semantic localization and mapping. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Online, 31 May–31 August 2020; pp. 1689–1696. [Google Scholar]
- Kuo, W.; Angelova, A.; Lin, T.-Y.; Dai, A. Mask2cad: 3d shape prediction by learning to segment and retrieve. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 260–277. [Google Scholar]
- Li, K.; DeTone, D.; Chen, Y.F.S.; Vo, M.; Reid, I.; Rezatofighi, H.; Sweeney, C.; Straub, J.; Newcombe, R. Odam: Object detection, association, and mapping using posed rgb video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5998–6008. [Google Scholar]
- Goel, S.; Kanazawa, A.; Malik, J. Shape and viewpoint without keypoints. In Proceedings of the European Conference and Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 88–104. [Google Scholar]
- Tyszkiewicz, M.J.; Maninis, K.-K.; Popov, S.; Ferrari, V. RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 211–228. [Google Scholar]
- Runz, M.; Li, K.; Tang, M.; Ma, L.; Kong, C.; Schmidt, T.; Reid, I.; Agapito, L.; Straub, J.; Lovegrove, S. Frodo: From detections to 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14720–14729. [Google Scholar]
- Li, K.; Rezatofighi, H.; Reid, I. Moltr: Multiple object localization, tracking and reconstruction from monocular rgb videos. IEEE Robot. Autom. Lett. 2021, 6, 3341–3348. [Google Scholar] [CrossRef]
- Hong, Z.; Yue, C.P. Real-Time 3D Visual Perception by Cross-Dimensional Refined Learning. IEEE Trans. Circuits Syst. Video Technol. 2024. [Google Scholar] [CrossRef]
- Sayed, M.; Gibson, J.; Watson, J.; Prisacariu, V.; Firman, M.; Godard, C. Simplerecon: 3d reconstruction without 3d convolutions. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–19. [Google Scholar]
- Duzceker, A.; Galliani, S.; Vogel, C.; Speciale, P.; Dusmanu, M.; Pollefeys, M. Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 15324–15333. [Google Scholar]
- Stier, N.; Rich, A.; Sen, P.; Höllerer, T. Vortx: Volumetric 3d reconstruction with transformers for voxelwise view selection and fusion. In Proceedings of the International Conference on 3D Vision, London, UK, 1–3 December 2021; pp. 320–330. [Google Scholar]
- Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
- Hua, B.-S.; Pham, Q.-H.; Nguyen, D.T.; Tran, M.-K.; Yu, L.-F.; Yeung, S.-K. Scenenn: A scene meshes dataset with annotations. In Proceedings of the International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 92–101. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wen, M.; Cho, K. Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos. Mathematics 2024, 12, 2114. https://doi.org/10.3390/math12132114
Wen M, Cho K. Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos. Mathematics. 2024; 12(13):2114. https://doi.org/10.3390/math12132114
Chicago/Turabian StyleWen, Mingyun, and Kyungeun Cho. 2024. "Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos" Mathematics 12, no. 13: 2114. https://doi.org/10.3390/math12132114
APA StyleWen, M., & Cho, K. (2024). Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos. Mathematics, 12(13), 2114. https://doi.org/10.3390/math12132114