Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos

Wen, Mingyun; Cho, Kyungeun

doi:10.3390/math12132114

Open AccessArticle

Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos

by

Mingyun Wen

¹

and

Kyungeun Cho

^2,*

¹

Department of Multimedia Engineering, Dongguk University-Seoul, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea

²

Division of AI Software Convergence, Dongguk University-Seoul, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(13), 2114; https://doi.org/10.3390/math12132114

Submission received: 17 June 2024 / Revised: 1 July 2024 / Accepted: 4 July 2024 / Published: 5 July 2024

(This article belongs to the Special Issue New Advances and Applications in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Existing 3D semantic scene reconstruction methods utilize the same set of features extracted from deep learning networks for both 3D semantic estimation and geometry reconstruction, ignoring the differing requirements of semantic segmentation and geometry construction tasks. Additionally, current methods allocate 2D image features to all voxels along camera rays during the back-projection process, without accounting for empty or occluded voxels. To address these issues, we propose separating the features for 3D semantic estimation from those for 3D mesh reconstruction. We use a pretrained vision transformer network for image feature extraction and depth priors estimated by a pretrained multi-view stereo-network to guide the allocation of image features within 3D voxels during the back-projection process. The back-projected image features are aggregated within each 3D voxel via averaging, creating coherent voxel features. The resulting 3D feature volume, composed of unified voxel feature vectors, is fed into a 3D CNN with a semantic classification head to produce a 3D semantic volume. This volume can be combined with existing 3D mesh reconstruction networks to produce a 3D semantic mesh. Experimental results on real-world datasets demonstrate that the proposed method significantly increases 3D semantic estimation accuracy.

Keywords:

3D semantic scene reconstruction; depth priors; vision transformer; multi-view stereo-network; voxel feature fusion

MSC:

68T07; 68T45; 68U10; 68U05

1. Introduction

Three-dimensional (3D) semantic scene reconstruction is an important task in the fields of computer vision [1,2], involving the reconstruction of 3D scenes from image or video data while simultaneously assigning semantic labels to various parts of the scene [3]. This dual-task capability enables machines to understand and interact with their environment more effectively, facilitating applications in robot navigation [4], autonomous driving [5], augmented reality [6,7], and more. By providing detailed and comprehensive environmental understanding, 3D semantic scene reconstruction enhances machine perception and decision-making processes.

Traditionally, 3D semantic scene reconstruction has relied heavily on 3D data from sensors such as LiDAR and depth cameras [8,9]. While effective, these sensors are often expensive, bulky, and less portable, limiting their widespread use. Consequently, there is a growing interest in leveraging 2D video data, which are more readily available, cost-effective, and easier to handle.

Current methods for 3D semantic scene reconstruction from 2D video data often use the same set of features for both semantic estimation and geometry reconstruction. This method overlooks the distinct requirements of these tasks, semantic estimation benefits from global contextual information, while geometry reconstruction focuses on local feature matching, leading to suboptimal performance. One reasonable method is to apply 3D semantic segmentation algorithms to reconstruct meshes from video data. However, this method is not yet used by existing algorithms and has significant limitations. Although 3D semantic segmentation algorithms [10,11,12] typically perform well on high-quality point clouds sampled from 3D sensor data, point clouds generated from 2D video data via mesh reconstruction are often less accurate and noisier. The lack of geometric detail makes the direct application of 3D semantic segmentation infeasible, as models trained on high-quality 3D data cannot generalize to these less accurate reconstructions, resulting in significant performance degradation. Additionally, existing 3D semantic scene reconstruction methods [13,14] typically back-project 2D image features into 3D voxel volumes without considering occlusions or empty spaces, resulting in inaccurate feature allocations.

To address these issues, we propose a novel method that separates the features used for 3D semantic estimation from those used for 3D mesh reconstruction. Our method leverages a pretrained vision transformer (ViT) network [15], specifically DINOv2 [16], whic h was trained using self-supervised techniques to learn robust visual features without supervision. ViTs, such as DINOv2, utilize self-attention mechanisms to capture long-range dependencies and integrate global contextual information, making them particularly effective for semantic tasks.

We use DINOv2 to extract rich, contextual image features from 2D video frames. These features are then guided by depth priors estimated by a pretrained multi-view stereo (MVS) network during the back-projection process to ensure accurate feature allocation within the 3D voxel volume. This depth-guided method helps prevent irrelevant features from being assigned to voxels, addressing one of the main shortcomings of existing methods.

The back-projected features are averaged within each voxel to create coherent voxel features. This 3D feature volume, composed of unified voxel feature vectors, is then fed into a 3D U-Net proposed by Finerecon [17] with a semantic classification head to produce a 3D semantic volume. This semantic volume can be seamlessly integrated with existing mesh reconstruction networks to generate a 3D semantic mesh, offering a modular method that improves semantic estimation accuracy without being affected by the precision of the reconstructed mesh.

The main contributions of our work are as follows:

We propose a feature separation method for 3D semantic estimation and 3D mesh reconstruction, addressing the interference between these tasks in existing methods;
We introduce the use of depth priors from a pretrained MVS network to guide the allocation of image features in 3D voxels, improving the accuracy of feature placement;
We utilize a pretrained ViT, which is adept at capturing global contextual information through self-supervised learning techniques, enhancing the robustness and accuracy of semantic features;
Our method produces a 3D semantic volume that can be combined with reconstructed 3D scene that obtained from any existing 3D reconstruction method, enhancing flexibility and applicability;
We present experimental results if real-world datasets that demonstrate significant improvements in 3D semantic estimation accuracy, validating the effectiveness of the proposed method.

The remainder of this paper is organized as follows: In Section 2, we review related work in the field of 3D semantic scene reconstruction. Section 3 details the proposed method. Section 4 presents the experimental results and analysis, demonstrating the effectiveness of the proposed method. Section 5 discusses the proposed method based on experimental results. Finally, in Section 6, we conclude the paper and discuss potential future directions.

2. Related Works

2.1. Three-Dimensional Semantic Segmentation Using Three-Dimensional Data Obtained from Three-Dimensional Sensor

The field of 3D semantic segmentation has been extensively researched, particularly with the advent of advanced neural network architectures such as PointNet [18] and the sparse convolutional neural network (CNN) [19]. These techniques focus on processing 3D point clouds derived from range sensors, enriched with additional attributes such as RGB values and normal vectors for each point. These methods assign semantic labels to each point in the point cloud, requiring precise 3D geometric inputs. The methods [8,20,21,22] that accept RGB-D depth images as input are also researched a lot. These methods operate in a similar fashion where they first adopt an off-the-shelf 2D semantic segmentation network to obtain 2D semantic labels per pixel, and then fuse the depth and labels into a voxel volume. Various methods for fusing semantic labels into the voxel volume have been proposed. For instance, Pham et al. [8] developed a method that constructs 3D meshes with voxel hashing from depth images and subsequently enhances the initial semantic labeling through super-voxel clustering and a high-order conditional random field to improve labeling coherence. Additionally, Cavallari and Di Stefano [21] and Rosinol et al. [22] have explored techniques that use depth images from 3D sensors to back-project 2D semantic labels into 3D voxels using a voting mechanisms. The voxel label is decided by the maximum voted semantic label. Reliable geometry representations (e.g., the 3D point cloud, or depth images from 3D sensors) are required to guarantee the performance of the algorithms. When applied to estimated geometric representations, there is a noticeable decline in segmentation performance due to the imprecise estimated geometry.

2.2. Three-Dimensional Semantic Scene Reconstruction from Videos

When leveraging RGB video inputs for 3D semantic reconstruction, methods bifurcate them into object-level and scene-level 3D semantic scene reconstructions. Object-level reconstruction focuses on identifying and reconstructing individual objects within a scene with semantic context, whereas scene-level reconstruction aims to semantically interpret and reconstruct the entire scene comprehensively. This distinction underscores the complexity and potential of 3D semantic reconstruction from RGB video inputs.

2.2.1. Object-Level 3D Semantic Reconstruction Methods

Object-level semantic reconstruction methods typically integrate detection and subsequent reconstruction processes [4,6,23,24,25,26,27]. Notable methods include Vid2CAD [6] and Mask2CAD [23], which detect objects within a scene and align CAD models from pre-existing databases to these detected objects in 3D space. However, a significant limitation of these methods is the mismatch between retrieved CAD models and the actual objects present in the scene, which may not always align perfectly in terms of shape and features.

To mitigate reliance on CAD databases, adaptive deformation strategies [24] have been developed, allowing a template shape to better match the object in the scene. The choice of the initial template is crucial, as significant differences between the geometric structure of the template and the target object can lead to inaccuracies. For example, using a spherical template [25] to model objects with complex features can result in significant inaccuracies, highlighting the challenge of template selection for diverse object geometries.

Additionally, implicit representation-based methods for mesh reconstruction [4,26,27] have emerged. These methods learn to reconstruct the mesh conditioned on object categories, significantly impacted by the learned shape priors. A common limitation of these methods is their tendency to successfully reconstruct only objects within the categories they were trained on [4,28]. Objects belonging to previously unseen categories often remain unreconstructed, and structural elements of the scene, such as walls and floors, are typically not reconstructed. This highlights a significant challenge in object-level reconstruction: the need for methods that generalize beyond their training data to encompass a broader array of object types and scene components.

2.2.2. Scene-Level 3D Semantic Reconstruction Methods

The domain of 3D semantic reconstruction from posed videos, particularly within the deep learning community, is relatively unexplored. Early pioneering efforts, such as Atlas [13], integrated the estimation of 3D geometry and semantic information from voxel features derived from a 3D CNN network. The initial approach to feature aggregation involved averaging back-projected multi-view image features without determining their relative importance, resulting in a noisy representation due to the indiscriminate fusion of multi-view information. To address this, CDRNet [29] incorporated a gated recurrent unit (GRU) for a more nuanced integration of multi-view features, selectively emphasizing more relevant features to mitigate noise issues.

Despite these advancements, existing methods often back-project 2D image features into 3D voxel volumes without considering occlusions or empty spaces, leading to inaccurate feature allocations. Moreover, directly applying 3D semantic segmentation algorithms to meshes reconstructed from video data can be problematic due to differences in reconstruction accuracy. These issues necessitate a more refined method for feature allocation and semantic estimation.

3. Three-Dimensional Semantic Volume Estimation via Depth Prior-Guided Three-Dimensional Voxel Feature Fusion

3.1. Overview

Figure 1 provides an overview of the proposed method. This method outputs a 3D semantic volume, where each voxel is classified with a semantic label. The semantic volume can then be combined with existing mesh reconstruction networks to generate a reconstructed 3D semantic scene from a posed RGB video.

Semantic volume estimation is divided into three steps: keyframe processing, depth prior-guided 3D voxel feature fusion, and 3D semantic volume estimation. Given a sequence of keyframes

{\{I_{i}\}}_{i = 1}^{N}

from a video, a pretrained MVS network predicts depth priors

{\{D_{i}\}}_{i = 1}^{N}

for all keyframes. An image encoder is used to extract the image features

{\{F_{i}\}}_{i = 1}^{N}

for all keyframes. These image features are then unprojected to specific voxels in a global 3D voxel volume, guided by the estimated depth priors, to form a 3D voxel feature volume. A 3D semantic volume decoder, consisting of a 3D CNN with the same structure as the one used in Finerecon and a classification head, is then applied to the 3D voxel feature volume to obtain voxel-wise semantic estimation results.

3.2. Keyframe Processing

The image encoder network is composed of pretrained ViT-Small (ViT-S), Dinov2, and a CNN composed of three Conv-BN-ReLU blocks, as shown in Figure 2. The Dinov2-extracted feature consists of 384 channels. Direct utilization of this feature in a 3D task is computationally expensive and can lead to excessive memory usage. The CNN module serves two main purposes: first, it reduces the channel size, thus alleviating memory and computational burden; second, it refines the feature to better suit the 3D semantic task.

During training, Dinov2 is frozen, and only the CNN module is trained. This approach reduces the computational load, leading to faster convergence. The depth priors are obtained from an off-the-shelf MVS network [30].

3.3. Depth Prior-Guided 3D Voxel Feature Fusion

This section introduces the multi-view semantic feature fusion process depicted in Figure 3. In this process, the estimated depth images

{\{D_{i}\}}_{i = 1}^{N}

from the MVS module guide the mapping of image features

{\{F_{i}\}}_{i = 1}^{N}

to a 3D voxel volume.

Assuming

(x, y, z)

is the coordinate of voxel

v_{j}

in the world coordinate system, the mapped pixel coordinates

(\hat{x}, \hat{y})

and the projected depth,

\hat{z}

, on keyframe

i

can be computed as follows:

[\begin{matrix} \hat{x} \\ \hat{y} \\ \hat{z} \end{matrix}] = Π K [\begin{matrix} R_{i} & t_{i} \\ 0 & 1 \end{matrix}] [\begin{matrix} \begin{matrix} x \\ y \\ z \end{matrix} \\ 1 \end{matrix}],

(1)

where

Π

represents perspective mapping;

R_{i}

and

t_{i}

are the extrinsic parameters of the camera for keyframe

i

.

K

represents the intrinsic parameters of camera.

The feature assignment of keyframe

i

for voxel

v_{j}

can be computed as follows:

V_{i} (:, x, y, z) = \{\begin{matrix} F_{i} (:, \hat{x}, \hat{y}) i f |\hat{z} - D_{i} (\hat{x}, \hat{y})| \leq δ \\ 0 o t h e r w i s e \end{matrix},

(2)

where

:

denotes the slicing operation,

D_{i} (\hat{x}, \hat{y})

is the sampled depth value from the estimated depth image of keyframe

i

, and

F_{i} (:, \hat{x}, \hat{y})

is the sampled image feature on the image feature plane. The threshold,

δ

, is defined as the absolute distance between the estimated depth value and the projected depth value of the voxel on the 3D frame. It determines whether

F_{i} (:, \hat{x}, \hat{y})

will be assigned to voxel

v_{j}

.

Voxels

v_{j - 1}

,

v_{j}

, and

v_{j + 1}

are neighboring voxels along a camera ray. they project the same position onto the image plane of keyframe

i

, and the projected values of them between the estimated depth value in that position are smaller than

δ

, an thus an identical sampled image feature is added to the them, as shown in Figure 3. This depth prior-guided approach helps to avoid the allocation of image features to voxels that should be empty or occluded by ensuring that only voxels with depth values close to the estimated depth are considered for image feature assignment. It can also mitigate the impact of depth estimation inaccuracies by preventing voxels that represent actual 3D geometry from being ignored.

Image features from other keyframes are also back-projected in the same manner, resulting a multi-view feature volume

{V_{i}}_{i = 1}^{N}

. By averaging these back-projected features at the voxel level, a fused semantic feature volume is constructed. This volume is then refined by a 3D CNN [17] with a semantic classification head for 3D semantic estimation.

3.4. Implementation Details

Keyframe selection is based on a relative camera pose, following the method described in [31]. The voxel size of the voxel volume is set to 4 cm, consistent with the settings used in most existing mesh reconstruction networks [13,14,32]. The threshold,

δ

, is set to 24 cm. The network is implemented in PyTorch 2.0 and both trained and tested on a Linux server equipped with six RTX A6000 graphic cards. During training, voxel volume chunks of scenes with fixed dimensions (3.84 m × 3.84 m × 2.24 m) are randomly cropped and fed into the network. The training process lasts approximately 22 h. For each cropped chunk, 20 keyframes are selected to generate the 3D voxel feature volume.

The loss used for training the network is defined as follows:

L_{s} = \frac{1}{|V|} \sum_{v \in V} C E (S^{*} (v), S (v))

(3)

where

L_{s}

is the 3D semantic loss,

S^{*}

is the ground truth 3D semantic volume,

S

is the estimated 3D semantic volume, and

C E

denotes the cross entropy loss.

4. Experimental Results and Analysis

4.1. Datasets and Metrics

The experiments were conducted on two popular existing datasets for 3D semantic scene segmentation, namely the ScanNet v2 dataset [33] and the SceneNN dataset [34]. The ScanNet v2 dataset was divided into training, validation, and test sets according to the official splits, with the ground truth for the test set hidden. This dataset is a common benchmark in 3D semantic scene reconstruction studies, including works like Atlas, Finerecon and Panopticfusion [20]. It includes 1513 scans of 707 scenes, covering various types of real-world indoor spaces, such as offices, apartments and bathrooms. Following existing works [13,29], the network is trained solely on the training set, and evaluated offline on the validation set and online on the test set. Semantic labels from the reconstructed semantic mesh are transferred to the ground truth mesh using K-nearest neighbor (KNN) to obtain the semantic results. This approach ensures that the comparison is based on consistent geometric alignment between the reconstructed and ground truth meshes. The benchmark provides semantic estimation results for 20 indoor object categories using the mean intersection over union (mIoU) metric. To further verify the generalization capability, we also report evaluation results on the SceneNN dataset, which includes 100 diverse scenes, captured using a different sensor to ScanNet v2. Both datasets provide comprehensive coverage of real-world indoor scenarios.

4.2. Experimental Results of ScanNet v2 Dataset

4.2.1. Quantitative Comparisons

Table 1 presents the mIoU results for 3D semantic estimation evaluated for the ScanNet v2 test set reported by the ScanNet v2 benchmark. Our method achieves a significantly higher mIoU score of 0.468, outperforming all of the compared methods. Among the methods compared, CDRNet achieves the second-highest mIoU score of 0.391. NeuralRecon [14] with added semantic heads and VoRTX [32] with added semantic heads, which are reimplemented by CDRNet, achieve mIoU scores of 0.279 and 0.132, respectively. Atlas achieves an mIoU score of 0.340. Although Atlas offers better performance compared with NeuralRecon and VoRTX, it still falls significantly short compared with our method.

Figure 4 illustrates the per-category evaluation of 3D semantic estimation on the ScanNet v2 validation set, comparing our method with Atlas. Our method consistently outperforms Atlas across all categories, demonstrating its superior generalization capability and robustness. We chose to compare only with Atlas due to the availability of detailed dataset processing methods and pretrained models in their code repository, which enabled a fair and reproducible comparison.

4.2.2. Qualitative Comparisons

Figure 5 presents qualitative comparisons of 3D semantic estimation results on the ScanNet v2 validation set, showcasing the ground truth (GT), Atlas, and our proposed method. For a more intuitive comparison, we directly compare the 3D semantic estimation results on the 3D ground truth mesh, which helps to mitigate the influence of mesh quality variations. Semantic labels from the reconstructed semantic mesh are transferred to the ground truth mesh using KNN to obtain the semantic results. The red boxes highlight areas where our method correctly labels objects that Atlas misclassifies. Our method demonstrates superior performance, accurately identifying and labeling various objects such as cabinets, beds, and shelves, resulting in a closer resemblance to the ground truth.

4.3. Generalization Experimental Results on SceneNN Dataset

4.3.1. Quantitative Comparisons

Table 2 presents the evaluation of 3D semantic estimation using the SceneNN dataset, demonstrating the generalization capability of various methods trained exclusively on the ScanNet v2 training set without further fine-tuning on the SceneNN dataset. Despite the lack of specific adaptation to SceneNN, our method achieves the highest mIoU score of 0.518, significantly outperforming the compared methods. Atlas attains an mIoU of 0.314, that of NeuralRecon with added semantic heads reaches 0.159, and CDRNet achieves an mIoU of 0.367. These results, reported by CDRNet, underscore our method’s superior ability to be generalized across different datasets. The substantial margin by which the performance of our approach surpasses that of Atlas and NeuralRecon highlights its robustness in producing accurate semantic estimations in diverse and unseen environments.

4.3.2. Qualitative Comparisons

Figure 6 provides visual comparisons between the performance of Atlas and that of our proposed method on the SceneNN dataset for generalization experiments. The red boxes highlight areas where our method correctly labels objects that Atlas misclassifies.

In the first row, our method accurately labels the chairs and tables, particularly in the edge areas between these objects, while Atlas fails to do so. The second row demonstrates that Atlas misclassifies a chair as a sofa and fails to recognize the table, whereas our method correctly identifies both objects. In the third row, Atlas does not output correct labels for the cabinet and counter, which are accurately labeled by our method. The last row shows that Atlas fails to correctly estimate the refrigerator and sink, and also over-segments the table, partially classifying it as a table and partially classifying it as a cabinet. In contrast, our method provides accurate semantic labels for all these objects, demonstrating superior generalization capability on the SceneNN dataset.

5. Discussion

This study presents a novel method for 3D semantic estimation from monocular videos, leveraging depth priors to guide voxel feature fusion. Our approach effectively addresses the allocation of image features within 3D voxel volumes and separates the features used for semantic estimation from those used for geometry reconstruction. The experimental results for the ScanNet v2 and SceneNN datasets highlight our method’s superior performance and generalization capability compared with existing methods like Atlas and CDRNet. The use of depth priors and a pretrained ViT for image feature extraction contributes significantly to the robustness of our approach, ensuring accurate semantic estimation even in challenging scenarios. Despite these promising results, the reliance on depth priors means that our method’s performance may be influenced by the quality of depth estimation. In future work, incorporating depth reliability information, which indicates the likelihood of depth existence in different voxels, could enhance the robustness of the network. Additionally, the inference speed of the proposed method is slow due to the computationally intensive processes involved in depth prior estimation and ViT image feature extraction. Future work could also explore the development of more lightweight depth estimation networks and image feature extraction methods to improve the speed of inference and move towards real-time performance. Additionally, investigating dynamic feature recognition, such as identifying and tracking moving objects (e.g., humans), could further enhance the applicability of our method in real-world scenarios.

6. Conclusions

In this paper, we introduced a depth prior-guided 3D voxel feature fusion method for 3D semantic estimation from monocular videos. Our method achieves significant improvements in semantic estimation accuracy and demonstrates strong generalization capabilities across different datasets. The integration of depth priors and a pretrained ViT enhances the robustness and accuracy of our approach, making it a valuable contribution to the field of 3D semantic scene reconstruction. Future work may focus on optimizing the feature fusion process to handle the variations in different depth priors.

Author Contributions

Conceptualization, M.W.; methodology, M.W.; software, M.W.; validation, M.W.; investigation, M.W.; writing—original draft preparation, M.W.; writing—review and editing, M.W.; visualization, M.W.; supervision, K.C.; project administration, K.C.; funding acquisition, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (2022R1A2C2006864) and by Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2024-RS-2023-00254592) grant funded by the Korea government (MSIT).

Data Availability Statement

The datasets used in this study are publicly available. The ScanNet v2 dataset can be accessed at http://www.scan-net.org/, accessed on 4 July 2024, and the SceneNN dataset is available at https://hkust-vgd.github.io/scenenn/, accessed on 4 July 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shi, W.; Xu, J.; Zhu, D.; Zhang, G.; Wang, X.; Li, J.; Zhang, X. RGB-D semantic segmentation and label-oriented voxelgrid fusion for accurate 3D semantic mapping. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 183–197. [Google Scholar] [CrossRef]
Han, L.; Zheng, T.; Zhu, Y.; Xu, L.; Fang, L. Live semantic 3d perception for immersive augmented reality. IEEE Trans. Vis. Comput. Graph. 2020, 26, 2012–2022. [Google Scholar] [CrossRef]
Kundu, A.; Genova, K.; Yin, X.; Fathi, A.; Pantofaru, C.; Guibas, L.J.; Tagliasacchi, A.; Dellaert, F.; Funkhouser, T. Panoptic neural fields: A semantic object-aware neural scene representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 12871–12881. [Google Scholar]
Li, S.; Cheng, K.-T. Joint stereo 3D object detection and implicit surface reconstruction. Sci. Rep. 2024, 14, 13893. [Google Scholar] [CrossRef] [PubMed]
Shao, H.; Wang, L.; Chen, R.; Li, H.; Liu, Y. Safety-enhanced autonomous driving using interpretable sensor fusion transformer. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; pp. 726–737. [Google Scholar]
Maninis, K.-K.; Popov, S.; Nießner, M.; Ferrari, V. Vid2cad: Cad model alignment using multi-view constraints from videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1320–1327. [Google Scholar] [CrossRef] [PubMed]
Wen, M.; Cho, K. Object-aware 3d scene reconstruction from single 2d images of indoor scenes. Mathematics 2023, 11, 403. [Google Scholar] [CrossRef]
Pham, Q.-H.; Hua, B.-S.; Nguyen, T.; Yeung, S.-K. Real-time progressive 3D semantic segmentation for indoor scenes. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 7–11 January 2019; pp. 1089–1098. [Google Scholar]
Huang, S.-S.; Chen, H.; Huang, J.; Fu, H.; Hu, S.-M. Real-time globally consistent 3D reconstruction with semantic priors. IEEE Trans. Vis. Comput. Graph. 2021, 29, 1977–1991. [Google Scholar] [CrossRef]
Yang, Y.-Q.; Guo, Y.-X.; Xiong, J.-Y.; Liu, Y.; Pan, H.; Wang, P.-S.; Tong, X.; Guo, B. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv 2023, arXiv:2304.06906. [Google Scholar]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point transformer v2: Grouped vector attention and partition-based pooling. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 33330–33342. [Google Scholar]
Kolodiazhnyi, M.; Vorontsova, A.; Konushin, A.; Rukhovich, D. OneFormer3D: One transformer for unified point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 20943–20953. [Google Scholar]
Murez, Z.; Van As, T.; Bartolozzi, J.; Sinha, A.; Badrinarayanan, V.; Rabinovich, A. Atlas: End-to-end 3d scene reconstruction from posed images. In Proceedings of the European Conference and Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 414–431. [Google Scholar]
Sun, J.; Xie, Y.; Chen, L.; Zhou, X.; Bao, H. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 15598–15607. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–5 May 2021. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Stier, N.; Ranjan, A.; Colburn, A.; Yan, Y.; Yang, L.; Ma, F.; Angles, B. Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 18423–18432. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
Narita, G.; Seno, T.; Ishikawa, T.; Kaji, Y. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, China, 2–8 November 2019; pp. 4205–4212. [Google Scholar]
Cavallari, T.; Di Stefano, L. Semanticfusion: Joint labeling, tracking and mapping. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 648–664. [Google Scholar]
Rosinol, A.; Abate, M.; Chang, Y.; Carlone, L. Kimera: An open-source library for real-time metric-semantic localization and mapping. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Online, 31 May–31 August 2020; pp. 1689–1696. [Google Scholar]
Kuo, W.; Angelova, A.; Lin, T.-Y.; Dai, A. Mask2cad: 3d shape prediction by learning to segment and retrieve. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 260–277. [Google Scholar]
Li, K.; DeTone, D.; Chen, Y.F.S.; Vo, M.; Reid, I.; Rezatofighi, H.; Sweeney, C.; Straub, J.; Newcombe, R. Odam: Object detection, association, and mapping using posed rgb video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5998–6008. [Google Scholar]
Goel, S.; Kanazawa, A.; Malik, J. Shape and viewpoint without keypoints. In Proceedings of the European Conference and Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 88–104. [Google Scholar]
Tyszkiewicz, M.J.; Maninis, K.-K.; Popov, S.; Ferrari, V. RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 211–228. [Google Scholar]
Runz, M.; Li, K.; Tang, M.; Ma, L.; Kong, C.; Schmidt, T.; Reid, I.; Agapito, L.; Straub, J.; Lovegrove, S. Frodo: From detections to 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14720–14729. [Google Scholar]
Li, K.; Rezatofighi, H.; Reid, I. Moltr: Multiple object localization, tracking and reconstruction from monocular rgb videos. IEEE Robot. Autom. Lett. 2021, 6, 3341–3348. [Google Scholar] [CrossRef]
Hong, Z.; Yue, C.P. Real-Time 3D Visual Perception by Cross-Dimensional Refined Learning. IEEE Trans. Circuits Syst. Video Technol. 2024. [Google Scholar] [CrossRef]
Sayed, M.; Gibson, J.; Watson, J.; Prisacariu, V.; Firman, M.; Godard, C. Simplerecon: 3d reconstruction without 3d convolutions. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–19. [Google Scholar]
Duzceker, A.; Galliani, S.; Vogel, C.; Speciale, P.; Dusmanu, M.; Pollefeys, M. Deepvideomvs: Multi-view stereo on video with recurrent spatio-temporal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 15324–15333. [Google Scholar]
Stier, N.; Rich, A.; Sen, P.; Höllerer, T. Vortx: Volumetric 3d reconstruction with transformers for voxelwise view selection and fusion. In Proceedings of the International Conference on 3D Vision, London, UK, 1–3 December 2021; pp. 320–330. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
Hua, B.-S.; Pham, Q.-H.; Nguyen, D.T.; Tran, M.-K.; Yu, L.-F.; Yeung, S.-K. Scenenn: A scene meshes dataset with annotations. In Proceedings of the International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 92–101. [Google Scholar]

Figure 1. An overview of the proposed method for 3D semantic estimation used for reconstructing a 3D scene. The proposed method can be used for 3D semantic scene reconstruction from videos.

Figure 2. Keyframe image feature extraction leveraging pretrained ViT-S, Dinov2 [16]. The output from ViT-S is finetuned with a CNN module to adjust the feature for the semantic estimation task.

Figure 3. Depth prior-guided voxel feature fusion. The image features are unprojected to specific voxels guided by depth priors. The semantic voxel features are fused by averaging these unprojected image features inside each voxel.

Figure 4. Per-category evaluation of 3D semantic estimation on the ScanNet v2 validation set. The red bars indicate the scores obtained by our method, while the blue bars represent the scores obtained by Atlas [13]. Higher scores indicate better performance.

Figure 5. Qualitative comparisons on ScanNet v2 validation set between Atlas [13] and our method.

Figure 6. Qualitative comparisons of performance on SceneNN dataset between Atlas [13] and our method.

Table 1. Evaluation of 3D semantic estimation on the ScanNet v2 test set.

Methods	mIoU
Atlas [13]	0.340
NeuralRecon [14] + Semantic-heads *	0.279
VoRTX [32] + Semantic-heads *	0.132
CDRNet [29] *	0.391
Ours	0.648

* The results are reported by CDRNet.

Table 2. Evaluation of 3D semantic estimation using the SceneNN dataset.

Methods	mIoU
Atlas [13] *	0.314
NeuralRecon [14] + semantic-heads *	0.159
CDRNet [29] *	0.367
Ours	0.518

* Results reported by CDRNet.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wen, M.; Cho, K. Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos. Mathematics 2024, 12, 2114. https://doi.org/10.3390/math12132114

AMA Style

Wen M, Cho K. Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos. Mathematics. 2024; 12(13):2114. https://doi.org/10.3390/math12132114

Chicago/Turabian Style

Wen, Mingyun, and Kyungeun Cho. 2024. "Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos" Mathematics 12, no. 13: 2114. https://doi.org/10.3390/math12132114

APA Style

Wen, M., & Cho, K. (2024). Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos. Mathematics, 12(13), 2114. https://doi.org/10.3390/math12132114

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos

Abstract

1. Introduction

2. Related Works

2.1. Three-Dimensional Semantic Segmentation Using Three-Dimensional Data Obtained from Three-Dimensional Sensor

2.2. Three-Dimensional Semantic Scene Reconstruction from Videos

2.2.1. Object-Level 3D Semantic Reconstruction Methods

2.2.2. Scene-Level 3D Semantic Reconstruction Methods

3. Three-Dimensional Semantic Volume Estimation via Depth Prior-Guided Three-Dimensional Voxel Feature Fusion

3.1. Overview

3.2. Keyframe Processing

3.3. Depth Prior-Guided 3D Voxel Feature Fusion

3.4. Implementation Details

4. Experimental Results and Analysis

4.1. Datasets and Metrics

4.2. Experimental Results of ScanNet v2 Dataset

4.2.1. Quantitative Comparisons

4.2.2. Qualitative Comparisons

4.3. Generalization Experimental Results on SceneNN Dataset

4.3.1. Quantitative Comparisons

4.3.2. Qualitative Comparisons

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI