Deep Monocular Depth Estimation Based on Content and Contextual Features
Abstract
:1. Introduction
- This work proposes a deep autoencoder network that leverages the benefits of squeeze-and-excitation networks (SENets) presented in [5]. SENets use the convolutional neural network (CNN) blocks to enhance channel interdependencies and improve feature representation without significant computational overhead. The proposed network is designed to extract precise content and structural information from monocular images, leveraging the power of deep learning to accurately predict depth from RGB input.
- This work proposes to enhance the accuracy of depth prediction for monocular images by leveraging the well-known semantic segmentation model HRNet-V2, as presented in [6]. HRNet-V2 enriches the content features with contextual semantic information, enabling the model to capture object boundaries better and maintain high-level representations of small objects in images. By integrating the strengths of HRNet-V2 with a deep learning approach to monocular depth prediction, this study aims to advance the state-of-the-art technologies in this field.
- The proposed model is an integrated framework combining two autoencoders to accurately predict high-resolution depth maps from monocular images. By leveraging the strengths of both models, the integrated framework is designed to provide a more robust and accurate prediction of depth, even in challenging scenarios. The proposed framework aims to advance the field of monocular depth prediction by providing a unified approach that can capture the richness and complexity of the real world while maintaining computational efficiency.
2. Related Work
3. Methodology
3.1. Problem Formulation
3.2. Network Architecture
3.2.1. Content Encoder
3.2.2. Semantic Encoder
3.2.3. Decoder
3.3. Loss Functions
4. Experiments and Results
4.1. Dataset
4.1.1. NYU Depth-v2 Dataset
4.1.2. SUN RGB-D Dataset
4.2. Parameter Settings
4.3. Evaluation Measures
4.4. Results and Discussion
4.4.1. Ablation Study
- Baseline that has one autoencoder network as proposed in [12] with the point-wise and SSIM losses.
- Baseline with skip connection: Applying skip connection to the autoencoder network by feeding the features maps extracted by the encoder layers to the corresponding decoder layers.
- Proposed model: The baseline with skip connection and the feature extracted by the encoder of the semantic segmentation autoencoder.
4.4.2. Analysing Performance
5. Conclusions and Future Directions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
Abs Rel | Absolute and Relative Error |
ADAM | Adaptive Moment Estimation Optimization Method |
BL | Base Line |
BLSC | Base Line withSkip Connections |
BN | Bbatch Nnormalization |
DE | Depth Estimation |
E1 | Content Encoder |
E2 | Semantic Encoder |
MDE | Monocular Depth Estimation |
MedErr | Median Error |
MLF | Multi Loss Function |
MRF | Markov Random Field |
MSE | Mean Squared Error |
Rel | Relative Error |
ReLU | Rectified Linear Unit |
RMSE | Root Mean Square Error |
SCLoss | Semantic Context Loss |
SCs | Skip Connections |
SENets | Squeeze-and-Excitation Networks |
SSIM | Structural Similarity |
References
- Simões, F.; Almeida, M.; Pinheiro, M.; Dos Anjos, R.; Dos Santos, A.; Roberto, R.; Teichrieb, V.; Suetsugo, C.; Pelinson, A. Challenges in 3d reconstruction from images for difficult large-scale objects: A study on the modeling of electrical substations. In Proceedings of the 2012 14th Symposium on Virtual and Augmented Reality, Rio de Janeiro, Brazil, 28–31 May 2012; pp. 74–83. [Google Scholar]
- Abdulwahab, S.; Rashwan, H.A.; García, M.Á.; Jabreel, M.; Chambon, S.; Puig, D. Adversarial Learning for Depth and Viewpoint Estimation From a Single Image. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2947–2958. [Google Scholar] [CrossRef]
- Abdulwahab, S.; Rashwan, H.A.; Garcia, M.A.; Masoumian, A.; Puig, D. Monocular depth map estimation based on a multi-scale deep architecture and curvilinear saliency feature boosting. Neural Comput. Appl. 2022, 34, 16423–16440. [Google Scholar] [CrossRef]
- Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 2018, 127, 302–321. [Google Scholar] [CrossRef] [Green Version]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for labeling pixels and regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
- Li, B.; Shen, C.; Dai, Y.; Van Den Hengel, A.; He, M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1119–1127. [Google Scholar]
- Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2274–2282. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Long, X.; Lin, C.; Liu, L.; Li, W.; Theobalt, C.; Yang, R.; Wang, W. Adaptive surface normal constraint for depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 12849–12858. [Google Scholar]
- Kopf, J.; Rong, X.; Huang, J.B. Robust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1611–1621. [Google Scholar]
- Alhashim, I.; Wonka, P. High quality monocular depth estimation via transfer learning. arXiv 2018, arXiv:1812.11941. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
- Li, Z.; Wang, X.; Liu, X.; Jiang, J. BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation. arXiv 2022, arXiv:2204.00987. [Google Scholar]
- Kim, D.; Lee, S.; Lee, J.; Kim, J. Leveraging contextual information for monocular depth estimation. IEEE Access 2020, 8, 147808–147817. [Google Scholar] [CrossRef]
- Gao, T.; Wei, W.; Cai, Z.; Fan, Z.; Xie, S.Q.; Wang, X.; Yu, Q. CI-Net: A joint depth estimation and semantic segmentation network using contextual information. Appl. Intell. 2022, 52, 18167–18186. [Google Scholar] [CrossRef]
- Mousavian, A.; Pirsiavash, H.; Košecká, J. Joint semantic segmentation and depth estimation with deep convolutional networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 611–619. [Google Scholar]
- Valdez-Rodríguez, J.E.; Calvo, H.; Felipe-Riverón, E.; Moreno-Armendáriz, M.A. Improving Depth Estimation by Embedding Semantic Segmentation: A Hybrid CNN Model. Sensors 2022, 22, 1669. [Google Scholar] [CrossRef] [PubMed]
- Klingner, M.; Termöhlen, J.A.; Mikolajczyk, J.; Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 582–600. [Google Scholar]
- Jiao, J.; Cao, Y.; Song, Y.; Lau, R. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 53–69. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Zhang, T.; Qi, G.J.; Xiao, B.; Wang, J. Interleaved group convolutions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4373–4382. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
- Lehtinen, J.; Munkberg, J.; Hasselgren, J.; Laine, S.; Karras, T.; Aittala, M.; Aila, T. Noise2noise: Learning image restoration without clean data. arXiv 2018, arXiv:1803.04189. [Google Scholar]
- Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; p. 3. [Google Scholar]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
- Song, S.; Lichtenberg, S.P.; Xiao, J. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration. PyTorch 2017, 6, 67. [Google Scholar]
- Liu, F.; Shen, C.; Lin, G. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5162–5170. [Google Scholar]
- Hao, Z.; Li, Y.; You, S.; Lu, F. Detail preserving depth estimation from a single image using attention guided networks. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 304–313. [Google Scholar]
- Ramamonjisoa, M.; Firman, M.; Watson, J.; Lepetit, V.; Turmukhambetov, D. Single Image Depth Estimation using Wavelet Decomposition. arXiv 2021, arXiv:2106.02022. [Google Scholar]
- Tang, M.; Chen, S.; Dong, R.; Kan, J. Encoder-Decoder Structure with the Feature Pyramid for Depth Estimation From a Single Image. IEEE Access 2021, 9, 22640–22650. [Google Scholar] [CrossRef]
- Chen, X.; Chen, X.; Zha, Z.J. Structure-aware residual pyramid network for monocular depth estimation. arXiv 2019, arXiv:1907.06023. [Google Scholar]
- Yin, W.; Liu, Y.; Shen, C.; Yan, Y. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5684–5693. [Google Scholar]
- Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
Method | Accuracy: Higher Is Better | Lower Is Better | ||||
---|---|---|---|---|---|---|
↑ | ↑ | ↑ | Rel↓ | RMS ↓ | ↓ | |
Baseline Model | 0.833 | 0.969 | 0.9928 | 0.14 | 0.532 | 0.056 |
Baseline with skip connection Model | 0.842 | 0.971 | 0.9931 | 0.148 | 0.525 | 0.054 |
Our model | 0.8523 | 0.974 | 0.9935 | 0.121 | 0.523 | 0.0527 |
Method | Accuracy: Higher Is Better | Lower Is Better | ||||
---|---|---|---|---|---|---|
↑ | ↑ | ↑ | Rel ↓ | RMS ↓ | ↓ | |
Baseline Model | 0.82 | 0.945 | 0.972 | 0.144 | 0.46 | 0.066 |
Baseline with skip connection Model | 0.826 | 0.948 | 0.973 | 0.141 | 0.46 | 0.064 |
Our model | 0.837 | 0.950 | 0.974 | 0.136 | 0.45 | 0.062 |
Method | Accuracy: Higher Is Better | Lower Is Better | ||||
---|---|---|---|---|---|---|
↑ | ↑ | ↑ | Rel ↓ | RMS ↓ | ↓ | |
Hao et al. [33] | 0.841 | 0.966 | 0.991 | 0.127 | 0.555 | 0.053 |
Ramamonjisoa et al. [34] | 0.8451 | 0.9681 | 0.9917 | 0.1258 | 0.551 | 0.054 |
Alhashim et al. [12] | 0.846 | 0.97 | 0.99 | 0.123 | 0.465 | 0.053 |
Tang et al. [35] | 0.826 | 0.963 | 0.992 | 0.132 | 0.579 | 0.056 |
Our model | 0.8523 | 0.974 | 0.9935 | 0.121 | 0.523 | 0.0527 |
Method | Encoder | Accuracy: Higher Is Better | Lower Is Better | ||||
---|---|---|---|---|---|---|---|
↑ | ↑ | ↑ | Rel ↓ | RMS ↓ | ↓ | ||
Chen et al. [36] | SENet-154 | 0.757 | 0.943 | 0.984 | 0.166 | 0.494 | 0.071 |
Yin et al. [37] | ResNeXt-101 | 0.696 | 0.912 | 0.973 | 0.183 | 0.541 | 0.082 |
BTS. [38] | DenseNet-161 | 0.740 | 0.933 | 0.980 | 0.172 | 0.515 | 0.075 |
Adabins. [14] | E-B5+Mini-ViT | 0.771 | 0.944 | 0.983 | 0.159 | 0.476 | 0.068 |
BinsFormer. [15] | ResNet-18 | 0.738 | 0.935 | 0.982 | 0.175 | 0.504 | 0.074 |
BinsFormer. [15] | Swin-Tiny | 0.760 | 0.945 | 0.985 | 0.162 | 0.478 | 0.069 |
BinsFormer. [15] | Swin-Large | 0.805 | 0.963 | 0.990 | 0.143 | 0.421 | 0.061 |
Our model | SENet-154 | 0.837 | 0.950 | 0.974 | 0.136 | 0.45 | 0.062 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abdulwahab, S.; Rashwan, H.A.; Sharaf, N.; Khalid, S.; Puig, D. Deep Monocular Depth Estimation Based on Content and Contextual Features. Sensors 2023, 23, 2919. https://doi.org/10.3390/s23062919
Abdulwahab S, Rashwan HA, Sharaf N, Khalid S, Puig D. Deep Monocular Depth Estimation Based on Content and Contextual Features. Sensors. 2023; 23(6):2919. https://doi.org/10.3390/s23062919
Chicago/Turabian StyleAbdulwahab, Saddam, Hatem A. Rashwan, Najwa Sharaf, Saif Khalid, and Domenec Puig. 2023. "Deep Monocular Depth Estimation Based on Content and Contextual Features" Sensors 23, no. 6: 2919. https://doi.org/10.3390/s23062919
APA StyleAbdulwahab, S., Rashwan, H. A., Sharaf, N., Khalid, S., & Puig, D. (2023). Deep Monocular Depth Estimation Based on Content and Contextual Features. Sensors, 23(6), 2919. https://doi.org/10.3390/s23062919