Unsupervised Learning of Depth from Monocular Videos Using 3D-2D Corresponding Constraints
Abstract
:1. Introduction
- We propose a depth prediction method for AMP based on unsupervised learning, which can learn from video sequences and simultaneously estimate the depth structure of the scene and the ego-motion.
- Our method makes the spatial correspondence between pixel points consistent with the image area by smoothing the 3D corresponding vector field by 2D image. This effectively improves the depth prediction ability of the neural network.
- The model is trained and evaluated on the KITTI dataset provided by [16]. The results of the assessment indicate that our unsupervised method is superior to existing methods of the same type and has better quality than other self-supervised and supervised methods in recent years.
2. Related Work
3. The Proposed Approach
3.1. Differentiable Reprojection Error
3.2. Image Reconstruction Loss
3.3. Corresponding Consistency Loss
3.4. Learning Setup
4. Experiments and Discussion
4.1. Network Structure
4.2. Datasets Description
4.3. Experiment Settings
4.4. Comparisons with Other Methods
4.4.1. Evaluation of Depth Estimation
4.4.2. Evaluation of Ego-Motion
4.4.3. Depth Results on Apollo and Cityscapes
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AMP | Autonomous Moving Platforms |
5G | Fifth-generation |
SFM | Structure From Motion |
SLAM | Simultaneous Localization And Mapping |
CNN | Convolutional Neural Network |
CRF | Conditional Random Fields |
DCNN | Deep Convolutional Neural Networks |
RNN | Recurrent Neural Network |
Abs Rel | Absolute Relative error |
Sq Rel | Square Relative error |
RMSE | Root Mean Squared Error |
RMSE log | Root Mean Squared logarithmic Error |
ATE | Absolute Trajectory Error |
6-DOF | Six degrees of freedom |
2D | two-dimensional |
3D | three-dimensional |
References
- Wymeersch, H.; Seco-Granados, G.; Destino, G.; Dardari, D.; Tufvesson, F. 5G mmWave positioning for vehicular networks. IEEE Wirel. Commun. 2017, 24, 80–86. [Google Scholar] [CrossRef] [Green Version]
- Lu, Z.; Huang, Y.C.; Bangjun, C. A Study for Application in Vehicle Networking and Driverless Driving. In Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, Beijing, China, 6–8 December 2019; pp. 264–267. [Google Scholar]
- Zhao, Y.; Jin, F.; Wang, M.; Wang, S. Knowledge Graphs Meet Geometry for Semi-supervised Monocular Depth Estimation. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, Hangzhou, China, 28–30 August 2020; pp. 40–52. [Google Scholar]
- Garg, R.; Kumar, B.G.V.; Carneiro, G.; Reid, I.D. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 740–756. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6612–6619. [Google Scholar]
- Murartal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef] [Green Version]
- Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Liu, Z.; Xie, R.; Ran, L. Radar HRRP Target Recognition Based on Dynamic Learning with Limited Training Data. Remote Sens. 2021, 13, 750. [Google Scholar] [CrossRef]
- Kazimierski, W.; Zaniewicz, G. Determination of Process Noise for Underwater Target Tracking with Forward Looking Sonar. Remote Sens. 2021, 13, 1014. [Google Scholar] [CrossRef]
- Li, B.; Gan, Z.; Chen, D.; Sergey Aleksandrovich, D. UAV Maneuvering Target Tracking in Uncertain Environments Based on Deep Reinforcement Learning and Meta-Learning. Remote Sens. 2020, 12, 3789. [Google Scholar] [CrossRef]
- Guo, J.; Bai, C.; Guo, S. A Review of Monocular Depth Estimation Based on Deep Learning. Unmanned Syst. Technol. 2019, 3. Available online: https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD&dbname=CJFDLAST2019&filename=UMST201902003&v=LxXxs2LYM%25mmd2FrpCJsoTtiaExYvBg0cRUvrHeXluBqPeql%25mmd2FO67HDuhfchKopV1yVha7 (accessed on 10 March 2021).
- Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 2650–2658. [Google Scholar]
- Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2024–2039. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5667–5675. [Google Scholar]
- Godard, C.; Aodha, O.M.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6602–6611. [Google Scholar]
- Geiger, A. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Taketomi, T.; Uchiyama, H.; Ikeda, S. Visual SLAM algorithms: A survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Appl. 2017, 9, 16. [Google Scholar] [CrossRef]
- Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
- Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. In Proceedings of the International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Wang, P.; Shen, X.; Lin, Z.; Cohen, S.; Price, B.; Yuille, A.L. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2800–2809. [Google Scholar]
- Jafari, O.H.; Groth, O.; Kirillov, A.; Yang, M.Y.; Rother, C. Analyzing modular CNN architectures for joint depth prediction and semantic segmentation. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4620–4627. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011. [Google Scholar]
- Mancini, M.; Costante, G.; Valigi, P.; Ciarfuglia, T.A. Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4296–4303. [Google Scholar]
- Liu, F.; Shen, C.; Lin, G. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5162–5170. [Google Scholar]
- Li, J.; Klein, R.; Yao, A. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3372–3380. [Google Scholar]
- Oliveira, G.L.; Radwan, N.; Burgard, W.; Brox, T. Topometric localization with deep learning. In Robotics Research; Springer: Berlin/Heidelberg, Germany, 2020; pp. 505–520. [Google Scholar]
- Clark, R.; Wang, S.; Wen, H.; Markham, A.; Trigoni, N. VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem. In Proceedings of the National Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 3995–4001. [Google Scholar]
- Repala, V.K.; Dubey, S.R. Dual cnn models for unsupervised monocular depth estimation. In Proceedings of the International Conference on Pattern Recognition and Machine Intelligence, Tezpur, India, 17–20 December 2019; pp. 209–217. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
- Rezende, D.J.; Eslami, S.; Mohamed, S.; Battaglia, P.; Jaderberg, M.; Heess, N. Unsupervised learning of 3d structure from images. arXiv 2016, arXiv:1607.00662. [Google Scholar]
- Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network. arXiv 2015, arXiv:1511.06702. [Google Scholar]
- Vijayanarasimhan, S.; Ricco, S.; Schmid, C.; Sukthankar, R.; Fragkiadaki, K. Sfm-net: Learning of structure and motion from video. arXiv 2017, arXiv:1704.07804. [Google Scholar]
- Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1983–1992. [Google Scholar]
- Garg, R.; Wadhwa, N.; Ansari, S.; Barron, J.T. Learning single camera depth estimation using dual-pixels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7628–7637. [Google Scholar]
- Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning Depth from Monocular Videos Using Direct Methods. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2022–2030. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Patait, A. An Introduction to the NVIDIA Optical Flow SDK. Available online: https://developer.nvidia.com/blog/an-introduction-to-the-nvidia-optical-flow-sdk/ (accessed on 13 February 2019).
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2015, arXiv:1603.04467. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
- Wang, P.; Huang, X.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2702–2719. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Name | Kernel Size | Stride | Output Channel | Input Size | Output Size | Input |
---|---|---|---|---|---|---|
conv1a | 2 | 32 | 1 | image | ||
conv1b | 1 | 32 | conv1a | |||
conv2a | 2 | 64 | conv1b | |||
conv2b | 1 | 64 | conv2a | |||
conv3a | 2 | 128 | conv2b | |||
conv3b | 1 | 128 | conv3a | |||
conv4a | 2 | 256 | conv3b | |||
conv4b | 1 | 256 | conv4a | |||
conv5a | 2 | 512 | conv4b | |||
conv5b | 1 | 512 | conv5a | |||
conv6a | 2 | 512 | conv5b | |||
conv6b | 1 | 512 | conv6a | |||
conv7a | 2 | 512 | conv6b | |||
conv7b | 1 | 512 | conv7a | |||
upcnv7 | 2 | 512 | conv7b | |||
icnv7 | 1 | 512 | upcnv7 + conv6b | |||
upcnv6 | 2 | 512 | icnv7 | |||
icnv6 | 1 | 512 | upcnv6 + conv5b | |||
upcnv5 | 2 | 256 | icnv6 | |||
icnv5 | 1 | 256 | upcnv5 + conv4b | |||
upcnv4 | 2 | 128 | icnv5 | |||
icnv4 | 1 | 128 | upcnv4 + conv3b | |||
pred4 | 1 | 1 | icnv4 | |||
upcnv3 | 2 | 64 | icnv4 | |||
icnv3 | 1 | 64 | upcnv3 + conv2b + pred4up | |||
pred3 | 1 | 1 | icnv3 | |||
upcnv2 | 2 | 32 | icnv3 | |||
icnv2 | 1 | 32 | upcnv2 + conv1b + pred3up | |||
pred2 | 1 | 1 | icnv2 | |||
upcnv1 | 2 | 16 | 1 | icnv2 | ||
icnv1 | 1 | 16 | 1 | 1 | upcnv1 + pred2up | |
pred1 | 1 | 1 | 1 | 1 | icnv1 |
Name | Kernel Size | Stride | Output Channel | Input Size | Output Size | Input |
---|---|---|---|---|---|---|
conv1 | 2 | 16 | 1 | Image | ||
conv2 | 2 | 32 | conv1 | |||
conv3 | 2 | 64 | conv2 | |||
conv4 | 2 | 128 | conv3 | |||
conv5 | 2 | 256 | conv4 |
Name | Kernel Size | Stride | Output Channel | Input |
---|---|---|---|---|
conv6 | 2 | 256 | conv5 | |
conv7 | 2 | 265 | conv6 | |
conv8 | 1 | conv7 |
Name | Kernel Size | Stride | Output Channel | Input Size | Output Size | Input |
---|---|---|---|---|---|---|
upcnv5 | 2 | 256 | conv5 | |||
upcnv4 | 2 | 128 | upcnv5 | |||
mask4 | 1 | upcnv4 | ||||
upcnv3 | 2 | 64 | upcnv4 | |||
mask3 | 1 | upcnv3 | ||||
upcnv2 | 2 | 32 | upcnv3 | |||
mask2 | 1 | upcnv2 | ||||
upcnv1 | 2 | 16 | 1 | upcnv2 | ||
mask1 | 1 | 1 | 1 | upcnv1 |
Method | Cap | Dataset | Supervised | Error | Accuracy Metric | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Depth | Pose | Abs Rel | Sq Rel | RMSE | RMSE log | ||||||
Eigen et al. [20] Coarse | 80 | K | √ | 0.214 | 1.605 | 6.563 | 0.292 | 0.673 | 0.884 | 0.957 | |
Eigen et al. [20] Fine | 80 | K | √ | 0.203 | 1.548 | 6.307 | 0.282 | 0.702 | 0.890 | 0.958 | |
Liu et al. [13] | 80 | K | √ | 0.202 | 1.614 | 6.307 | 0.282 | 0.678 | 0.895 | 0.965 | |
Zhou et al. [5] | 80 | K | 0.208 | 1.768 | 6.856 | 0.283 | 0.678 | 0.885 | 0.957 | ||
Zhou et al. [5] | 80 | K + C | 0.183 | 1.595 | 6.720 | 0.270 | 0.733 | 0.901 | 0.959 | ||
Ours | 80 | K | 0.176 | 1.497 | 6.898 | 0.274 | 0.739 | 0.898 | 0.956 | ||
Ours consis | 80 | K | 0.169 | 1.387 | 6.670 | 0.265 | 0.748 | 0.904 | 0.960 | ||
Garg et al. [4] | 50 | K | √ | 0.169 | 1.080 | 5.104 | 0.273 | 0.740 | 0.904 | 0.962 | |
Zhou et al. [5] | 50 | K + C | 0.173 | 1.151 | 4.990 | 0.250 | 0.751 | 0.915 | 0.969 | ||
Ours | 50 | K | 0.167 | 1.116 | 4.940 | 0.249 | 0.760 | 0.917 | 0.967 | ||
Ours consis | 50 | K | 0.162 | 1.039 | 4.851 | 0.244 | 0.767 | 0.920 | 0.969 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jin, F.; Zhao, Y.; Wan, C.; Yuan, Y.; Wang, S. Unsupervised Learning of Depth from Monocular Videos Using 3D-2D Corresponding Constraints. Remote Sens. 2021, 13, 1764. https://doi.org/10.3390/rs13091764
Jin F, Zhao Y, Wan C, Yuan Y, Wang S. Unsupervised Learning of Depth from Monocular Videos Using 3D-2D Corresponding Constraints. Remote Sensing. 2021; 13(9):1764. https://doi.org/10.3390/rs13091764
Chicago/Turabian StyleJin, Fusheng, Yu Zhao, Chuanbing Wan, Ye Yuan, and Shuliang Wang. 2021. "Unsupervised Learning of Depth from Monocular Videos Using 3D-2D Corresponding Constraints" Remote Sensing 13, no. 9: 1764. https://doi.org/10.3390/rs13091764
APA StyleJin, F., Zhao, Y., Wan, C., Yuan, Y., & Wang, S. (2021). Unsupervised Learning of Depth from Monocular Videos Using 3D-2D Corresponding Constraints. Remote Sensing, 13(9), 1764. https://doi.org/10.3390/rs13091764