Joint Unsupervised Learning of Depth, Pose, Ground Normal Vector and Ground Segmentation by a Monocular Camera Sensor
Abstract
:1. Introduction
- An unsupervised learning framework is proposed, which can estimate the scene depth, ego-motion, ground normal vector and ground segmentation simultaneously.
- A joint learning process is proposed, which uses heterogeneous loss functions to boost the mutual information flow between the estimation of different scene structures.
- Extensive comparison experiments and ablation studies on public datasets are conducted and demonstrate the improvement of proposed approach on estimation of scene structures such as the depth, ego-pose, ground segmentation and ground normal vector.
2. Related Work
2.1. Depth and Ego-Motion Estimation
2.2. Unsupervised Semantic Segmentation
2.3. Ground Normal Vector Estimation
3. Proposed Method
3.1. Self-Supervised Ground Segmentation
3.2. SfM Framework
3.3. Joint Learning
3.3.1. Ground Self-Learning Loss
3.3.2. Plane Photometric Loss
3.3.3. Depth Abnormal Punishment Loss
3.4. Entire Learning and Inference Framework
4. Experiment
4.1. Dataset
4.2. Implementation Details
4.2.1. Training Configuration
4.2.2. Network Structure
4.2.3. Environment
4.3. Experimental Results
4.3.1. Depth Estimation Results
4.3.2. Ego-Pose Estimation Results
4.3.3. Ground Normal Vector Estimation Results
4.3.4. Unsupervised Ground Segmentation Results
4.4. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
Appendix A. Homography Transformation
Appendix B. Calculation of the Horizon
Appendix C. Calculation of the Loss Based on SSIM
Appendix D. The Metrics Used for Depth Estimation
References
- Khan, S.M.; Shah, M. A multiview approach to tracking people in crowded scenes using a planar homography constraint. In European Conference on Computer Vision; Springer: Berlin, Germany, 2006; pp. 133–146. [Google Scholar]
- Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; pp. 424–432. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Košecká, J.; Zhang, W. Extraction, matching, and pose recovery based on dominant rectangular structures. Comput. Vis. Image Underst. 2005, 100, 274–293. [Google Scholar] [CrossRef] [Green Version]
- Forssén, P.E.; Lowe, D.G. Shape descriptors for maximally stable extremal regions. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–20 October 2007; pp. 1–8. [Google Scholar]
- Bian, J.; Lin, W.Y.; Matsushita, Y.; Yeung, S.K.; Nguyen, T.D.; Cheng, M.M. Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4181–4190. [Google Scholar]
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Bian, J.W.; Wu, Y.H.; Zhao, J.; Liu, Y.; Zhang, L.; Cheng, M.M.; Reid, I. An evaluation of feature matchers for fundamental matrix estimation. arXiv 2019, arXiv:1908.09474. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
- Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5667–5675. [Google Scholar]
- Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1983–1992. [Google Scholar]
- Zou, Y.; Luo, Z.; Huang, J.B. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 36–53. [Google Scholar]
- Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12240–12249. [Google Scholar]
- McDaniel, M.W.; Nishihata, T.; Brooks, C.A.; Iagnemma, K. Ground plane identification using LIDAR in forested environments. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–8 May 2010; pp. 3831–3836. [Google Scholar]
- Man, Y.; Weng, X.; Li, X.; Kitani, K. GroundNet: Monocular Ground Plane Estimation with Geometric Consistency. In ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2018; pp. 2170–2178. [Google Scholar]
- Bansal, A.; Russell, B.; Gupta, A. Marr revisited: 2d-3d alignment via surface normal prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, June 26–1 July 2016; pp. 5965–5974. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, June 26–July 1 2016; pp. 3213–3223. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Rob. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; pp. 2366–2374. [Google Scholar]
- Lin, G.; Liu, F.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for dense prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1228–1242. [Google Scholar] [CrossRef] [PubMed]
- Kuznietsov, Y.; Stuckler, J.; Leibe, B. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6647–6655. [Google Scholar]
- Yin, Z.; Darrell, T.; Yu, F. Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6044–6053. [Google Scholar]
- Tang, C.; Tan, P. Ba-net: Dense bundle adjustment network. arXiv 2018, arXiv:1806.04807. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
- Zhan, H.; Garg, R.; Saroj Weerasekera, C.; Li, K.; Agarwal, H.; Reid, I. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 340–349. [Google Scholar]
- Bian, J.; Li, Z.; Wang, N.; Zhan, H.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2019; pp. 35–45. [Google Scholar]
- Hu, W.; Miyato, T.; Tokui, S.; Matsumoto, E.; Sugiyama, M. Learning discrete representations via information maximizing self-augmented training. arXiv 2017, arXiv:1702.08720. [Google Scholar]
- Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv 2018, arXiv:1808.06670. [Google Scholar]
- Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.J.; Lipson, H. Understanding Neural Networks through Deep Visualization. arXiv 2015, arXiv:1506.06579. [Google Scholar]
- Ji, X.; Henriques, J.F.; Vedaldi, A. Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Long Beach, CA, USA, 16–20 June 2019; pp. 9865–9874. [Google Scholar]
- Hoiem, D.; Efros, A.A.; Hebert, M. Recovering surface layout from an image. Int. J. Comput. Vis. 2007, 75, 151–172. [Google Scholar] [CrossRef]
- Ren, Z.; Jae Lee, Y. Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 762–771. [Google Scholar]
- Chen, W.; Xiang, D.; Deng, J. Surface normals in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 1557–1566. [Google Scholar]
- Zhang, Y.; Song, S.; Tan, P.; Xiao, J. Panocontext: A whole-room 3d context model for panoramic scene understanding. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 668–686. [Google Scholar]
- Stekovic, S.; Fraundorfer, F.; Lepetit, V. General 3D Room Layout from a Single View by Render-and-Compare. arXiv 2020, arXiv:2001.02149. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yang, Z.; Wang, P.; Xu, W.; Zhao, L.; Nevatia, R. Unsupervised learning of geometry with edge-aware depth-normal consistency. arXiv 2017, arXiv:1711.03665. [Google Scholar]
- Wang, C.; Miguel Buenaposada, J.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2022–2030. [Google Scholar]
- Qiu, J.; Cui, Z.; Zhang, Y.; Zhang, X.; Liu, S.; Zeng, B.; Pollefeys, M. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene From Sparse LiDAR Data and Single Color Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 2024–2039. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Garg, R.; BG, V.K.; Carneiro, G.; Reid, I. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision; Springer: Zurich, Switzerland, 2016; pp. 740–756. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Rob. 2015, 31, 1147–1163. [Google Scholar] [CrossRef] [Green Version]
- Dragon, R.; Van Gool, L. Ground plane estimation using a hidden markov model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 4026–4033. [Google Scholar]
Learning | Method | Datasets | Error ↓ | Accuracy ↑ | |||||
---|---|---|---|---|---|---|---|---|---|
AbsRel | SqRel | RMS | RMSlog | ||||||
S | Eigen et al. [20] | K (D) | 0.203 | 1.548 | 6.307 | 0.282 | 0.702 | 0.890 | 0.958 |
Liu et al. [41] | K (D) | 0.202 | 1.161 | 6.523 | 0.275 | 0.678 | 0.895 | 0.965 | |
Kuznietsov et al. [22] | K (B + D) | 0.113 | 0.741 | 4.621 | 0.189 | 0.862 | 0.960 | 0.986 | |
SS | Garg et al. [42] | K (B) | 0.152 | 1.226 | 5.849 | 0.246 | 0.784 | 0.921 | 0.967 |
Zhan et al. [26] | K (B) | 0.144 | 1.391 | 5.869 | 0.241 | 0.803 | 0.928 | 0.969 | |
Godard et al. [25] | K (B) | 0.148 | 1.344 | 5.927 | 0.247 | 0.803 | 0.922 | 0.964 | |
Godard et al [25] | CS + K (B) | 0.124 | 1.076 | 5.311 | 0.219 | 0.847 | 0.942 | 0.973 | |
US | Zhou et al. [9] | K (M) | 0.208 | 1.768 | 6.856 | 0.283 | 0.678 | 0.885 | 0.957 |
Yang et al. [38] | K (M) | 0.182 | 1.481 | 6.501 | 0.267 | 0.725 | 0.906 | 0.963 | |
Mahjourian et al. [10] | K (M) | 0.163 | 1.240 | 6.220 | 0.250 | 0.762 | 0.916 | 0.968 | |
Wang et al. [39] | K (M) | 0.151 | 1.257 | 5.583 | 0.228 | 0.810 | 0.936 | 0.974 | |
Geonet-VGG [11] | K (M) | 0.164 | 1.303 | 6.090 | 0.247 | 0.765 | 0.919 | 0.968 | |
Geonet-Resnet [11] | K (M) | 0.155 | 1.296 | 5.857 | 0.233 | 0.793 | 0.931 | 0.973 | |
DF-Net [12] | K (M) | 0.150 | 1.124 | 5.507 | 0.223 | 0.806 | 0.933 | 0.973 | |
CC [13] | K (M) | 0.140 | 1.070 | 5.326 | 0.217 | 0.826 | 0.941 | 0.975 | |
SC-SfMLearner [27] | K (M) | 0.137 | 1.089 | 5.439 | 0.217 | 0.830 | 0.942 | 0.975 | |
ours | K (M) | 0.135 | 1.006 | 5.336 | 0.212 | 0.833 | 0.944 | 0.977 | |
US | Zhou et al. [9] | CS + K (M) | 0.198 | 1.836 | 6.565 | 0.275 | 0.718 | 0.901 | 0.960 |
Yang et al. [38] | CS + K (M) | 0.165 | 1.360 | 6.641 | 0.248 | 0.750 | 0.914 | 0.969 | |
Mahjourian et al. [10] | CS + K (M) | 0.159 | 1.231 | 5.912 | 0.243 | 0.784 | 0.923 | 0.970 | |
Wang et al. [39] | CS + K (M) | 0.148 | 1.187 | 5.496 | 0.226 | 0.812 | 0.938 | 0.975 | |
Geonet-Resnet [11] | CS + K (M) | 0.153 | 1.328 | 5.737 | 0.232 | 0.802 | 0.934 | 0.972 | |
DF-Net [12] | CS + K (M) | 0.146 | 1.182 | 5.215 | 0.213 | 0.818 | 0.943 | 0.978 | |
CC [13] | CS + K (M) | 0.139 | 1.032 | 5.199 | 0.213 | 0.827 | 0.943 | 0.977 | |
SC-SfMLearner [27] | CS + K (M) | 0.128 | 1.047 | 5.234 | 0.208 | 0.846 | 0.947 | 0.976 | |
ours | CS + K (M) | 0.126 | 0.943 | 5.084 | 0.203 | 0.849 | 0.949 | 0.978 |
Methods | Test-1 (Sequence 09) | Test-2 (Sequence 10) | ||
---|---|---|---|---|
ORB-SLAM [43] | 15.30 | 0.26 | 3.68 | 0.48 |
Zhou et al. [9] | 17.84 | 6.78 | 37.91 | 17.78 |
Zhan et al. [26] | 11.93 | 3.91 | 12.45 | 3.46 |
SC-SfMlearner [27] | 11.2 | 3.35 | 10.1 | 4.96 |
ours | 9.36 | 2.61 | 10.25 | 3.84 |
Methods | Error/Deg |
---|---|
GroundNet [15] (Supervised) | 0.70 |
HMM [44] (Unsupervised) | 4.10 |
ours (K) (Unsupervised) | 3.23 |
ours (CS + K) (Unsupervised) | 3.02 |
Methods | IOU |
---|---|
Supervision | 0.74 |
Unsupervised | 0.83 |
Methods | Datasets | Error ↓ | Accuracy ↑ | |||||
---|---|---|---|---|---|---|---|---|
AbsRel | SqRel | RMS | RMSlog | |||||
Basic | K | 0.137 | 1.091 | 5.441 | 0.217 | 0.830 | 0.942 | 0.975 |
Basic + GSFL + PPL | K | 0.136 | 1.103 | 5.417 | 0.215 | 0.835 | 0.944 | 0.976 |
Basic + GSFL + PPL + APL | K | 0.135 | 1.006 | 5.336 | 0.212 | 0.833 | 0.944 | 0.977 |
Methods | Test-01 (Sequence 09) | Test-02 (Sequence 10) | ||
---|---|---|---|---|
Basic(K) | 11.24 | 3.34 | 10.07 | 4.91 |
Basic + GSFL + PPL(K) | 9.34 | 2.63 | 9.51 | 3.97 |
Basic + GSFL + PPL + APL(K) | 9.36 | 2.61 | 10.14 | 3.84 |
Methods | IOU |
---|---|
Basic | 0.48 |
Basic + (GSFL) | 0.83 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xiong, L.; Wen, Y.; Huang, Y.; Zhao, J.; Tian, W. Joint Unsupervised Learning of Depth, Pose, Ground Normal Vector and Ground Segmentation by a Monocular Camera Sensor. Sensors 2020, 20, 3737. https://doi.org/10.3390/s20133737
Xiong L, Wen Y, Huang Y, Zhao J, Tian W. Joint Unsupervised Learning of Depth, Pose, Ground Normal Vector and Ground Segmentation by a Monocular Camera Sensor. Sensors. 2020; 20(13):3737. https://doi.org/10.3390/s20133737
Chicago/Turabian StyleXiong, Lu, Yongkun Wen, Yuyao Huang, Junqiao Zhao, and Wei Tian. 2020. "Joint Unsupervised Learning of Depth, Pose, Ground Normal Vector and Ground Segmentation by a Monocular Camera Sensor" Sensors 20, no. 13: 3737. https://doi.org/10.3390/s20133737
APA StyleXiong, L., Wen, Y., Huang, Y., Zhao, J., & Tian, W. (2020). Joint Unsupervised Learning of Depth, Pose, Ground Normal Vector and Ground Segmentation by a Monocular Camera Sensor. Sensors, 20(13), 3737. https://doi.org/10.3390/s20133737