An Unsupervised Monocular Visual Odometry Based on Multi-Scale Modeling
Abstract
:1. Introduction
- 1
- We propose to use densely linked atrous convolutions to increase the receptive field size in VO task. As such, the network can effectively capture multi-scale information.
- 2
- We propose to use the non-local self-attention mechanism to calculate the pixel-level pairwise relation as well as model the long-range dependency. Thus our network can make better use of the multi-scale information in the image.
2. Related Work
2.1. Supervised Methods
2.2. Unsupervised Methods
2.3. Atrous Convolution
2.4. Non-Loacl Self-Attention
3. Method
3.1. Overview
3.2. Densely Connected Atrous Convolution
3.3. Non-Local Self-Attention
3.4. Loss Function
4. Experiments
4.1. Implementation Details
4.2. Pose Estimation
4.3. Depth Estimation
4.4. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- DeSouza, G.N.; Kak, A.C. Vision for mobile robot navigation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 237–267. [Google Scholar] [CrossRef] [Green Version]
- Chen, C.; Seff, A.; Kornhauser, A.; Xiao, J. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA, 8–10 June 2015; pp. 2722–2730. [Google Scholar]
- Azuma, R.T. A survey of augmented reality. Presence Teleoperators Virtual Environ. 1997, 6, 355–385. [Google Scholar] [CrossRef]
- Mur-Artal, R.; Montiel, J.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef] [Green Version]
- Engel, J.; Schops, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2015; pp. 834–849. [Google Scholar]
- Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Clark, R.; Wen, H.; Trigoni, N. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2043–2050. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Bangalore, India, 17–19 August 2017; pp. 1122–1131. [Google Scholar]
- Wang, R.; Pizer, S.M.; Frahm, J.M. Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5555–5564. [Google Scholar]
- Bian, J.W.; Zhan, H.; Wang, N.; Li, Z.; Zhang, L.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised Scale-consistent Depth Learning from Video. Int. J. Comput. Vis. 2021, 129, 1–17. [Google Scholar] [CrossRef]
- Shen, T.; Luo, Z.; Zhou, L.; Deng, H.; Zhang, R.; Fang, T.; Quan, L. Beyond photometric loss for self-supervised ego-motion estimation. In Proceedings of the 2019 International Conference on Robotics and Automation, Virtual Conference, 31 May–31 August 2020; pp. 6359–6365. [Google Scholar]
- Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 2938–2946. [Google Scholar]
- Pandey, T.; Pena, D.; Byrne, J.; Pandey, T.; Moloney, D. Leveraging deep learning for visual odometry using optical flow. Sensors 2021, 21, 1313. [Google Scholar] [CrossRef]
- Costante, G.; Mancini, M. Uncertainty Estimation for Driven Visual Odometry. IEEE Trans. Robot. 2020, 99, 1–20. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Fischer, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 2758–2766. [Google Scholar]
- Almalioglu, Y.; Saputra, M.R.U.; de Gusmao, P.P.B.; Markham, A.; Trigoni, N. Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5474–5480. [Google Scholar]
- Li, S.; Xue, F.; Wang, X.; Yan, Z.; Zha, H. Sequential adversarial learning for self-supervised deep visual odometry. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 2851–2860. [Google Scholar]
- Li, R.H.; Wang, S.; Long, Z.Q.; Gu, D.B. Undeepvo: Monocular visual odometry through unsupervised deep learning. In Proceedings of the 2018 IEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–26 May 2018; pp. 7286–7291. [Google Scholar]
- Yin, Z.C.; Shi, J.P. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1983–1992. [Google Scholar]
- Holschneider, M.; Kronland-Martinet, R.; Morlet, J.; Tchamitchian, P. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space; Tchamitchian, P., Ed.; Publishing House: Berlin/Heidelberg, Germany, 1989; pp. 289–297. [Google Scholar]
- Papandreou, G.; Kokkinos, I.; Savalle, P.A. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 390–399. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Dai, J.; He, K.; Li, Y.; Ren, S.; Sun, J. Instance-sensitive fully convolutional networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 534–549. [Google Scholar]
- Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3684–3692. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. Stat 2017, 1050, 20. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 603–612. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Guangzhou, China, 26–27 September 2015; pp. 234–241. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vegas, NV, USA, 26 June–1 July 2016; pp. 4040–4048. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceeding of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, Rhode Island, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2021, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
- Li, S.; Wang, X.; Cao, Y.; Xue, F.; Yan, Z.; Zha, H. Self-supervised deep visual odometry with online adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6339–6348. [Google Scholar]
- Zhan, H.; Garg, R.; Weerasekera, C.S.; Li, K.; Agarwal, H.; Reid, I. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 340–349. [Google Scholar]
- Li, Y.; Ushiku, Y.; Harada, T. Pose graph optimization for unsupervised monocular visual odometry. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 5439–5445. [Google Scholar]
- Li, X.; Hou, Y.; Wang, P.; Gao, Z.; Xu, M.; Li, W. Transformer guided geometry model for flow-based unsupervised visual odometry. Neural Comput. Appl. 2021, 33, 8031–8042. [Google Scholar] [CrossRef]
- Xue, F.; Wang, Q.; Wang, X.; Dong, W.; Wang, J.; Zha, H. Guided feature selection for deep visual odometry. In Proceeding of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Springer: Cham, Switzerland, 2018; pp. 293–308. [Google Scholar]
- Kuo, X.Y.; Liu, C.; Lin, K.C.; Lee, C.Y. Dynamic Attention-based Visual Odometry. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 24–30 October 2020; pp. 36–37. [Google Scholar]
- Gadipudi, N.; Elamvazuthi, I.; Lu, C.-K.; Paramasivam, S.; Su, S. WPO-Net: Windowed Pose Optimization Network for Monocular Visual Odometry Estimation. Sensors 2021, 21, 8155. [Google Scholar] [CrossRef] [PubMed]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar] [CrossRef]
- Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12240–12249. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Bangalore, India, 17–19 August 2017; pp. 270–279. [Google Scholar]
- Pilzer, A.; Lathuiliere, S.; Sebe, N.; Ricci, E. Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Wong, A.; Soatto, S. Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5644–5653. [Google Scholar]
Method | Type | Seq.09 | Seq.10 | ||
---|---|---|---|---|---|
SfMLearner [8] | Mono | 19.15 | 6.82 | 40.40 | 17.69 |
GeoNet [19] | Mono | 28.72 | 9.8 | 23.90 | 9.0 |
Wang et al. [9] | Mono | 9.88 | 3.40 | 12.24 | 5.2 |
DeepMatchVO [11] | Mono | 9.91 | 3.8 | 12.18 | 3.9 |
SC-Depth [10] | Mono | 7.31 | 3.05 | 7.79 | 4.9 |
Depth-VO-Feat [41] | Ste | 11.89 | 3.6 | 12.82 | 3.41 |
Shunkai Li et al. [42] | Ste | 6.23 | 2.11 | 12.9 | 3.17 |
Xiangyu Li et al. [43] | Ste | 2.26 | 1.06 | 3.00 | 1.28 |
DeepVO [7] | Su | 5.96 | 6.12 | 9.77 | 10.20 |
WPO-Net [46] | Su | 8.19 | 3.02 | 8.95 | 3.12 |
Fei Xue et al. [44] | Su | 3.47 | 1.75 | 3.94 | 1.72 |
DAVO [45] | Su | 3.91 | 1.46 | 5.37 | 1.64 |
ORB-SLAM2-M (w/o LC) [4]) | Ge | 9.67 | 0.3 | 4.04 | 0.3 |
ORB-SLAM2-M [4] | Ge | 3.22 | 0.4 | 4.25 | 0.3 |
Ours | Mono | 7.58 | 0.51 | 7.35 | 1.35 |
Method | Supervision | Error | Accuracy | |||||
---|---|---|---|---|---|---|---|---|
AbsRel | SqRel | RMSE | RMSE log | |||||
Zhou et al. [8] | Mono | 0.208 | 1.768 | 6.856 | 0.283 | 0.678 | 0.885 | 0.957 |
GeoNet [19] | Mono | 0.155 | 1.296 | 5.857 | 0.233 | 0.793 | 0.931 | 0.973 |
CC [48] | Mono | 0.140 | 1.070 | 5.326 | 0.217 | 0.826 | 0.941 | 0.975 |
SC-Depth [10] | Mono | 0.137 | 1.089 | 5.439 | 0.217 | 0.830 | 0.942 | 0975 |
Li et al. [18] | Ste | 0.183 | 1.73 | 6.57 | 0.268 | – | – | – |
Godard et al. [49] | Ste | 0.148 | 1.344 | 5.927 | 0.247 | 0.803 | 0.922 | 0.964 |
Zhan et al. [41] | Ste | 0.144 | 1.391 | 5.869 | 0.241 | 0.803 | 0.928 | 0.969 |
Pilzer et al. [50] | Ste | 0.1424 | 1.2306 | 5.785 | 0.239 | 0.795 | 0.924 | 0.968 |
Wong et al. [51] | Ste | 0.135 | 1.157 | 5.556 | 0.234 | 0.820 | 0.932 | 0.968 |
Xiangyu Li et al. [43] | Ste | 0.135 | 1.234 | 5.624 | 0.233 | 0.823 | 0.932 | 0.968 |
Ours | Mono | 0.125 | 0.992 | 5.192 | 0.208 | 0.844 | 0.947 | 0.977 |
Method | Seq.09 | Seq.10 | ||
---|---|---|---|---|
baseline | 14.43 | 4.23 | 10.93 | 2.53 |
atrous | 9.14 | 1.15 | 10.17 | 1.54 |
attention | 7.67 | 0.69 | 8.16 | 1.55 |
baseline + atrous + self-attention | 7.58 | 0.51 | 7.35 | 1.35 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhi, H.; Yin, C.; Li, H.; Pang, S. An Unsupervised Monocular Visual Odometry Based on Multi-Scale Modeling. Sensors 2022, 22, 5193. https://doi.org/10.3390/s22145193
Zhi H, Yin C, Li H, Pang S. An Unsupervised Monocular Visual Odometry Based on Multi-Scale Modeling. Sensors. 2022; 22(14):5193. https://doi.org/10.3390/s22145193
Chicago/Turabian StyleZhi, Henghui, Chenyang Yin, Huibin Li, and Shanmin Pang. 2022. "An Unsupervised Monocular Visual Odometry Based on Multi-Scale Modeling" Sensors 22, no. 14: 5193. https://doi.org/10.3390/s22145193
APA StyleZhi, H., Yin, C., Li, H., & Pang, S. (2022). An Unsupervised Monocular Visual Odometry Based on Multi-Scale Modeling. Sensors, 22(14), 5193. https://doi.org/10.3390/s22145193