StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information
Abstract
:1. Introduction
2. Related Work
2.1. Methods Based on Geometry
2.2. Methods Based on Learning
3. Approach
3.1. The Structure of StereoVO
3.2. SSIM Cost Volume
3.3. Loss Function
4. Experimental Results
4.1. Dataset and Metrics
4.2. Implementation Details
4.3. Comparison with Other Methods
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Barros, A.M.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A comprehensive survey of visual slam algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
- Aslan, M.F.; Durdu, A.; Yusefi, A.; Sabanci, K.; Sungur, C. A tutorial: Mobile robotics, SLAM, bayesian filter, keyframe bundle adjustment and ROS applications. ROS 2021, 6, 227–269. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Song, S.; Chandraker, M.; Guest, C.C. High Accuracy Monocular SFM and Scale Correction for Autonomous Driving. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 730–743. [Google Scholar] [CrossRef] [PubMed]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Xue, F.; Wang, Q.; Wang, X.; Dong, W.; Zha, H. Guided Feature Selection for Deep Visual Odometry. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018. [Google Scholar]
- Li, R.; Wang, S.; Long, Z.; Gu, D. Undeepvo: Monocular visual odometry through unsupervised deep learning. In Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018. [Google Scholar]
- Yin, Z.; Shi, J. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–21 June 2018. [Google Scholar]
- Costante, G.; Mancini, M.; Valigi, P.; Ciarfuglia, T.A. Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE Robot. Autom. Lett. 2015, 1, 18–25. [Google Scholar] [CrossRef]
- Muller, P.; Savakis, A. Flowdometry: An Optical Flow and Deep Learning Based Approach to Visual Odometry. In Proceedings of the Applications of Computer Vision, Santa Rosa, CA, USA, 27–29 March 2017. [Google Scholar]
- Saputra, M.; Gusmao, P.D.; Wang, S.; Markham, A.; Trigoni, N. Learning Monocular Visual Odometry through Geometry-Aware Curriculum Learning. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019. [Google Scholar]
- Ganti, P.; Waslander, S. Network Uncertainty Informed Semantic Feature Selection for Visual SLAM. In Proceedings of the Conference on Computer and Robot Vision, Kingston, ON, Canada, 28–31 May 2019. [Google Scholar]
- Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Almalioglu, Y.; Saputra, M.; Gusmo, P.; Markham, A.; Trigoni, N. GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019. [Google Scholar]
- Madhu, B.V.; Majumder, A.; Das, K.; Kumar, S. Undemon: Unsupervised Deep Network for Depth and Ego-Motion Estimation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018. [Google Scholar]
- Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Zhan, H.; Garg, R.; Saroj Weerasekera, C.; Li, K.; Agarwal, H.; Reid, I. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–21 June 2018. [Google Scholar]
- Mur-Artal, R.; Tardos, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
- Amiri, A.J.; Loo, S.Y.; Zhang, H. Semi-supervised monocular depth estimation with left-right consistency using deep neural network. In Proceedings of the IEEE International Conference on Robotics and Biomimetics, Dali, China, 6–8 December 2019. [Google Scholar]
- Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. MonoSLAM: Real-Time Single Camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1052–1067. [Google Scholar] [CrossRef] [PubMed]
- Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust visual inertial odometry using a direct EKF-based approach. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots & Systems, Hamburg, Germany, 28 September–3 October 2015. [Google Scholar]
- Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the IEEE and Acm International Symposium on Mixed & Augmented Reality, Nara, Japan, 13–16 November 2007. [Google Scholar]
- Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
- Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the IEEE International Conference on Robotics and Automation, Hong Kong, China, 31 May-7 June 2014. [Google Scholar]
- Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Clark, R.; Wen, H.; Trigoni, N. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017. [Google Scholar]
- Yang, N.; Von Stumberg, L.; Wang, R.; Cremers, D. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020. [Google Scholar]
- Jiang, Z.; Taira, H.; Miyashita, N.; Okutomi, M. Self-Supervised Ego-Motion Estimation Based on Multi-Layer Fusion of RGB and Inferred Depth. In Proceedings of the IEEE International Conference on Robotics and Automation, Philadelphia, PA, USA, 23–27 May 2022. [Google Scholar]
- Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J. High Accuracy Optical Flow Estimation Based on a Theory for Warping. In Proceedings of the European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004. [Google Scholar]
- Fischer, P.; Dosovitskiy, A.; Ilg, E.; Husser, P.; Hazrba, C.; Golkov, V.; Patrick, V.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Wang, W.; Hu, Y.; Scherer, S. TartanVO: A Generalizable Learning-based VO. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021. [Google Scholar]
- Wang, W.; Zhu, D.; Wang, X.; Hu, Y.; Qiu, Y.; Wang, C.; Hu, Y.; Kapoor, A.; Scherer, S. TartanAir: A Dataset to Push the Limits of Visual SLAM. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Kyoto, Japan, 23–27 October 2020. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Rhode, Greece, 18–20 June 2012. [Google Scholar]
- Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Rob. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
- Aslan, M.F.; Durdu, A.; Sabanci, K. Visual-inertial image-odometry network (VIIONet): A Gaussian process regression-based deep architecture proposal for UAV pose estimation. Measurement 2022, 194, 111030. [Google Scholar] [CrossRef]
- Han, L.; Lin, Y.; Du, G.; Lian, S. Deepvio: Self-supervised deep learning of monocular visual inertial odometry using 3d geometric constraints. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019. [Google Scholar]
- Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Brox, T. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Zhou, W.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar]
- Umeyama, S. Least-squares Estimation of Transformation Parameters between Two Point Patterns. IEEE Trans Pattern Anal Mach Intell 1991, 13, 376–380. [Google Scholar] [CrossRef]
Block | Layer | k | s | p | chns |
---|---|---|---|---|---|
Conv Block | Conv1_1 | 7 | 2 | 3 | 3/16 |
Conv1_2 | 3 | 1 | 1 | 48/16 | |
Conv Redir1 | 7 | 2 | 3 | 3/2 | |
Conv2_1 | 5 | 3 | 2 | 18/32 | |
Conv2_2 | 3 | 1 | 1 | 96/32 | |
Conv Redir2 | 5 | 2 | 2 | 18/2 | |
Conv3_1 | 5 | 2 | 2 | 34/64 | |
Conv3_2 | 3 | 1 | 1 | 192/64 | |
Conv Redir3 | 5 | 2 | 2 | 34/2 | |
Conv Block4 | Conv4_1 | 3 | 2 | 1 | 157/256 |
Conv4_2 | 3 | 1 | 1 | 471/256 | |
Conv Block5 | Conv5_1 | 3 | 2 | 1 | 256/512 |
Conv5_2 | 3 | 1 | 1 | 768/512 | |
Conv Redir5 | 3 | 2 | 1 | 256/2 | |
Conv Block6 | Conv6_1 | 3 | 2 | 1 | 514/1024 |
Conv6_2 | 3 | 1 | 1 | 1542/1024 | |
Conv Redir6 | 3 | 2 | 1 | 514/2 | |
FC | FC1 | - | - | - | 4104/1024 |
FC2 | - | - | - | 1024/6 |
Phase | Train | Test | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Sequence | 00 | 01 | 02 | 03 | 04 | 05 | 06 | 08 | 07 | 09 | 10 |
Frames | 4541 | 1101 | 4661 | 801 | 271 | 2761 | 1101 | 4071 | 1101 | 1591 | 1201 |
Total | 19,308 | 3893 |
Method | Seq. 07 | Seq. 09 | Seq. 10 | Average | |||||
---|---|---|---|---|---|---|---|---|---|
Trel (%) | Rrel (deg/100 m) | Trel (%) | Rrel (deg/100 m) | Trel (%) | Rrel (deg/100 m) | Trel (%) | Rrel (deg/100 m) | ||
MG | VISO2-M [4] | 23.61 | 19.11 | 4.04 | 1.43 | 25.20 | 3.80 | 17.62 | 8.11 |
ORB-SLAM2-M [18] | 10.96 | 0.37 | 15.30 | 0.26 | 3.71 | 0.30 | 9.99 | 0.31 | |
StG | ORB-SLAM2-S [18] | 0.59 | 0.20 | 1.8 | 0.18 | 1.67 | 0.37 | 1.35 | 0.25 |
ORB-SLAM3-S [23] | 0.62 | 0.21 | 1.16 | 0.19 | 1.02 | 0.34 | 1.24 | 0.25 | |
MU | GeoNet [8] | - | - | 30.57 | 9.89 | 45.14 | 10.37 | 37.86 | 8.17 |
MU | D3VO [28] | - | - | 10.40 | 0.75 | 14.24 | 1.21 | 12.32 | 0.98 |
MU | MLF-VO [29] | 6.11 | 2.59 | 3.64 | 1.33 | 6.39 | 1.14 | 5.38 | 1.68 |
MS | DeepVO [27] | 3.91 | 4.60 | - | - | 8.11 | 8.83 | 6.01 | 6.72 |
MS | PCNN [9] | - | - | 6.42 | 2.52 | 19.70 | 3.62 | 13.06 | 7.79 |
MS | Flowdometry [10] | - | - | 12.64 | 8.04 | 11.65 | 7.28 | 12.15 | 7.66 |
MS | TartanVO [32] | 8.51 | 4.55 | 7.24 | 2.99 | 9.20 | 2.80 | 8.32 | 3.45 |
StS | StereoVO (our) | 3.42 | 1.87 | 5.11 | 1.39 | 5.78 | 1.62 | 4.77 | 1.63 |
Method | RMSE | Mean | Std. |
---|---|---|---|
MLF-VO | 0.105 | 0.078 | 0.068 |
StereoVO | 0.089 | 0.070 | 0.055 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Duan, C.; Junginger, S.; Thurow, K.; Liu, H. StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information. Appl. Sci. 2023, 13, 5842. https://doi.org/10.3390/app13105842
Duan C, Junginger S, Thurow K, Liu H. StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information. Applied Sciences. 2023; 13(10):5842. https://doi.org/10.3390/app13105842
Chicago/Turabian StyleDuan, Chao, Steffen Junginger, Kerstin Thurow, and Hui Liu. 2023. "StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information" Applied Sciences 13, no. 10: 5842. https://doi.org/10.3390/app13105842
APA StyleDuan, C., Junginger, S., Thurow, K., & Liu, H. (2023). StereoVO: Learning Stereo Visual Odometry Approach Based on Optical Flow and Depth Information. Applied Sciences, 13(10), 5842. https://doi.org/10.3390/app13105842