Uncertainty-Aware Depth Network for Visual Inertial Odometry of Mobile Robots
Abstract
:1. Introduction
- We designed a network named UD-Net, which is a straightforward DNN architecture that uses a shared encoder–decoder structure to estimate both the depth of each pixel in the RGB image and the uncertainty of depth estimation. In contrast to existing research on uncertainty estimation, we propose the uncertainty of depth estimation that allows the network to directly learn regions where errors are likely to occur during the depth estimation process. For training UD-Net, we introduce a depth loss based on the estimated uncertainty and an uncertainty loss based on the estimated depth, which is specifically designed for training UD-Net.
- We integrate UD-Net with the feature-based VIO algorithm [9] to propose a novel VIO algorithm which is robust for the unavoidable errors of the depth network.
- Using the public KITTI dataset, we demonstrated the improved performance of depth estimation achieved by our proposed pipeline. We acquired and processed an underground parking lot dataset to demonstrate that our approach not only improves depth estimation performance but also enhances VIO performance.
2. Related Work
3. Methodology
3.1. Uncertainty-Aware Depth Network
3.2. Visual Inertial Odometry Based on UD-Net
4. Experimental Results
4.1. Experiment Setting and Evaluation Measures
4.2. Experimental Results on the KITTI Dataset
4.3. Experimental Results on the Underground Parking Lot Dataset
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Deng, W.; Huang, K.; Chen, X.; Zhou, Z.; Shi, C.; Guo, R.; Zhang, H. Semantic RGB-D SLAM for rescue robot navigation. IEEE Access 2020, 8, 221320–221329. [Google Scholar] [CrossRef]
- Hong, S.; Bangunharcana, A.; Park, J.M.; Choi, M.; Shin, H.S. Visual SLAM-based robotic mapping method for planetary construction. Sensors 2021, 21, 7715. [Google Scholar] [CrossRef] [PubMed]
- Guo, B.; Guo, N.; Cen, Z. Obstacle avoidance with dynamic avoidance risk region for mobile robots in dynamic environments. IEEE Robot. Autom. Lett. 2022, 7, 5850–5857. [Google Scholar] [CrossRef]
- Ab Wahab, M.N.; Nefti-Meziani, S.; Atyabi, A. A comparative review on mobile robot path planning: Classical or meta-heuristic methods? Annu. Rev. Control 2020, 50, 233–252. [Google Scholar] [CrossRef]
- Munoz-Salinas, R.; Medina-Carnicer, R. UcoSLAM: Simultaneous localization and mapping by fusion of keypoints and squared planar markers. Pattern Recognit. 2020, 101, 107193. [Google Scholar] [CrossRef]
- Motroni, A.; Buffi, A.; Nepa, P. A survey on indoor vehicle localization through RFID technology. IEEE Access 2021, 9, 17921–17942. [Google Scholar] [CrossRef]
- Kiss-Illés, D.; Barrado, C.; Salamí, E. GPS-SLAM: An augmentation of the ORB-SLAM algorithm. Sensors 2019, 19, 4973. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Tyagi, A.; Liang, Y.; Wang, S.; Bai, D. DVIO: Depth-aided visual inertial odometry for rgbd sensors. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Bari, Italy, 4–8 October 2021; pp. 193–201. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar] [CrossRef]
- Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar]
- Yuan, W.; Gu, X.; Dai, Z.; Zhu, S.; Tan, P. NeWCRFs: Neural Window Fully-connected CRFs for Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Bhat, S.F.; Alhashim, I.; Wonka, P. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4009–4018. [Google Scholar]
- Cong, P.; Li, J.; Liu, J.; Xiao, Y.; Zhang, X. SEG-SLAM: Dynamic Indoor RGB-D Visual SLAM Integrating Geometric and YOLOv5-Based Semantic Information. Sensors 2024, 24, 2102. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Jin, Y.; Yu, L.; Chen, Z.; Fei, S. A mono slam method based on depth estimation by densenet-cnn. IEEE Sens. J. 2021, 22, 2447–2455. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Li, Z.; Yu, L.; Pan, Z. A monocular SLAM system based on ResNet depth estimation. IEEE Sens. J. 2023, 23, 15106–15114. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Panetta, K.; Bao, L.; Agaian, S. Sequence-to-sequence similarity-based filter for image denoising. IEEE Sens. J. 2016, 16, 4380–4388. [Google Scholar] [CrossRef]
- Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Hornauer, J.; Belagiannis, V. Gradient-based uncertainty for monocular depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 613–630. [Google Scholar]
- Poggi, M.; Aleotti, F.; Tosi, F.; Mattoccia, S. On the uncertainty of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3227–3237. [Google Scholar]
- MacKay, D.J. A practical Bayesian framework for backpropagation networks. Neural Comput. 1992, 4, 448–472. [Google Scholar] [CrossRef]
- Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Huang, G.; Li, Y.; Pleiss, G.; Liu, Z.; Hopcroft, J.E.; Weinberger, K.Q. Snapshot ensembles: Train 1, get m for free. arXiv 2017, arXiv:1704.00109. [Google Scholar]
- Chen, L.; Tang, W.; Wan, T.R.; John, N.W. Self-supervised monocular image depth learning and confidence estimation. Neurocomputing 2020, 381, 272–281. [Google Scholar] [CrossRef]
- Nix, D.A.; Weigend, A.S. Estimating the mean and variance of the target probability distribution. In Proceedings of the IEEE International Conference on Neural Networks (ICNN), Orlando, FL, USA, 28 June–2 July 1994; Volume 1, pp. 55–60. [Google Scholar]
- Pilzer, A.; Lathuiliere, S.; Sebe, N.; Ricci, E. Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9768–9777. [Google Scholar]
- Neal, R.M. Bayesian Learning for Neural Networks; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 118. [Google Scholar]
- Eldesokey, A.; Felsberg, M.; Holmquist, K.; Persson, M. Uncertainty-aware cnns for depth completion: Uncertainty from beginning to end. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12014–12023. [Google Scholar]
- Su, W.; Xu, Q.; Tao, W. Uncertainty guided multi-view stereo network for depth estimation. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7796–7808. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Gao, X.S.; Hou, X.R.; Tang, J.; Cheng, H.F. Complete solution classification for the perspective-three-point problem. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 930–943. [Google Scholar]
- Lepetit, V.; Moreno-Noguer, F.; Fua, P. EP n P: An accurate O (n) solution to the P n P problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
- Penate-Sanchez, A.; Andrade-Cetto, J.; Moreno-Noguer, F. Exhaustive linearization for robust camera pose and focal length estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2387–2400. [Google Scholar] [CrossRef]
- Shi, J.; Tomasi. Good features to track. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar]
- Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vancouver, BC, Canada, 24–28 August 1981; Volume 2, pp. 674–679. [Google Scholar]
- Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Civera, J.; Davison, A.J.; Montiel, J.M. Inverse depth parametrization for monocular SLAM. IEEE Trans. Robot. 2008, 24, 932–945. [Google Scholar] [CrossRef]
- Meagher, D. Geometric modeling using octree encoding. Comput. Graph. Image Process. 1982, 19, 129–147. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diega, CA, USA, 7–9 May 2015. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
- Son, E.; Choi, J.; Song, J.; Jin, Y.; Lee, S.J. Monocular Depth Estimation from a Fisheye Camera Based on Knowledge Distillation. Sensors 2023, 23, 9866. [Google Scholar] [CrossRef]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
- Bai, C.; Xiao, T.; Chen, Y.; Wang, H.; Zhang, F.; Gao, X. Faster-LIO: Lightweight tightly coupled LiDAR-inertial odometry using parallel sparse incremental voxels. IEEE Robot. Autom. Lett. 2022, 7, 4861–4868. [Google Scholar] [CrossRef]
- Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
- Uhrig, J.; Schneider, N.; Schneider, L.; Franke, U.; Brox, T.; Geiger, A. Sparsity invariant cnns. In Proceedings of the International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 11–20. [Google Scholar]
- Tsai, D.; Worrall, S.; Shan, M.; Lohr, A.; Nebot, E. Optimising the selection of samples for robust lidar camera calibration. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2631–2638. [Google Scholar]
Method | Error Metric ↓ | Accuracy Metric ↑ | |||||||
---|---|---|---|---|---|---|---|---|---|
AbsRel | SqRel | RMSE | RMSEi | SIlog | log10 | ||||
Original KITTI [8] | |||||||||
BTS [11] | 0.084 | 0.563 | 4.096 | 0.176 | 16.624 | 0.040 | 0.905 | 0.965 | 0.983 |
NewCRFs [12] | 0.117 | 0.786 | 4.750 | 0.208 | 19.596 | 0.054 | 0.845 | 0.946 | 0.977 |
AdaBins [13] | 0.102 | 0.636 | 4.102 | 0.186 | 17.718 | 0.046 | 0.879 | 0.958 | 0.982 |
UD-Net † | 0.085 | 0.547 | 4.037 | 0.173 | 16.425 | 0.040 | 0.905 | 0.967 | 0.983 |
UD-Net (ours) | 0.061 | 0.276 | 2.674 | 0.127 | 11.980 | 0.028 | 0.945 | 0.982 | 0.991 |
Improved KITTI [51] | |||||||||
BTS [11] | 0.060 | 0.255 | 2.821 | 0.097 | 8.967 | 0.027 | 0.954 | 0.992 | 0.998 |
NewCRFs [12] | 0.090 | 0.462 | 3.744 | 0.140 | 12.782 | 0.040 | 0.901 | 0.979 | 0.995 |
AdaBins [13] | 0.074 | 0.336 | 2.939 | 0.112 | 10.337 | 0.031 | 0.937 | 0.988 | 0.997 |
UD-Net † | 0.061 | 0.250 | 2.784 | 0.097 | 8.960 | 0.027 | 0.954 | 0.993 | 0.998 |
UD-Net (ours) | 0.046 | 0.126 | 1.816 | 0.072 | 6.506 | 0.020 | 0.976 | 0.997 | 0.999 |
Method | Error Metric ↓ | Accuracy Metric ↑ | |||||||
---|---|---|---|---|---|---|---|---|---|
AbsRel | SqRel | RMSE | RMSEi | SIlog | log10 | ||||
NewCRFs [12] | 0.094 | 0.377 | 1.797 | 0.143 | 13.914 | 0.035 | 0.913 | 0.977 | 0.993 |
AdaBins [13] | 0.082 | 0.338 | 1.827 | 0.137 | 13.243 | 0.032 | 0.928 | 0.978 | 0.992 |
BTS [11] | 0.102 | 0.444 | 1.972 | 0.155 | 14.497 | 0.038 | 0.908 | 0.973 | 0.991 |
UD-Net † | 0.101 | 0.436 | 1.944 | 0.153 | 14.273 | 0.038 | 0.910 | 0.973 | 0.991 |
UD-Net (ours) | 0.086 | 0.359 | 1.765 | 0.135 | 12.485 | 0.032 | 0.937 | 0.981 | 0.992 |
Driving Scenario | Case 1 | Case 2 | Case 3 | Case 4 | Case 5 | Case 6 | Average | |
---|---|---|---|---|---|---|---|---|
Driving Distance [m] | 225.35 | 225.35 | 122.26 | 44.32 | 44.27 | 26.02 | ||
Method | Depth | RMSE of ATE [m] | ||||||
VINS-Mono [40] | None | 7.6614 | 2.0864 | 2.7397 | 0.8252 | 2.9859 | 4.6489 | 4.1034 |
VINS-RGBD [9] | Sensor | 5.8164 | 7.8056 | 0.9733 | 1.5514 | 0.7652 | 0.7833 | 4.8166 |
Ours | Baseline | 2.0736 | 0.8750 | 0.6901 | 0.2468 | 0.3556 | 0.2698 | 1.1382 |
UD-Net † | 2.0724 | 0.8809 | 0.7417 | 0.2121 | 0.2948 | 0.2277 | 1.1411 | |
UD-Net (ours) | 2.1260 | 0.7693 | 0.5627 | 0.2193 | 0.2507 | 0.2037 | 1.0870 | |
Method | Depth | Translation error of RPE [m] | ||||||
VINS-Mono [40] | None | 0.1518 | 0.0539 | 0.0563 | 0.0857 | 0.3136 | 0.2527 | 0.1127 |
VINS-RGBD [9] | Sensor | 0.1217 | 0.5044 | 0.0753 | 0.2152 | 0.0632 | 0.1252 | 0.2413 |
Ours | Baseline | 0.0367 | 0.0328 | 0.0302 | 0.0304 | 0.0368 | 0.0562 | 0.0346 |
UD-Net † | 0.0368 | 0.0331 | 0.0305 | 0.0311 | 0.0365 | 0.0565 | 0.0348 | |
UD-Net (ours) | 0.0377 | 0.0330 | 0.0295 | 0.0308 | 0.0338 | 0.0547 | 0.0347 | |
Method | Depth | Rotation error of RPE [deg] | ||||||
VINS-Mono [40] | None | 0.1379 | 0.1442 | 0.1844 | 0.4872 | 0.5463 | 0.6791 | 0.2175 |
VINS-RGBD [9] | Sensor | 0.3601 | 0.4537 | 0.3722 | 1.3643 | 1.2669 | 2.4969 | 0.5969 |
Ours | Baseline | 0.1396 | 0.1576 | 0.1828 | 0.4958 | 0.5164 | 0.7093 | 0.2220 |
UD-Net † | 0.1405 | 0.1649 | 0.1894 | 0.4926 | 0.5169 | 0.7133 | 0.2258 | |
UD-Net (ours) | 0.1394 | 0.1557 | 0.1870 | 0.4863 | 0.5102 | 0.7233 | 0.2216 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, J.; Jo, H.; Jin, Y.; Lee, S.J. Uncertainty-Aware Depth Network for Visual Inertial Odometry of Mobile Robots. Sensors 2024, 24, 6665. https://doi.org/10.3390/s24206665
Song J, Jo H, Jin Y, Lee SJ. Uncertainty-Aware Depth Network for Visual Inertial Odometry of Mobile Robots. Sensors. 2024; 24(20):6665. https://doi.org/10.3390/s24206665
Chicago/Turabian StyleSong, Jimin, HyungGi Jo, Yongsik Jin, and Sang Jun Lee. 2024. "Uncertainty-Aware Depth Network for Visual Inertial Odometry of Mobile Robots" Sensors 24, no. 20: 6665. https://doi.org/10.3390/s24206665
APA StyleSong, J., Jo, H., Jin, Y., & Lee, S. J. (2024). Uncertainty-Aware Depth Network for Visual Inertial Odometry of Mobile Robots. Sensors, 24(20), 6665. https://doi.org/10.3390/s24206665