Monocular Absolute Depth Estimation from Motion for Small Unmanned Aerial Vehicles by Geometry-Based Scale Recovery
Abstract
:1. Introduction
- We propose a monocular absolute depth estimation method for UAV flight scenes. Our method extends the unsupervised monocular depth estimation technology from relative depth estimation in semi-structured ground scenes to absolute depth estimation in unstructured aerial scenes, which have not been as extensively studied. The multi-view geometry technology utilized in traditional MDE methods is introduced to address the problem of scale ambiguity faced by the modern unsupervised MDE approach.
- We propose a scale recovery algorithm based on multi-view geometry for disambiguation. A scale factor between relative and absolute scales is estimated for each frame, which allows the relative depth map to be multiplied pixel by pixel to obtain the corresponding absolute depth map. To achieve this, we take the displacement between adjacent frames as an intermediate quantity, the absolute value of which is available from the UAV positions, and the relative value of which can be calculated from the relative depth values of the matched points between two frames and the UAV navigation data. The quotient of the absolute and relative values of this quantity is taken as the scale factor.
- To make our method applicable to flight scenes, the attitude angles of the UAV obtained from its navigation sensors are introduced into the scale recovery algorithm to perform a series of geometric transformations, improving the accuracy of scale factors and absolute depth estimation results. Experiments also verify the robustness of our method to navigation sensor noise and its adaptability to different parameter settings, depth ranges and variations, and UAV motions.
2. Related Work
2.1. CNN-Based MDE
2.2. Scale Problems in Unsupervised MDE
3. Method
3.1. Pipeline
- Stage 1, Relative depth estimation and feature matching. This stage predicts relative depth maps for reference and current frames using unsupervised deep learning. By introducing a geometry consistency loss as part of the supervision for network training, the relative depth predictions are scale-consistent under sequence. Meanwhile, feature extraction and matching is performed between these two frames.
- Stage 2, Absolute depth estimation by geometry-based scale recovery. This stage transforms the relative depth map of current frame into an absolute depth map, also known as “scale recovery”.This is achieved by calculating a scale factor between relative and absolute scales using the relative depth values of matched feature points in two frames and the UAV navigation data at these two imaging moments. In this calculation process, the real world scale is introduced by the position data, and a series of geometric transformations are performed according to its attitude angles.
3.2. Relative Depth Estimation and Feature Matching
3.2.1. Relative Depth Estimation with Scale Consistency
3.2.2. Feature Extraction and Matching
3.3. Absolute Depth Estimation by Geometry-Based Scale Recovery
3.3.1. Calculation of Relative Baseline Length
3.3.2. Calculation of Scale Factor
4. Experiment
4.1. Dataset
4.2. Implementation Details
4.3. Metrics
4.4. Quantitative Results
4.4.1. Performance
4.4.2. Comparison with Other Methods
4.4.3. Ablation Study
4.4.4. Results with Different Settings
- (a)
- Frame intervals
- (b)
- Depth ranges and variations
- (c)
- UAV motion
4.4.5. Time Results
4.5. Qualitative Results
5. Conclusions
- Our method only requires the UAV to be equipped with a camera and ordinary navigation sensors, without any additional conditions or prior information, making it suitable for most small flying platforms.
- Since the scale recovery algorithm serves as a post-processing stage of relative depth estimation, it can be transplanted to various unsupervised MDE methods, which is flexible for practical applications.
- It is robust to the navigation sensor noise, and also applicable to different depth ranges and variations as well as different UAV motions.
- Reducing the relative depth estimation error and feature matching error is the key to improving the accuracy of our absolute depth estimation method. The performance could also be enhanced with the help of other sensors or visual tasks.
- Integrating feature matching and scale recovery algorithms into the CNN to achieve an end-to-end architecture for absolute depth estimation can speed up the inference, which may enable our method to operate on embedded devices for real-time applications.
- Improving the robustness of this method to different climate and weather conditions is also a worthwhile research direction.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
UAV | Unmanned aerial vehicle |
MDE | Monocular depth estimation |
CNN | Convolutional neural network |
GT | Ground truth |
DoF | Degrees of freedom |
GPS | Global Positioning System |
IMU | Inertial measurement unit |
EKF | Extended Kalman filter |
SLAM | Simultaneous localization and mapping |
SSIM | Structural similarity |
SIFT | Scale-invariant feature transform |
FOV | Field of view |
Abs Rel | Absolute relative error |
MRE | Mean relative error |
Sq Rel | Square relative error |
RMSE | Root mean square error |
RMSE log | Logarithmic root mean square error |
References
- Khan, F.; Salahuddin, S.; Javidnia, H. Deep learning-based monocular depth estimation methods—A state-of-the-art review. Sensors 2020, 20, 2272. [Google Scholar] [CrossRef] [PubMed]
- Masoumian, A.; Rashwan, H.A.; Cristiano, J.; Asif, M.S.; Puig, D. Monocular depth estimation using deep learning: A review. Sensors 2022, 22, 5353. [Google Scholar] [CrossRef] [PubMed]
- Dong, X.; Garratt, M.A.; Anavatti, S.G.; Abbass, H.A. Towards real-time monocular depth estimation for robotics: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16940–16961. [Google Scholar] [CrossRef]
- Xue, F.; Zhuo, G.; Huang, Z.; Fu, W.; Wu, Z.; Ang, M.H. Toward hierarchical self-supervised monocular absolute depth estimation for autonomous driving applications. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 2330–2337. [Google Scholar]
- Florea, H.; Nedevschi, S. Survey on monocular depth estimation for unmanned aerial vehicles using deep learning. In Proceedings of the IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 22–24 September 2022; pp. 319–326. [Google Scholar]
- Romero-Lugo, A.; Magadan-Salazar, A.; Fuentes-Pacheco, J.; Pinto-Elías, R. A comparison of deep neural networks for monocular depth map estimation in natural environments flying at low altitude. Sensors 2022, 22, 9830. [Google Scholar] [CrossRef] [PubMed]
- Wang, D.; Li, W.; Liu, X.; Li, N.; Zhang, C. UAV environmental perception and autonomous obstacle avoidance: A deep learning and depth camera combined solution. Comput. Electron. Agric. 2020, 175, 105523. [Google Scholar] [CrossRef]
- Zhang, Z.; Xiong, M.; Xiong, H. Monocular depth estimation for UAV obstacle avoidance. In Proceedings of the International Conference on Cloud Computing and Internet of Things (CCIOT), Changchun, China, 6–7 December 2019; pp. 43–47. [Google Scholar]
- Ullman, S. The interpretation of structure from motion. Proc. R. Soc. Lond. B 1979, 203, 405–426. [Google Scholar]
- Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. Sci. China-Technol. Sci. 2020, 63, 1612–1627. [Google Scholar] [CrossRef]
- Ming, Y.; Meng, X.; Fan, C.; Yu, H. Deep learning for monocular depth estimation: A review. Neurocomputing 2021, 438, 14–33. [Google Scholar] [CrossRef]
- He, L.; Dong, Q.; Hu, Z. The inherent ambiguity in scene depth learning from single images. Sci. Sin. Inform. 2016, 46, 811–818. [Google Scholar] [CrossRef]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
- Fonder, M.; Van Droogenbroeck, M. Mid-Air: A multi-modal dataset for extremely low altitude drone flights. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 553–562. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. (NIPS) 2014, 27, 2366–2374. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
- Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 September 2016; pp. 740–756. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; kavukcuoglu, K. Spatial transformer networks. Adv. Neural Inf. Process. Syst. (NIPS) 2015, 28, 2017–2025. [Google Scholar]
- He, L.; Yang, J.; Kong, B.; Wang, C. An automatic measurement method for absolute depth of objects in two monocular images based on SIFT feature. Appl. Sci. 2017, 7, 517. [Google Scholar] [CrossRef]
- Zhang, C.; Cao, Y.; Ding, M.; Li, X. Object depth measurement and filtering from monocular images for unmanned aerial vehicles. J. Aerosp. Inf. Syst. 2022, 19, 214–223. [Google Scholar] [CrossRef]
- Petrovai, A.; Nedevschi, S. Exploiting pseudo labels in a self-supervised learning framework for improved monocular depth estimation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1578–1588. [Google Scholar]
- Zhang, S.; Zhang, J.; Tao, D. Towards scale-aware, robust, and generalizable unsupervised monocular depth estimation by integrating IMU motion dynamics. In Proceedings of the European Conference on Computer Vision (ECCV), Tel-Aviv, Israel, 23–27 October 2022; pp. 143–160. [Google Scholar]
- Pinard, C.; Chevalley, L.; Manzanera, A.; Filliat, D. Learning structure-from-motion from motion. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 363–376. [Google Scholar]
- Bian, J.W.; Li, Z.; Wang, N.; Zhan, H.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Adv. Neural Inf. Process. Syst. (NeurIPS) 2019, 32, 35–45. [Google Scholar]
- Bian, J.W.; Zhan, H.; Wang, N.; Li, Z.; Zhang, L.; Shen, C.; Cheng, M.M.; Reid, I. Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis. 2021, 129, 2548–2564. [Google Scholar] [CrossRef]
- Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. In Proceedings of the International Conference on Field and Service Robotics (FSR), Zurich, Switzerland, 12–15 September 2018; pp. 621–635. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. (NeurIPS) 2019, 32, 8026–8037. [Google Scholar]
Method | Training Dataset | Testing Dataset | Depth Type | Evaluation Mode | Error Metrics (Lower Is Better) | Accuracy Metrics (Higher Is Better) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | a1 | a2 | a3 | |||||
SC-Depth | K 1 | MA | Relative | MS 2 | 0.2303 | 3.0121 | 9.8578 | 0.3491 | 0.6292 | 0.8537 | 0.9360 |
SC-Depth | K+MA | MA | Relative | MS | 0.1906 | 1.8670 | 7.3232 | 0.2587 | 0.7092 | 0.8996 | 0.9631 |
SC-Depth | K+MA | MA | Relative | Direct | 0.9404 | 18.0870 | 22.7413 | 2.9993 | 0.0000 | 0.0000 | 0.0000 |
Ours | K+MA | MA | Absolute | Direct | 0.2953 | 3.2969 | 9.5658 | 0.4261 | 0.4069 | 0.6941 | 0.8442 |
Testing Dataset | Depth Type | Source of Navigation Data | Error Metrics (Lower Is Better) | Accuracy Metrics (Higher Is Better) | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Position | Attitude | Abs Rel | Sq Rel | RMSE | RMSE log | a1 | a2 | a3 | ||
MA | Absolute | GT | GT | 0.2953 | 3.2969 | 9.5658 | 0.4261 | 0.4069 | 0.6941 | 0.8442 |
MA | Absolute | GPS | GT | 0.3052 | 3.3454 | 9.6172 | 0.4336 | 0.3874 | 0.6795 | 0.8408 |
MA | Absolute | GT | IMU | 0.3000 | 3.2852 | 9.5836 | 0.4311 | 0.3844 | 0.6863 | 0.8451 |
MA | Absolute | GPS | IMU | 0.3094 | 3.3360 | 9.6263 | 0.4387 | 0.3695 | 0.6670 | 0.8399 |
Method | Depth Type | Source of Scale Information | Error Metrics (Lower Is Better) | Accuracy Metrics (Higher Is Better) | |||||
---|---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | a1 | a2 | a3 | |||
SC-Depth [28] | Relative | / | 0.1140 | 0.8130 | 4.7060 | 0.1910 | 0.8730 | 0.9600 | 0.9820 |
DNet [4] | Absolute | Camera height | 0.1130 | 0.8640 | 4.8120 | 0.1910 | 0.8770 | 0.9600 | 0.9810 |
SD-SSMDE [24] | Absolute | Camera height | 0.1000 | 0.6610 | 4.2640 | 0.1720 | 0.8960 | 0.9670 | 0.9850 |
DynaDepth [25] | Absolute | IMU | 0.1090 | 0.7870 | 4.7050 | 0.1950 | 0.8690 | 0.9580 | 0.9810 |
Pinard et al. [26] | Absolute | Vehicle position | 0.3124 | 5.0302 | 8.4985 | 0.4095 | 0.5919 | 0.7961 | 0.8821 |
Ours | Absolute | Vehicle position | 0.3060 | 2.6086 | 7.4261 | 0.6470 | 0.4891 | 0.7312 | 0.8830 |
Testing Dataset | Scale Recovery Inputs | Error Metrics (Lower Is Better) | Accuracy Metrics (Higher Is Better) | |||||
---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | a1 | a2 | a3 | ||
MA | Image + baseline | 0.3130 | 4.3951 | 10.6195 | 0.4927 | 0.4179 | 0.6707 | 0.7956 |
Image + baseline + attitudes | 0.2953 | 3.2969 | 9.5658 | 0.4261 | 0.4069 | 0.6941 | 0.8442 | |
K | Image + baseline | 0.4198 | 4.0908 | 9.3903 | 1.2373 | 0.2894 | 0.5347 | 0.6703 |
Image + baseline + attitudes | 0.3060 | 2.6086 | 7.4261 | 0.6470 | 0.4891 | 0.7312 | 0.8830 |
Sequence Number | Depth Range for Evaluation | Depth Variation Ranking 1 | Error Metrics (Lower Is Better) | Accuracy Metrics (Higher Is Better) | |||||
---|---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | a1 | a2 | a3 | |||
4000 | 0–500 m | 2 | 0.2453 | 2.6401 | 14.9685 | 0.3501 | 0.4249 | 0.7927 | 0.9507 |
4001 | 0–500 m | 5 | 0.4159 | 15.4110 | 49.5378 | 0.6622 | 0.1692 | 0.3751 | 0.6125 |
4002 | 0–500 m | 4 | 0.3538 | 6.2921 | 22.8000 | 0.5591 | 0.4248 | 0.6788 | 0.8244 |
4003 | 0–500 m | 3 | 0.2948 | 3.4807 | 16.6903 | 0.4536 | 0.3633 | 0.6791 | 0.8316 |
4004 | 0–500 m | 1 | 0.1800 | 0.4894 | 2.2918 | 0.2406 | 0.6327 | 0.9304 | 0.9905 |
Testing Sequence | Speed Variation | Speed Value | Error Metrics (Lower Is Better) | Accuracy Metrics (Higher Is Better) | |||||
---|---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | a1 | a2 | a3 | |||
Customized | Constant | 10 m/s | 0.2798 | 3.3815 | 9.4691 | 0.3897 | 0.5073 | 0.7718 | 0.8844 |
Customized | Varying | 8–12 m/s | 0.2840 | 3.4040 | 9.5028 | 0.3878 | 0.4898 | 0.7721 | 0.8919 |
Testing Dataset | Magnitude of Lateral Acceleration | Error Metrics (Lower Is Better) | Accuracy Metrics (Higher Is Better) | |||||
---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | a1 | a2 | a3 | ||
MA | The higher half | 0.3052 | 3.8611 | 10.7055 | 0.4477 | 0.3911 | 0.6694 | 0.8293 |
MA | The lower half | 0.2853 | 2.7314 | 8.4232 | 0.4045 | 0.4227 | 0.7189 | 0.8592 |
Step | Time per Frame (ms) |
---|---|
Relative depth estimation | 28.0 |
Feature matching | 112.3 |
Scale recovery | 97.7 |
Total | 238.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, C.; Weng, X.; Cao, Y.; Ding, M. Monocular Absolute Depth Estimation from Motion for Small Unmanned Aerial Vehicles by Geometry-Based Scale Recovery. Sensors 2024, 24, 4541. https://doi.org/10.3390/s24144541
Zhang C, Weng X, Cao Y, Ding M. Monocular Absolute Depth Estimation from Motion for Small Unmanned Aerial Vehicles by Geometry-Based Scale Recovery. Sensors. 2024; 24(14):4541. https://doi.org/10.3390/s24144541
Chicago/Turabian StyleZhang, Chuanqi, Xiangrui Weng, Yunfeng Cao, and Meng Ding. 2024. "Monocular Absolute Depth Estimation from Motion for Small Unmanned Aerial Vehicles by Geometry-Based Scale Recovery" Sensors 24, no. 14: 4541. https://doi.org/10.3390/s24144541
APA StyleZhang, C., Weng, X., Cao, Y., & Ding, M. (2024). Monocular Absolute Depth Estimation from Motion for Small Unmanned Aerial Vehicles by Geometry-Based Scale Recovery. Sensors, 24(14), 4541. https://doi.org/10.3390/s24144541