Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation
Abstract
:1. Introduction
- We devise an Inter-task Attention Module to exploit the inter-task geometric correlation between depth and pose estimation networks. It learns attention maps from depth information as guidance to help the pose network identify key regions to be targeted. To the best of our knowledge, this is the first attempt to propose this idea for exploiting the inter-task geometric correlation in self-supervised monocular depth estimation.
- We introduce a Spatial-Temporal Memory Module in a depth estimation network to leverage the spatial and temporal geometric context among consecutive frames, which is effective for utilizing historical information and improving estimation results.
- We conduct comprehensive empirical studies on the KITTI dataset, and the single-frame inference result of our method outperforms state-of-the-art methods by a relative gain of 6.6% based on the major evaluation metric.
2. Related Work
2.1. Inter-Task Monocular Video Learning
2.2. Long-Range Representation Learning
3. Method
3.1. Problem Definition
3.2. Network Architecture
3.2.1. Inter-Task Attention Module
3.2.2. Spatial-Temporal Memory Module
4. Experiments
4.1. Depth Estimation Result
4.2. Evaluation of Generalization Ability
4.3. Ablation Study
Effect of Inter-Task Attention Module
4.4. Evaluation with Improved Ground Truth
4.5. Single-Scale Evaluation
4.6. Results with Post-Processing
4.7. Inference Speed
5. Visual Odometry
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Godard, C.; MacAodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
- Zhao, H.; Zhang, Q.; Zhao, S.; Chen, Z.; Zhang, J.; Tao, D. Simdistill: Simulated multi-modal distillation for bev 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 18–22 November 2024. [Google Scholar]
- Shu, Y.; Hu, A.; Zheng, Y.; Gan, L.; Xiao, G.; Zhou, C.; Song, L. Evaluation of ship emission intensity and the inaccuracy of exhaust emission estimation model. Ocean. Eng. 2023, 287, 115723. [Google Scholar] [CrossRef]
- Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5667–5675. [Google Scholar]
- Zhan, H.; Garg, R.; Weerasekera, C.S.; Li, K.; Agarwal, H.; Reid, I. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Yang, Z.; Wang, P.; Xu, W.; Zhao, L.; Nevatia, R. Unsupervised learning of geometry with edge-aware depth-normal consistency. arXiv 2017, arXiv:1711.03665. [Google Scholar] [CrossRef]
- Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1983–1992. [Google Scholar]
- Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12240–12249. [Google Scholar]
- Kim, H.R.; Angelaki, D.E.; DeAngelis, G.C. The neural basis of depth perception from motion parallax. Philos. Trans. R. Soc. B Biol. Sci. 2016, 371, 20150256. [Google Scholar] [CrossRef] [PubMed]
- Colby, C. Perception of Extrapersonal Space: Psychological and Neural Aspects. Int. Encycl. Soc. Behav. Sci. 2001, 11205–11209. [Google Scholar] [CrossRef]
- Gieseking, J.J.; Mangold, W.; Katz, C.; Low, S.; Saegert, S. The People, Place, and Space Reader; Routledge: London, UK, 2014. [Google Scholar]
- Gregory, R.L. Knowledge in perception and illusion. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. 1997, 352, 1121–1127. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Glasgow, UK, 2015. [Google Scholar]
- Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8001–8008. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Zou, Y.; Luo, Z.; Huang, J.-B. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 36–53. [Google Scholar]
- Zhao, H.; Zhang, J.; Zhang, S.; Tao, D. Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Li, R.; Wang, S.; Long, Z.; Gu, D. Undeepvo: Monocular visual odometry through unsupervised deep learning. In Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018; pp. 7286–7291. [Google Scholar]
- CS Kumar, A.; Bhandarkar, S.M.; Prasad, M. Depthnet: A recurrent neural network architecture for monocular depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 283–291. [Google Scholar]
- Wang, R.; Pizer, S.M.; Frahm, J.-M. Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5555–5564. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Zhao, H.; Zhang, J.; Chen, Z.; Yuan, B.; Tao, D. On Robust Cross-view Consistency in Self-supervised Monocular Depth Estimation. Mach. Intell. Res. 2024, 21, 495–513. [Google Scholar] [CrossRef]
- Agarwal, A.; Arora, C. Attention attention everywhere: Monocular depth prediction with skip attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023. [Google Scholar]
- Zhao, C.; Zhang, Y.; Poggi, M.; Tosi, F.; Guo, X.; Zhu, Z.; Huang, G.; Tang, Y.; Mattoccia, S. Monovit: Self-supervised monocular depth estimation with a vision transformer. In Proceedings of the International Conference on 3D Vision, Prague, Czechia, 12–15 September 2022. [Google Scholar]
- Bae, J.; Moon, S.; Im, S. Deep digging into the generalization of self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
- Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear units(elus). arXiv 2015, arXiv:1511.07289. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; So Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
- Garg, R.; BG, V.K.; Carneiro, G.; Reid, I. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 740–756. [Google Scholar]
- Godard, C.; MacAodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Pillai, S.; Ambruş, R.; Gaidon, A. Superdepth: Self-supervised, super-resolved monocular depth estimation. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 9250–9256. [Google Scholar]
- Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2022–2030. [Google Scholar]
- Bian, J.-W.; Li, Z.; Wang, N.; Zhan, H.; Shen, C.; Cheng, M.-M.; Reid, I. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Glasgow, UK, 2019. [Google Scholar]
- Zhou, J.; Wang, Y.; Qin, K.; Zeng, W. Unsupervised high-resolution depth learning from videos with dual networks. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Spencer, J.; Bowden, R.; Hadfield, S. DeFeat-Net: General Monocular Depth via Simultaneous Unsupervised Representation Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14402–14413. [Google Scholar]
- Zhao, W.; Liu, S.; Shu, Y.; Liu, Y.-J. Towards better generalization: Joint depth-pose learning without posenet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Saxena, A.; Sun, M.; Ng, A.Y. Make3D: Learning 3D Scene Structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 824–840. [Google Scholar] [CrossRef] [PubMed]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understandin. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 2024–2039. [Google Scholar] [CrossRef]
- Uhrig, J.; Schneider, N.; Schneider, L.; Franke, U.; Brox, T.; Geiger, A. Sparsity invariant cnns. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017. [Google Scholar]
- Poggi, M.; Tosi, F.; Mattoccia, S. Learning monocular depth estimation with unsupervised trinocular assumptions. In Proceedings of the International Conference on 3D Vision, Verona, Italy, 5–8 September 2018. [Google Scholar]
- Zhao, J.; Yan, Z.; Zhou, Z.; Chen, X.; Wu, B.; Wang, S. A ship trajectory prediction method based on GAT and LSTM. Ocean. Eng. 2023, 289, 116159. [Google Scholar] [CrossRef]
- Grupp, M. evo: Python Package for the Evaluation of Odometry and SLAM. 2017. Available online: https://github.com/MichaelGrupp/evo (accessed on 5 May 2023).
Methods | Train | Error Metric ↓ | Accuracy Metric ↑ | |||||
---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | < 1.25 | < 1.252 | < 1.253 | ||
[36] † | S | 0.152 | 1.226 | 5.849 | 0.246 | 0.784 | 0.921 | 0.967 |
MD (R50) † [37] | S | 0.133 | 1.142 | 5.533 | 0.230 | 0.830 | 0.936 | 0.970 |
SuperDepth [38] | S | 0.112 | 0.875 | 4.958 | 0.207 | 0.852 | 0.947 | 0.977 |
MD2 [2] | S | 0.107 | 0.849 | 4.764 | 0.201 | 0.874 | 0.953 | 0.977 |
[1] † | M | 0.183 | 1.595 | 6.709 | 0.270 | 0.734 | 0.902 | 0.959 |
[5] | M | 0.163 | 1.240 | 6.220 | 0.250 | 0.762 | 0.916 | 0.968 |
GeoNet † [8] | M | 0.149 | 1.060 | 5.567 | 0.226 | 0.796 | 0.935 | 0.975 |
DDVO [39] | M | 0.151 | 1.257 | 5.583 | 0.228 | 0.810 | 0.936 | 0.974 |
DF-Net [18] | M | 0.150 | 1.124 | 5.507 | 0.223 | 0.806 | 0.933 | 0.973 |
Struct2depth [16] | M | 0.141 | 1.026 | 5.142 | 0.210 | 0.845 | 0.845 | 0.948 |
CC [9] | M | 0.140 | 1.070 | 5.326 | 0.217 | 0.826 | 0.941 | 0.975 |
SC-SFMLearner [40] | M | 0.137 | 1.089 | 5.439 | 0.217 | 0.830 | 0.942 | 0.975 |
HR [41] | M | 0.121 | 0.873 | 4.945 | 0.197 | 0.853 | 0.955 | 0.982 |
MD2(R18) [2] | M | 0.115 | 0.882 | 4.701 | 0.190 | 0.879 | 0.961 | 0.982 |
DeFeat [42] | M | 0.126 | 0.925 | 5.035 | 0.200 | 0.862 | 0.954 | 0.980 |
[43] | M | 0.113 | 0.704 | 4.581 | 0.184 | 0.871 | 0.961 | 0.984 |
Ours (R18) | M | 0.106 | 0.761 | 4.545 | 0.182 | 0.890 | 0.965 | 0.983 |
Ours (R50) | M | 0.105 | 0.731 | 4.412 | 0.181 | 0.891 | 0.965 | 0.983 |
Methods | Train | Abs Rel | Sq Rel | RMSE | log10 |
---|---|---|---|---|---|
[47] | D | 0.475 | 6.562 | 10.05 | 0.165 |
[37] | S | 0.544 | 10.94 | 11.760 | 0.193 |
[1] | M | 0.383 | 5.321 | 10.470 | 0.478 |
DDVO [39] | M | 0.387 | 4.720 | 8.090 | 0.204 |
MD2 [2] | M | 0.322 | 3.589 | 7.417 | 0.163 |
Ours | M | 0.316 | 3.200 | 7.095 | 0.158 |
Resolution | Net | IAM | STMM | Abs Rel | Sq Rel | RMSE | RMSE Log | |
---|---|---|---|---|---|---|---|---|
Ours (full) | R50 | √ | √ | 0.105 | 0.731 | 4.412 | 0.181 | |
Ours (full) | R18 | √ | √ | 0.106 | 0.761 | 4.545 | 0.182 | |
Ours w/o IAM | R18 | √ | 0.112 | 0.844 | 4.815 | 0.190 | ||
Ours w/o STMM | R18 | √ | 0.111 | 0.829 | 4.799 | 0.190 | ||
Ours w/o STMM | R18 | √ | CBAM | 0.109 | 0.778 | 4.591 | 0.186 | |
Ours w/o STMM | R18 | √ | SE | 0.111 | 0.799 | 4.704 | 0.187 | |
Ours (plain) | R18 | 0.120 | 0.915 | 4.972 | 0.196 | |||
Ours (full) smaller | R18 | √ | √ | 0.110 | 0.809 | 4.616 | 0.185 | |
Ours (full) w/o pre | R18 | √ | √ | 0.124 | 0.847 | 4.713 | 0.196 |
Methods | Dist | Abs Rel | Sq Rel | RMSE | RMSE Log |
---|---|---|---|---|---|
HR [41] | ≤20 | 0.102 | 0.391 | 1.959 | 0.146 |
Ours | ≤20 | 0.076 | 0.189 | 1.410 | 0.123 |
HR [41] | >20 | 0.187 | 2.430 | 9.695 | 0.305 |
Ours | >20 | 0.152 | 2.556 | 9.226 | 0.211 |
Ours w/o STMM | >20 | 0.179 | 2.803 | 9.667 | 0.291 |
Ours w/o IAM | >20 | 0.158 | 2.643 | 9.346 | 0.234 |
Methods | Train | Resolution | Error Metric ↓ | Accuracy Metric ↑ | |||||
---|---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | < 1.25 | < 1.252 | < 1.253 | |||
MD R50 † [37] | S | 0.109 | 0.811 | 4.568 | 0.166 | 0.877 | 0.967 | 0.988 | |
3Net (VGG) [49] | S | 0.119 | 0.920 | 4.824 | 0.182 | 0.856 | 0.957 | 0.985 | |
3Net (R50) [49] | S | 0.102 | 0.675 | 4.293 | 0.159 | 0.881 | 0.969 | 0.991 | |
SuperDepth [38] | S | 0.090 | 0.542 | 3.967 | 0.144 | 0.901 | 0.976 | 0.993 | |
MD2 [2] | S | 0.085 | 0.537 | 3.868 | 0.139 | 0.912 | 0.979 | 0.993 | |
Zhou et al. † [1] | M | 0.176 | 1.532 | 6.129 | 0.244 | 0.758 | 0.921 | 0.971 | |
Mahjourian et al. [5] | M | 0.134 | 0.983 | 5.501 | 0.203 | 0.827 | 0.944 | 0.981 | |
GeoNet [8] | M | 0.132 | 0.994 | 5.240 | 0.193 | 0.833 | 0.953 | 0.985 | |
DDVO [39] | M | 0.126 | 0.866 | 4.932 | 0.185 | 0.851 | 0.958 | 0.986 | |
CC [9] | M | 0.123 | 0.881 | 4.834 | 0.181 | 0.860 | 0.959 | 0.985 | |
MD2 (R18) [2] | M | 0.090 | 0.545 | 3.942 | 0.137 | 0.914 | 0.983 | 0.995 | |
Ours (R18) | M | 0.083 | 0.447 | 3.667 | 0.126 | 0.924 | 0.986 | 0.997 |
Methods | Error Metric ↓ | Accuracy Metric ↑ | ||||||
---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | < 1.25 | < 1.252 | < 1.253 | ||
Zhou et al. † [1] | 0.210 | 0.258 | 2.338 | 7.040 | 0.309 | 0.601 | 0.853 | 0.940 |
Mahjourian et al. [5] | 0.189 | 0.221 | 1.663 | 6.220 | 0.265 | 0.665 | 0.892 | 0.962 |
GeoNet [8] | 0.172 | 0.202 | 1.521 | 5.829 | 0.244 | 0.707 | 0.913 | 0.970 |
DDVO [39] | 0.108 | 0.147 | 1.014 | 5.183 | 0.204 | 0.808 | 0.946 | 0.983 |
CC [9] | 0.162 | 0.188 | 1.298 | 5.467 | 0.232 | 0.724 | 0.927 | 0.974 |
MD2 (R18) [2] | 0.093 | 0.109 | 0.623 | 4.136 | 0.154 | 0.873 | 0.977 | 0.994 |
Ours (R18) | 0.082 | 0.096 | 0.507 | 3.828 | 0.139 | 0.898 | 0.983 | 0.996 |
Methods | Resolution | Error Metric ↓ | Accuracy Metric ↑ | |||||
---|---|---|---|---|---|---|---|---|
Abs Rel | Sq Rel | RMSE | RMSE log | < 1.25 | < 1.252 | < 1.253 | ||
MD2 [2] | 0.115 | 0.882 | 4.701 | 0.190 | 0.879 | 0.961 | 0.982 | |
MD2 (pp) [2] | 0.112 | 0.838 | 4.607 | 0.187 | 0.883 | 0.962 | 0.982 | |
Ours (R18) | 0.106 | 0.761 | 4.545 | 0.182 | 0.890 | 0.965 | 0.983 | |
Ours (R18 + pp) | 0.104 | 0.726 | 4.457 | 0.180 | 0.893 | 0.965 | 0.984 | |
Ours (R50) | 0.105 | 0.731 | 4.412 | 0.181 | 0.891 | 0.965 | 0.983 | |
Ours (R50 + pp) | 0.104 | 0.709 | 4.352 | 0.179 | 0.894 | 0.966 | 0.984 |
Device | Time (s) | Speed (f/s) |
---|---|---|
GPU | 60.9 | 11.5 |
CPU | 8721.1 | 0.08 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, H.; Kong, Y.; Zhang, C.; Zhang, H.; Zhao, J. Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation. ISPRS Int. J. Geo-Inf. 2024, 13, 193. https://doi.org/10.3390/ijgi13060193
Zhao H, Kong Y, Zhang C, Zhang H, Zhao J. Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation. ISPRS International Journal of Geo-Information. 2024; 13(6):193. https://doi.org/10.3390/ijgi13060193
Chicago/Turabian StyleZhao, Hailiang, Yongyi Kong, Chonghao Zhang, Haoji Zhang, and Jiansen Zhao. 2024. "Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation" ISPRS International Journal of Geo-Information 13, no. 6: 193. https://doi.org/10.3390/ijgi13060193
APA StyleZhao, H., Kong, Y., Zhang, C., Zhang, H., & Zhao, J. (2024). Learning Effective Geometry Representation from Videos for Self-Supervised Monocular Depth Estimation. ISPRS International Journal of Geo-Information, 13(6), 193. https://doi.org/10.3390/ijgi13060193