Efficient Structure from Motion for Large-Size Videos from an Open Outdoor UAV Dataset
Abstract
:1. Introduction
2. Related Work
2.1. Unstructured, Sparsely Sampled Collection
2.2. Coherent, Densely Sampled Collection
2.3. Keypoint Adjustment
3. Proposed Method
3.1. System Overview
3.2. Initial Pose Estimation
3.3. Two-Step Keypoint Adjustment
3.3.1. Coarse Keypoint Adjustment
3.3.2. Sub-Pixel Refinement
3.4. Global Pose Refinement
3.4.1. Rotation Averaging
3.4.2. Rotation Averaged Bundle Adjustment
- Reprojection term: this term represents the reprojection error corresponding to all tie points in bundle adjustment, as follows:
- Known rotation term: this term is used as a regularizer to reduce the accumulated error, which is given by:
4. Experiment
4.1. Dataset and Metrics
4.1.1. Dataset
4.1.2. Metrics
- Check points error: the accuracy of triangulation is evaluated by utilizing surveyed points called check points (CPs) that were not used for georeferencing. Given a check point with coordinate , the root mean square errors (RMSEs) for plane (), elevation (), and pixel () in terms of m CPs are evaluated as follows:
- Absolute trajectory error: ATE is utilized to assess the drift in the position and rotation of the estimated trajectory. The estimated trajectory has been aligned with the ground truth trajectory using Umeyama’s method [54], resulting in aligned poses represented as . RMSEs for the position () and rotation () are evaluated as follows:
4.2. Results
4.3. Ablation Experiment
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Snavely, N. Bundler: Structure from motion (SfM) for unordered image collections. J. Inst. Image Inf. Telev. Eng. 2008, 65, 479–482. [Google Scholar] [CrossRef]
- Wu, C. Towards linear-time incremental structure from motion. In Proceedings of the 2013 International Conference on 3D Vision-3DV, Seattle, WA, USA, 29 June–1 July 2013; pp. 127–134. [Google Scholar]
- Schonberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
- Sweeney, C.; Hollerer, T.; Turk, M. Theia: A fast and scalable structure-from-motion library. In Proceedings of the 23rd ACM International Conference on Multimedia, New York, NY, USA, 11–13 October 2015; pp. 693–696. [Google Scholar]
- Ozyesil, O.; Singer, A. Robust camera location estimation by convex programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2674–2683. [Google Scholar]
- Cui, H.; Gao, X.; Shen, S.; Hu, Z. HSfM: Hybrid structure-from-motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1212–1221. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular slam system. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
- Sumikura, S.; Shibuya, M.; Sakurada, K. OpenVSLAM: A versatile visual slam framework. ACM SIGMultimed. Rec. 2019, 11, 10. [Google Scholar] [CrossRef]
- Yu, L.; Yang, E.; Yang, B. AFE-ORB-SLAM: Robust monocular vslam based on adaptive fast threshold and image enhancement for complex lighting environments. J. Intell. Robot. Syst. 2022, 105, 26. [Google Scholar] [CrossRef]
- Szántó, M.; Bogár, G.R.; Vajta, L. ATDN vSLAM: An all-through deep learning-based solution for visual simultaneous localization and mapping. Period. Polytech. Electr. Eng. Comput. Sci. 2022, 66, 236–247. [Google Scholar] [CrossRef]
- Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Schubert, D.; Goll, T.; Demmel, N.; Usenko, V.; Stuckler, J.; Cremers, D. The TUM VI benchmark for evaluating visual-inertial odometry. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1680–1687. [Google Scholar]
- Yin, J.; Li, A.; Li, T.; Yu, W.; Zou, D. M2DGR: A multi-sensor and multi-scenario slam dataset for ground robots. IEEE Robot. Autom. Lett. 2021, 7, 2266–2273. [Google Scholar] [CrossRef]
- Schops, T.; Schonberger, J.L.; Galliani, S.; Sattler, T.; Schindler, K.; Pollefeys, M.; Geiger, A. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3260–3269. [Google Scholar]
- Kurniawan, R.A.; Ramdan, F.; Furqon, M.T. Videogrammetry: A new approach of 3-dimensional reconstruction from video using sfm algorithm: Case studi: Coal mining area. In Proceedings of the 2017 International Symposium on Geoinformatics (ISyG), Malang, Indonesia, 24–25 November 2017; pp. 13–17. [Google Scholar]
- Jeon, I.; Lee, I. 3D Reconstruction of unstable underwater environment with SFM using SLAM. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, 43, 957–962. [Google Scholar] [CrossRef]
- Habib, Y.; Papadakis, P.; Fagette, A.; Le Barz, C.; Gonçalves, T.; Buche, C. From sparse SLAM to dense mapping for UAV autonomous navigation. Geospat. Inform. XIII SPIE 2023, 12525, 110–120. [Google Scholar] [CrossRef]
- Woodford, O.J.; Rosten, E. Large scale photometric bundle adjustment. arXiv 2020, arXiv:2008.11762. [Google Scholar]
- Lindenberger, P.; Sarlin, P.-E.; Larsson, V.; Pollefeys, M. Pixel-perfect structure-from-motion with featuremetric refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5987–5997. [Google Scholar]
- Chen, Y.; Zhao, J.; Kneip, L. Hybrid rotation averaging: A fast and robust rotation averaging approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10358–10367. [Google Scholar]
- Snavely, N.; Seitz, S.M.; Szeliski, R. Photo tourism: Exploring photo collections in 3D. Semin. Graph. Pap. Push. Boundaries 2023, 2, 515–526. [Google Scholar] [CrossRef]
- Heinly, J.; Schonberger, J.L.; Dunn, E.; Frahm, J. Reconstructing the World in Six Days. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3287–3295. [Google Scholar]
- Jiang, S.; Li, Q.; Jiang, W.; Chen, W. Parallel structure from motion for UAV images via weighted connected dominating set. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5413013. [Google Scholar] [CrossRef]
- Chen, Y.; Shen, S.; Chen, Y.; Wang, G. Graph-based parallel large scale structure from motion. Pattern Recognit. 2020, 107, 107537. [Google Scholar] [CrossRef]
- Barath, D.; Mishkin, D.; Eichhardt, I.; Shipachev, I.; Matas, J. Efficient initial pose-graph generation for global sfm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14546–14555. [Google Scholar]
- Wang, X.; Xiao, T.; Kasten, Y. A hybrid global structure from motion method for synchronously estimating global rotations and global translations. ISPRS J. Photogramm. Remote Sens. 2021, 174, 35–55. [Google Scholar] [CrossRef]
- Zhu, S.; Shen, T.; Zhou, L.; Zhang, R.; Wang, J.; Fang, T.; Quan, L. Parallel structure from motion from local increment to global averaging. arXiv 2017, arXiv:1702.08601. [Google Scholar]
- Cui, Z.; Tan, P. Global structure-from-motion by similarity averaging. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 864–872. [Google Scholar]
- Zhu, S.; Zhang, R.; Zhou, L.; Shen, T.; Fang, T.; Tan, P.; Quan, L. Very large-scale global SfM by distributed motion averaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4568–4577. [Google Scholar]
- Moreira, G.; Marques, M.; Costeira, J.P. Rotation averaging in a split second: A primal-dual method and a closed-form for cycle graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5452–5460. [Google Scholar]
- Zhuang, B.; Cheong, L.-F.; Lee, G.H. Baseline desensitizing in translation averaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4539–4547. [Google Scholar]
- Goldstein, T.; Hand, P.; Lee, C.; Voroninski, V.; Soatto, S. Shapefit and shapekick for robust, scalable structure from motion. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 10–16 October 2016; pp. 289–304. [Google Scholar]
- Wilson, K.; Snavely, N. Robust global translations with 1dsfm. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 5–12 September 2014; pp. 61–75. [Google Scholar]
- Yang, L.; Ye, J.; Zhang, Y.; Wang, L.; Qiu, C. A semantic SLAM-based method for navigation and landing of UAVs in indoor environments. Knowl.-Based Syst. 2024, 293, 111693. [Google Scholar] [CrossRef]
- Shum, H.-Y.; Ke, Q.; Zhang, Z. Efficient bundle adjustment with virtual key frames: A hierarchical approach to multi-frame structure from motion. In Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins, CO, USA, 23–25 June 1999; pp. 538–543. [Google Scholar]
- Resch, B.; Lensch, H.; Wang, O.; Pollefeys, M.; Sorkine-Hornung, A. Scalable structure from motion for densely sampled videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3936–3944. [Google Scholar]
- Jiang, N.; Cui, Z.; Tan, P. A global linear method for camera pose registration. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 481–488. [Google Scholar]
- Leotta, M.J.; Smith, E.; Dawkins, M.; Tunison, P. Open source structure-from-motion for aerial video. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [Google Scholar]
- Banterle, F.; Gong, R.; Corsini, M.; Ganovelli, F.; Gool, L.V.; Cignoni, P. A deep learning method for frame selection in videos for structure from motion pipelines. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 3667–3671. [Google Scholar]
- Wang, L.; Ge, L.; Luo, S.; Yan, Z.; Cui, Z.; Feng, J. TC-SfM: Robust track-community-based structure-from-motion. IEEE Trans. Image Process. 2024, 33, 1534–1548. [Google Scholar] [CrossRef] [PubMed]
- Gong, Y.; Zhou, P.; Liu, Y.; Dong, H.; Li, L.; Yao, J. A cluster-based disambiguation method using pose consistency verification for structure from motion. ISPRS J. Photogramm. Remote Sens. 2024, 209, 398–414. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to sift or surf. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
- Revaud, J.; Weinzaepfel, P.; De Souza, C.; Humenberger, M. R2D2: Repeatable and reliable detector and descriptor. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 12414–12424. [Google Scholar]
- Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-Net: A trainable cnn for joint detection and description of local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8084–8093. [Google Scholar]
- Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Torii, A. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7199–7209. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Li, X.; Han, K.; Li, S.; Prisacariu, V. Dual-resolution correspondence networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Online, 6–12 December 2020; pp. 17346–17357. [Google Scholar]
- Zhou, Q.; Sattler, T.; Leal-Taixe, L. Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4669–4678. [Google Scholar]
- Dusmanu, M.; Schönberger, J.L.; Pollefeys, M. Multi-view optimization of local feature geometry. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 670–686. [Google Scholar]
- Germain, H.; Bourmaud, G.; Lepetit, V. S2DNet: Learning Image Features for Accurate Sparse-to-Dense Matching. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 626–643. [Google Scholar]
- Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
- Knapitsch, A.; Park, J.; Zhou, Q.-Y.; Koltun, V. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph. (ToG) 2017, 36, 78. [Google Scholar] [CrossRef]
Method | w/ GCPs | w/o GCPs | ||||
---|---|---|---|---|---|---|
Time (s) | ||||||
(a) Regular sequence | ||||||
COLMAP [3] | 6.8 | 7.4 | 1.78 | 1.99 | 0.34 | - |
COLMAP * [3] | 6.9 | 12 | 1.81 | 2.14 | 1.04 | 4650 |
Theia [4] | 8.8 | 8.9 | 1.93 | 1.32 | 0.46 | - |
Theia * [4] | 10 | 25 | 2.86 | 2.30 | 1.15 | 1413 |
OpenVSLAM [8] | 8.3 | 6.3 | 1.67 | 1.10 | 0.39 | 1812 |
Ours | 3.5 | 4.6 | 1.55 | 0.40 | 0.14 | 632 |
(b) Irregular sequence | ||||||
COLMAP [3] | 10 | 13 | 3.12 | 0.64 | 0.52 | - |
COLMAP * [3] | 11 | 11 | 3.95 | 1.87 | 1.83 | 550 |
Theia [4] | 6.6 | 6.2 | 1.97 | 1.33 | 1.08 | - |
Theia * [4] | 10 | 9 | 3.55 | 1.56 | 1.47 | 648 |
OpenVSLAM [8] | 5.8 | 8.3 | 2.59 | 0.74 | 0.53 | 430 |
Ours | 5.1 | 2.1 | 1.77 | 0.50 | 0.27 | 316 |
Method | M01 | M02 | M03 | M04 | M05 | V101 | V102 | V103 | V201 | V202 | V203 | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|
COLMAP [3] | 0.037 | 0.033 | 0.052 | 0.073 | 0.053 | 0.089 | 0.063 | 0.088 | 0.064 | 0.056 | 0.058 | 0.061 |
Theia [4] | 0.040 | 0.033 | 0.072 | 0.269 | 0.078 | 0.091 | 0.067 | 0.156 | 0.072 | 0.088 | 1.980 | 0.267 |
OpenVSLAM [8] | 0.041 | 0.032 | 0.033 | 0.096 | 0.049 | 0.096 | 0.064 | 0.066 | 0.061 | 0.053 | 0.072 | 0.060 |
Ours | 0.040 | 0.032 | 0.032 | 0.093 | 0.048 | 0.094 | 0.063 | 0.065 | 0.059 | 0.053 | 0.071 | 0.059 |
Method | M01 | M02 | M03 | M04 | M05 | V101 | V102 | V103 | V201 | V202 | V203 |
---|---|---|---|---|---|---|---|---|---|---|---|
COLMAP [3] | 534 | 463 | 279 | 187 | 226 | 371 | 63 | 254 | 202 | 144 | 564 |
Theia [4] | 148 | 131 | 108 | 81 | 68 | 127 | 50 | 41 * | 85 | 85 | 25 * |
Ours | 132 | 105 | 65 | 71 | 58 | 101 | 63 | 55 | 78 | 63 | 65 |
Features Refinement | ETH3D Outdoor | |||||
---|---|---|---|---|---|---|
Accuracy (%) | Completeness (%) | |||||
1 cm | 2 cm | 5 cm | 1 cm | 2 cm | 5 cm | |
SIFT [43] | 62.36 | 71.70 | 86.27 | 0.06 | 0.34 | 2.65 |
FKA [20] | 65.63 | 76.25 | 91.19 | 0.07 | 0.40 | 2.86 |
Ours | 66.48 | 78.75 | 92.12 | 0.07 | 0.40 | 2.90 |
SuperPoint [45] | 49.19 | 64.34 | 82.74 | 0.09 | 0.49 | 3.46 |
FKA [20] | 67.20 | 79.84 | 90.63 | 0.16 | 0.82 | 4.98 |
Ours | 68.17 | 80.13 | 90.87 | 0.17 | 0.83 | 4.96 |
D2-Net [47] | 34.66 | 51.38 | 72.12 | 0.02 | 0.13 | 1.77 |
FKA [20] | 64.68 | 79.17 | 90.88 | 0.08 | 0.59 | 5.37 |
Ours | 65.39 | 80.18 | 91.25 | 0.09 | 0.59 | 5.36 |
R2D2 [46] | 42.71 | 59.81 | 80.71 | 0.05 | 0.36 | 3.02 |
FKA [20] | 64.02 | 77.77 | 90.19 | 0.11 | 0.60 | 4.01 |
Ours | 64.19 | 77.99 | 90.24 | 0.11 | 0.61 | 4.04 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xiang, R.; Chen, J.; Ji, S. Efficient Structure from Motion for Large-Size Videos from an Open Outdoor UAV Dataset. Sensors 2024, 24, 3039. https://doi.org/10.3390/s24103039
Xiang R, Chen J, Ji S. Efficient Structure from Motion for Large-Size Videos from an Open Outdoor UAV Dataset. Sensors. 2024; 24(10):3039. https://doi.org/10.3390/s24103039
Chicago/Turabian StyleXiang, Ruilin, Jiagang Chen, and Shunping Ji. 2024. "Efficient Structure from Motion for Large-Size Videos from an Open Outdoor UAV Dataset" Sensors 24, no. 10: 3039. https://doi.org/10.3390/s24103039
APA StyleXiang, R., Chen, J., & Ji, S. (2024). Efficient Structure from Motion for Large-Size Videos from an Open Outdoor UAV Dataset. Sensors, 24(10), 3039. https://doi.org/10.3390/s24103039