A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers
Abstract
:1. Introduction
- A new feature fusion module is proposed to construct a multi-scale fusion network. The improved ResNet50 network is used for feature extraction, generating multi-scale feature maps through a feature pyramid network and then using an improved feature fusion module to fuse features to obtain feature information of images at different scales, improving registration accuracy by mining deep semantic information while minimizing the loss of detail features.
- A new CNN-Transformer hybrid model is proposed for feature detection and matching. The attention mechanism of the Transformer is used to mine the long-range dependencies of images, and the encoder–decoder structure of the Transformer is used to complete feature matching and obtain initial correspondences.
2. Related Work
3. Materials and Methods
3.1. Network Architecture
3.2. Feature Extraction Backbone Network
3.3. Feature Pyramid Network
3.4. Transformer Encoder–Decoder
3.5. Loss Function
3.6. GSM Mismatch Removal Algorithm
4. Results
4.1. Datasets and Experimental Setup
4.2. Evaluation Metrics
4.3. Comparison Methods
4.4. Qualitative Experiments
- Check for significant spatial misalignment in the fusion map, indicating potential issues with the registration method or spatial transformation model if objects in the image show significant misalignment or offset;
- Observe whether the edges of objects in the checkerboard map are aligned and appear continuous and natural, especially whether the transition in edge regions is smooth without noticeable breaks.
4.5. Quantitative Experiments
4.6. Application of the Positioning Software Platform
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zitová, B.; Flusser, J. Image registration methods: A survey. Image Vis. Comput. 2003, 21, 977–1000. [Google Scholar] [CrossRef]
- Le Moigne, J.; Netanyahu, N.S.; Eastman, R.D. Image Registration for Remote Sensing: Survey of Image Registration Methods; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Sui, H.; Xu, C.; Liu, J.; Hua, F. Automatic Optical-to-SAR Image Registration by Iterative Line Extraction and Voronoi Integrated Spectral Point Matching. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6058–6072. [Google Scholar] [CrossRef]
- Chen, J.; Frey, E.C.; He, Y.; Segars, W.P.; Li, Y.; Du, Y. TransMorph: Transformer for unsupervised medical image registration. Med. Image Anal. 2022, 82, 102615. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Leng, C.; Hong, Y.; Pei, Z.; Cheng, I.; Basu, A. Multimodal Remote Sensing Image Registration Methods and Advancements: A Survey. Remote Sens. 2021, 13, 5128. [Google Scholar] [CrossRef]
- Detone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Canny, J. A computational approach to edge detection. Read. Comput. Vis. 1987, 184–203. [Google Scholar] [CrossRef]
- Lehureau, G.; Tupin, F.; Tison, C.; Oller, G.; Petit, D. Registration of metric resolution SAR and Optical images in urban areas; proceedings of the Synthetic Aperture Radar (EUSAR). In Proceedings of the 2008 7th European Conference on Synthetic Aperture Radar, Friedrichshafen, Germany, 2–5 June 2008. [Google Scholar]
- Beymer, D. Feature correspondence by interleaving shape and texture computations. In Proceedings of the CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18–20 June 1996; pp. 921–928. [Google Scholar]
- Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Lowe, D. Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–100. [Google Scholar] [CrossRef]
- Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded up robust features. In Proceedings of the 9th European Conference on Computer Vision—Volume Part I, Graz, Austria, 7–13 May 2006. [Google Scholar]
- Ma, W.; Wen, Z.; Wu, Y.; Jiao, L.; Gong, M.; Zheng, Y.; Liu, L. Remote Sensing Image Registration with Modified SIFT and Enhanced Feature Matching. IEEE Geosci. Remote Sens. Lett. 2016, 14, 3–7. [Google Scholar] [CrossRef]
- Wu, Y.; Ma, W.; Gong, M.; Su, L.; Jiao, L. A Novel Point-Matching Algorithm Based on Fast Sample Consensus for Image Registration. IEEE Geosci. Remote Sens. Letters 2014, 12, 43–47. [Google Scholar] [CrossRef]
- Ye, Y.; Shan, J.; Bruzzone, L.; Shen, L. Robust Registration of Multimodal Remote Sensing Images Based on Structural Similarity. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2941–2958. [Google Scholar] [CrossRef]
- Li, J.; Hu, Q.; Ai, M. RIFT: Multi-modal Image Matching Based on Radiation-variation Insensitive Feature Transform. IEEE Trans. Image Process. 2019, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
- Zhu, B.; Yang, C.; Dai, J.; Fan, J.; Qin, Y.; Ye, Y. R2FD2: Fast and Robust Matching of Multimodal Remote Sensing Image via Repeatable Feature Detector and Rotation-invariant Feature Descriptor. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
- Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned Invariant Feature Transform; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
- Ono, Y.; Trulls, E.; Fua, P.; Yi, K.M. LF-Net: Learning Local Features from Images; Curran Associates Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
- Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2D2: Repeatable and Reliable Detector and Descriptor. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Luo, Z.; Zhou, L.; Bai, X.; Chen, H.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. ASLFeat: Learning Local Features of Accurate Shape and Localization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Li, J.; Xu, W.; Shi, P.; Zhang, Y.; Hu, Q. LNIFT: Locally Normalized Image for Rotation Invariant Multimodal Feature Matching. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
- Deng, Y.; Ma, J. ReDFeat: Recoupling Detection and Description for Multimodal Feature Learning. IEEE Trans. Image Process. 2023, 32, 591–602. [Google Scholar] [CrossRef] [PubMed]
- Gao, T.; Lan, C.; Huang, W.; Wang, L.; Wei, Z.; Yao, F. Multiscale Template Matching for Multimodal Remote Sensing Image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 10132–10147. [Google Scholar] [CrossRef]
- Wang, X.; Jabri, A.; Efros, A.A. Learning Correspondence from the Cycle-Consistency of Time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Ma, J.; Zhao, J.; Jiang, J.; Zhou, H.; Guo, X. Locality Preserving Matching. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
- Li, Z.; Snavely, N. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Yao, Y.; Zhang, Y.; Wan, Y.; Liu, X.; Yan, X.; Li, J. Multi-Modal Remote Sensing Image Matching Considering Co-Occurrence Filter. IEEE Trans. Image Process. 2022, 31, 2584–2597. [Google Scholar] [CrossRef] [PubMed]
- Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted andlearned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5173–5182. [Google Scholar]
- Fan, Z.; Liu, Y.; Liu, Y.; Zhang, L.; Zhang, J.; Sun, Y.; Ai, H. 3MRS: An Effective Coarse-to-Fine Matching Method for Multimodal Remote Sensing Imagery. Remote Sens. 2022, 14, 478. [Google Scholar] [CrossRef]
Item | Version |
---|---|
CPU | i7-9700 |
GPU | RTX 2070 SUPER |
Operating System | Windows 10 |
Language | Python 3.8 |
Framework | PyTorch 2.1.0 |
CUDA | CUDA 12.1 |
cuDNN | cuDNN 8.8.1 |
Method | PCK-1px (%) | PCK-3px (%) | PCK-5px (%) | MRE (pixel) | Matching Time (s) |
---|---|---|---|---|---|
SIFT | 0.15 | 0.27 | 0.34 | / | 10.4 |
RIFT | 0.23 | 0.51 | 0.65 | / | 8.83 |
3MRS | 0.52 | 0.63 | 0.71 | 1.85 | 3.94 |
Ours | 0.61 | 0.74 | 0.86 | 1.19 | 2.82 |
Backbone | FFM | Attention Mechanism | PCK-1px (%) | PCK-3px (%) | PCK-5px (%) | Matching Time (s) |
---|---|---|---|---|---|---|
ResNet50 | × | × | 43.2 | 51.6 | 65.4 | 5.64 |
ResNet50 | × | √ | 60.2 | 68.2 | 82.1 | 3.27 |
ResNet50 | √ | × | 51.3 | 62.5 | 76.4 | 4.36 |
ResNet50 | √ | √ | 63.4 | 76.5 | 88.2 | 2.82 |
Method | PCK (%) | MRE (pixel) | Matching Time | ||
---|---|---|---|---|---|
1px | 3px | 5px | (s) | ||
SIFT+NN | 14.5 | 30.1 | 38.8 | 13.85 | 10.04 |
LNIFT | 17.6 | 33.4 | 42.6 | 9.74 | 9.76 |
D2-Net+NN | 23.2 | 35.9 | 53.6 | 7.33 | 8.86 |
SuperPoint+SuperGlue | 53.9 | 68.3 | 78.1 | 2.96 | 6.87 |
ReDFeat | 58.7 | 70.1 | 82.4 | 2.34 | 3.42 |
LoFTR | 65.1 | 71.6 | 84.6 | 1.54 | 2.89 |
Ours | 63.4 | 76.5 | 88.2 | 1.38 | 2.82 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, R.; Long, S.; Sun, W.; Liu, H. A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers. Drones 2024, 8, 651. https://doi.org/10.3390/drones8110651
He R, Long S, Sun W, Liu H. A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers. Drones. 2024; 8(11):651. https://doi.org/10.3390/drones8110651
Chicago/Turabian StyleHe, Ruofei, Shuangxing Long, Wei Sun, and Hongjuan Liu. 2024. "A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers" Drones 8, no. 11: 651. https://doi.org/10.3390/drones8110651
APA StyleHe, R., Long, S., Sun, W., & Liu, H. (2024). A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers. Drones, 8(11), 651. https://doi.org/10.3390/drones8110651