A Coarse-to-Fine Feature Match Network Using Transformers for Remote Sensing Image Registration
Abstract
:1. Introduction
- Transformers with self and cross attention layers are utilized to transform feature vectors to get context-dependent feature descriptors, which greatly improves their determination.
- A schema of coarse-to-fine feature matching is designed, which reduces the computational cost of global correlation to achieve the balance between efficiency and accuracy.
- A novel local feature match network, MatcherTF is proposed, which has great flexibility as its input can be any number of coordinate points on an image. It can be used alone for remote sensing image registration or combined with existing feature detectors for high-quality image registration tasks.
2. Related Work
2.1. Image Registration of Natural Images
2.2. Image Registration of Remote Sensing Images
2.3. Transformer in the Computer Vision Tasks
3. Methods
3.1. The Overall Process of the Proposed Method
3.2. Coarse Feature Matching Process
3.2.1. The Architecture of Coarse Feature Transformer Network
3.2.2. Coarse Feature Matching
3.3. Fine Feature Matching Process
3.3.1. The Architecture of Fine Feature Transformers Network
3.3.2. Fine Feature Matching
3.4. Loss function
4. Experiments and Analysis
4.1. The Generation of Training Data
- (1)
- We deliberately chose a variety of Sentinel-2 images, captured in different years and seasons, and covering diverse scenes such as mountains, farmlands, buildings, roads, and more. Once the selection process was complete, we could use a python script to automatically download the chosen Sentinel-2 images from the Copernicus Open Access Hub.
- (2)
- For each scene in the Sentinel-2 image dataset, we utilized a Python script to download the corresponding Landsat-8 image based on the coordinates of the Sentinel-2 image’s center point. After obtaining the pairs of Sentinel-2 and Landsat-8 images, we clipped out their overlapping areas and divided them evenly into image tiles based on their geographic coordinates using the Geographic Data Abstraction Library (GDAL) [43]. As a result, each pair of image patches is precisely aligned, as the level-1C product of Sentinel-2 and the level-2 product of Landsat-8 undergo rigorous geometric correction.
- (3)
- For each pair of image tiles, we applied random scaling, rotation, and shifting to one tile to obtain a transformed image. A smaller patch with size of was then cropped from the transformed tile, which serves as the sensed image and can more realistically mimic the geometric transformation of the image. Because it can avoid the introduction of black areas in the corners. Similarly, a image patch centered at the other tile was cropped as the reference image. Finally, we calculated the homography matrix H between the reference and sensed images based on the applied scale, rotation, and displacement. The range of rotation was , the range of scaling was , and the range of shift was pixels.
4.2. The Implementation Details
4.3. Qualitative Experiments
4.4. Quantitative Experiments
5. Discussion
5.1. Sensitivity Analysis of Feature Points
5.2. Efficiency Analysis
5.3. Pros and Cons Analysis
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Ma, J.; Jiang, X.; Fan, A.; Jiang, J.; Yan, J. Image matching from handcrafted to deep features: A survey. Int. J. Comput. Vis. 2021, 129, 23–79. [Google Scholar] [CrossRef]
- Paul, S.; Pati, U.C. A comprehensive review on remote sensing image registration. Int. J. Remote. Sens. 2021, 42, 5396–5432. [Google Scholar] [CrossRef]
- Liang, J.; Liu, X.; Huang, K.; Li, X.; Wang, D.; Wang, X. Automatic registration of multisensor images using an integrated spatial and mutual information (SMI) metric. IEEE Trans. Geosci. Remote. Sens. 2013, 52, 603–615. [Google Scholar] [CrossRef]
- Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 1997, 16, 187–198. [Google Scholar] [CrossRef] [Green Version]
- Xiang, Y.; Tao, R.; Wan, L.; Wang, F.; You, H. OS-PC: Combining feature representation and 3-D phase correlation for subpixel optical and SAR image registration. IEEE Trans. Geosci. Remote. Sens. 2020, 58, 6451–6466. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Ye, Y.; Shen, L. HOPC: A novel similarity metric based on geometric structural properties for multi-modal remote sensing image matching. ISPRS Ann. Photogramm. Remote. Sens. Spat. Inf. Sci. 2016, 3, 9. [Google Scholar] [CrossRef] [Green Version]
- Li, J.; Hu, Q.; Ai, M. RIFT: Multi-modal image matching based on radiation-variation insensitive feature transform. IEEE Trans. Image Process. 2019, 29, 3296–3310. [Google Scholar] [CrossRef]
- Xiang, Y.; Wang, F.; Wan, L.; You, H. SAR-PC: Edge detection in SAR images via an advanced phase congruency model. Remote Sens. 2017, 9, 209. [Google Scholar] [CrossRef] [Green Version]
- Xiang, Y.; Tao, R.; Wang, F.; You, H.; Han, B. Automatic registration of optical and SAR images via improved phase congruency model. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 13, 5847–5861. [Google Scholar] [CrossRef]
- Zhu, B.; Yang, C.; Dai, J.; Fan, J.; Qin, Y.; Ye, Y. R2FD2: Fast and Robust Matching of Multimodal Remote Sensing Images via Repeatable Feature Detector and Rotation-invariant Feature Descriptor. IEEE Trans. Geosci. Remote. Sens. 2023. [Google Scholar] [CrossRef]
- Harris, C.; Stephens, M. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Manchester, UK, 31 August–2 September 1988; Volume 15, pp. 10–5244. [Google Scholar]
- Cohen, T.; Welling, M. Group Equivariant Convolutional Networks. In Proceedings of the 33rd International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; Volume 48, pp. 2990–2999. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In Proceedings of the European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, 11–14 October 2016; pp. 467–483. [Google Scholar]
- Ono, Y.; Trulls, E.; Fua, P.; Yi, K.M. LF-Net: Learning local features from images. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2D2: Reliable and repeatable detector and descriptor. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Choy, C.B.; Gwak, J.; Savarese, S.; Chandraker, M. Universal correspondence network. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Tian, Y.; Fan, B.; Wu, F. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 661–669. [Google Scholar]
- Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE international Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 118–126. [Google Scholar]
- Mishchuk, A.; Mishkin, D.; Radenovic, F.; Matas, J. Working hard to know your neighbor’s margins: Local descriptor learning loss. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- He, K.; Lu, Y.; Sclaroff, S. Local descriptors optimized for average precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 596–605. [Google Scholar]
- Schmidt, T.; Newcombe, R.; Fox, D. Self-supervised visual descriptor learning for dense correspondence. IEEE Robot. Autom. Lett. 2016, 2, 420–427. [Google Scholar] [CrossRef]
- Melekhov, I.; Tiulpin, A.; Sattler, T.; Pollefeys, M.; Rahtu, E.; Kannala, J. Dgc-net: Dense geometric correspondence network. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1034–1042. [Google Scholar]
- Truong, P.; Danelljan, M.; Timofte, R. GLU-Net: Global-local universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6258–6268. [Google Scholar]
- Rocco, I.; Cimpoi, M.; Arandjelović, R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Truong, P.; Danelljan, M.; Gool, L.V.; Timofte, R. GOCor: Bringing globally optimized correspondence volumes into your neural network. Adv. Neural Inf. Process. Syst. 2020, 33, 14278–14290. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
- Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5173–5182. [Google Scholar]
- Yuan, X.; Yuan, X.; Chen, J.; Wang, X. Large Aerial Image Tie Point Matching in Real and Difficult Survey Areas via Deep Learning Method. Remote Sens. 2022, 14, 3907. [Google Scholar] [CrossRef]
- Liu, Y.; Gong, X.; Chen, J.; Chen, S.; Yang, Y. Rotation-invariant siamese network for low-altitude remote-sensing image registration. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 13, 5746–5758. [Google Scholar] [CrossRef]
- Ye, F.; Su, Y.; Xiao, H.; Zhao, X.; Min, W. Remote sensing image registration using convolutional neural network features. IEEE Geosci. Remote. Sens. Lett. 2018, 15, 232–236. [Google Scholar] [CrossRef]
- Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient transformer for remote sensing image segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
- Li, Q.; Chen, Y.; Zeng, Y. Transformer with transfer CNN for remote-sensing-image object detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
- Krzysztof, C.; Valerii, L.; David, D.; Xingyou, S.; Andreea, G.; Tamas, S.; Peter, H.; Jared, D.; Afroz, M.; Lukasz, K.; et al. Rethinking attention with performers. In Proceedings of the of ICLR, Virtual Event, 3–7 May 2021. [Google Scholar]
- Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
- Bengio, Y.; Goodfellow, I.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2017; Volume 1. [Google Scholar]
- Wang, X.; Jabri, A.; Efros, A.A. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2566–2576. [Google Scholar]
- GDAL/OGR Contributors. GDAL/OGR Geospatial Data Abstraction Software Library; Open Source Geospatial Foundation: Chicago, IL, USA, 2022. [Google Scholar]
- Franks, S.; Storey, J.; Rengarajan, R. The new landsat collection-2 digital elevation model. Remote Sens. 2020, 12, 3909. [Google Scholar] [CrossRef]
- Fraser, C.S.; Dial, G.; Grodecki, J. Sensor orientation via RPCs. ISPRS J. Photogramm. Remote. Sens. 2006, 60, 182–194. [Google Scholar] [CrossRef]
- Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
- Riba, E.; Mishkin, D.; Ponsa, D.; Rublee, E.; Bradski, G. Kornia: An open source differentiable computer vision library for pytorch. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 3674–3683. [Google Scholar]
Comparison Experiment | Method | The Mean Success Ratio Under Different Thresholds (pixels) | The Mean Reprojection Error Under Different Thresholds (pixels) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
@1px | @3px | @5px | @7px | @9px | @1px | @3px | @5px | @7px | @9px | ||
Comparison between SIFT and The Proposed Method | The Proposed Method | 0.39 | 0.53 | 0.55 | 0.57 | 0.58 | 0.57 | 1.01 | 1.27 | 1.53 | 1.84 |
SIFT | 0.23 | 0.24 | 0.24 | 0.24 | 0.24 | 0.44 | 0.67 | 0.88 | 1.10 | 1.33 | |
Comparison between SuperPoint and The Proposed Method | The Proposed Method | 0.44 | 0.64 | 0.67 | 0.69 | 0.70 | 0.62 | 1.17 | 1.54 | 1.89 | 2.25 |
SuperPoint | 0.08 | 0.11 | 0.12 | 0.12 | 0.13 | 0.56 | 0.96 | 1.17 | 1.36 | 1.58 | |
Comparison between LoFTR and The Proposed Method | The Proposed Method | 0.32 | 0.49 | 0.52 | 0.53 | 0.55 | 0.56 | 1.01 | 1.26 | 1.47 | 1.69 |
LoFTR | 0.27 | 0.38 | 0.38 | 0.38 | 0.38 | 0.57 | 1.00 | 1.13 | 1.21 | 1.28 |
Method | The Mean Success Ratio Under Different Thresholds (pixels) | The Mean Reprojection Error Under Different Thresholds (pixels) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@1px | @3px | @5px | @7px | @9px | @1px | @3px | @5px | @7px | @9px | |
The Proposed Method | 0.23 | 0.35 | 0.37 | 0.39 | 0.40 | 0.57 | 1.03 | 1.30 | 1.54 | 1.79 |
Method Trained by Different Feature Points | The Categories of Test Feature Points | |||||||||||||||||||
Random | Sift | Mixed | Superpoint | |||||||||||||||||
@1px | @3px | @5px | @7px | @9px | @1px | @3px | @5px | @7px | @9px | @1px | @3px | @5px | @7px | @9px | @1px | @3px | @5px | @7px | @9px | |
random-trained | 0.80 | 0.90 | 0.92 | 0.93 | 0.94 | 0.77 | 0.86 | 0.89 | 0.90 | 0.91 | 0.80 | 0.89 | 0.92 | 0.93 | 0.94 | 0.40 | 0.52 | 0.55 | 0.57 | 0.59 |
sift-trained | 0.53 | 0.72 | 0.78 | 0.82 | 0.84 | 0.55 | 0.73 | 0.78 | 0.81 | 0.84 | 0.52 | 0.72 | 0.78 | 0.81 | 0.84 | 0.14 | 0.25 | 0.29 | 0.33 | 0.36 |
mixed-trained | 0.74 | 0.86 | 0.88 | 0.90 | 0.91 | 0.74 | 0.85 | 0.87 | 0.89 | 0.90 | 0.74 | 0.85 | 0.88 | 0.90 | 0.91 | 0.39 | 0.51 | 0.55 | 0.57 | 0.59 |
superpoint-trained | 0.74 | 0.87 | 0.89 | 0.90 | 0.91 | 0.75 | 0.88 | 0.90 | 0.91 | 0.92 | 0.74 | 0.87 | 0.89 | 0.91 | 0.91 | 0.74 | 0.84 | 0.86 | 0.87 | 0.88 |
Method Trained by Different Feature Points | The Categories of Test Feature Points | |||||||||||||||||||
Random | Sift | Mixed | Superpoint | |||||||||||||||||
@1px | @3px | @5px | @7px | @9px | @1px | @3px | @5px | @7px | @9px | @1px | @3px | @5px | @7px | @9px | @1px | @3px | @5px | @7px | @9px | |
random-trained | 0.41 | 0.59 | 0.69 | 0.77 | 0.85 | 0.42 | 0.61 | 0.73 | 0.83 | 0.93 | 0.41 | 0.59 | 0.69 | 0.77 | 0.85 | 0.52 | 0.91 | 1.19 | 1.46 | 1.76 |
sift-trained | 0.51 | 0.92 | 1.20 | 1.44 | 1.67 | 0.51 | 0.91 | 1.20 | 1.45 | 1.68 | 0.52 | 0.94 | 1.22 | 1.46 | 1.71 | 0.6 | 1.32 | 1.89 | 2.45 | 3.06 |
mixed-trained | 0.43 | 0.65 | 0.78 | 0.89 | 1.00 | 0.43 | 0.65 | 0.79 | 0.91 | 1.02 | 0.43 | 0.65 | 0.78 | 0.89 | 0.99 | 0.52 | 0.93 | 1.21 | 1.48 | 1.79 |
superpoint-trained | 0.45 | 0.66 | 0.78 | 0.88 | 0.97 | 0.45 | 0.66 | 0.77 | 0.87 | 0.96 | 0.45 | 0.66 | 0.78 | 0.88 | 0.98 | 0.41 | 0.60 | 0.72 | 0.84 | 0.94 |
Module Name | Parameters (Floating Point Operations, FLOPs) |
---|---|
Feature Representation Network | M |
Coarse Feature Transformer | M |
Fine Feature Transformer | 328 K |
Total | M |
Methods | SIFT | SuperPoint | LoFTR | Proposed Method |
---|---|---|---|---|
Elapsed Time (s) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liang, C.; Dong, Y.; Zhao, C.; Sun, Z. A Coarse-to-Fine Feature Match Network Using Transformers for Remote Sensing Image Registration. Remote Sens. 2023, 15, 3243. https://doi.org/10.3390/rs15133243
Liang C, Dong Y, Zhao C, Sun Z. A Coarse-to-Fine Feature Match Network Using Transformers for Remote Sensing Image Registration. Remote Sensing. 2023; 15(13):3243. https://doi.org/10.3390/rs15133243
Chicago/Turabian StyleLiang, Chenbin, Yunyun Dong, Changjun Zhao, and Zengguo Sun. 2023. "A Coarse-to-Fine Feature Match Network Using Transformers for Remote Sensing Image Registration" Remote Sensing 15, no. 13: 3243. https://doi.org/10.3390/rs15133243
APA StyleLiang, C., Dong, Y., Zhao, C., & Sun, Z. (2023). A Coarse-to-Fine Feature Match Network Using Transformers for Remote Sensing Image Registration. Remote Sensing, 15(13), 3243. https://doi.org/10.3390/rs15133243