Quasi-Dense Matching for Oblique Stereo Images through Semantic Segmentation and Local Feature Enhancement
Abstract
:1. Introduction
2. Materials and Methods
2.1. Automatic Semantic Segmentation Strategy
2.1.1. Multiplanar Semantic Segmentation Model
2.1.2. Training Data
2.1.3. Image Segmentation and Adaptive Optimization
2.2. Quasi-Dense Matching Method
2.2.1. Automatic Identification of Corresponding Planes
2.2.2. LoFTR-Based Weak Texture Feature Enhancement Matching
- (1)
- Large viewpoint correction: First, using the IA and IB matching points obtained in the previous plane recognition process, we estimate the projection transformation matrix H based on Equation (2) and the random sample consensus (RANSAC) algorithm as follows:
- (2)
- Feature extraction: For the image pair IA and IB′, feature extraction is first performed using VGG CNN, resulting in 1/8 coarse feature and 1/2 fine feature maps for both images.
- (3)
- Generating coarse-level feature prediction results: The coarse extracted feature maps and are flattened into one-dimensional vectors, and position encoding is added to each vector. These vectors with position encoding are then inputted into the LoFTR module, which comprises N (N = 4) self-attention and cross-attention layers. The LoFTR module utilizes a self-attention mechanism to capture the correlations between different positions within the image, learning the importance of local features and enhancing the discriminative ability of the convolutional model for different texture features. After processing through this module, two enhanced texture feature maps with higher discriminability, labeled as and , are outputted. Subsequently, the similarity between these two feature maps is calculated to perform the matching of corresponding features.
- (4)
- Outputting the prediction results: For any coarse-level matching prediction ∈ Mc, local corresponding windows of size w × w (w = 5) are cropped from the fine feature maps ,. Second, a smaller LoFTR module then transforms the cropped features within each window, yielding two transformed local feature maps, and , centered at and , respectively. Third, we correlate the center vector of with all vectors in and thus produce a heatmap that represents the matching probability of each pixel in the neighborhood of with , and the location ′ is obtained by calculating the expectation of the probability distribution. Finally, all coarse-level matches are refined within the local windows of the fine level, resulting in the fine-level matching predictions Mf for IA and IB′.
- (5)
- Outputting the final result: Finally, the coordinates of the fine-level matching points on IB′ are normalized to the original coordinate system of the right image IB using Equation (3), representing the final result of weak texture feature-enhanced matching.
3. Results
3.1. Experimental Environment
3.2. Evaluation Metrics
- (1)
- Number of correct matching points, : Fifteen pairs of uniformly distributed corresponding points are manually selected from the stereo images. The fundamental matrix F0 is estimated using the least-squares method and considered as the ground truth. Using the well-known fundamental matrix F0, the error of any matching point is calculated using Equation (4). A threshold (set to 3.0) is set and imposed for the error. If the error was less than , the pair of points is a correct pair of matching points and is included in the count of correct matching points, :
- (2)
- Match correct rate, α: This is defined by , where k denotes the total number of matching points.
- (3)
- Matching root-mean-squared error (RMSE) (pixel). This is calculated using Equation (5):
- (4)
- Matching spatial distribution quality, : References [26,27] generated Delaunay triangulation based on the matching points. They evaluated the spatial distribution quality of the matching points by considering the areas and shapes of each triangle, as well as the global and local distribution of the matching points. This is calculated using Equation (6):
3.3. Experimental Methods and Data
3.4. Experimental Results and Analysis
4. Discussion
- (1)
- The proposed method has significant advantages in terms of the number of correctly matched points. Table 1 presents the quantitative experimental results for six groups of large viewpoint stereo images in architectural scenes that show the highest number of correctly matched points obtained using the proposed method. As shown in Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14, our proposed method can achieve accurate and dense matching results in each group of images, especially for matching a large number of corresponding points on the top and facades of buildings, which provides sufficient tie points for image orientation and three-dimensional (3D) reconstruction. The reasons are twofold. First, the multiplane segmentation and corresponding plane matching method proposed in this paper can transform the matching of complex 3D scenes into simple plane scene matching. Second, the LoFTR texture enhancement strategy introduced in this paper effectively improves the problem of weak texture on the top and facades of buildings, leading to accurate and dense matching results.
- (2)
- According to the above experimental results, DFM has advantages in accuracy, but its effect on affine changes is poor. Compared with DFM, SuperGlue is more capable of handling large viewpoint affine transformations and single-texture regions; however, the number of matching points is much less than that obtained using our method. The LoFTR algorithm, which is based on the SuperGlue method, uses Transformer positional encoding and attention mechanisms to significantly enhance the texture features of building facades. GlueStick has not improved or even decreased in quantity compared to SuperGlue, but has improved in spatial distribution quality and matching accuracy. However, obtaining a sufficient number of matching points due to the influence of image distortion is challenging.
- (3)
- Our method also demonstrates some advantages in terms of matching accuracy and precision. Table 1 shows that our method achieves high matching correctness rates for most of the test data (a, b, d–f), and sub-pixel matching precision for test data (b–f). The reasons behind this are as follows. First, our method performs individual matching for each planar scene and utilizes strict homography geometric transformations for distortion correction and constrained matching, effectively ensuring matching correctness and precision. Second, during the quasi-dense matching process, the proposed method first conducts coarse-level matching prediction and then refines the matches at a finer level, ensuring the accurate positioning of matching points.
- (4)
- The proposed method exhibits good spatial distribution quality for the matching points. Figure 9, Figure 10, Figure 11, Figure 12, Figure 13 and Figure 14 show that the distribution area of the matching points of our method in image space has significantly improved. Table 1 demonstrates that our method outperforms DFM and LoFTR algorithms in terms of the spatial distribution quality of matching points. Our method has good spatial distribution quality for matching points.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ge, Y.; Guo, B.; Zha, P.; Jiang, S.; Jiang, Z.; Li, D. 3D Reconstruction of Ancient Buildings Using UAV Images and Neural Radiation Field with Depth Supervision. Remote Sens. 2024, 16, 473. [Google Scholar] [CrossRef]
- Yao, G.B.; Yilmaz, A.; Meng, F.; Zhang, L. Review of wide-baseline stereo image matching based on deep learning. Remote Sens. 2021, 13, 3247. [Google Scholar] [CrossRef]
- Ji, S.; Luo, C.; Liu, J. A Review of Dense Stereo Image Matching Methods Based on Deep Learning. Geomat. Inf. Sci. Wuhan Univ. 2021, 46, 193–202. [Google Scholar] [CrossRef]
- Liu, J.; Ji, S.P. Deep learning based dense matching for aerial remote sensing images. Acta Geod. Cartogr. Sin. 2019, 48, 1141–1150. [Google Scholar] [CrossRef]
- Luo, S.D.; Chen, H.B. Stereo matching algorithm of adaptive window based on region growing. J. Cent. South Univ. Technol. 2005, 36, 1042–1047. [Google Scholar]
- Fritz, C.O.; Morris, P.E.; Richler, J.J. Effect size estimates: Current use, calculations, and interpretation. Exp. Psychol. Gen. 2012, 141, 2–18. [Google Scholar] [CrossRef]
- Yang, H.; Zhang, S.; Zhang, Q. Least Squares Matching Methods for Wide Base-line Stereo Images Based on SIFT Features. Acta Geod. Cartogr. Sin. 2010, 39, 187–194. [Google Scholar] [CrossRef]
- David, G.L. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Yang, H.; Zhang, S.; Wang, L. Robust and precise registration of oblique images based on scale-invariant feature transformation algorithm. IEEE Geosci. Remote Sens. Lett. 2012, 9, 783–787. [Google Scholar] [CrossRef]
- Zhang, Q.; Wang, Y.; Wang, L. Registration of images with affine geometric distortion based on maximally stable extremal regions and phase congruency. Image Vis. Comput. 2015, 36, 23–39. [Google Scholar] [CrossRef]
- Xiao, X.W.; Guo, B.X.; Li, D.R.; Zhao, X.A. Quick and affine invariance matching method for oblique images. Acta Geod. Cartogr. Sin. 2015, 44, 414–442. [Google Scholar] [CrossRef]
- Morel, J.-M.; Yu, G. Asift: A new framework for fully affine invariant image comparison. SIAM J. Imaging Sci. 2009, 2, 438–469. [Google Scholar] [CrossRef]
- Tian, Y.R.; Fan, B.; Wu, F.C. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 661–669. [Google Scholar] [CrossRef]
- Mishchuk, A.; Mishkin, D.; Radenovic, F. Working hard to know your neighbor’s margins: Local descriptor learning loss. Adv. Neural Inf. Process. Syst. 2017, 1, 4826–4837. [Google Scholar] [CrossRef]
- Zhang, C.; Yao, G.; Man, X.; Huang, P.; Zhang, L.; Ai, H. Affine invariant feature matching of oblique images based on multi-branch network. Acta Geod. Cartogr. Sin. 2021, 50, 641–651. [Google Scholar] [CrossRef]
- Mishkin, D.; Radenovic, F.; Matas, J. Repeatability is not enough: Learning affine regions via discriminability. In Proceedings of the 2018 Computer Vision, Munich, Germany, 8–14 September 2018; pp. 287–304. [Google Scholar] [CrossRef]
- Revaud, J.; Weinzaepfel, P.; De Souza, C.; Pion, N.; Csurka, G.; Cabon, Y.; Humenberger, M. R2D2: Repeatable and reliable detector and descriptor. arXiv 2019, arXiv:1906.06195. [Google Scholar]
- Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In Proceedings of the IEEE 2020 Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. arXiv 2021, arXiv:2104.00680. [Google Scholar]
- Efe, U.; Ince, K.; Alatan, A. DFM: A Performance Baseline for Deep Feature Matching. arXiv 2021, arXiv:2106.07791. [Google Scholar]
- Pautrat, R.; Suárez, I.; Yu, Y.; Pollefeys, M.; Larsson, V. Gluestick: Robust image matching by sticking points and lines together. arXiv 2023, arXiv:2304.02008. [Google Scholar]
- Yao, G.B.; Yilmaz, A.; Zhang, L.; Meng, F.; Ai, H.B.; Jin, F.X. Matching large baseline oblique stereo images using an end-to-end convolutional neural network. Remote Sens. 2021, 13, 274. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015. [Google Scholar] [CrossRef]
- Wu, Z.; Han, X.; Lin, Y.L.; Uzunbas, M.G.; Goldstein, T.; Lim, S.N.; Davis, L.S. Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 518–534. [Google Scholar] [CrossRef]
- Zhu, Q.; Wu, B.; Xu, Z.X. Seed point selection method for triangle constrained image matching propagation. IEEE Geosci. Remote Sens. Lett. 2006, 3, 207–211. [Google Scholar] [CrossRef]
- Yao, G.; Zhang, J.; Gong, J.; Jin, F. Automatic Production of Deep Learning Benchmark Dataset for Affine-Invariant Feature Matching. ISPRS Int. J. Geo-Inf. 2023, 12, 33. [Google Scholar] [CrossRef]
Test Data | Evaluation Metrics | Ours | DFM | AffNet | SuperGlue | LoFTR | GlueStick |
---|---|---|---|---|---|---|---|
(a) | /(Pair) | 2751 | 1199 | 832 | 618 | 1695 | 336 |
a/(%) | 0.80 | 0.61 | 0.41 | 0.61 | 0.58 | 0.58 | |
/(Pixel) | 1.20 | 0.36 | 0.35 | 1.83 | 0.65 | 0.65 | |
56.9 | 59.2 | 87.2 | 64.9 | 59.2 | 33.7 | ||
(b) | /(Pair) | 898 | 31 | 82 | 537 | 520 | 259 |
a/(%) | 0.49 | 0.53 | 0.14 | 0.52 | 0.22 | 0.40 | |
/(Pixel) | 0.35 | 0.18 | 0.37 | 0.39 | 0.38 | 0.81 | |
32.6 | 37.5 | 17.13 | 54.33 | 40.9 | 39.8 | ||
(c) | /(Pair) | 2751 | 602 | 1100 | 618 | 1695 | 393 |
a/(%) | 0.68 | 0.34 | 0.33 | 0.61 | 0.58 | 0.47 | |
/(Pixel) | 0.99 | 0.35 | 0.37 | 1.24 | 2.07 | 0.71 | |
45.5 | 27.7 | 45.9 | 23.3 | 46.5 | 32.6 | ||
(d) | /(Pair) | 2254 | 237 | 296 | 330 | 1291 | 241 |
a/(%) | 0.82 | 0.40 | 0.20 | 0.50 | 0.60 | 0.52 | |
/(Pixel) | 0.86 | 0.35 | 0.36 | 2.9 | 0.57 | 0.67 | |
41.1 | 32.0 | 28.2 | 26.8 | 43.3 | 24.4 | ||
(e) | /(Pair) | 1530 | 56 | 125 | 196 | 1015 | 179 |
a/(%) | 0.64 | 0.43 | 0.22 | 0.24 | 0.46 | 0.48 | |
/(Pixel) | 0.99 | 0.32 | 0.36 | 1.88 | 1.06 | 0.69 | |
40.1 | 24.1 | 18.8 | 23.5 | 43.3 | 23.0 | ||
(f) | /(Pair) | 2059 | 915 | 974 | 273 | 1034 | 226 |
a/(%) | 0.69 | 0.46 | 0.42 | 0.39 | 0.47 | 0.44 | |
/(Pixel) | 0.49 | 0.37 | 0.35 | 1.88 | 1.46 | 0.75 | |
27.5 | 83.3 | 28.2 | 32.5 | 43.3 | 34.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yao, G.; Zhang, J.; Zhu, F.; Gong, J.; Jin, F.; Fu, Q.; Ren, X. Quasi-Dense Matching for Oblique Stereo Images through Semantic Segmentation and Local Feature Enhancement. Remote Sens. 2024, 16, 632. https://doi.org/10.3390/rs16040632
Yao G, Zhang J, Zhu F, Gong J, Jin F, Fu Q, Ren X. Quasi-Dense Matching for Oblique Stereo Images through Semantic Segmentation and Local Feature Enhancement. Remote Sensing. 2024; 16(4):632. https://doi.org/10.3390/rs16040632
Chicago/Turabian StyleYao, Guobiao, Jin Zhang, Fengqi Zhu, Jianya Gong, Fengxiang Jin, Qingqing Fu, and Xiaofang Ren. 2024. "Quasi-Dense Matching for Oblique Stereo Images through Semantic Segmentation and Local Feature Enhancement" Remote Sensing 16, no. 4: 632. https://doi.org/10.3390/rs16040632
APA StyleYao, G., Zhang, J., Zhu, F., Gong, J., Jin, F., Fu, Q., & Ren, X. (2024). Quasi-Dense Matching for Oblique Stereo Images through Semantic Segmentation and Local Feature Enhancement. Remote Sensing, 16(4), 632. https://doi.org/10.3390/rs16040632