A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers

He, Ruofei; Long, Shuangxing; Sun, Wei; Liu, Hongjuan

doi:10.3390/drones8110651

Open AccessArticle

A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers

¹

365th Research Institute, Northwestern Polytechnical University, Xi’an 710072, China

²

School of Aerospace Science and Technology, Xidian University, Xi’an 710071, China

³

Xi’an ASN Technology Group Co., Ltd., Xi’an 710065, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(11), 651; https://doi.org/10.3390/drones8110651

Submission received: 16 September 2024 / Revised: 25 October 2024 / Accepted: 1 November 2024 / Published: 7 November 2024

(This article belongs to the Special Issue Intelligent Image Processing and Sensing for Drones 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Using images captured by drone cameras and comparing them with known Google satellite maps to obtain the current location of the drone is an important way of UAV navigation in GPS-denied environments. But, due to inherent modality differences and significant geometric deformations, cross-modal image registration is challenging. This paper proposes a CNN-Transformer hybrid network model for feature detection and feature matching. ResNet50 is used as the backbone network for feature extraction. An improved feature fusion module is used to fuse feature maps from different levels, and then a Transformer encoder–decoder structure is used for feature matching to obtain preliminary correspondences. Finally, a geometric outlier removal method (GSM) is used to eliminate mismatched points based on the geometric similarity of inliers, resulting in more robust correspondences. Qualitative and quantitative experiments were conducted on multimodal image datasets captured by UAVs; the correct matching rate was improved by 52%, 21%, and 15%, respectively, and the error was reduced by 36% compared to the 3MRS algorithm. A total of 56 experiments were conducted in actual scenarios, with a localization success rate of 91.1%, and the RMSE of UAV positioning was 4.6 m.

Keywords:

UAV visual navigation; image registration; multimodal images; feature fusion; attention mechanism; false matching point elimination

1. Introduction

Traditional UAV (unmanned aerial vehicle) navigation systems use GPS (or better GNSS) for positioning from satellite signals [1], and inertial navigation systems are used when there is a short loss of satellite signals to communicate with ground control stations through continuous data links and determine the spatial position and attitude of the drone during flight. However, positioning algorithms have certain limitations, and the disadvantage of inertial navigation technology is that positioning errors accumulate over time. GPS navigation is susceptible to electromagnetic interference and cannot be used in denied environments. Under refusal conditions, drones typically utilize sensors, such as inertial navigation systems (IMUs) and visual sensors (such as cameras), to obtain information on flight status and the surrounding environment [2]. Among them, the camera sensor can obtain images of the ground and achieve positioning and navigation through multimodal image registration algorithms [3]. In response to the situation where unmanned aerial vehicles cannot provide accurate location information without GPS signals, this paper designs and implements an image registration-based auxiliary unmanned aerial vehicle autonomous positioning software platform.

Image registration-based positioning technology is a highly autonomous and accurate navigation technique, which is an important means of navigation and a cutting-edge direction in the field of UAV autonomous navigation. By registering the images obtained by the drone with the map images, accurate autonomous localization of the drone can be achieved. The positioning process mainly consists of two steps: coarse positioning and fine positioning [4]. Firstly, the image obtained by the drone is roughly positioned on the map to obtain the rough corresponding image block of the aerial image in the map, achieving spatial positioning. Then, the multimodal image registration method proposed in this article is used to match feature points with the same name, and based on the geographic information carried by the feature points in Google Maps, the unmanned aerial vehicle is finally able to achieve autonomous localization [5]. This method is robust to intensity information. Feature-based methods mainly include three steps: feature detection, feature description, and feature matching. Feature detection involves extracting significant features from the two images, such as point features [6], edge features [7], and region features [8]. Feature description involves describing the extracted features for subsequent matching. Feature matching involves designing specific similarity measures for descriptors, establishing a geometric transformation model, and aligning the two images.

Despite the proposal of many promising registration methods, there are still some problems in proposing a universal method with good performance in accuracy, robustness, and efficiency in the field of multimodal registration. Due to the heterogeneous characteristics of multimodal images, there are typically “five discrepancies” (differences in imaging characteristics, scale variations, rotation discrepancies, noise variations, and landscape differences) and “three distinctions” (different environments, weather conditions, and atmospheric conditions) that pose significant challenges to achieving high-accuracy matching. Accurately defining and describing corresponding features, as well as characterizing them, presents considerable difficulties [9]. Furthermore, extracting corresponding features between multimodal images is challenging, and even when such features are extracted, they may be incomplete or difficult to correspond due to variations in perspective and environmental conditions. The core challenge of feature-based methods is the feature detection and matching of multimodal image pairs. Because of the significant nonlinear intensity differences between multimodal image pairs, many widely used feature matches for general vision applications may fail, or they may produce a large number of outliers (mismatches) using only local image information.

To address the multimodal registration problem, this paper proposes a CNN-Transformer [10] hybrid model for image registration. In the feature extraction stage, we improve the commonly used CNN architecture ResNet50 [11] to enhance the model’s ability to extract homologous features. Next, we introduce a feature fusion module and a Transformer encoder structure for feature fusion and matching, thereby obtaining initial correspondences. Finally, we utilize the GSM (Geometric Similarity Measure) algorithm to eliminate mismatched points from the initial matches, achieving a more robust correspondence and iteratively approximating the optimal spatial transformation model, thereby facilitating multimodal image registration. The accuracy of unmanned aerial vehicle autonomous positioning was tested, and 56 experiments were conducted in actual scenarios, with a positioning success rate of 91.1% and a root mean square error of 4.6 m. The experimental results show that the method proposed in this paper can be successfully applied to the autonomous localization of unmanned aerial vehicles in real-world scenarios.

The main contributions of this paper are as follows:

A new feature fusion module is proposed to construct a multi-scale fusion network. The improved ResNet50 network is used for feature extraction, generating multi-scale feature maps through a feature pyramid network and then using an improved feature fusion module to fuse features to obtain feature information of images at different scales, improving registration accuracy by mining deep semantic information while minimizing the loss of detail features.
A new CNN-Transformer hybrid model is proposed for feature detection and matching. The attention mechanism of the Transformer is used to mine the long-range dependencies of images, and the encoder–decoder structure of the Transformer is used to complete feature matching and obtain initial correspondences.

2. Related Work

Over the past few decades, image matching technology has been extensively researched. SIFT [12] is a classic feature detection algorithm known for its scale invariance, rotation invariance, and suppression of illumination and noise effects. It has been widely used in image feature extraction and has performed excellently in image registration. Subsequently, various improved algorithms based on SIFT were proposed, introducing different techniques to enhance feature point extraction and description. The SURF algorithm [13], for instance, significantly improved execution efficiency using integral images on the Hessian matrix and reduced-dimensional feature descriptors, addressing the high computational complexity and time consumption of SIFT. The PSO-SIFT algorithm [14] combined the position, scale, and orientation of each key point, using new gradient definitions and feature matching methods to further improve the SIFT algorithm. The FSC-SIFT algorithm [15] used the Fast Sample Consensus (FSC) algorithm to find initial correct results and used the Iterative Selection of Correct Matches (ISCM) algorithm to increase correct matches. However, these techniques perform poorly on multimodal images.

For multimodal image registration, researchers proposed using phase consistency models based on local phase information to extract and describe robust feature points due to the large intensity and appearance differences between images. For example, HOPC [16] captured geometric structure or shape features by constructing dense descriptors of directional phase consistency histograms; RIFT [17] used phase consistency for feature point detection and description, reducing the impact of nonlinear radiometric distortion; and R2FD2 [18] combined log-Gabor multichannel autocorrelation feature detectors and log-Gabor maximum index maps for feature description, and it was robust to radiometric and rotational differences.

In recent years, with the rapid development of deep learning and its excellent performance in image classification, object recognition, semantic segmentation, etc., deep learning has been introduced into image registration tasks. The main idea is to automatically learn image feature information from large amounts of training data, using the powerful representation capabilities of convolutional neural networks to replace handcrafted detectors and descriptors based on direct perception or experience. Traditional handcrafted descriptors can only extract and represent relatively low-level features, while deep learning methods can typically extract higher-level, more abstract semantic features. Matching using higher-level semantic information has strong generalizability. Therefore, feature description, feature detection, and feature matching based on deep learning methods have been successively proposed. In the feature detection stage, ResNet and FPN (feature pyramid network) are commonly used CNN architectures for extracting local features. ResNet constructs deep networks by introducing residual connections, which allow the network to more easily learn identity mappings, thus facilitating the effective training of deep networks. ResNet50, with its 50 layers, performs exceptionally well in various image classification and feature extraction tasks, becoming an important benchmark model in the field of deep learning. Once the feature maps are obtained, feature response maps are calculated by combining different feature maps, and local maxima are identified as feature points. For instance, LIFT [19] and LF-Net [20] are neural networks that implement feature point extraction operations and output feature points with orientations. R2D2 [21] calculated repeatability and reliability scores from feature maps to select key points with high repeatability and reliability. ASLFeat [22], proposed by Luo et al., jointly learned local feature detectors and descriptors, enhancing the accuracy of point detection by mining local shape information of feature points. LNIFT [23] introduced a local normalized filter that transforms the original images into normalized images for feature detection and description, significantly reducing the severe nonlinear radiation distortion (NRD) between multimodal images. ReDFeat [24] recouples the independent constraints of multimodal feature learning’s detection and description through a mutual weighting strategy, where the detection probability of robust features is forced to peak and repeat while emphasizing features with high detection scores during optimization. The study [25] proposed a multi-scale template matching strategy to improve the matching performance of multimodal images under displacement and scale variations.

3. Materials and Methods

Feature point matching essentially involves finding the corresponding point in another image for a given query point from two images. Image registration corresponds to finding the dense mapping of all key points. Therefore, the problem of finding correspondences between two images can be formulated as a function as follows:

x^{'} = F_{Φ} (x | I, I^{'})

(1)

where I and I’ are the reference and target images, respectively, F_Φ is the neural network structure, Φ represents the neural network parameter set, x represents a query point in image I, and x’ represents the corresponding position of x in image I’, as shown in Figure 1.

The proposed multimodal image registration method mainly includes four steps. First, a multi-scale feature fusion network model is utilized to extract features from the two images. This process includes using an improved ResNet50 network for feature extraction, generating multi-scale feature maps through a feature pyramid, and employing an enhanced feature fusion module (FFM) for feature fusion. Then, the geometric outlier removal method (GSM) is employed to eliminate mismatched points from the initial correspondences between the two images. This approach involves analyzing the geometric relationships among the inliers and iteratively refining the optimal spatial transformation model to identify and remove outliers that deviate from the established geometric consistency with these inliers. Finally, the target image is transformed according to the spatial transformation model, and the final registration result is obtained through thin plate spline interpolation (TPS). The overall framework of the registration process is shown in Figure 2.

3.1. Network Architecture

This paper proposes a CNN-Transformer hybrid model for image registration, fusing deep and shallow features to mine richer semantic features from images. The proposed multi-scale feature fusion network architecture includes four components: a feature extraction backbone network, a feature pyramid network (FPN) module, a Transformer encoder, and a Transformer decoder. The overall framework of the network model is shown in Figure 3. The main function of the feature extraction backbone network is to extract multi-scale image features from the two images using weight-sharing convolutional neural networks for feature extraction and obtaining different scale feature maps through a series of convolutions, pooling, and other operations. The FPN module is a feature pyramid network that aligns different levels of features through a designed feature fusion module (FFM) to achieve multi-scale feature fusion, improving the feature expression capability of the network model. The main function of the Transformer encoder and decoder modules is to learn the global features of the two images, capturing contextual semantic information through the self-attention mechanism to improve matching performance. Finally, the output of the decoder is transformed into corresponding coordinates through a simple feedforward network, completing the feature point matching process.

First, the two images with a size of H × W are input into the network model, and different levels of downsampled feature maps are extracted from the two input images using a shared-weight backbone network B. The feature maps of the two images are concatenated according to corresponding levels, with the output of the last residual block of each stage corresponding to a feature level in the feature pyramid. Then, position encoding P and multi-scale encoding S are added, generating fused multi-scale feature maps c through the feature fusion module, inputting the obtained feature maps and query points x into the Transformer encoder T_E, and interpreting the results using the corresponding Transformer decoder T_D. Finally, the output of the Transformer decoder is processed by a fully connected layer E to obtain the estimated corresponding point x’. The overall process of the network model can be expressed mathematically as follows:

x^{'} = F_{Φ} (x | I, I^{'}) = D (T_{D} (P (x), T_{E} (c)))

(2)

The fully connected layer E consists of three layers of multi-layer perceptrons (MLPs), each with 256 neurons, activated by the ReLU function. Additionally, after obtaining feature maps of different scales, position encoding is required, adding the coordinate function Ω to the feature map to generate a context feature map c as follows:

c = [E (I), E (I^{'})] + P (Ω)

(3)

where [·] denotes concatenation along the spatial dimension. A linear position encoding is used, so for a given position x = [x, y], P(x) can be expressed as follows:

P (x) = [p_{1} (x), p_{2} (x), \dots, p_{\frac{N}{4}} (x)]

(4)

p_{k} (x) = [\sin (k π x^{Τ}), \cos (k π x^{Τ})]

(5)

where N denotes the number of channels in the feature map.

3.2. Feature Extraction Backbone Network

An improved ResNet50 is used as the backbone network to construct the feature extraction network, obtaining different levels of feature maps. To capture deeper semantic information from the two input images, we added an additional convolution module at the end of the backbone network. This increases the depth of the feature extraction network, enabling the model to learn more complex and abstract features, thereby improving the performance of the backbone network in feature extraction and enhancing the subsequent image matching process. The specific structure of the multi-scale feature extraction backbone network is shown in Figure 4.

Specifically, the proposed network model uses an improved ResNet50 network as a bottom-up extraction architecture, calculating hierarchical feature representations composed of multiple scale feature maps using the output of the last residual block of each stage for subsequent feature fusion. Formally, the outputs of these residual blocks {S2, S3, S4, S5} correspond to the conv3, conv4, and conv5 stages and the output of the convolution module, with each output corresponding to a feature level in the feature pyramid. The outputs of the conv1 and conv2 stages are not included in the feature pyramid due to their large memory footprint.

The specific structure of the convolution module is shown in Figure 5. Similar to the residual module of ResNet50, the convolution module first uses 1024 1 × 1 convolution kernels to adjust the

\frac{H}{32} \times \frac{W}{32} \times 2048

feature vectors to

\frac{H}{32} \times \frac{W}{32} \times 1024

, reducing the number of channels, uses 3 × 3 convolution kernels to compress the feature map size to

\frac{H}{64} \times \frac{W}{64} \times 1024

, and finally uses 4096 1 × 1 convolution kernels to increase the number of channels, adjusting the size to

\frac{H}{64} \times \frac{W}{64} \times 4096

.

3.3. Feature Pyramid Network

The feature pyramid network used in this paper is based on a bottom-up and top-down path. The bottom-up path mainly performs the feedforward computation of the convolutional network, calculating hierarchical feature representations composed of multiple scale feature maps. The top-down path progressively fuses high-level and low-level feature maps through a feature fusion module, generating the final fused feature map. The feature fusion module learns a semantic offset field to obtain accurate semantic scope mapping, aligning high-level semantic information while retaining low-level spatial detail information, avoiding feature misalignment. Therefore, the fused features contain both rich semantics and spatial alignment with corresponding low-level features. The structure of the feature pyramid network model is shown in Figure 6.

After the input image is fed into the network model, four different levels of feature maps are obtained. The feature fusion module is then used to fuse different levels of feature maps, iterating until the final feature map is generated. Specifically, S5 is obtained through convolution and pooling to obtain D5. Then, through lateral connections, S4 and the output of the previous layer D5 are sent into the FFM for feature fusion to obtain the output feature D4. Similarly, the output feature map D2 is obtained through the last FFM.

To alleviate feature misalignment and enhance feature fusion, a novel feature fusion mechanism is introduced into the registration framework. This FFM is based on the FPN fusion method, obtaining more accurate semantic information by learning a semantic offset field while maintaining the same spatial size as the low-level feature map during feature fusion, facilitating correct correspondences between the two images and spatial alignment. Under the semantic offset field, feature fusion is enhanced by aligning high-level feature maps with rich spatial details. The mathematical definition of the feature fusion mechanism is as follows:

F_{l} = ℱ_{l} (S)

(6)

F_{h} = ℱ_{h} (D)

(7)

F_{o} = F_{l} \oplus W (F_{l}, F_{h} | θ)

(8)

where ⊕ denotes element-wise addition and S_i and Di represent low-level feature maps from the bottom-up path and high-level feature maps from the top-down path in the feature pyramid, respectively. F_l and F_h encode feature maps of different sizes, and F_o is the fused feature map. W denotes the feature alignment function with parameters θ, combining high-level and low-level feature maps to learn a semantic offset field for accurate semantic alignment. The specific structure of the feature fusion module is shown in Figure 7.

Formally, given a low-level feature map

S \in ℝ^{H \times W \times C}

and a high-level feature map

D \in ℝ^{H^{'} \times W^{'} \times C}

(where

H^{'} = H / 2

,

W^{'} = W / 2

), the final output is a fused feature map F_o. First, the low-level feature map S_i is passed through a 3 × 3 convolution layer to obtain a feature map

S^{'} \in ℝ^{H^{'} \times W^{'} \times C}

containing richer feature representations. The high-level feature map D_i is adjusted to the same number of channels as S_i through a 1 × 1 convolution layer to obtain

D^{'} \in ℝ^{H^{'} \times W^{'} \times C}

. Then, D’ is upsampled to the same size as S’ through bilinear interpolation to obtain

D^{'} \in ℝ^{H \times W \times C}

. Before adding the different levels of feature maps element wise, they are concatenated and input into a 1 × 1 convolution layer, aiming to enhance the convolution feature learning by establishing the interdependence of high-level and low-level feature representations. This allows the FFM to increase its sensitivity to information features that can be utilized by subsequent matching. After that, a semantic offset field

G \in ℝ^{H \times W \times 2}

is generated, and grid sampling U is used to upsample D’ under the semantic offset field to obtain

\hat{D}

. The feature alignment function W is defined as follows:

\hat{D} = W (S, D | θ) = U (D | G) = U (D | conv (concat (S, D)))

(9)

Finally, element-wise addition is used to combine the obtained feature maps S’ and

\hat{D}

to obtain the fused feature map F_o, completing the feature fusion process. Using FFM, semantic alignment can be achieved by combining low-level feature maps with deformed high-level feature maps, alleviating feature misalignment. The fused feature map generated by combining high-level and low-level feature maps through FFM has more accurate semantic information in each pixel, retaining the spatial size of the low-level feature map while aligning the rich spatial details of the high-level feature map to enhance feature fusion. This feature alignment mechanism can improve the accuracy and robustness of image registration, making the registration results more precise and reliable.

3.4. Transformer Encoder–Decoder

The encoder and decoder both use a 6-layer structure. Each encoder layer T_E contains an 8-head self-attention layer and an MLP layer. The core component of the encoder is the multi-head self-attention layer. Each self-attention layer simultaneously computes dependencies between representations at various positions in the input sequence and integrates this information into each position’s representation, allowing the model to capture long-range dependencies within the input sequence. In each layer l, the self-attention input is calculated from the input

X^{(l - 1)} \in ℝ^{L \times C}

as a triplet (query, key, value) as follows:

Q = X^{(l - 1)} W^{q}

(10)

K = X^{(l - 1)} W^{k}

(11)

V = X^{(l - 1)} W^{v}

(12)

where

W^{q}, W^{k}, W^{v} \in ℝ^{C \times d}

are learnable parameters and d is the dimensionality of the triplets. The similarity between Q and K is then calculated and normalized using the softmax function to obtain the weight of each K for the current Q as follows:

A (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(13)

The normalized weights are then used to compute a weighted sum of V, resulting in the final representation for the current Q. To increase the model’s representation capacity, multiple heads of attention are usually used, i.e., multiple sets of different Q, K, and V are computed simultaneously, concatenated, and projected to obtain the final value as follows:

MSA (X^{(l - 1)}) = Concat (h e a d_{1}, \dots, h e a d_{h}) W^{O}

(14)

h e a d_{i} = A (X^{(l - 1)} {W_{i}}^{q}, X^{(l - 1)} {W_{i}}^{k}, X^{(l - 1)} {W_{i}}^{v})

(15)

where

W^{q}, W^{k}, W^{v} \in ℝ^{C \times d}

, and h are the number of attention heads.

Finally, these output vectors are passed through a fully connected feedforward neural network. This network independently applies a nonlinear transformation to the representation at each position, increasing the model’s representation capacity. To prevent gradient vanishing and exploding problems, residual connections are used between each self-attention layer and feedforward neural network.

Each decoder layer contains an 8-head encoder–decoder attention layer but no self-attention layer to prevent communication between query points. Pre-normalized residual units and the same MLP layers as T_E are used. In MSA, Q, K, and V come from the same input sequence, while in cross-attention MA, Q comes from the output T of the previous layer in the input sequence, and K and V come from the output X of the encoder’s last layer. Formally, in each layer l, MA is defined as follows:

MA (T^{i, (l - 1)}, X^{i}) = Concat (h e a d_{1}, \dots, h e a d_{h}) W^{O}

(16)

h e a d_{j} = A (T^{i, (l - 1)} {W_{j}}^{q}, X^{i} {W_{j}}^{k}, X^{i} {W_{j}}^{v})

(17)

where

W^{q}, W^{k}, W^{v} \in ℝ^{C \times d}

, and h are the number of attention heads.

3.5. Loss Function

The process of training the deep neural network involves incorporating supervision information into a loss function and minimizing the loss function to guide network learning. Let

I \in ℝ^{H \times W}

,

I^{'} \in ℝ^{H \times W}

, and

x \in {[0, 1]}^{2}

be the normalized query point coordinates in image I. We need to search for the corresponding point

x^{'} \in {[0, 1]}^{2}

in image I’. Therefore, the problem of finding correspondences is transformed into finding the optimal parameter set Φ that minimizes the loss function. The most widely used mean square error (MSE) is chosen as the similarity measure to calculate the error between the predicted and true values. Two types of supervision signals are used: reprojection loss and cycle consistency loss [26]. The loss function used to train the network model can be expressed as follows:

L = \arg \min \underset{(x, x^{'}, I, I^{'}) \sim D}{E} L_{1} + L_{2}

(18)

L_{1} = {‖ x^{'} - F_{Φ} (x | I, I^{'}) ‖}_{2}^{2}

(19)

L_{2} = {‖ x - F_{Φ} (F_{Φ} (x | I, I^{'}) | I, I^{'}) ‖}_{2}^{2}

(20)

3.6. GSM Mismatch Removal Algorithm

The GSM algorithm introduces a scaling factor based on LPM [27], providing robustness to scale transformation during outlier removal. The GSM uses an affine transformation model consisting of scale, rotation, vertical shift, and horizontal shift parameters. In the affine transformation model, inliers have similar scale information, while outliers have different scale information. This mathematical model measures the geometric relationship differences between the feature points of the two images. The specific process of the GSM is as follows.

Assuming a set of N initial feature correspondences,

S = {(x_{i}, y_{i})}_{i = 1}^{N}

is extracted from the two given input images, where the initial matching point set of the reference image is

S_{r e f} = {(S_{i}^{r e f})}_{i = 1}^{N} = {(x_{i}^{r e f}, y_{i}^{r e f})}_{i = 1}^{N}

, and the corresponding matching point set of the target image is

S_{s e n} = {(S_{i}^{s e n})}_{i = 1}^{N} = {(x_{i}^{s e n}, y_{i}^{s e n})}_{i = 1}^{N}

, where x_i and y_i are two-dimensional column vectors representing the spatial positions of feature points. The goal of outlier removal is to remove outliers in S to establish accurate correspondences. First, define d as the distance between two points as follows:

d ({s_{i}}^{r e f}, {s_{j}}^{r e f}) = ‖ {s_{i}}^{r e f}, {s_{j}}^{r e f} ‖ = \sqrt{{({x_{i}}^{r e f} - {x_{j}}^{r e f})}^{2} + {({x_{i}}^{r e f} - {x_{j}}^{r e f})}^{2}}

(21)

Combining the spatial transformation model, we obtain the following:

d ({s_{i}}^{r e f}, {s_{j}}^{r e f}) = s \times ‖ {s_{i}}^{s e n}, {s_{j}}^{s e n} ‖ = s \times d ({s_{i}}^{s e n}, {s_{j}}^{s e n})

(22)

Thus, the ratio of

d (s_{i}^{s e n}, s_{j}^{s e n})

and

d (s_{i}^{r e f}, s_{j}^{r e f})

is the scaling factor s. The objective function for outlier removal can be defined as follows:

S^{*} = \arg \min_{S_{i n}} C (S_{i n}; S, λ)

(23)

Since the scaling factor is fixed in a pair of images of the same scene or object, the cost function is defined as follows:

C (S_{i n}; S, λ) = \sum_{i \in S_{i n}} \sum_{j \in S_{i n}} \sum_{k \in S_{i n}} | \frac{d ({s_{i}}^{r e f}, {s_{j}}^{r e f})}{d ({s_{i}}^{s e n}, {s_{j}}^{s e n})} - \frac{d ({s_{i}}^{r e f}, {s_{k}}^{r e f})}{d ({s_{i}}^{s e n}, {s_{k}}^{s e n})} | + λ (N - | S_{i n} |)

(24)

where S_in is the set of internal points,

| \cdot |

indicates the number of set elements, and λ = 0.9. In the same way, a binary vector p of N × 1 is introduced, where p_i∈{0,1} represents the matching correctness of the ith correspondence (x_i, y_i). p_i = 1 is the inner point, and p_i = 0 is the outer point. The above equation can be transformed into the following:

C (S_{i n}; S, λ) = \sum_{i = 1}^{N} p_{i} (c_{i} - λ) + λ N

(25)

c_{i} = \sum_{j = 1}^{N} \sum_{k = 1}^{N} | \frac{d ({s_{i}}^{r e f}, {s_{j}}^{r e f})}{d ({s_{i}}^{s e n}, {s_{j}}^{s e n})} - \frac{d ({s_{i}}^{r e f}, {s_{k}}^{r e f})}{d ({s_{i}}^{s e n}, {s_{k}}^{s e n})} |

(26)

To optimize the geometric transformation function, the cost function needs to be minimized. Any correspondence with a cost less than λ will reduce the cost function, while any correspondence with a cost greater than λ will increase the cost function. Therefore, the optimal solution of p that minimizes the cost function is determined by the simple criterion as follows:

p_{i} = {\begin{matrix} 1, c_{i} \leq λ \\ 0, c_{i} > λ \end{matrix}, i = 1, \dots, N

(27)

Thus, the optimal inlier set S* is as follows:

S^{*} = {i | p_{i} = 1, i = 1, \dots, N}

(28)

4. Results

4.1. Datasets and Experimental Setup

The MegaDepth dataset [28] is used as the training dataset, containing a large number of images from the internet, covering various scenes. Each image is equipped with real depth maps obtained from laser scanners or other depth sensors, making the depth information of the dataset more real and accurate. The model trained on MegaDepth demonstrates strong generalization capabilities. The test set uses multimodal remote sensing image datasets provided by CoFSM [29], including optical–optical (multi-view, cross-temporal), optical–infrared, optical–depth, optical–map, optical–SAR images, and day-night six cross-modal images. The HPatches [30] dataset is used in the ablation study section. HPatches is a dataset used for the evaluation of local descriptors, consisting of 116 scenes, including 59 scenes with viewpoint variations and 57 scenes with illumination variations. Each scene contains 5 image pairs, resulting in a total of 696 image pairs. Due to its reproducibility, diversity, real-world data source, large scale, and multi-task nature, HPatches has been widely cited in related work since its publication, establishing an objective benchmark for evaluating local descriptors. This dataset has been extensively used in many local descriptor algorithm papers as a benchmark for algorithm evaluation and is considered important and classic in the field of image matching.

The network model is implemented using Python as the compilation language under the PyTorch deep learning framework on the Windows 10 operating system. During training, the Adam optimizer is used to minimize the loss, with a learning rate of 1 × 10⁻⁴, a weight decay of 1 × 10⁻³, a batch size of 4, and 100 training epochs. The software and hardware environment configurations for the experiment are shown in Table 1.

4.2. Evaluation Metrics

Qualitative experiments visualize the registration results through feature point matching relationships, fusion maps, and checkerboard maps. Quantitative experiments include two evaluation metrics: the mean reprojection error (MRE) and the percentage of correct key points (PCKs). The formula for calculating the mean reprojection error (MRE) of a pair of feature matching points

(x_{R}^{(i)}, y_{R}^{(i)})

and

(x_{S}^{(i)}, y_{S}^{(i)})

is as follows:

MRE = \frac{1}{N} \sum_{i = 1}^{N} \sqrt{{(x_{S^{'}}^{(i)} - x_{R}^{(i)})}^{2} + {(y_{S^{'}}^{(i)} - y_{R}^{(i)})}^{2}}

(29)

where

(x_{R}^{(i)}, y_{R}^{(i)})

are feature points from the reference image,

(x_{S}^{(i)}, y_{S}^{(i)})

are feature points from the target image, and N denotes the number of matched key points. The reprojection feature point of

(x_{S}^{(i)}, y_{S}^{(i)})

is

(x_{S^{'}}^{(i)}, y_{S^{'}}^{(i)})

, and the reprojection error is the distance between the reprojection feature point and the corresponding point in the reference image.

The PCK measures the percentage of correct matches out of the total number of matches, indicating the correctness of feature point matches. The percentage of correct key points is given by the following:

PCK = \frac{# CorrectMatchedPoints}{# TotalMatchedPoints}

(30)

The criterion for determining successful feature point matching is whether its reprojection error is below a specific threshold. When the reprojection error of a pair of feature points is less than the threshold, it is considered a successful match; otherwise, it is considered a mismatch. For the MRE metric, the smaller the value, the higher the accuracy. For the PCK metric, the larger the value, the better the matching performance.

4.3. Comparison Methods

To validate the advancement of the proposed method, experiments compare it with three other methods: SIFT, RIFT, and 3MRS.

SIFT [12] was proposed by Lowe et al. in 1999 and improved in 2004. By constructing a Gaussian pyramid, potential key points are extracted as local extrema in the difference of Gaussian space and filtered using the Hessian matrix of local intensity values, showing robustness to illumination, scale, and rotation.

RIFT [17] was proposed by Li et al. in 2019. It is an improved SIFT algorithm using phase consistency for feature point detection and maximum index maps for description, reducing the impact of nonlinear radiometric distortion, insensitive to intensity variations, and robust to rotation.

3MRS [31] was proposed by Fan et al. in 2022. It uses index maps constructed from the maximum values in all directions of convolution images obtained by a set of log-Gabor filters for feature description and adopts a 3D phase correlation matching strategy for template feature matching, achieving good registration results.

4.4. Qualitative Experiments

To verify the effect of various methods on multimodal image registration, fusion maps and checkerboard maps are used to observe and analyze the registration results, comparing the registration accuracy of different methods. Observations should focus on the following:

Check for significant spatial misalignment in the fusion map, indicating potential issues with the registration method or spatial transformation model if objects in the image show significant misalignment or offset;
Observe whether the edges of objects in the checkerboard map are aligned and appear continuous and natural, especially whether the transition in edge regions is smooth without noticeable breaks.

Figure 8, Figure 9, Figure 10 and Figure 11 show the registration results of these three methods and the proposed method on four groups of cross-modal images, demonstrating significant scale and rotation transformations and representing challenging image registration problems.

The areas marked in red boxes indicate registration errors with noticeable spatial misalignment or discontinuities. From the registration results, it can be seen that the fusion maps of the SIFT algorithm have relatively blurry object edges, and the checkerboard maps show many position offset and spatial misalignment issues. For example, in Figure 8, the circular flower bed shows serious misalignment, and in Figure 9, the buildings are not aligned. When dealing with image pairs with severe geometric distortions, such as in Figure 10 and Figure 11, the SIFT algorithm fails to find enough key point correspondences, directly leading to registration failure, and the output result is the original target image. The RIFT algorithm shows significant improvement over SIFT, achieving rough alignment for less challenging image pairs, such as in Figure 9. The RIFT algorithm can also effectively register images with scale changes, such as in Figure 11, but struggles with images with rotation changes. For example, in Figure 8, the road shows misalignment, and in Figure 10, the registration fails.

The 3MRS algorithm achieves relatively good results, with only a few objects not aligned, and the overall contours are roughly aligned. However, for images with significant rotation changes, such as in Figure 10, the 3MRS algorithm shows serious misalignment and position offset, almost resulting in registration failure. For images with scale changes, there are also some discontinuities, such as in Figure 11. For less challenging registration images without scale and rotation changes, such as in Figure 9, the performance is comparable to the proposed method. However, in Figure 8, there is still a noticeable misalignment in the road. The proposed method achieves good results in all four groups of image registrations, with no noticeable spatial misalignment or position offset, and the details are well aligned, with smooth transitions in edge regions and clear object contours in the fusion maps. Overall, the proposed method shows the best registration effect, providing a good alignment effect superior to the other three methods. The proposed method performs well in different scene image registrations, demonstrating robustness to significant nonlinear intensity differences and insensitivity to scale, rotation, and illumination changes. Moreover, it achieves alignment for different modal image pairs, especially in cases with challenging appearance differences, validating the generalizability of the proposed algorithm.

4.5. Quantitative Experiments

The correct matching percentage and mean reprojection error of each method are calculated on the test set, and the results are averaged. The registration results of different methods are shown in Table 2.

From the comparison of evaluation metrics in the table, it can be seen that the SIFT algorithm, due to the general quality of extracted key points, often fails to register images with significant appearance differences, resulting in the worst performance in both the matching success rate and error among these methods. The RIFT and 3MRS algorithms show significantly better performance than the SIFT algorithm, usually successfully registering most cases, but for images with significant rotation and scale changes, they fail to achieve satisfactory results, occasionally showing noticeable spatial misalignment and position offset. The proposed method has the best performance among these methods, with a registration success rate of 100%, and the correct matching rate within five pixels is nearly 50% higher than that of the SIFT algorithm and about 20% higher than the RIFT and 3MRS algorithms. The mean reprojection error is also the smallest among these methods. Since both the SIFT and RIFT algorithms have registration failures, the MRE is not calculated for them. The MRE of the proposed method is about 36% lower than that of the 3MRS method. Overall, the proposed algorithm successfully matches all selected image pairs, effectively registering cross-modal images in different scenes, showing good scale and rotation invariance, and having certain advancement and stability.

In order to further comprehensively evaluate the accuracy and robustness of the proposed method, the comparative experiment was to compare the proposed method with four methods: SIFT+NN, D2-Net+NN, SuperPoint+SuperGlue, and LoFTR.

The innovation of neural network architecture is mainly divided into two points. First, a new feature fusion module is proposed to alleviate the problem of feature misalignment, enhance feature fusion, and effectively align the high-level feature map with the low-level feature map so as to provide robust feature information for subsequent matching. The second point introduces the attention mechanism to obtain the long-distance dependency of the image and uses the self-attention layer and the cross-attention layer to match the two images, and the global receptive field provided by the Transformer can produce dense matching in the low-texture area. In order to verify the effectiveness of the two innovation points, a quantitative evaluation was carried out on the HPatches dataset, and the results are shown in Table 3.

The evaluation index used in the ablation experiment is a PCK with three pixels as the threshold. Under the benchmark framework, the PCK of the benchmark method is 51.6%. The PCK increased by about 17% after the introduction of the attention mechanism and the PCK increased by 11% compared with the benchmark when only the feature fusion module is used, and the method proposed in this paper is 25% better than the benchmark method. It is proven that the innovations proposed in this paper improve the registration accuracy to a certain extent and verify the effectiveness of the method.

In order to further evaluate the registration performance of the proposed method, the registration results of the proposed method were compared with the registration results of SIFT+NN, D2-Net+NN, SuperPoint+SuperGlue, and LoFTR. The test was carried out on the HPatches dataset and evaluated using two evaluation indicators, the PCK and MRE, which were subdivided into three different measures with one, three, and five pixels as thresholds. The results of the assessment are shown in Table 4.

As can be seen in the table, the methods presented in this paper have achieved the best results with the exception of PCK-1px. PCK-1px is 1.7% lower than the LoFTR algorithm and is better than the other four methods in the case of three-pixel and five-pixel thresholds; the SIFT method is the traditional detection method, and D2-Net uses convolutional neural network to play a dual role for feature detection and feature description, which has improved performance compared with the traditional method, but the effect is limited. Compared with these algorithms, it can be seen that the method proposed in this paper can show good performance and has certain advancements.

4.6. Application of the Positioning Software Platform

This section mainly introduces the interface design and functional application of the unmanned aerial vehicle autonomous positioning software platform in Figure 12 and demonstrates the functions of the software platform. This software platform can be used for autonomous positioning of drones, registering the captured drone images with Google Maps and then using a spatial transformation model to find the best matching location and record the current location information. Finally, the matching location points are marked on the map.

5. Conclusions

This paper conducts in-depth research on the task of multimodal image registration. Due to the inherent modality differences and significant geometric deformations in multimodal images, cross-modal image registration is challenging. To address the diversity of modality information in massive images, this paper proposes a two-stage matching strategy registration method. Feature detection and matching use a CNN-Transformer hybrid network model, proposing a new feature fusion module for multi-scale feature fusion and adopting the GSM algorithm for outlier removal to obtain more accurate registration results. The experiments show that this method can effectively improve the accuracy and robustness of multimodal image registration, which is better than other registration algorithms.

Author Contributions

Conceptualization, R.H. and W.S.; methodology, S.L.; software, H.L.; validation, R.H. and W.S.; formal analysis, R.H. and W.S.; investigation, S.L.; resources, S.L., W.S. and R.H.; data curation, H.L. and R.H.; writing—original draft preparation, S.L.; writing—review and editing, R.H. and W.S.; visualization, S.L.; supervision, R.H. and W.S.; project administration, W.S.; funding acquisition, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was funded by the National Natural Science Foundation of China (62173330, 62371375); Shaanxi Key R&D Plan Key Industry Innovation Chain Project (2022ZDLGY03-01); China College Innovation Fund of Production, Education, and Research (2021ZYAO8004); Xi’an Science and Technology Plan Project (2022JH-RGZN-0039).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zitová, B.; Flusser, J. Image registration methods: A survey. Image Vis. Comput. 2003, 21, 977–1000. [Google Scholar] [CrossRef]
Le Moigne, J.; Netanyahu, N.S.; Eastman, R.D. Image Registration for Remote Sensing: Survey of Image Registration Methods; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Sui, H.; Xu, C.; Liu, J.; Hua, F. Automatic Optical-to-SAR Image Registration by Iterative Line Extraction and Voronoi Integrated Spectral Point Matching. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6058–6072. [Google Scholar] [CrossRef]
Chen, J.; Frey, E.C.; He, Y.; Segars, W.P.; Li, Y.; Du, Y. TransMorph: Transformer for unsupervised medical image registration. Med. Image Anal. 2022, 82, 102615. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Leng, C.; Hong, Y.; Pei, Z.; Cheng, I.; Basu, A. Multimodal Remote Sensing Image Registration Methods and Advancements: A Survey. Remote Sens. 2021, 13, 5128. [Google Scholar] [CrossRef]
Detone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Canny, J. A computational approach to edge detection. Read. Comput. Vis. 1987, 184–203. [Google Scholar] [CrossRef]
Lehureau, G.; Tupin, F.; Tison, C.; Oller, G.; Petit, D. Registration of metric resolution SAR and Optical images in urban areas; proceedings of the Synthetic Aperture Radar (EUSAR). In Proceedings of the 2008 7th European Conference on Synthetic Aperture Radar, Friedrichshafen, Germany, 2–5 June 2008. [Google Scholar]
Beymer, D. Feature correspondence by interleaving shape and texture computations. In Proceedings of the CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 18–20 June 1996; pp. 921–928. [Google Scholar]
Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Lowe, D. Distinctive image features from scaleinvariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–100. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded up robust features. In Proceedings of the 9th European Conference on Computer Vision—Volume Part I, Graz, Austria, 7–13 May 2006. [Google Scholar]
Ma, W.; Wen, Z.; Wu, Y.; Jiao, L.; Gong, M.; Zheng, Y.; Liu, L. Remote Sensing Image Registration with Modified SIFT and Enhanced Feature Matching. IEEE Geosci. Remote Sens. Lett. 2016, 14, 3–7. [Google Scholar] [CrossRef]
Wu, Y.; Ma, W.; Gong, M.; Su, L.; Jiao, L. A Novel Point-Matching Algorithm Based on Fast Sample Consensus for Image Registration. IEEE Geosci. Remote Sens. Letters 2014, 12, 43–47. [Google Scholar] [CrossRef]
Ye, Y.; Shan, J.; Bruzzone, L.; Shen, L. Robust Registration of Multimodal Remote Sensing Images Based on Structural Similarity. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2941–2958. [Google Scholar] [CrossRef]
Li, J.; Hu, Q.; Ai, M. RIFT: Multi-modal Image Matching Based on Radiation-variation Insensitive Feature Transform. IEEE Trans. Image Process. 2019, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
Zhu, B.; Yang, C.; Dai, J.; Fan, J.; Qin, Y.; Ye, Y. R2FD2: Fast and Robust Matching of Multimodal Remote Sensing Image via Repeatable Feature Detector and Rotation-invariant Feature Descriptor. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned Invariant Feature Transform; Springer International Publishing: Cham, Switzerland, 2016. [Google Scholar]
Ono, Y.; Trulls, E.; Fua, P.; Yi, K.M. LF-Net: Learning Local Features from Images; Curran Associates Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2D2: Repeatable and Reliable Detector and Descriptor. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Luo, Z.; Zhou, L.; Bai, X.; Chen, H.; Zhang, J.; Yao, Y.; Li, S.; Fang, T.; Quan, L. ASLFeat: Learning Local Features of Accurate Shape and Localization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Li, J.; Xu, W.; Shi, P.; Zhang, Y.; Hu, Q. LNIFT: Locally Normalized Image for Rotation Invariant Multimodal Feature Matching. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Deng, Y.; Ma, J. ReDFeat: Recoupling Detection and Description for Multimodal Feature Learning. IEEE Trans. Image Process. 2023, 32, 591–602. [Google Scholar] [CrossRef] [PubMed]
Gao, T.; Lan, C.; Huang, W.; Wang, L.; Wei, Z.; Yao, F. Multiscale Template Matching for Multimodal Remote Sensing Image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 10132–10147. [Google Scholar] [CrossRef]
Wang, X.; Jabri, A.; Efros, A.A. Learning Correspondence from the Cycle-Consistency of Time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Ma, J.; Zhao, J.; Jiang, J.; Zhou, H.; Guo, X. Locality Preserving Matching. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Li, Z.; Snavely, N. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Yao, Y.; Zhang, Y.; Wan, Y.; Liu, X.; Yan, X.; Li, J. Multi-Modal Remote Sensing Image Matching Considering Co-Occurrence Filter. IEEE Trans. Image Process. 2022, 31, 2584–2597. [Google Scholar] [CrossRef] [PubMed]
Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted andlearned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5173–5182. [Google Scholar]
Fan, Z.; Liu, Y.; Liu, Y.; Zhang, L.; Zhang, J.; Sun, Y.; Ai, H. 3MRS: An Effective Coarse-to-Fine Matching Method for Multimodal Remote Sensing Imagery. Remote Sens. 2022, 14, 478. [Google Scholar] [CrossRef]

Figure 1. Feature point correspondences.

Figure 2. Multimodal image registration and UAV positioning overall process.

Figure 3. Overall framework of the network model.

Figure 4. Multi-scale feature extraction network architecture.

Figure 5. Convolution module structure.

Figure 6. Feature pyramid network architecture.

Figure 7. Feature fusion module structure.

Figure 8. Optical–map registration result comparison. (The areas marked in red boxes indicate registration errors with noticeable spatial misalignment or discontinuities).

Figure 9. Optical–depth registration result comparison. (The areas marked in red boxes indicate registration errors with noticeable spatial misalignment or discontinuities).

Figure 10. Image registration result comparison with rotation changes. (The areas marked in red boxes indicate registration errors with noticeable spatial misalignment or discontinuities).

Figure 11. Image registration result comparison with scale changes. (The areas marked in red boxes indicate registration errors with noticeable spatial misalignment or discontinuities).

Figure 12. Application of UAV Autonomous positioning software.

Table 1. Experimental environment configuration.

Item	Version
CPU	i7-9700
GPU	RTX 2070 SUPER
Operating System	Windows 10
Language	Python 3.8
Framework	PyTorch 2.1.0
CUDA	CUDA 12.1
cuDNN	cuDNN 8.8.1

Table 2. Evaluation metric comparison of different registration methods.

Method	PCK-1px (%)	PCK-3px (%)	PCK-5px (%)	MRE (pixel)	Matching Time (s)
SIFT	0.15	0.27	0.34	/	10.4
RIFT	0.23	0.51	0.65	/	8.83
3MRS	0.52	0.63	0.71	1.85	3.94
Ours	0.61	0.74	0.86	1.19	2.82

Table 3. Ablation test results.

Backbone	FFM	Attention Mechanism	PCK-1px (%)	PCK-3px (%)	PCK-5px (%)	Matching Time (s)
ResNet50	×	×	43.2	51.6	65.4	5.64
ResNet50	×	√	60.2	68.2	82.1	3.27
ResNet50	√	×	51.3	62.5	76.4	4.36
ResNet50	√	√	63.4	76.5	88.2	2.82

Table 4. Comparative experimental results.

Method	PCK (%)			MRE (pixel)	Matching Time
Method	1px	3px	5px	MRE (pixel)	(s)
SIFT+NN	14.5	30.1	38.8	13.85	10.04
LNIFT	17.6	33.4	42.6	9.74	9.76
D2-Net+NN	23.2	35.9	53.6	7.33	8.86
SuperPoint+SuperGlue	53.9	68.3	78.1	2.96	6.87
ReDFeat	58.7	70.1	82.4	2.34	3.42
LoFTR	65.1	71.6	84.6	1.54	2.89
Ours	63.4	76.5	88.2	1.38	2.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, R.; Long, S.; Sun, W.; Liu, H. A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers. Drones 2024, 8, 651. https://doi.org/10.3390/drones8110651

AMA Style

He R, Long S, Sun W, Liu H. A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers. Drones. 2024; 8(11):651. https://doi.org/10.3390/drones8110651

Chicago/Turabian Style

He, Ruofei, Shuangxing Long, Wei Sun, and Hongjuan Liu. 2024. "A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers" Drones 8, no. 11: 651. https://doi.org/10.3390/drones8110651

APA Style

He, R., Long, S., Sun, W., & Liu, H. (2024). A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers. Drones, 8(11), 651. https://doi.org/10.3390/drones8110651

Article Menu

A Multimodal Image Registration Method for UAV Visual Navigation Based on Feature Fusion and Transformers

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Network Architecture

3.2. Feature Extraction Backbone Network

3.3. Feature Pyramid Network

3.4. Transformer Encoder–Decoder

3.5. Loss Function

3.6. GSM Mismatch Removal Algorithm

4. Results

4.1. Datasets and Experimental Setup

4.2. Evaluation Metrics

4.3. Comparison Methods

4.4. Qualitative Experiments

4.5. Quantitative Experiments

4.6. Application of the Positioning Software Platform

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI