2. Related Work
Concerning traditional feature detection algorithms, in 1999, Lowe, D G. proposed the Sift [
9] algorithm of local scale invariant features, which is one of the classic, most traditional features capable of stable detection in regard to image rotation, blur, different scales, and brightness. In 2006, the Surf [
10] algorithm, proposed by H. Bay, was found to be equivalent to the accelerated version of the Sift algorithm. With the aim of maintaining the original performance of the Sift algorithm, it solved the shortcomings of the high computational complexity and long-term consumption of the Sift algorithm. However, using the Sift and Surf algorithms, it is still impossible to conduct real-time feature-matching tasks for UAV camera images. In 2011, Rublee et al. proposed the Orb [
11] algorithm as an effective alternative to Sift and Surf. Orb used the Fast [
12] algorithm as the basis for feature extraction and the BRIEF [
13] algorithm as the basis for feature matching. The computation time of the Orb algorithm was 1% of that of Sift and 10% of that of Surf, but the feature extraction and matching effect, in the case of low-texture scenes, were poor, and the accuracy was low.
In work aiming to improve the accuracy and speed of feature extraction and matching algorithms, due to the tilt and large angle of view, some low-altitude UAV aerial images are difficult to register accurately. To solve this problem, Wang et al. [
14] used the improved ASIFT(affine scale-invariant feature transform) algorithm to collect the initial feature points, the Weighted Least Squares Matching (WLSM) method to correct the positioning of the feature points, and the adaptive normalized cross-correlation algorithm to estimate the local transformation model. Finally, the UAV aerial images with large changes in perspective could be registered at the sub-pixel level. Liu et al., stitching high-resolution UAV aerial farmland images [
15], found that the image was down-sampled before the detection of the features, aiming to reduce the number of feature points, and the feature matching was realized by a feature descriptor based on gradient normalization. The Progressive Sample Consistency algorithm was used to eliminate the false matching points, which improved the speed and accuracy of the algorithm. Wu et al. [
16], stitching forest images taken by a UAV, found that high scene similarity leads to low accuracy in feature matching and a long stitching time. To solve this problem, the arccosine function ratio of the unit vector dot product was introduced so as to overcome the long matching time caused by the excessive number of matching points, and the Fast Sample Consistency (FSC) algorithm was introduced to eliminate the false matching points, which improved the accuracy of the algorithm. However, the abovementioned methods may not obtain satisfactory feature-matching results for high-resolution, low-texture maps and UAV aerial images and are far from capable of performing real-time tasks. Goh, J. N. et al. [
17] introduced matrix multiplication into the Ransac [
18] algorithm for the task of the real-time stitching and mapping of UAV aerial images, which greatly reduced the processing time for calculating the homography matrix. Moreover, in the stitching process, several input images were divided into two halves to reduce the time for feature detection and extraction. Xiong, P. et al. [
19], conducting real-time UAV stitching, used the prediction region to match the features of the current image, ensuring that the time required for the task was the same and reducing the stitching error. Zhang, G. et al. [
20] introduced the semantic segmentation [
21,
22,
23] algorithm to filter the foreground features, which improved the robustness and limitations of the algorithm, in order to solve the problems of misalignment and tearing caused by the significant changes in the dynamic foreground during the real-time splicing of the UAV images. However, the segmentation algorithm may lead to the degradation of the real-time performance.
In terms of the registration of UAV images and maps, Yuan, Y. et al. [
24], aiming to solve the problem that the UAV aerial images and Google Maps cannot be accurately registered due to large differences in the viewpoint direction, shooting time, and height between images, obtained Google Map images from the approximate position of the UAV aerial images. Using the VGG16 [
25] model to extract deep convolutional features, the algorithm achieved a more robust and effective registration effect. Zhuo, X. et al. [
26] stated that the greatest challenge in matching UAV aerial images with previously captured geo-referenced images is the significant differences in scale and rotation. They proposed dense feature detection and one-to-many matching strategies, combined with global geometric constraints for outlier detection, to identify thousands of valid matches in cases where Sift failed. This method can be used to realize the geo-registration of UAV aerial images, and the registration accuracy reaches the decimeter level. However, the algorithm was only studied in terms of its accuracy and was not optimized in real time. In order to avoid the error accumulation and distortion caused by using local methods to stitch continuous images captured by UAV airborne cameras, Lin Y. et al. [
27] proposed using a high-resolution map as a reference image, to register frames on the map and perform stitching by the frame-to-frame registration method. Nassar A. et al. [
28] realized the positioning of the UAV by registering the forward- and downward-view images taken by the UAV and the satellite map. The algorithm only used the airborne camera and did not require GPS. The semantic shape-matching algorithm was introduced in the registration process to improve the accuracy, which proved that the utilization of visual information can provide a promising method of UAV navigation.
Nowadays, feature-matching algorithms have powerful functions and are often used for image stitching [
29,
30,
31], positioning, mapping, registration, and other visual tasks. However, using this technology for scenes with sparse textures and tasks requiring a high real-time performance and accuracy remains challenging.
The main work reported in this paper is as follows:
1. The SuperGlue matching algorithm was applied for the real-time registration of UAV aerial images and maps, and a hierarchical blocking strategy combined with prior UAV data is proposed here to optimize the performance of the algorithm.
2. The inter-frame information was integrated into the matching process to improve the stability of the algorithm.
3. A method for updating map features in real time is proposed to improve the robustness and applicability of the algorithm.
4. Experimental Results
The experiment was mainly divided into four parts. One compared the performance of the proposed and Orb methods in two aspects: feature-matching and registration effect. The other verified the effectiveness of several improvements proposed in this paper; a vertical comparison experiment was conducted.
The vertical comparison experiment can be divided into three aspects. Firstly, the feature-matching effect prior to and following map blocking and rotation activity was compared. Secondly, the stability of the registration prior to and following integrating the matching information between frames was compared. Finally, the accuracy of registration prior to and following the real-time updating of map feature points was compared, and the evaluation was conducted considering subjective and objective perspectives.
The multirotor X-type tethered UAV (with a pan–tilt–zoom camera, as depicted in
Figure 9) was used in the present experiment, the resolution of all map blocks was 1920 × 1080, the resolution of the UAV aerial image was 1920 × 1080, and the confidence threshold of the SuperGlue algorithm was set to 0.2.
4.1. The Effect of the Proposed Method and the Orb Algorithm
This experiment mainly compared the traditional Orb-matching algorithm with the method proposed in this paper, and the Orb algorithm used the BF [
36] algorithm to conduct the matching. Two groups of map blocks and UAV aerial images were selected to compare the matching effect and accuracy of the matching-point pairs of the two methods (the Ransac [
18] algorithm was used to calculate the matching accuracy in the experiment). Then, the registration results of the two methods were compared, where the registration result refers to overlaying the registered UAV aerial images onto the map block.
In
Figure 10, we presented the effect of feature matching between the Orb algorithm and our proposed method. In order to present clearer results, we uniformly selected 20 pairs of matches and drew them. From the figure, it can be observed that the Orb algorithm has many incorrect matches (we selected five of them to mark). Similarly, we also uniformly selected 20 matching points for the SuperGlue algorithm to be drawn, and we can observe that basically no error matching is evident.
Table 1 presents the comparison of the number and accuracy of matching-point pairs of the two methods. From the two groups of experiments, we can observe that the Orb algorithm and our method can attain a relatively high number of matching-point pairs; however, after eliminating the mismatching-point pairs by the Ransac [
18] algorithm, the remaining correct matching-point pairs of the Orb algorithm are very few. The table also shows that the matching accuracy of the Orb algorithm is very low, indicating that most of the matching-point pairs obtained by the Orb algorithm are invalid.
In
Figure 11, we present two groups of image registration results for the Orb algorithm and our method. It can be observed that our method can accurately register the UAV aerial images and maps; however, the Orb algorithm cannot register the two objects. It can also be observed from the figure that when the Orb algorithm was used, an abnormal result was obtained, which was caused by the incorrect matching of the Orb algorithm, because the homography transformation matrix calculated using the incorrect matching method was also wrong.
4.2. Blocking and Rotation Experiments
This experiment can be divided into two aspects. The first verified that the map has a better feature-matching effect with the UAV aerial image after dividing it into blocks. We selected a recorded aerial video of the UAV, a map block, and a non-block map (with a greater geographical range), and we matched the features of the video frame images with the two maps, respectively. The effect of the feature-matching process was evaluated by the number of matching points, and we also compared their running speed.
The second aspect involved verifying that the UAV aerial image had a stronger feature-matching effect when it was rotated to face the same direction as the map. Similarly, we selected 10 frames of the UAV aerial images that were not consistent with the map direction, and we rotated them by the heading angle of the UAV to obtain a set of images that were consistent with the map direction. Feature matching between these images and the map was performed, and the effect of feature matching prior to and following rotation was evaluated by the number of matching points obtained.
In the blocking experiment,
Figure 12 presents the matching results of a frame of a UAV aerial image with the map block and unblocked map. It can be observed that the map following blocking presents more matching points with the UAV aerial image, and there is evidence of some incorrect matches (we marked them with black numbers) when not blocking.
Table 2 shows the frame rate of the video frame registration in the two ways. It can be observed that the feature-matching process has a higher frame rate after the map is blocked, which improves the speed of registration.
Table 3 shows the number of matching-point pairs for 10 randomly selected frames. We presented the larger value in bold and can observe that there were increased numbers of matching points following blocking.
In the rotation experiment,
Figure 13 presents the matching effect of the map with the UAV aerial images prior to and following rotation. In order to better display the results, we removed the matches with a matching-confidence result lower than 0.3, and observed that there were more matching points following the image rotation, and the performance of feature matching was greatly improved.
Table 4 depicts the comparison results of the number of matching points in 10 frames of images. The higher values are presented in bold, and we can observe that when the UAV aerial image and map roughly face the same direction, increased matching-point pairs can be obtained.
4.3. Comparison Conducted Prior to and Following the Addition of Inter-Frame-Matching Information
This section of the experiment was divided into two parts: one verifies that frame-to-frame matching works better than map-to-frame matching; the second verifies that the stability of video frame registration is greatly improved after integrating inter-frame matching.
As shown in
Figure 14, the blue dots in the image represent matching points. One can observe the richer matching points in the right-hand-side image. In
Figure 15, 10 frames are extracted. By comparing the pairs of matching points obtained through the two methods, one can observe that when the UAV aerial image is matched with the transformed previous frame, there are more matching points than when it is matched with the map.
This experiment was conducted to verify that the stability of the registration can be improved by integrating inter-frame-matching behavior. Since the motion between two frames is very reduced, the homography transformation matrices of two adjacent frames should be close to each other during the registration process. The stability can be determined by the difference between the transformation matrices of the two adjacent frames, and the greater the average value of the difference between the transformation matrices of the two adjacent frames, the more unstable the registration process. The difference of homography transformation matrices between two adjacent frames can be obtained by using Equation (11):
where
is the value at position (
) of the transformation matrix in the previous frame,
is the value at position (
) of the transformation matrix in the current frame,
and
represent the row and column of the transformation matrix, respectively, and
represents the difference between the two matrices.
We recorded a video taken by the tethered UAV, and we registered each frame with the map using two methods: one matched with the map only, and the other integrating inter-frame-matching behavior. A total of 100 frames from the video were selected to save the experimental results, and the values of the 100 frames under the two methods were compared.
Due to the limited space of the paper,
Table 5 only shows the
values of 15 sampling frames and the average value of 100 frames. The lower values are presented in bold, and it can be observed that the difference between the transformation matrices of the two adjacent frames is very minor after integrating the inter-frame-matching technique, while the difference between the transformation matrices of the two adjacent frames is relatively considerable when the inter-frame-matching technique is not utilized. This shows that incorporating inter-frame-matching techniques into video frame image registration can produce a stable registration result. In
Figure 16, we visually present the results we obtained. In the figure, the yellow line represents the result without utilizing inter-frame matching, while the blue line represents the result with the usage of inter-frame matching. It can be observed that after the integration of the inter-frame-matching technique, the transformation matrix between the two adjacent frames presents a minor difference, and the entire video registration process is more stable.
4.4. Comparison of the Registration Effects Prior to and Following the Real-Time Update of Map Features
This experiment was designed to verify that a greater registration accuracy can be achieved after updating map features. The experiments were conducted with and without updating the map features. Two scenes were selected for the experiment and the experimental data were collected using the tethered UAV (the video was collected with the camera tilted in order to increase the difficulty of registration). Two methods were used to register each frame of the video in real time.
The transformation matrix can be obtained by the feature-matching technique, and the UAV aerial image can be transformed into the coordinate system of the map through the transformation matrix. There is an overlapping area between the transformed UAV aerial image and the map, and the coincidence degree of the two images can be determined by the difference image of the overlapping area. (The difference image can be obtained by subtracting the gray image of the transformed UAV aerial image from the gray image of the map. That is, the gray values of two corresponding pixels are subtracted.) The pixel value of the difference image represents the difference between the two images at this pixel point. The smaller the average pixel value of the difference image, the greater the accuracy of the UAV aerial image and map registration. In other words, the more white parts there are in the difference image, the worse the accuracy of registration. The average pixel value in the effective area of the difference image can be calculated using the following equation:
where
represents the grayscale image of the map block,
represents the grayscale image of the frame image following the homography transformation,
i,
j satisfies
,
is the number of eligible pixels, and
represents the average of the gray value of the effective region in the difference image.
In terms of the result evaluation criteria, we divided the results into subjective and objective evaluations, and for the latter, we used the number of matching points and value. The experiment was divided into two groups. Due to the limited space of the paper, 13 frames (the 15th, 30th, 45th, 60th, 75th, 90th, 105th, 120th, 135, 150th, 165, 180th, and 195th frames) from the video were selected, and the registration results of these frames under the two methods were evaluated and compared.
4.4.1. Experiment 1 (Group 1)
In order to better display the results, we selected 9 frames from the 13 sampling frames to present their graphical results.
Figure 17 depicts the difference image of the registration results without updating the map features, and
Figure 18 shows the difference image after updating the map features (note: one can observe that the pixel value of the difference image remains high after updating the map features because there are certain differences evident in the color and brightness between the map and UAV aerial images). It can be observed that the top-left-corner areas in the first, third, and eighth images without being updated are whiter than those that have been updated, while the second and seventh images are more obvious, indicating that their registration accuracy is worse.
Figure 19 and
Figure 20 present the registration results prior to and following the updating of the map features. It can be observed that the registration results of the second, third, fourth, fifth, seventh, and eighth images present an obvious misalignment without updating the map features. In addition, we can also observe that there are basically no matching points evident when the frame images are matched with the map without updating the features (the blue points in the figure indicate the matching points). However, after updating the map features, the matching points of the image significantly increase.
Table 6 exhibits the results of the
value and number of matching points of the 13 sampling frames. It can be observed that after updating the map features, the matching points between the UAV aerial image and map significantly increase, and the
value is basically lower than that without updating.
Figure 21 presents the results exhibited in
Table 6 in a graphical way, and it can be observed that the matching points dramatically increase after updating the map features. Although the change in the
value is not obvious, it attains a smaller value for each frame, which also means that the registration accuracy is higher.
4.4.2. Experiment 2 (Group 2)
For the second set of experiments, similarly, we selected 9 from 13 frames for the graphical display;
Figure 22 shows the difference image without updating the map feature and
Figure 23 shows the difference image after updating the map feature. It can be observed that the lower-left-corner area of the third, fourth, and seventh images without receiving an update are whiter than those that have been updated, and there are obviously incorrect transformations in the second and eighth images.
Figure 24 and
Figure 25 depict the registration results prior to and following the update of the map feature. It can be observed that when the feature is not updated, the second and eighth registration results present considerable deformations. Although the contrast is not obvious in the first, third, fourth, fifth, sixth, seventh, and ninth images, it can also be observed that the edge of the overlapping area is misaligned.
Table 7 presents the results of the
value and number of matching points of the 13 sampling frames. It can be observed that after updating the map feature, the number of matching points basically increases; however, the increase is less than that of experiment 1, which is caused by the richer texture features of this scene. On the other hand, the
value is basically smaller than that without updating, and the result also is more stable.
Figure 26 shows the results of
Table 7 graphically; the yellow line represents the results after updating the map feature and the blue line represents the results without updating the map feature. It can be observed that there are a good number of matching-point pairs prior to and following updating; however, the number of matching points is further improved and tends to be stable, and the
is basically 2–3 points smaller after updating, and the results are relatively stable.
5. Discussion
With the rapid development of computer vision and UAV technologies, UAVs are often used in the field for certain tasks, such as visual detection and tracking to analyze or monitor targets; however, this only displays the information of an image and only conveys the visual feeling. If the correspondence between the real-time frames of the UAV and geographic map can be determined, the camera image can be endowed with geographic information. Increased applications can be obtained by transmitting the target geographic information to other platforms, such as combining this with the model map or 3D platform to achieve a virtual reality effect.
In the more ancient work, the projection transformation method was used to project the real-time frame onto the map, and the position of the camera image was calculated by the position information of the UAV and angle information of the camera. However, this method requires the information provided by the UAV to be extremely accurate, and the rotations of the UAV and camera make the calculation process very complex, including numerous accumulated errors and a lack of flexibility. With the gradual development of feature-matching algorithms in the field, both their accuracy and speed have improved; therefore, the improvement of the feature-matching techniques makes it possible to accurately register the UAV aerial images with the map. The UAV aerial images and geographic map are registered by feature matching, so that the UAV aerial images also have geographic coordinates, and the real-time geolocation of the target is realized.
The traditional feature-matching algorithms include Sift [
9], Surf [
10], Brisk [
37], etc. However, they are not real-time methods and can only process a single image; therefore, their application scope is narrow. Therefore, a lot of research has been conducted on speeding up these algorithms, such as meshing or eliminating invalid regions; however, they remain very dissimilar to the real-time method. The emergence of the Orb algorithm has solved the problem of the real-time method, and the Orb algorithm is widely used in various studies because of its superior performance. However, although the Orb algorithm has a good performance, it is difficult to achieve correct matching for scenes with sparse textures, and it even generated a high error rate. In this study, the SuperPoint and SuperGlue algorithms, which exhibit real-time performances, were adopted. The SuperGlue algorithm has a better and more stable performance in relation to sparse texture scenes, and it is suitable for performing feature matching for maps with sparse textures (
Figure 10 and
Figure 11).
In addition, the map has a wide range, while the UAV aerial image has a narrow range. There is a wide gap in the scale between the two methods; therefore, it is difficult to perform feature matching between them. The easiest way to solve this problem is to cut the map; however, the UAV aerial image is constantly changing. Thus, how do we attain the appropriate map following the blocking process? The traditional method used in the field is to obtain the position directly below the UAV through the GPS information of the UAV to select the corresponding map block. On this basis, we used the pan–tilt–zoom camera and introduced the rotation information of the camera, so that our method could register the UAV aerial images under the tilt angle. In addition, our method could flexibly rotate the camera image by the heading angle of the UAV, so that the UAV aerial images with different angles could also be registered in the study (
Figure 12 and
Figure 13).
The movement of the UAV and rotation of the camera caused the scene to be changeable; however, the map was immutable, which may cause the performance of the feature-matching algorithm to be unstable and may achieve poor results for some complex scenes. Inspired by the idea of real-time mapping, we proposed a method to update the map feature in real time, so that the map could change according to the change in the external environment. The experiments (
Figure 17,
Figure 18,
Figure 19,
Figure 20,
Figure 21,
Figure 22,
Figure 23,
Figure 24,
Figure 25 and
Figure 26) showed that, in some scenes where the feature-matching performance was difficult, the proposed method effectively improved the accuracy of the feature matching and presented greater robustness and flexibility. In addition, the proposed method combined global and inter-frame matching techniques to create a more stable registration process, the inter-frame matching technique reduced the fluctuation of the global-matching technique, and the global matching technique restricted the cumulative error produced by the inter-frame matching method, as shown in
Figure 16.
Indeed, the proposed method also had some limitations. When the camera tilt angle was very large, it produced poor results, and the frames and prior UAV data were required to be collected synchronously and have a low-delay frequency. In future research, we hope to optimize the existing problems in this regard.
6. Conclusions
Due to the sparse texture and wide coverage of the map, as well as the large difference between the dynamic UAV aerial image and the static map, it is difficult to accurately register the UAV aerial image and the map using the traditional feature-matching algorithm. To solve this problem, in this study, the SuperPoint and SuperGlue algorithms, which are based on deep learning, were used for feature matching. The hierarchical blocking strategy, combined with prior UAV data, was introduced to improve the matching performance, and matching information obtained between frames was introduced to render the registration process smoother and more stable. The concept of updating the map features with UAV aerial image features was proposed with the aim of updating the map features in real time, rendering the method more adaptable to the changing environment and improving the registration accuracy and the robustness and applicability of the algorithm. Finally, the UAV aerial image can be accurately registered on the map in real time, adapting to the changes in the environment and the camera head. A large number of experiments showed that the proposed algorithm is feasible, practical, and scientific and has specific application value in the fields of UAV aerial image registration and UAV aerial image target geo-positioning.