1. Introduction
Remote sensing images, with their advantages of convenience and intuition, are becoming one of the effective means for people to observe, describe and analyze the characteristics of the Earth’s surface. At the same time, the quality of remote sensing images has been improving due to the maturity of sensor imaging technology, and the types of images are becoming more and more diversified, gradually developing in the direction of multi-modal, multi-spectral, multi-resolution and multi-temporal features [
1]. Different types of images contain different information, and the use of multi-source remote sensing images together to realize information complementarity has become one of the hot topics in recent years. The prerequisite for the simultaneous use of heterogeneous images is the registration of heterogeneous images.
Over the years, many scholars in different countries have published review papers by summarizing the existing registration techniques. Zitová et al., in a review in 2003, outlined the basic steps of image registration and classified the registration techniques in detail [
2]. There are five basic steps in image registration: feature detection, feature matching, mapping function design, image transformation and resampling [
2]. The image registration scenarios can be classified according to the method of image acquisition: different angle image registration, different time phase image registration, different modal image registration, etc. As for the registration techniques, they are subdivided into grayscale-based registration methods, feature-based registration methods, and transformation domain-based registration methods. Among them, feature-based registration methods are in the mainstream. The grayscale-based image registration algorithms count the grayscale features of multimodal images, calculate the similarity of grayscale features, and complete the image registration work. Grayscale-based matching algorithms suffer from numerous limitations, especially for heterogeneous images with large differences in grayscale values. Feature-based heterogeneous image registration relies on the point features [
3], line features [
4] and plane features [
5] of the image, calculates the similarity of the features of the two images and completes the feature matching [
6,
7,
8]. Li et al. summarized the difficulties and modalities of feature-based visible image and SAR remote sensing image registration [
9]. Due to the imaging characteristics of synthetic aperture radar, SAR images often have speckle noise, which affects the feature extraction in the registration process. Meanwhile, due to different radiation characteristics, the same object will demonstrate different gray values in visible and SAR images. In addition, due to the characteristics of side-view imaging of synthetic aperture radar, SAR images will be superimposed on the mask, shrinkage and other characteristics, which further increases the difficulty of registration. Li Kai summarized the classical operators and algorithms commonly used in the process of optical and SAR image registration [
9]. Based on point features: Moravec [
10], Harris [
11], SUSAN [
12], SIFT [
13], SURF [
14]. Based on line features: ROA [
15], registration based on chain code and line features, method combining line features and histogram, method based on adaptive algorithm and line features. Based on plane features: MRF [
16], level set based methods [
17], multi-scale registration.
Moreover, there are many methods based on deep learning, such as Mu-net [
18], PCNet [
19], RFNet [
20] and Fourier-Net [
21]. In order to adapt various types of multimodal images, Mu-Net uses the structural similarity to design a loss function that allows Mu-net to achieve comprehensive and accurate registration [
18]. With the help of phase congruency, PCNet enhances the similarity of multimodal images, thus improving the performance of registration [
19]. RFNet combines image registration and image fusion and improves the performance of fine registration via the feedback of image fusion [
20]. In order to conserve resources and improve speed, Fourier-Net uses a parameter-free model-driven decoder to replace expansive paths in a U-net style network [
21].
In recent years, the state-of-the-art heterogeneous image matching algorithms based on point features have included PSO-SIFT [
22], OS-SIFT [
23], RIFT [
24] and LNIFT [
25]. PSO-SIFT overcomes the problem of intensity differences in heterogeneous remote sensing images by introducing a new gradient definition, and then finely corrects the matching performance through positional differences, scale differences, and dominant orientation differences between pairs of feature points in combination with the results of the initial matching. PSO-SIFT abandons the strategy of SIFT to use the gradient obtained by subtracting the gradients of the two neighboring pixels in the Gaussian scale space of the image, instead using the Sobel operator to compute the gradient of keypoints in the Gaussian scale space of the image, which in turn optimizes the computation of the dominant orientation of the feature points. Meanwhile, PSO-SIFT adopts a circular neighborhood with radius 12
and 17 bins to determine the log-polar sectors in the dominant orientation of the keypoints to construct the GLOH-like [
26] feature descriptor, instead of using the original
square sector SIFT descriptor. In the matching process, PSO-SIFT firstly obtains the initial matching results with the nearest neighbor ratio algorithm, then optimizes the distance calculation method for feature point pairs by combining the positional difference, scale difference, and dominant orientation difference of the initial matched keypoints, and finally rejects the wrong matched point pairs with the FSC algorithm [
27]. The experimental results of PSO-SIFT show that PSO-SIFT outperforms SURF [
14], SAR-SIFT [
28] and MS-SIFT [
29] in multi-spectral and multi-sensor remote sensing image matching.
The OS-SIFT algorithm divides optical and SAR image matching into three steps. The first step is to detect the feature points in the Harris scale space of optical and SAR images, respectively, the second step is to determine the orientation of the feature points and construct the feature descriptors, and the third step is keypoint matching. The idea of Harris scale space is derived from DOG scale space. In DOG space scale space, keypoints are extracted by finding local maxima. In Harris scale space, the corner points are detected with the multiscale Harris function. The multiscale Harris function can be derived by replacing the first derivative of DOG scale space with the multiscale gradient computations [
23]. In order to overcome the performance impact of the significant difference between optical and SAR images on the repetition rate of keypoint detection, the OS-SIFT algorithm adopts different gradient calculation methods for optical and SAR images when constructing the Harris scale space. The Sobel operator is used to calculate the image gradient for optical images, and the ROWEA [
30] operator is used to calculate the image gradient for SAR images. After detecting the keypoints in the Harris scale space, the position of the keypoints is also finely corrected by the least squares method. Similarly to PSO-SIFT, OS-SIFT also uses a circular neighborhood with a radius of 12
(
is the parameter of the first scale,
in OS-SIFT) and 17 bins to determine the dominant orientation of the keypoints in the log-polar sectors to construct the GLOH-like keypoint feature descriptors, instead of using the original
square sector SIFT descriptor. Finally, OS-SIFT matches point pairs by the nearest neighbor ratio and eliminates false matching point pairs by FSC [
27]. The OS-SIFT experimental results show that it outperforms SIFT-M [
31] and PSO-SIFT algorithms in terms of matching accuracy.
In order to solve the problem of intensity and gradient sensitivity to nonlinear radiation differences in the process of feature detection and descriptor construction, RIFT proposes a feature detection method based on phase consistency and a maximum index map to construct the descriptor. The experimental results of RIFT show that in terms of matching performance, it is better than the algorithms of SAR-SIFT [
28], LSS [
32], PIIFD [
33] and others.
LNIFT proposes an algorithm to improve the matching performance by reducing the nonlinear radiometric differences of heterogeneous images. LNIFT firstly employs a mean filter to filter the multimodal images to obtain normalized image pairs to reduce the modal differences between the images. Then, feature points are detected on the normalized images according to the improved ORB [
34], and HOG [
35] feature descriptors are constructed on the normalized images to enhance the rotational invariance of matching. The experimental results show that LNIFT achieves better matching results than SIFT, PSO-SIFT, OS-SIFT and RIFT on multiple multimodal image datasets.
Heterogeneous image matching algorithms based on line features in recent years mainly combined some control points of line features and line features. Meng et al. used Gaussian gamma-type double window (GCS) [
36] and LSD [
37] linear feature detection, and then extracted the control points as the to-be-matched points to achieve the matching between optical and SAR images [
38]. Sui et al. used different line feature extraction methods for optical images and SAR images [
39]. For optical images, line features are extracted directly using LSD detector, while for SAR images, some pretreatment will be performed first. The Lee filter [
40] is used to reduce the effect of scattering noise on a SAR image, then the edges of the SAR image are detected by using Gaussian gamma-shaped (GGS) bi-windows. The line features of the SAR image are obtained by Hough transform [
41]. In order to improve the matching accuracy, the line features are extracted on the low-resolution image. Then, the transform relationship between the intersections of the line segments is calculated, which guides the conjugate line segment selection for fine matching. Finally, the images are matched according to the spatial consistency.
Although there have been many matching algorithms for heterogenous images in the past few years, including those based on point features and line features, these methods still have many limitations. These limitations include the following:
Feature detection is too confined to point features or line features, which cannot well combine the advantages of point features and line features. This leads to limitations in feature detection.
The step of extracting keypoints in the line feature-based matching algorithm is too complicated, which prevents the advantages of line features from being fully utilized.
Constructing the dominant orientation of the keypoint still relies too much on the intensity and gradient of the local image patch around the keypoint, which will lead to uncontrollable differences in the dominant orientation between two points in the reference image and registration image. Because there are nonlinear differences between the intensity and gradient of heterogeneous images, it will also reduce the registration performance and the rotation invariance of the matching.
In this paper, we address the above limitations by proposing a rotation-invariant matching method based on the combination of line features and point features for heterogenous images. This proposed method mainly consists of the following two approaches:
First, we use the LSD algorithm to extract the line features of the heterogenous image, and then extract the points on the straight line segment as keypoints. When extracting the feature points on the line features, we first compare the gradient magnitude of multiple points in the vertical direction, perpendicular to the feature line segment at every position, and select the point with the largest gradient magnitude as the real feature point.
Second, in order to improve the rotation invariance, we no longer determine the dominant orientation of the feature point based on the intensity or gradient of the local image patch while constructing feature descriptor. We directly assign the tilt angle of the straight line segment to the orientation of the keypoints according to which line they are extracted from. At the same time, we rotate the image according to the tilt angle and center of the given straight line segment, and construct HOG-like feature descriptors on the obtained rotated image.
The rest of the paper is organized as follows:
Section 2 provides a detailed description of all steps in the methodology.
Section 3 describes the datasets used for the experiments. In
Section 4, the matching performance and rotation invariance of each dataset are evaluated and discussed qualitatively and quantitatively in turn. In
Section 5, future research directions are be discussed. Finally, the conclusions are presented in
Section 6.
4. Results
In this section, we comprehensively evaluate the performance of our method on all of the datasets listed above. Different from traditional methods that use several or dozens of image pairs for tests, we used 37,500 image pairs for comparisons. The proposed method was compared with three baseline or state-of-the-art methods including PSO-SIFT, RIFT, and LNIFT. For fair comparisons, we used the official implementations of each method provided by the authors. At the same time, the thresholds of keypoint extraction for RIFT, LNIFT, and our LPHOG method were equally 5000 on each image, and we set a contrast threshold for PSO-SIFT to be very small (0.0001) to extract as many as feature points as possible. All experiments were conducted on a PC with AMD Ryzen 7 5800 H CPU at 3.2 GHz and 16 GB of RAM.
We qualitatively evaluated the matching performance by the correct matching images and mosaic grid maps of the sample images, as well as quantitatively evaluating the rotation invariance of the method by the number of correct matches (NCM) of the sample images at every rotation angle. Finally, we statistically assessed the matching performance of the method by the average NCM,
(percentage of image pairs with at least 10 correctly matched point pairs),
(percentage of image pairs with at least 100 correctly matched point pairs), and the average RMSE to quantitatively evaluate the matching performance of the algorithm, where the higher the NCM,
and
, the better, while the lower the RMSE, the better. If the number of correct matches (NCM) of an image pair is not smaller than 10, this image pair is regarded as correctly matched, since an NCM value that is too small will make the robust estimation technique fail [
25]. If one image pair is not correctly matched (i.e., NCM < 10), its RMSE is set to be 20 pixels [
25].
The definition of
The definition of
The RMSE is computed as [
46]
where is the number of correctly matched keypoints after the fast sample consensus (FSC), and
denotes the transformed coordinates of
by the estimated transformation matrix H,
denotes the number of correct matches.
4.1. Parameter Settings
There is only one parameter left ( in Equation (15)) that affects the performance of our method. We set in Dataset 1 and Dataset 2 and in Dataset 3.
4.2. Evaluation of Dataset 1
4.2.1. Qualitative Evaluation of Dataset 1
We selected three pairs of images from the optical–SAR dataset for qualitative comparison.
Figure 22,
Figure 23 and
Figure 24 show the registration results for Dataset 1 before rotation, after rotation of
and after rotation of
, respectively.
Figure 25 shows the checkerboard mosaic images of our LPHOG method without rotation, after
rotation, and after
rotation, respectively. As shown in
Figure 22, PSO-SIFT could only match the third pair of images, but could not correctly match the first and second pairs of images. After
rotation and
rotation, PSO-SIFT could not match these three pairs of images correctly, indicating that the overall matching performance of PSO-SIFT was poor and there was almost no rotation invariance. Without rotation, RIFT could match these three pairs of images well, but the matching performance of the first pair and the second pair decreased significantly after
rotation, and the matching performance after
rotation improved somewhat compared with the results of
rotation, which meant that the matching performance of RIFT was good, but the rotation invariance was not robust. LNIFT could correctly match the first and second pairs of images without rotation, but failed on the third pair of images, and could correctly match only the second pair of images after
rotation and
rotation, which indicated that the overall matching performance of LNIFT was stronger than that of PSO-SIFT, but weaker than that of RIFT, and the robustness of rotation invariance was not weak. On the whole, the number of matching point pairs obtained with our LPHOG method was significantly higher than that of the above three methods in all three situations, and the number of matching point pairs did not significantly decrease or even increase when the rotation angle changed, which indicates that the matching performance of our method was significantly better than that of PSO-SIFT, RIFT, and LNIFT. At the same time, our method had good rotation invariance. Further, from the checkerboard mosaic images shown in
Figure 25, all of our image matching accuracies were high.
4.2.2. Quantitative Evaluation of Dataset 1
Figure 26 shows the variation in NCM with rotation angles for the first pair of sample images, the second pair of sample images and the third pair of sample images in Dataset 1, respectively. We rotated the registration image of each sample image pair. The rotation angles were from
to
with an interval of
. Thus, a total of 25 registration images of each image pair were obtained (the rotation angles were
) [
24]. As a whole, the NCM of our LPHOG method was at least three times higher than that of the other three algorithms (PSO-SIFT, RIFT, LNIFT) in every rotation angle, and the highest NCM was up to 20 times higher. From the first pair, the number of matching pairs for PSO-SIFT stayed close to 0. LNIFT could only achieve more than 50 correct match points at
, and the numbers of matching pairs for other rotation angles were close to those obtained with PSO-SIFT. The matching points for RIFT fluctuated, and the fluctuation was relatively small, staying above and below a small number of 20 matching pairs. The number of correctly matched pairs of points in our method stayed above 300 for most of the rotation angles, and only dropped below 250 for
. For the second pair, our method maintained over 190 match point pairs for the majority of angles, only dropping to near 160 for the
rotation angle. PSO-SIFT, like the first pair, maintained the number of match point pairs near 0. LNIFT worked better than RIFT for the majority of rotation angles at
and
; but, for rotation angles at
, RIFT was stronger than LNIFT. From the third pair, our method consistently maintained more than 250 correctly matched pairs of points, while the number of correctly matched pairs of points for the other three best-performing RIFTs was consistently less than 1/5 of ours.
Figure 27 shows the changes in
,
, average NCM, and average RMSE for the whole Dataset 1, respectively. The
of our method was always equal to 1, and
was basically maintained above 0.75, which indicated that our method could correctly match all of the images in Dataset 1 under every rotation angle, and most of the images were matched well under all rotation angles. The
of RIFT fluctuated between 0.5 and 0.8, and the
was maintained near 0 in most of the rotation angles, which indicated that RIFT could correctly match most of the images in most of the rotation angles, but the overall matching performance was poor. When the rotation angle was located in
, the
of LNIFT could be maintained above 0.6, but the decline was very obvious in
(the change of P1 in
and
was approximately symmetric). The
of LNIFT with the change in the rotation angles was similar to a parabola, and it dropped to 0 between
, which indicated that the rotation invariance of LNIFT was poor. The average RMSE and
correlation was more obvious, with the average RMSE of our method staying near 1.12, and the other methods’ RMSE being affected by
and located at a higher level. The average number of matched pairs of points for PSO-SIFT was maintained near 5; RIFT floated around 15~20; and the average number of correctly matched pairs of points for LNIFT showed a similar parabolic shape between
, which reached the lowest level between
. Our LPHOG method stayed around 135, indicating that the overall matching performance of our method was very good and robust to rotation invariance.
The comparisons show that our method has wide applicability to the registration of many optical images and SAR images and adapts to many rotation angles. The matching performance of our method is very good, the number of matched pairs of points is always high, and the RMSE is always low. The rotation invariance of our method is highly robust.
4.3. Evaluation of Dataset 2
4.3.1. Qualitative Evaluation of Dataset 2
We selected three pairs of sample images from Dataset 2 to demonstrate the matching performance for optical images and infrared images.
Figure 28,
Figure 29 and
Figure 30 show the matching performance for sample images without rotation, after
rotation and after
rotation, respectively.
Figure 31 shows the checkerboard mosaic images for our LPHOG method before rotation, after
and after
of rotation, respectively. Without rotation, PSO-SIFT could only correctly match the second pair of images; LNIFT could only correctly match the first and second pair; RIFT and our method could both correctly match these three pairs of images. After rotating
, PSO-SIFT failed on all three pairs of images; RIFT only correctly matched the third pair of images; LNIFT correctly matched the second pair of images, and our method correctly matched all three pairs of images. After rotating
, PSO-SIFT and LNIFT failed on all three pairs of images; RIFT correctly matched the first and second pairs of images; and our method correctly matched all three pairs of images. As seen from the checkerboard mosaic images in
Figure 31, our method matched all three pairs of images with high accuracy in all three situations. The comparisons show that our method outperformed the other three methods in terms of matching performance, especially in terms of rotation invariance.
4.3.2. Quantitative Evaluation of Dataset 2
Figure 32 shows the variation in NCM with rotation angles for the first pair of sample images, the second pair of sample images and the third pair of sample images, respectively. We rotated the registration image of each sample image pair. The rotation angles were from
to
with an interval of
. Thus, a total of 25 registration images for each image pair were obtained (the rotation angles were
) [
24]. As a whole, our LPHOG method had more correctly matched point pairs than the other three methods for every rotation angle. Our method maintained the number of correctly matched point pairs above 100 for the first pair, above 50 for the second pair, and above 70 for the third pair at every rotation angle. The worst of the other methods, PSO-SIFT, correctly matched pairs of points close to 0 at any rotation angle for the first and third pairs; for the second pair, it was close to 0 for all angles except
and
. For the first and third pairs, the number of matched points for RIFT fluctuated between 20–75, and in a few rotation angles, it was up to 1/2 of ours. For the second pair, the NCM of RIFT was in the range of 5–25, and in a few rotation angles, it reached 1/10 of ours. For all of the sample images, the NCM of LNIFT at
varied basically symmetrically with
. The number of correctly matched pairs of points for LNIFT at
in some cases maintained a high level, but was still far below that of our method and approached 0 in most cases between
.
Figure 33 shows the changes in
,
, average NCM, and average RMSE for the whole Dataset 2, respectively. As a whole, our
,
and average NCM were larger than those of PSO-SIFT, RIFT and LNIFT at every rotation angle, and the average RMSE was smaller than those of PSO-SIFT, RIFT and LNIFT at every rotation angle, which indicated that our algorithm’s matching performance, rotation invariance and matching accuracy were better than those of these three algorithms. Among them, the
and the average number of matched points with our method had more significant jumps at
,
,
,
and
, which should be due to the fact that the line feature detection was not affected by the image borders in these rotation angle cases. How to eliminate the influence of image borders on line feature detection at other rotation angles will be one of our key research topics in the future.
The comparisons show that our LPHOG method has wide applicability to the registration of many optical images and infrared images, and adapts to many rotation angles. The number of matched pairs of points is always high, and the RMSE is always low. The rotation invariance of our method is highly robust for optical and infrared image registration.
4.4. Evaluation of Dataset 3
4.4.1. Qualitative Evaluation of Dataset 3
We selected three pairs of sample images from Dataset 3 to compare the matching performance of optical and optical matching.
Figure 34,
Figure 35 and
Figure 36 show the matching performance for sample images without rotation, after
rotation and after
rotation, respectively.
Figure 37 shows the checkerboard mosaic images of our method before rotation and after
and
rotation, respectively. Without rotation, PSO-SIFT and RIFT could correctly match three pairs of images; LNIFT could only match the second pair of images; our LPHOG method could correctly match these three pairs of images, and the number of matched points was much higher than that of the other three methods. As shown in
Figure 35, after
of rotation, PSO-SIFT and LNIFT failed to match these three pairs of images; both RIFT and our method could correctly match these three pairs of images. As shown in
Figure 36, after
rotation, LNIFT failed to match the three pairs of images; PSO-SIFT matched the third pair of images correctly; RIFT matched the three pairs of images correctly, but the number of matched pairs was lower; our method matched the three pairs of images correctly and the numbers of matched pairs were high. As shown in
Figure 37, our method’s matching accuracy was very high. (Some regions are misaligned because the two images correspond to different times, and the corresponding object sizes and shapes do not exactly coincide.)
4.4.2. Quantitative Evaluation of Dataset 3
Figure 38 shows the variation in NCM with rotation angles for the first pair of sample images, the second pair of sample images and the third pair of sample images, respectively. We rotated the registration image of each sample image pair. The rotation angles were from
to
with an interval of
. Thus, a total of 25 registration images for each image pair were obtained (the rotation angles were
) [
24]. We could see that the number of correctly matched point pairs with our method was much higher than that of the other three methods for all three pairs of samples. The NCM was greater than 500 for most rotation angles in the first pair and not less than 450 for all rotation angles; it was higher than 200 for most rotation angles in the second pair; and it was higher than 250 for most rotation angles in the third pair, which indicated that our method has very good rotation invariance.
Figure 39 shows the changes in
,
, average NCM, and average RMSE for the whole Dataset 3, respectively. In most of the rotation angle cases, the
of PSO-SIFT fluctuated between 0.2 and 0.4 and could jump to between 0.57 and 0.8 at
,
,
,
, and
, which indicated that PSO-SIFT detected many points of the image borders but could not eliminate the influence of the points of the image borders, reducing the matching performance. The
and
of LNIFT in the range of [
,
] were a little high; its
was higher than 0.7, and its
was higher than 0.57. However, both of them decreased rapidly between
(the changes in
and
of LNIFT in the range between
were approximately symmetrical with those in
), which indicates that LNIFT had better matching performance in small angular differences, but poor matching performance in large angular differences. The
of our LPHOG method always equaled 1 for all rotation angles, and the
of RIFT was close to 1 for all rotation angles, indicating that the applicability of both our method and RIFT is good. However, the
of our method was significantly higher than the
of RIFT, and our method’s
was always very close or equal to 1, indicating that the matching performance of our method not only has wide applicability but also has very good rotation invariance. Our method’s average RMSE was maintained around 1.114, which was lower than that of RIFT, and much lower than that of PSO-SIFT and LNIFT, indicating that the matching accuracy of our method is very high overall. As a whole, the average NCM values were significantly better than those of the other three methods, 2.5 times higher than those of the other three methods for every rotation angle, and the average NCM of our method was higher than 400 for most of the rotation angles, which indicates that our algorithm is significantly better than the other three algorithms.
The comparisons show that our method has wide applicability to the registration of many optical images with optical images, and adapts to many rotation angles. The number of matched pairs of points is always high, and the RMSE is always low. The rotation invariance of our method is highly robust for optical and optical image registration.