1. Introduction
Although the latest imaging sensors have been equipped with advanced positioning systems, the latitude and longitude information of the geocoded remote sensing images still contains inevitable errors [
1,
2,
3]. Remote sensing image registration is the procedure to spatially align different images of the same region, which is unavoidable for any multi-time or multi-sensor remote sensing applications, such as change detection and image fusion. The optical and synthetic aperture radar (SAR) sensors are the two most important ways to obtain high-spatial-resolution imageries of the earth’s surface from a long distance, such as from a satellite. Also, they reveal distinct and complementary ground characteristics. Therefore, the combined use of them has aroused many concerns in academic circles [
4,
5,
6], for which optical and SAR image registration is still a nontrivial issue that needs to be better resolved.
Current studies on optical-SAR registration, no matter the handcrafted methods [
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19] or the deep learning-based ones [
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30], mainly focus on dealing with the vast radiometric and geometric disparity problem, which makes it quite difficult to obtain sufficient reliable CPs that are sparsely distributed across the input image pairs. After the putative CPs are obtained, outlier removal and image warping are mostly conducted under the assumption that the geometric relationship between the input optical-SAR image pairs can be depicted by a linear equation, such as the affine or projective transformation. This linear assumption only holds for image pairs of flatlands. However, only a quite small percentage of the global landmass can be considered strictly flat. When the imaging area contains noticeable topographic fluctuations, two images acquired from different viewpoints will present unavoidable local geometric distortions. This distortion can be more serious for high-resolution optical-SAR image pairs due to the range-imaging nature of SAR sensors, which produces foreshortening and layover effects [
31]. Also, the DEM (digital elevation modal) images used for the geometric calibration are usually of low spatial and elevation resolution. Several pixelwise dense registration approaches based on the optical flow technique have been proposed to deal with the local geometric distortion problem [
32,
33,
34]. However, the pixelwise registration would have a high memory demand. It also fails either when the ground relief changes or when the initial displacement is too large, say more than 50 pixels.
In this article, we investigate the registration problem of large optical-SAR image pairs with non-flat terrains and high spatial resolution, for which the unified linear geometric relationship no longer holds, leading to great difficulty for not only the sparse feature point matching but also for the outlier removal and image warping processes. These are the two obstacles that make the current optical-SAR image registration frameworks unable to properly deal with images with non-flat terrains.
As for the sparse feature point matching issue, when the images to be co-registered are of flatlands, a small amount of sparse CPs is sufficient to acquire a good estimation of the affine or projective transformation. On the other hand, for non-flat terrains, much denser and sparser landmarks are required so that the geometric formula with a higher dimension or for each local area can be calculated. Many advantageous algorithms that generate homogeneous features from heterogeneous optical and SAR image pairs have been proposed for higher matching accuracy. Generally, they can be classified into two types, as shown in
Figure 1.
The first type is the SIFT-like detection-then-description approach, as shown in
Figure 1a, which tries to identify repeatable feature points from across the whole input image pairs and then putative correspondences are obtained based on the feature descriptor similarity measurement. In order to cope with the vast modal differences between the optical and SAR images, delicately designed feature point detectors and feature descriptors are proposed. For example, the ALGH method [
7] uses the Harris–Laplace Sobel operator for feature point detection from the optical image and the Harris–Laplace ROEWA operator for the SAR image. Then, the GLOH-like descriptor is constructed using the amplitudes of multi-scale and multi-orientation log-Gabor responses. The OS-SIFT method [
8] adopts a similar strategy by using two distinct Harris scale spaces to obtain consistent gradients from optical and SAR images for repeatable feature point detection. The RIFT method [
9] makes use of the fact that the phase congruency (PC) maps of multiple image models share more structure information when compared with the intensity image. Therefore, both feature point detection and feature description are conducted based on the PC maps. In LNIFT [
10], a local normalization filter in the spatial domain is proposed to initially narrow down the radiometric differences between the multi-modal images. Then, an improved ORB keypoint detector and a HOG-like descriptor are applied to the filtered images. Although the previous studies have made noticeable progress for the registration of optical and SAR images with flat terrain, they are likely to be inapplicable to the registration of non-flat terrains. Because the detection-then-description paradigm usually cannot produce sufficient amounts of repeatable sparse CPs from highly heterogeneous optical and SAR images. In addition, their outlier removal process relies heavily on the geometric constraints within the inlier CPs.
The second scheme to obtain sparse correspondences is the template matching technique, as shown in
Figure 1b, which first applies the blockwise Harris (or Fast, ORB, et al.) corner detectors to the reference image to obtain an evenly distributed point set. Then, the correspondences on the sensed image are identified based on the template feature similarity measurement using the local searching strategy, assuming that the image pairs have been coarsely registered by the geo-information. For the research following this paradigm, the main effort is put into the template feature descriptor design or learning process so as to more reliably measure the feature similarity between the optical and SAR image templates. The representative handcrafted methods include MIND [
11], HOPC [
12], CFOG [
13], SFOC [
14], OS-PC [
15,
16], AWOG [
17], HOPES [
18], et al. For example, the MIND method uses the self-similarity theory to extract image structures that preserve across modalities. The HOPC and CFOG methods both use the pixelwise HOG-like descriptor to collect similar features from multi-modal images. The AWOG method uses the feature orientation index table to build the pixelwise descriptor. The SFOC combines first- and second-order gradient information by using steerable filters to obtain more discriminative structure features. The HOPES method extracts the primary edge structure using the Gabor filters and conducts an edge fusion algorithm to obtain shared features from optical and SAR images. In recent years, many deep learning-based methods have come out [
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30], where diverse kinds of Siamese or pseudo-Siamese convolutional neural network architectures are designed to learn shared features from optical and SAR images. In [
27], the CNN feature extractor produces pixelwise deep features, which mimic the handcrafted method. The authors of [
28] claim that both shallow and deep features should be incorporated into the feature matching process so as to not only get better feature discriminative ability but also finer feature location precision. Also, a self-adaptively weighted loss function is introduced to obtain better training result. In [
29], three different CNNs are designed and trained for feature point detection, feature matching and outlier removal, respectively. In [
30], a residual denoising network is incorporated into the pseudo-Siamese CNN to alleviate the influence of speckle noise on SAR images.
Since this template matching paradigm usually adopts the blockwise feature point detection strategy, we are able to obtain putatively sparse CPs as dense as we like. However, the high heterogeneity of optical and SAR images would definitely lead to a large number of outliers within the putative matches. The outlier ratio varies drastically for different landcover types, depending on the texture similarity and discernibility between the optical and SAR images to be co-registered. This leads to the second obstacle, outlier removal.
The outlier removal issue has been extensively researched in the fields of photogrammetry and computer vision. It is a critical pipeline for many applications, such as structure-from-motion (SfM) [
35], simultaneous localization and mapping (SLAM) [
36], multi-view stereo [
37], visual odometry [
38], and image registration [
39]. Many different techniques and routes have been proposed. The RANSAC technique [
40] randomly and repeatedly selects a small initial point set and enlarges this set by finding the inlier ones that are geometrically consistent with the set. Until now, RANSAC has been the most robust and widely applied method in remote sensing image registration research and applications. Numerous modified approaches have been proposed to increase the time efficiency and accuracy of the classical RANSAC, such as the maximum likelihood estimation sample consensus (MLESAC) [
41] method, the least median of squares (LMEDS) method [
42], and the fast sample consensus (FSC) method [
43]. Note that the previous RANSAC-like approaches can only identify outliers under the assumption that all the inliers obey a unified linear spatial relationship (affine or projective transform), which can be depicted by a 3 × 3 matrix. In order to distinguish outliers when the linear geometric relationship does not hold, non-parametric and graph-based methods have been widely exploited.
The non-parametric methods define deformation functions in a high-dimensional form. For example, the coherent point drift (CPD) method [
44] formulates the matching problem as the estimation of a mixture of densities utilizing Gaussian mixture models. The vector field consensus (VFC) method [
45] models the movement consensus in the vector field, which operates in a vector-valued reproducing kernel Hilbert space. The graph-based methods are based on the assumption of geometric consensus among neighborhood inliers. For example, the grid-based motion statistics (GMS) method [
46] identifies the inliers by checking if the adjacent ones are close to each other in both images. The locality-preserving matching (LPM)-based methods [
47,
48] assume that the inliers should preserve the spatial neighborhood relationship and structure. Although non-parametric and graph-based approaches can deal with non-linear situations and have been prevailingly applied in the computer vision field, they are rarely applied in the remote sensing image registration task. It is because they all require very dense, sparse correspondences and usually fail when inliers are distributed dispersedly [
49].
Note that the majority of the earth’s surface is non-flat. Especially for the high-resolution optical and SAR image pairs, even slight elevation variations would produce noticeable local geometric distortion. However, the problem of large optical-SAR image registration of rough terrains has rarely been addressed in the current studies due to the low matching accuracy issue caused by the extreme radiometric and geometric disparity and also the difficulty of outlier removal caused by the spatially varying geometric relationships. This work tries to deal with these problems, with the main contributions as follows:
Under the template matching paradigm, we propose to use the phase congruency map of the textureless and noisy SAR image to obtain an evenly distributed point set, which guarantees that each feature point is surrounded by salient local structures that help to increase the matchability. The putative sparse correspondences on the optical image are acquired using the learned deep features drawn from a very large local image patch (641 × 641 pixels), which significantly increases the matching accuracy. Meanwhile, an adaptive search range is used under the local searching pipeline. In this way, we are able to get a collection of very densely distributed sparse CPs with a quite low outlier ratio.
An effective outlier removal and transformation estimation procedure is proposed for putative CPs that do not obey a unified geometric constraint. Firstly, taking advantage of the different mechanisms of sparse matching and optical flow-based dense matching, we propose a mutual verification-based outlier removal method. In this way, unreliable CPs are initially filtered out without any assumption of the geometric constraint. Secondly, we assume that, except for the mountainous area with extremely sharp elevation variations, most of the ground surface can be considered locally flat. A recursive RANSAC method is proposed to automatically cluster the CPs into different point sets, with each set located in a locally flat image area, which can then be co-registered using the linear geometric transformation. As for the mountainous areas, small subsets of inlier CPs are identified, which are used as anchor points, so as to preparatively remove the large positioning error of the mountainous areas for the subsequent optical flow-based image warping.
Extensive experiments are conducted to evaluate the effectiveness of the proposed sparse matching, outlier removal, and transformation estimation methods. The results show that the proposed sparse matching method produces a significant increase in matching accuracy, from about 30% to 100%. The subsequent mutual verification-based outlier removal strategy further filters out about 30% of the outliers. Also, compared with the other well-established methods, the proposed non-flat image warping process is able to produce both robust and accurate registration results for diverse landcover and landscape types.
The overall framework of the proposed optical-SAR image registration method is shown in
Figure 2.
4. Discussion
For the registration of remote sensing images with complex terrain, the essential problem is to obtain a set of reliable, sparse corresponding feature points that distribute evenly and densely across the input image and also present a low outlier ratio. Owing to the proposed SAR-PC-Moment-based feature point detector and especially the big template matching strategy, the outlier ratio of the putative sparse matches is reduced from more than 50% to less than 30%. The proposed mutual verification-based outlier removal method further filters out about 30% of the outliers, therefore reducing the outlier ratio to less than 20% in general. This result is very meaningful for the subsequent geometric relationship estimation procedure. Based on the proposed recursive RANSAC method, the input large image is automatically segmented into locally flat and non-flat areas, and different registration strategies are applied to the two different landscape types. In this way, a very robust registration result can be obtained, with higher matching precision for flat regions and a relatively bigger matching error when local topographic fluctuation exists.
The proposed registration framework can be applied not only for optical-SAR images but also for any image types that exhibit spatially varying geometric relationships, such as high-resolution optical–optical, SAR–SAR, optical–LIDAR image pairs. Therefore, it is able to act as a reliable technique for supporting information fusion applications using multi-time and multi-modal remote sensing images of various landscape types.
Still, there are shortcomings and unsolved problems. Firstly, the time consumption of sparse matching is quite high since very large image templates, sized at 641
641 pixels, are used for sparse matching. This time consumption can be reduced by decreasing the intensity of the sparse feature points, but it would probably result in a higher registration error. Secondly, as shown in
Table 8, the non-rigid CPD method is able to produce better registration accuracy for the first two datasets, which are composed of more flat terrains, as shown in
Figure 5. This result indicates the potential to further improve registration accuracy by combining the conception of probabilistic non-rigid registration with the proposed method, as long as densely distributed sparse matches with a low outlier ratio are obtained beforehand. The third problem is that there are still many image areas of mountainous terrain unregistered, due to the lack of reliable sparse matches caused by the extreme local geometric distortions. In [
58], a geometric registration approach using the DEM or DSM is proposed, which remaps the central perspective projection of the optical sensor into the side-looking mechanism of the SAR sensor. However, this method requires high geo-location accuracy for both the optical and SAR imaging systems as well as precise DEM or DSM information, which are all hard to get. The registration of optical and SAR images of mountainous areas is still an open question.