1. Introduction
Remote sensing technology has the advantages of a large detection range, fast data acquisition, and few restricted conditions [
1,
2,
3]. Three-dimensional (3D) reconstruction based on remote sensing images has been widely applied in urban planning [
4], geological surveying [
5], unmanned driving [
6], and so on. Currently, the most commonly used method for the 3D reconstruction of satellite remote sensing images is to obtain the disparity map of the image pair using stereo matching, and the real geographic coordinates and elevation are then calculated using the rational function model [
7,
8,
9,
10,
11,
12,
13].
A stereo algorithm consists of four steps: cost computation matching, cost aggregation, disparity computation (optimization), and disparity refinement. Traditional stereo matching methods usually utilize the low-level features of image patches around the pixel to measure the dissimilarity. Local descriptors, such as absolute difference (AD), the sum of squared difference (SSD), census transform [
14], or their combination (AD-CENSUS) [
15], are often employed. For cost aggregation and disparity optimization, some global methods treat disparity selection as a multi-label learning problem and optimize a corresponding 2D graph partitioning problem by graph cut [
16] or belief propagation [
17,
18]. Semi-global methods approximately solve the NP-hard 2D graph partitioning by factorizing it into independent scanlines and leveraging dynamic programming to aggregate the matching cost [
19,
20,
21,
22]. Compared with ordinary optical images, objects in satellite remote sensing image sequences often have large deformations, and the stereo matching of this kind of image is so difficult that the matching rate generally drops. There is a greater need for the disparity refinement method to correct and fill the disparity.
The traditional refinement step includes a left-right check [
23,
24], hole filling, and a smoothing filter. Supikov proposed a method of decoupling the resolution of the solver grid from the resolution of the disparity map, and holes are filled by amplification of the low-resolution grid [
25]. Jiao proposed a secondary disparity refinement. Small hole regions are first filled by the most appropriate disparity in its neighborhood. Inconsistent regions are filled by detecting and checking whether the edges of the disparity map coincide with the boundaries of objects in the scene. These regions are filled by the modified OccWeight. OccWeight matrix is calculated for each pixel q in a cross window of the pixel p to be filled. The weight calculated by the color difference and the spatial distance indicates the similarity between p and q. The filling disparity is that of the pixel with largest weight [
26]. Huang proposed dual-path depth refinements using the cross-based support region by referring texture features to correct the inaccurate disparities [
27]. Their methods also use two-phase refinement, which is different from the other method [
26]. Yan et al. proposed a disparity refinement method based on super-pixel segmentation and RANSAC plane fitting. They first segmented the image into super-pixels, and then used the RANSAC to fit each super-pixel plane. For the fitted super-pixels, the disparity refinement with occlusion processing was implemented based on Markov random field [
28,
29].
Other disparity refinement methods are outlier detection and removal [
30,
31,
32,
33]. Qin improved the matching accuracy of deep discontinuous areas by detecting matching lines [
34].
The above method uses color, texture, and other features to calculate the similarity of the hole pixels and the neighbor pixels with disparity. The similarity is used as a weight to calculate the filling disparity. The disparity range of remote sensing images is large, especially when there are tall buildings in them, and sometimes shear deformation exists in the disparity maps. In such a case, pixels with the same color do not have the same disparity, and these methods may be not suitable. Even if there is shear deformation, the same planes in the left and right images still follow the affine transformation. Thus, the planes should be fitted first and then the missed disparities could be filled according to the plane affine equations, which will provide more promising results.
There are many planes in urban remote sensing images. Some studies using ‘PatchMatch’ have considered plane constraints [
35,
36,
37]. These methods estimated the plane using sparse matching in advance, and then the plane constraints were then incorporated into the cost function. Hou et al. extended ‘PatchMatch’ into an integrated depth map reconstruction method that combined camera parameters and used it in the multi-view depth map reconstruction of aerial remote sensing images. In their study the intrinsic parameter matrix, rotation and translation matrix of the camera were used in plane model. These parameter matrices are similar with ordinary images [
38]. However, for satellite remote sensing images the real 3D coordinates are calculated based on rational polynomial coefficient model which is completely different from ordinary images. Their method cannot be used in the 3D reconstruction of the satellite images.
Planes estimated based on sparse matching are not accurate enough. To solve this problem, based on the initial disparity estimation, we used a plane segmentation method to estimate accurate planes, and the disparity was then refined according to plane equations. Plane segmentation methods are divided into four categories, which are those based on region growth, model fitting, feature clustering, and the global energy function [
39]. Region growth-based methods are not robust to noise, varying point densities, or occlusion [
40,
41].
In the model fitting methods, although the reported random sample consensus (RANSAC) [
42] and Hough transform (HT) [
43] approaches work very well for 3D plane segmentation on point clouds with low levels of noise and clutter, these algorithms still have some disadvantages [
39]. Running RANSAC sequentially to detect multiple planes is widely known to be sub-optimal since the inaccuracies in detecting the first planes will heavily affect the subsequent planes [
44]. Despite the popularity and efficiency of feature clustering-based methods, they have difficulty in neighborhood definition and are sensitive to noise and outliers [
45]. The energy optimization-based methods are more robust to high levels of noise and clutter compared with the other methods [
46]. However, these methods are computationally expensive for plane segmentation [
44] and depend heavily on the adequacy of initial input.
Because there is a certain amount of noise in the disparity map, and some scattered small areas are separated by missed matching pixels, most of plane segmentation methods are not suitable for the application in this paper. We designed a fast initial segmentation algorithm based on mean-shift that provides accurate input for energy minimization, and it has certain robustness to noise and can approximate the surface through the planes. The initial segmentation based on mean-shift and energy minimization constitutes the plane segmentation algorithm in this paper. Based on this algorithm, the disparity is refined according to the plane segmentation and fitting results. After that, missed matching areas are filled and noise pixels are removed to improve the stereo matching effect. We combined the energy optimization with the super-pixel segmentation method. There were few planes involved, and the alpha-expansion algorithm can achieve fast energy minimization [
47].
For ordinary images, the stereo matching methods are evaluated using ground truth disparity maps. There are no such disparity maps for remote sensing images, so the evaluation must be performed by elevation maps. However, for satellite remote sensing images, the elevation calculation is complicated, and different elevation calculation methods may derive different results. For many images, corresponding elevation maps cannot be obtained. Images from different satellites vary drastically, and the evaluation results for the current satellite image may not be suitable to other satellite images. To solve this problem, we propose a new evaluation measure, called the edge matching rate (EMR) and the edge matching map (EMM). The performance of the matching methods can be evaluated to a certain extent in the absence of disparity and elevation ground truth.
The rest of this paper is organized as follows: In
Section 2, we describe our proposed method including the framework, mean-shift-based plane segmentation, and disparity refinement. In
Section 3, EMR and EMM are presented and experimental results on ordinary optical and remote sensing images prove that they are effective in evaluation and analysis of the matching results. The mean-shift-based plane segmentation is then evaluated on a standard dataset. Finally, results on disparity refinement are illustrated to demonstrate the performance of our method.
Section 4 provides our conclusions with some ideas for further work.