1. Introduction
In recent years, optical video satellites have thrived, since the first Skysat series of small video satellites launched by Skybox Imaging in the United States in 2013. The video camera carried on the satellite can continuously observe the moving process in the form of video recording [
1], with observation times of up to 120 s. Compared with traditional satellite images, video images can capture changes in a specific area in a short amount of time and can be effectively applied to real-time monitoring of natural disasters such as volcanic eruptions, earthquakes, floods, and fires. However, the application of satellite video is severely constrained by factors such as satellite attitude control errors, satellite platform jitter, and differences in imaging viewpoints between adjacent frames. As a result, correct mapping relationships between pixels in inter-frames cannot be established [
2]. Video stabilization aims to eliminate or reduce the relative deformation between adjacent frames, establish correct mapping relationships between homologous image elements, and generate stable and smooth videos.
There are relatively few studies on satellite video stabilization. The existing VSMs mainly include the technique based on the classical motion model and the rational polynomial model (RPC) process. Both methods first need to obtain the video inter-frame motion vector. The feature-based method [
3,
4] is mainly used. For example, this is completed using the SIFT [
5,
6,
7] or SAR-SIFT [
8] algorithms or the deep learning method [
9,
10,
11] to extract the homologous points between video frames. Secondly, various forms differ in the different transformation models adopted. Among the RPC-based methods, Zhou Nan [
12] and others proposed a Digital Elevation Model(DEM)-assisted VSM for optical video satellites. At the same time, each frame of the video was given geocoding [
13]. Zhang et al. [
14] studied the image stabilization of satellite video with geometric model constraints; Wang Xia et al. [
15] proposed a VSM considering image plane distortion. The high-precision VSM based on RPC and DEM is difficult to widely use because of the inaccuracy and easy loss of RPC information and the plane projection error caused by DEM. Therefore, a high-precision VSM based on the classical motion model is designed to expand the application range of satellite video formats in satellite video stabilization.
Among the methods based on the classical motion model, Feng Li [
16] used the rigid transform model as the inter-frame motion model to perform video stabilization on the infrared video satellite, but this method can only be applied to video images with rotation and translation; Kumar et al. [
17] and Maolei Zhang et al. [
18] used the affine transform model for video stabilization; Murthy et al. [
19] used the perspective transform model as the inter-frame motion model to perform video stabilization on the SkySat-1 satellite, but the accuracy obtained was low; Hui Xing et al. [
20] and Walha et al. [
21] used the similar transformation model for video stabilization. All of the above methods have advantages only for a particular data condition, cannot be widely used for many types of satellite video data, and have other limitations.
The stabilized image of Synthetic Aperture Radar (SAR) video has been less studied, and it is only stabilized to some extent in the generation stage of SAR video. For example, Yan et al. [
22] obtained the stabilized SAR video by stabilizing the rotation and trajectory of the platform in real time; Robert Linnehan et al. [
23] generated the stabilized SAR video by introducing the concept of map drift to compensate for the platform motion.
This paper presents a general satellite VSM based on the traditional transformation model. It addresses the limitations that the existing satellite VSMs are only applicable to specific data and improves the error elimination process. To enhance the stabilization accuracy of the satellite VSM based on the traditional motion model, an improved error elimination algorithm based on RANSAC using the Euclidean distance constraint, ED-RANSAC, is proposed. Furthermore, we propose evaluation indexes for assessing the stabilization accuracy of satellite video since the current satellite VSMs lack a systematic approach for evaluation.
The innovations of this paper include the following:
- (1)
An improved error rejection algorithm: the Euclidean distance-constrained RANSAC algorithm (ED-RANSAC) is proposed to achieve high-precision homologous feature extraction. Additionally, the limitation that existing satellite VSMs are only applicable to specific data is solved.
- (2)
The stabilization accuracy of optical video is improved to better than 0.15 pixels to achieve stable and smooth video, providing a reliable database for subsequent applications such as target detection based on video data. Additionally, SAR video stabilization accuracy of better than 0.3 pixels can also be achieved.
- (3)
The paper proposes evaluation indexes for assessing the stabilized image accuracy of satellite video.
2. Methods
The main methods currently applied for satellite video stabilization are the adjacent frame method and the fixed frame method. The fixed frame method, also called the master frame method, is a method that aligns the auxiliary frames to the master frame by using the first or middle frame of the video as the master frame, and the other frames as the auxiliary frames. This method suits satellite video with comprehensive coverage and minor changes between video frames. Still, it is easy to produce the phenomenon that the error gradually worsens as the number of frames increases.
In this paper, the adjacent frame method is used as the video stabilization method process. The adjacent frame method is to take the former frame of the video as the main frame and the last frame as the auxiliary frame. The SIFT algorithm [
5] is used to detect homologous points for two frames. The proposed ED-RANSAC algorithm is used to eliminate the mismatched data to improve the homonymous point alignment accuracy. The homologous points are used to calculate the model transformation parameters between the two frames. Finally, the auxiliary frame is corrected to the correct position of the main frame by the calculated model transformation parameters, and the video frame image sequence is obtained.
Figure 1 shows the experimental flow.
2.1. Homologous Feature Detection Algorithm
The SIFT algorithm was adopted to achieve homologous feature extraction in this paper. This algorithm can be roughly divided into four steps: creating a scale space, feature point localization, key point direction distribution and generate descriptors.
Creating a scale space: The SIFT algorithm uses Gaussian kernel functions
of different scales convolved with a two-dimensional image
to create the scale space
. The convolution operation is represented as follows:
where
σ is the scale factor, which indicates the blurring degree of the image.
Differentiate two adjacent Gaussian images to obtain the difference of Gaussian pyramid (DOG), expressed as:
Feature point localization: Each pixel in the DOG is compared with its 26 adjacent points to determine whether it is an extreme point; then, the detected extreme points are fitted using Taylor expansion to find the correct position of the feature point on the image.
Key point direction distribution: To achieve rotation invariance and reduce the impact caused by image rotation on descriptors, Formulas (3) and (4) are used to calculate the gradient values and directional parameters of different feature points
at their respective scales. Within the neighborhood range of each feature point, every 10 degrees is a direction, and the gradient histograms of 36 directions between 0 degrees and 360 degrees are counted. The direction with the peak value in the histogram is determined as the main direction of the feature point.
Among them, and denote the gradient values and direction parameters of the feature points on their respective scales, respectively.
Generate descriptors: The coordinate axis is rotated to the main direction of the feature point, using the feature point as the center. A 4 × 4 window is then set, with every 45 degrees representing an interval. The 8 directional intervals are evenly divided between 0 and 360 degrees. For each unit within the window, the gradient histograms of the 8 directions are calculated. These histograms are then Gaussian-weighted to cover the entire window and normalization is performed to generate a 128-dimensional descriptor vector.
2.2. ED-RANSAC Algorithm
The Randomized Sampling Consistency (RANSAC) algorithm, proposed by Fischler and Bolles in 1981 [
24], is a stochastic parameter estimation algorithm that iteratively fits the mathematical model parameters from a set of sample points.
An error threshold needs to be stetted as an upper limit of iterations for traditional RANSAC algorithm. The algorithm can fail to converge if the maximum number of iterations is not set in advance. The upper limit on the number of iterations is closely related to the probability of obtaining the best model. As the upper limit on the number of iterations increases, the probability of obtaining the best model increases. However, a larger upper limit setting leads to an increase in computational cost, which reduces the speed of execution of the algorithm. It has been argued that the RANSAC process was too time-consuming since an assumption of RANSAC rarely holds in practice: namely, the assumption that the model parameters are calculated from an uncontaminated sample. An improved method called LO-RANSAC [
25] is therefore proposed, which takes advantage of the fact that the model assumptions from the smallest uncontaminated sample are almost always sufficiently close to the optimal solution. An algorithm that is almost identical to the theoretical performance is produced when applied to the local optimization step of the chosen model. Lo-RANSAC increases the number of outlier detections and thus speeds up the overall solving process by allowing for the early termination of the RANSAC iterative process, ultimately achieving the aim of obtaining a higher quality model. The big trouble, however, is that this method requires the identification of a pure sample as a basis, and finding a pure sample is usually uncertain. In addition to RANSAC and its related variant methods, the use of the Pauta criterion (3sigma) to reject outliers is a valid method. It assumes that the sample obeys a normal distribution and 99.7% of the correct values are within three standard deviations. This method is suitable for data with a large number of samples, making this method suitable for data with a large sample size.
To improve the overall stabilization accuracy while considering the time cost, this paper proposes an improved RANSAC algorithm. The algorithm selects matching pairs randomly as samples to calculate the transformation matrix. The algorithm calculates the consistent set that satisfies the current transformation matrix based on the transformation matrix, the sample set, and the error metric function. Then, it iteratively updates the optimal consistent set. The spatial distance between two points is calculated, and the Euclidean distance (ED) was introduced as a threshold to filter the optimal consistent set twice. The matching pairs that satisfy the threshold condition in the optimal set are retained as the final set of homologous points to calculate the transformation matrix.
Equation (5) represents the two-dimensional plane coordinates
obtained after the transformation from the original coordinates consisting of the homogeneous coordinate
. Among them,
is the transformation parameter obtained by least-squares decomposition of four randomly selected points,
represents the linear image transformation,
is the translation on
x,
y, respectively, and
is used to generate the image perspective transformation.
is usually set to 1.
In Equation (6),
are the coordinates of the homonymous points on the reference image, and
are the coordinates of the homonymous points on the image to be aligned after the transformation of Equation (5).
Figure 2 shows the algorithm flow.
2.3. Evaluation Indicators
In this paper, the Root Means Square Error (RMSE) is used as the evaluation index of steady image accuracy, and its formula is as follows:
In Equation (7), are the coordinates of the homonymous points detected on the primary image, are the coordinates of the homonymous points detected on the auxiliary image transformed by Equation (5), and N is the number of homonymous points.
3. Experiment and Analysis
In this section, we conduct video stabilization experiments and evaluate the accuracy using different regions of optical satellite video and SAR video to verify the effectiveness and generalizability of the proposed method.
3.1. Experiment Data
This experiment was conducted using Jilin-01 optical video satellite data for verification. The Jilin-01 video satellite orbits at an altitude of 656 km, the ground resolution of 1.13 m. The single shot video can last up to 120 s, and the frame rate of the video is 25 frames per second [
26,
27,
28]. To demonstrate the wide applicability of the method proposed in this paper, satellite video data from three different land cover types were used for the experiments, namely, sea area (Zhifu Bay in Yantai), desert (Jiayuguan in Gansu), and mountainous area (Leibo County in Sichuan). The details of the satellite video data used in this experiment are shown in
Table 1.
Figure 3 shows the satellite video images for the three different land cover types used in this experiment. It can be observed from the figure that the three feature types varied greatly. For instance, the data from Yantai Zhifu Bay and Gansu Jiayuguan have a majority of areas with an inconspicuous texture inside, which could have an impact on the detection of key points.
3.2. Threshold ED Determination
The error rejection process is iterative, and the obtained homologous points are utilized to compute the transformation parameters for correcting the frame images. The smaller the Euclidean distance between the coordinates of the homologous points on the main frame and the coordinates of the homologous points on the auxiliary frame after transformation, the higher the correction accuracy between the main frame and the auxiliary frame and the higher the accuracy of video stabilization when pushed to the entire video frame sequence.
Figure 4 illustrates the application of this method using data from Leibo County in Sichuan Province. In this example, the threshold ED value is set to 0.2, which maximizes the elimination of false match pairs while preserving sufficient correct match pairs to calculate the model transformation parameters. The relationship between the ED value and RMSE is shown on the left side of the image axis, while the relationship between the ED value and Correct Matching Number (CMN) is depicted on the right side. To demonstrate the balance between steady-state accuracy and correct matching points, and to emphasize the rationality of the chosen threshold, the RMSE is inverted in the figure.
3.3. Inter-Frame Motion Model
Scholars have used many transformation models, such as rigid, similar, affine, and perspective transformations, as inter-frame motion models for satellite video stabilization. The rigid transform only translates and rotates the image without changing the shape of the graph, so the rigid transform is unsuitable for satellite video stabilization because of the deformation between satellite video frames. The similarity transform is an extension of the rigid transform, and the similarity transform is a rigid transform when the scaling factor is 1. The affine transformation is a linear transformation from two-dimensional coordinates to two-dimensional coordinates, which responds to the mapping relationship between image coordinates before and after the shift [
29] and is widely applied to image transformation. The perspective transformation is a transformation that takes advantage of the condition that the perspective center, image point, and target point are co-linear and rotates the shadow-bearing surface (perspective surface) around the trace (perspective axis) by a certain angle according to the law of perspective rotation, which destroys the original projection ray beam and still keeps the projection geometry on the shadow-bearing surface unchanged [
30]. It is more widely applicable than the affine transformation.
In order to find a suitable transformation model, the following experiments are designed in this paper.
- (1)
Two adjacent frames of the condition data of the three land cover types listed in
Section 3, sea, desert, and mountain, are selected for homologous point detection to obtain 19,599, 18,197, and 12,169 homologous point pairs each, respectively.
- (2)
The homologous point pairs were input into the ED-RANSAC operator combined with three transformation models (affine, perspective, similarity) for screening.
- (3)
The First Select (FS), Final Point (FP), Correct Matching Ratio (CMR), and RMSE are plotted as discriminators. This is used to discriminate which model is more suitable for satellite video stabilization.
From
Figure 5 and
Table 2, it can be seen that the similarity transformation model performs poorly in all four discrimination indicators for the three land cover types, with fewer matching points, lower screening accuracy, and lower matching accuracy compared to the other two transformation models. The affine transformation and perspective transformation models perform well in all four discrimination indicators, with only small differences in matching points, screening accuracy, and RMSE. However, the perspective transformation model performs more evenly than the affine transformation model under the three land cover types.
The perspective transformation model is more suitable as the transformation model for satellite video stabilization.
3.4. Experimental Precision Evaluation Methods
- (1)
Inter-frame video stabilization precision evaluation
The satellite video stabilization process, whether the fixed frame, frame-by-frame, or the setting the main frame at intervals method, requires the evaluation of the matching accuracy between the main frame and the auxiliary frame. The accuracy of each frame match is calculated and charted to assess the level of stabilization achieved. Then, the average value of the matching accuracy between all frames is used as a benchmark, and the difference between the average value of the matching accuracy of each frame is calculated to determine the fluctuation of the stabilization accuracy of the image stabilization method.
- (2)
Overall video stabilization precision evaluation
The inter-frame stabilization accuracy does not represent the real accuracy of the stabilization method, and sometimes the phenomenon of error propagation may occur. Therefore, it is necessary to perform overall accuracy verification of the output image sequence. The first frame in the output image sequence is used as the reference frame, and the image matching method is used to verify the accuracy with the image of each interval of 10 frames to see whether there is error accumulation. The average of the inter-frame validation matching accuracy is used as the true stabilization accuracy of the satellite video stabilization method.
3.5. Experimental Results and Analysis
According to the video stabilization method introduced in
Section 2, the stabilization experiments were conducted on the satellite video image data of the three land cover type conditions in
Section 3.1 as follows:
- (1)
To verify the stability of the proposed method, the average frame-to-frame stabilization precision of each experimental data point was used as the reference. The difference between the stabilization precision of each image frame and its average value was calculated to study the fluctuation of the frame-to-frame stabilization precision.
- (2)
The RANSAC, LO-RANSAC, 3sigma, and ED-RANSAC algorithm were used to conduct video stabilization experiments on the three types of land cover data to compare the improvement of the stabilization precision before and after the improvement of the RANSAC algorithm. In this paper, we first perform a preliminary screening of homonymous points, and the number and content of input homonymous points are the same for the four algorithms.
- (3)
The overall stabilization precision of the output image sequence was verified. The first frame of the output image sequence was used as the reference frame, and the image matching method was used to verify the precision of every 10 frames to check for error accumulation. The matching precision of the first and last frames was used to verify the true stabilization precision of the satellite video stabilization method.
3.5.1. Inter-Frame Video Stabilization Precision Evaluation
From
Figure 6, the method in this paper has good performance in video steadiness accuracy under various land cover types. Among them, the steady image accuracy of Yantai Zhifu Bay and Sichuan Leibo County fluctuates within ±0.01 pixels. The steady image accuracy of Gansu Jiayuguan fluctuates slightly in ±0.02 pixels, and the steady image accuracy of the data in the three land cover types as a whole does not exceed ±0.02 pixels, which fully illustrates the stability of the method in this paper.
The proposed ED-RANSAC algorithm has greatly improved performance compared to the original RANSAC, LO-RANSAC, and 3sigma algorithm. From the line chart on the right of
Figure 6, it can be seen that the performance of the method proposed in this paper is the best. The stabilization accuracy in the marine area (Zhifu Bay in Yantai) has been improved to better than 0.15 pixels. The accuracy of the desert area (Jiayuguan in Gansu) is improved to better than 0.15 pixels. The stabilization accuracy in the mountainous area (Leibo County in Sichuan) has also been improved to better than 0.15 pixels, meeting the requirements for smooth video applications. This proves that the method proposed in this paper can eliminate the influence of terrain factors. The quantitative analysis of the stabilization accuracy is shown in
Table 3.
Table 3 summarizes the RMSE of all frames obtained from the video stabilization experiments of the four algorithms on three different datasets. The maximum, minimum, and median RMSE values of all frames for all algorithms under the three datasets are recorded to quantitatively analyze the improvement in stabilization accuracy after the improvement of the RANSAC algorithm, compared with the other three algorithms. 3sigma has the worst results, with a maximum RMSE of more than 1.0 pixel in the desert region. LO-RANSAC and the original RANSAC algorithm show significant fluctuations in stabilization accuracy under the three different land cover types. The largest difference between the maximum and minimum RMSE values is around 0.4 pixels in the Jiayuguan in Gansu dataset, and the median RMSE is also above 0.3 pixels. The improved ED-RANSAC algorithm greatly improves the image stabilization accuracy so that the image stabilization accuracy under the conditions of three types of ground objects is better than 0.15 pixels, and the fluctuation of image stabilization accuracy is also greatly reduced. The overall image stabilization accuracy fluctuates around 0.03 pixels. The high accuracy and stability of the method in this paper are well-proven.
3.5.2. Overall Video Stabilization Precision Evaluation
By correcting the satellite video frame images, the geometric correspondence between the video frames can be restored. The experimental data were stabilized using the stabilization method proposed in this paper to obtain a stabilized video sequence. The first and last frame images of the stabilized sequence are shown in
Figure 7. As can be seen from
Figure 7, the effective coverage range of the two images on the ground differs significantly due to the influence of satellite platform shake and differences in the satellite video imaging angle.
To validate the effectiveness of the proposed method, image matching was performed on every 10 frames of the stabilized video sequence, using the first stabilized frame as the reference. The RMSE between corresponding points was used as the metric to measure the inter-frame matching accuracy. The results are shown in
Table 4. It can be seen that the video stabilization accuracy obtained by this method is better than 0.15 pixels, which is consistent with the geometric accuracy between video frames in
Table 3 and can meet the application requirements of high-precision satellite video stabilization. However, the
Table 4 also shows that as the difference in frame numbers between the compared frames increases, the number of checkpoints decreases. Especially for the mountainous area data, the overall stabilization accuracy is not strictly meaningful due to the large difference in frame numbers of the data itself and the insufficient number of verification checkpoints, which is a problem that needs to be addressed in future work.
To visualize the video stabilization effect achieved by the method in this paper. In the obtained image sequence, the first frame and the last frame (combined with
Table 4, the first frame and the tenth frame are selected in mountainous areas) are selected to show the local image edge map, and the local image edge map of the first frame and the last frame of the original image is listed below as a comparative display. For enhanced display, color processing was applied to one of the images. It can be seen from
Figure 8 that the video stabilization effect between the two frames is excellent. There is no misalignment in areas such as water boundaries, buildings, roads, and farmland.
3.6. Application in SAR Video
3.6.1. Experimental Data
A SAR video released by Sandia National Laboratories is used as the experimental data in this section. The video size is 657 × 720 pixels, with a total of 150 frames. The video is shot in the” circular trajectory” mode [
31,
32], and the large displacement and angle change increase the difficulty of video stabilization.
Figure 9 shows the experimental data of SAR video images in this paper.
3.6.2. Experimental Results and Analysis
Since there are significant differences between optical images and SAR images due to different imaging methods, the traditional SIFT algorithm cannot effectively detect the homologous features on SAR images. This paper adopts the SAR-SIFT algorithm instead of the SIFT algorithm for the detection of homologous points in the experimental process. This section conducts video stabilization experiments on SAR video data using the experimental process, methods, and evaluation indexes mentioned in the previous section. The method’s steady image stability verification results in the SAR video in this paper are shown in
Figure 10. The comparison verification results of the ED-RANSAC RANSAC, LO-RANSAC and 3sigma algorithm are shown in
Figure 11. The quantitative analysis of video stabilization accuracy is shown in
Table 5.
From
Figure 10, it can be seen that the method in this paper also shows good stability in the stabilization of SAR video, and the stabilization accuracy does not exceed ±0.05 pixels. It also proves the universality of the method in this paper again.
Figure 11 shows the comparison results between the algorithm in this paper and the other three algorithms.
Table 5 shows the comparison results of the four algorithms quantitatively by the maximum, minimum, and median values of RMSE of all frames. This method improves the stabilization accuracy in SAR video stabilization from about 0.6 pixels before improving to about 0.25 pixels. The stabilization accuracy is improved significantly to meet the application requirements of high-precision satellite video stabilization.
The first frame of the output image sequence is used as the main frame to match with it every 10 frames in turn for overall accuracy verification, and the verification results are shown in
Table 6. From
Table 6, the overall accuracy of video steady is better than 0.3 pixels, which indicates that the method in this paper can also affect accuracy improvement for SAR video.
In order to visually interpret the video stabilization effect obtained by the method in this paper, the first frame and the last frame in the obtained stabilized video image sequence are selected to display their local image border maps. The partial image border maps of the first frame and the last frame of the original image are listed below as a comparison display. For enhance the display effect, one of the images is color processed. From
Figure 12, this method has a good effect on stabilizing SAR video images, in which there is no misalignment of roads, flower beds, buildings, etc.
The above figures and tables indicate that the method of this article is not only applicable to different types of ground conditions in optical satellite video data, but also has good performance in the application of SAR video stabilization. The universality, high precision, and stability of the proposed method are fully proved.