Remote Sensing Target Tracking Method Based on Super-Resolution Reconstruction and Hybrid Networks

Wan, Hongqing; Xu, Sha; Yang, Yali; Li, Yongfang

doi:10.3390/jimaging11020029

Open AccessArticle

Remote Sensing Target Tracking Method Based on Super-Resolution Reconstruction and Hybrid Networks

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, 333 Longteng Road, Songjiang District, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

J. Imaging 2025, 11(2), 29; https://doi.org/10.3390/jimaging11020029

Submission received: 19 November 2024 / Revised: 20 December 2024 / Accepted: 7 January 2025 / Published: 21 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Remote sensing images have the characteristics of high complexity, being easily distorted, and having large-scale variations. Moreover, the motion of remote sensing targets usually has nonlinear features, and existing target tracking methods based on remote sensing data cannot accurately track remote sensing targets. And obtaining high-resolution images by optimizing algorithms will save a lot of costs. Aiming at the problem of large tracking errors in remote sensing target tracking by current tracking algorithms, this paper proposes a target tracking method combined with a super-resolution hybrid network. Firstly, this method utilizes the super-resolution reconstruction network to improve the resolution of remote sensing images. Then, the hybrid neural network is used to estimate the target motion after target detection. Finally, identity matching is completed through the Hungarian algorithm. The experimental results show that the tracking accuracy of this method is 67.8%, and the recognition identification F-measure (IDF1) value is 0.636. Its performance indicators are better than those of traditional target tracking algorithms, and it can meet the requirements for accurate tracking of remote sensing targets.

Keywords:

remote sensing image; super-resolution reconstruction; target tracking; CNN; GRU

1. Introduction

Super-resolution reconstruction and target tracking represent two pivotal research directions in the field of computer vision. The combination of the two technologies has significant value in the fields of video surveillance, autonomous driving, and remote sensing. Target tracking aims to model the target tracking process in video sequences. Precise target tracking can be applied in scenarios such as intelligent video analysis and personnel tracking [1]. With the wide application of deep learning technologies, the reconstruction performance from low-resolution images to high-resolution images has been significantly improved [2,3]. This is of great significance for the target tracking task, as it can help algorithms identify abundant target features more precisely [4,5] and further improve the accuracy and robustness of the algorithms.

Super-resolution reconstruction algorithms can be divided into two categories: traditional multi-image super-resolution algorithms and single-image super-resolution algorithms [6,7]. Many scholars at home and abroad have conducted in-depth explorations in the field of super-resolution reconstruction technology. On the one hand, in the research on super-resolution reconstruction methods based on deep learning, they have actively explored neural network architectures applicable to remote sensing images [8,9]. For example, some research teams have improved and optimized traditional convolutional neural networks (CNNs) and generative adversarial networks (GANs) in light of the characteristics of remote sensing images to enhance the effect of super-resolution reconstruction [10]. On the other hand, there have also been quite a number of research achievements in the super-resolution reconstruction of multi-source remote sensing images. They are committed to solving the problems of registration, fusion, etc., in multi-source data fusion so as to make full use of the complementary information of multi-source remote sensing data and improve the quality and accuracy of the reconstructed images [11]. The tracking effect of super-resolution reconstruction technology under complex environmental conditions (such as clouds, fog, atmospheric interference, etc.) and for complex targets (such as small targets, camouflaged targets, etc.) is constantly being improved by optimizing the algorithm to enhance its adaptability to various complex situations [12]. On the premise of ensuring the accuracy of super-resolution reconstruction and target tracking, research is being conducted on how to improve the running speed and efficiency of algorithms to meet the requirements for real-time performance in practical applications [13]. With the continuous development of deep learning in the field of computer science, it will also bring more inspiration to the field of super-resolution.

Traditional super-resolution reconstruction algorithms have certain limitations and fail to adequately address the challenges posed by environmental complexity and noise in remote sensing images, resulting in suboptimal reconstruction outcomes. Therefore, this paper proposes a super-resolution reconstruction algorithm specifically designed for remote sensing images to fully capture the detailed features of remote sensing images and improve the model’s accuracy. In addition, general target tracking algorithms often over-emphasize the appearance similarity between targets across frames while neglecting the nonlinear characteristics of target motion. To address this, this paper designs a hybrid neural network, CNN-GRU, that extracts both local features and global information, allowing the target tracking algorithm to more effectively consider the nonlinear characteristics of target motion, thus improving tracking accuracy. The target tracking algorithm proposed in this paper, specifically for remote sensing images, is suitable for detecting small and camouflaged targets, offering advantages such as high accuracy and strong applicability. Compared to traditional target tracking and super-resolution reconstruction algorithms, the methods proposed in this paper show significant improvements and offer substantial advantages.

2. Overall Framework of Remote Sensing Target Tracking System

The overall framework of the system is shown in Figure 1. First, the input remote sensing image sequences undergo super-resolution reconstruction using the improved network to enhance image quality. Next, the reconstructed images are fed into the target detection model, detection transformer (DETR) [14], based on the transformer framework, which outputs the detected remote sensing targets. Then, the detection boxes are passed into the motion estimation model for motion modeling and state estimation, followed by the calculation of the cost matrix. Finally, the Hungarian algorithm [15] is applied to perform inter-frame target data association and identity transfer, updating the tracking model and outputting the tracking result.

3. Main Algorithms

3.1. Super-Resolution Reconstruction Algorithm

General super-resolution reconstruction models are mainly used for processing natural landscapes and cannot fully take into account characteristics such as the complexity of scenes and noise in remote sensing images, which leads to poor reconstruction effects. Therefore, based on EDSR [16], this paper proposes a super-resolution reconstruction model specifically designed for remote sensing images. This network has redesigned the residual blocks, upsampling blocks, and loss functions, improving the model’s feature extraction ability and further enhancing the details and texture of the reconstructed images.

3.1.1. Residual Block

Deepening the residual blocks can effectively enhance the network’s ability to extract image features. For this reason, this paper proposes a residual block based on the convolutional layer, the GELU activation function [17], and the BN layer. The convolutional layer adopts small-scale convolutional kernels to better handle the local features of remote sensing images and increase the receptive field. Meanwhile, the GELU activation function is added after each convolutional layer. The GELU is a continuous and smooth function, and its derivative exists throughout the entire real number domain, which can avoid the vanishing gradient phenomenon during the training process and accelerate convergence. The formula is:

G E L U (x) = 0.5 x \{1 + \tanh [\sqrt{2 / π} (x + 0.044715 x^{3})]\}

(1)

where x is the input feature.

The network structure of the residual module is shown in Figure 2, where the BN layer is used to accelerate model convergence.

3.1.2. Upsampling Block

The upsampling block proposed in this paper is alternately composed of a deconvolution layer [18] and a sub-pixel convolution layer. The deconvolution layer achieves upsampling by mapping each pixel of the input tensor to multiple pixels of the output tensor, which can reduce phenomena such as jagged distortion in remote sensing images. The sub-pixel convolution layer improves the quality of the reconstructed image by reducing the number of channels and increasing the resolution.

Let the input size be

(B, C, H_{i}, W_{i})

(number of batches, number of channels, height, width, respectively) and the deconvolution calculation formula be

M_{o} = \frac{M_{i} + 2 p - k}{s} + 1

(2)

where M is the height or width, p is the fill number, k is the convolution kernel size, s is the convolution kernel size, and i and o represent the input and output. After the deconvolution operation, the size of the image is transformed into

(B, C * r^{2}, H_{o}, W_{o})

, and r is the magnification.

Sub-pixel convolution rearranges and reorganizes the image after deconvolution operation according to certain rules. In the first step, the input tensor is divided according to the number of channels to obtain

r^{2}

sub-tensors, each of which has dimensions of

(B, C, H_{o}, W_{o})

. In the second step, each subtensor is rearranged and reorganized. First, a new tensor of size

r * r

is constructed, and then the height and width of each subtensor are scaled to 1/r. Finally, the scaled subtensor is filled into the construction tensor row by row and column by pixel. In the third step,

r^{2}

new tensors are concatenated according to channel dimension to obtain an output tensor of

(B, C, H_{o} * r, W_{o} * r)

.

3.1.3. Loss Function

The loss function is composed of mean squared error (MSE) [19] and structural similarity index measure (SSIM) [20] loss function, calculated by the formula:

l o s s = l o s s_{M S E} + l o s s_{S S I M}

(3)

MSE is an indicator for measuring image quality. By calculating the differences between two images at corresponding pixel points, it can effectively capture information such as noise and distortion in remote sensing images. The calculation formula is as follows:

l o s s_{M S E} = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} {[X (i, j) - Y (i, j)]}^{2}

(4)

In the formula, m and n are the number of pixels in the high and wide directions of the input image,

(i, j)

are the pixel coordinates, and X and Y represent the super-resolution reconstructed image and the corresponding label image, respectively.

The SSIM loss function can effectively handle high-level features such as the structure and texture of remote sensing images. It reflects the image reconstruction quality by examining luminance, contrast, and structure. The calculation formula is as follows:

\{\begin{matrix} L (X, Y) = \frac{2 μ_{x} μ_{y} + C_{1}}{μ_{x}^{2} + μ_{y}^{2} + C_{1}} \\ C (X, Y) = \frac{2 σ_{x} σ_{y} + C_{2}}{σ_{x}^{2} + σ_{y}^{2} + C_{2}} \\ S (X, Y) = \frac{σ_{x y} + C_{3}}{σ_{x} σ_{y} + C_{3}} \end{matrix}

(5)

where L, C, and S are the values of brightness, contrast, and structure, respectively.

μ

is the mean value of the pixel,

σ

is the variance of the pixel value, and

C_{1}, C_{2}, C_{3}

are constants. In summary, the calculation formula of SSIM loss function is as follows:

l o s s_{S S I M} = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(6)

3.2. Motion Estimation

3.2.1. Estimation Model

After the remote sensing images have been detected, it is necessary to model the detected targets to reflect their inter-frame motion states so that the identities of the targets can be propagated to the next frame. Let the state of each remote sensing target be:

K = {[u, v, γ, h, \overset{\cdot}{u}, \overset{\cdot}{v}, \overset{\cdot}{γ,} \overset{\cdot}{v}]}^{T}

(7)

In the formula,

u

and

v

represent the horizontal and vertical pixel positions of the target center,

γ

and

h

represent the aspect ratio and height of the detection box, and the remaining variables are the first derivative of the corresponding variables

3.2.2. Hybrid Neural Network

Remote sensing targets have characteristics such as variable positions and large variations in size. This makes it impossible for traditional tracking algorithms to represent the nonlinearity of the motion of remote sensing targets when the speed is directly set as a constant, resulting in inaccurate target state estimation and exacerbating the accumulation of errors. Therefore, this paper proposes a hybrid neural network called CNN-GRU. It regards the motion of remote sensing targets between frames as a time series prediction problem. The model is composed of a convolutional neural network (CNN) [21] and a gated recurrent unit (GRU) [22], which can extract local information and the context of sequential data, thus effectively capturing the nonlinear characteristics of the motion between frames. Its structure is shown in Figure 3.

Let the input of CNN-GRU be matrix P; the vertical direction of the matrix represents the state of each target identity

{[u, v, γ, h, \overset{\cdot}{u}, \overset{\cdot}{v}, \overset{\cdot}{γ,} \overset{\cdot}{v}]}^{T}

, the horizontal direction represents the state quantity corresponding to the first ten frames of images, and matrix P is

P = [\begin{matrix} \begin{matrix} u_{t - 10} & u_{t - 9} \\ v_{t - 10} & v_{t - 9} \end{matrix} & \dots & \begin{matrix} u_{t - 1} \\ v_{t - 1} \end{matrix} \\ ⋮ & ⋱ & ⋮ \\ \begin{matrix} {\dot{h}}_{t - 10} & {\dot{h}}_{t - 9} \end{matrix} & \dots & {\dot{h}}_{t - 1} \end{matrix}]

(8)

where t represents the current frame. At this time, the hybrid neural network outputs the predicted value

Q

of the target identity state in the current frame, i.e.,:

Q = {[u_{t}, v_{t}, γ_{t}, h_{t}, \overset{\cdot}{u_{t}}, \overset{\cdot}{v_{t}}, \overset{\cdot}{γ_{t}}, \overset{\cdot}{h_{t}}]}^{T}

(9)

The CNN is used to extract the local features of remote sensing target trajectory data, making up for the deficiency that the recurrent neural network can only focus on the contextual information globally. The CNN has three CGM modules, and each module is composed of a 1D convolutional layer, a GELU activation function, and a 1D max pooling layer. Among them, the 1D convolutional layer is specifically designed to handle time series data, and its convolutional kernels slide only in the horizontal direction.

GRU is a special recurrent neural network structure. It can solve problems such as the vanishing gradient and exploding gradient to some extent during time series training. Meanwhile, it also has a certain degree of memory and can capture the long-term dependencies in sequential data. This paper adopts a three-layer GRU network; this is used to capture the temporal information in the high-level features after convolutional calculations. The GRU contains a reset gate and an update gate, and its structure is shown in Figure 4. In the figure,

h_{i - 1}

and

h_{i}

are the passed hidden layer state,

{\tilde{h}}_{i}

is the candidate state after reset,

r_{i}

and

z_{i}

are the reset gate and update gate, respectively. And

x_{i}

and

y_{i}

are the input and output of node

i

, respectively.

The status update formula of the gated cycle unit is:

r_{i} = σ (W_{r} x_{i} + U_{r} h_{i - 1} + b_{r})

(10)

z_{i} = σ (W_{z} x_{i} + U_{z} h_{i - 1} + b_{z})

(11)

The status update formula for the memory unit is:

{\tilde{h}}_{i} = \tan h (W_{h} x_{i} + U_{h} (r_{i} ⊙ h_{i - 1}) + b_{h})

(12)

h_{i} = z_{i} ⊙ h_{i - 1} + (1 - z_{i}) ⊙ {\tilde{h}}_{i}

(13)

This node outputs the formula as:

y_{i} = σ (W_{y} h_{i})

(14)

In the formula,

W_{r}, W_{z}, W_{h}, W_{y}

are input weight vectors of reset gate, update gate, candidate unit, and output gate, respectively.

U_{r}, U_{z}, U_{h}

are the cyclic weight vectors of the corresponding units, respectively.

b_{r}, b_{z}, b_{h}, b_{y}

are input weight vectors of reset gate, update gate, candidate unit, and output gate, respectively.

σ

is the Sigmoid activation function [23], whose expression is:

σ (x) = \frac{1}{1 + e^{- x}}

(15)

3.3. Data Association

After motion estimation, it is necessary to conduct data association between the two frames before and after the motion estimation and to complete identity transfer and a status update. After the remote sensing image is detected, detection boxes are generated, and each detection box has four variables: u, v, γ, and h. If the remote sensing image of frame t-1 has target frames, each detection frame has target frames to match, and the number for matching is

m * n

. This article will regard the time interval between two frames as constant

Δ t

, the match after each measured value D for

{[u_{t}, v_{t}, γ_{t}, h_{t}, \overset{\cdot}{u_{t}}, \overset{\cdot}{v_{t}}, \overset{\cdot}{γ_{t}}, \overset{\cdot}{h_{t}}]}^{T}

.

In this paper, the box represented by the four variables of the predicted value Q of frame t and the measured value

u, v, γ, h

is calculated in pairs to measure the position similarity between the predicted value and the measured value. The formula is as follows:

r_{I o U} = \frac{D_{l} \cap Q_{l}}{D_{l} \cup Q_{l}}

(16)

where l represents the quantity of position.

The four variables

\overset{\cdot}{u}, \overset{\cdot}{v}, \overset{\cdot}{γ,} \overset{\cdot}{v}

in the predicted value Q of frame t and the measured value D are calculated to reflect the velocity similarity between the measured value and the measured value by solving the Mahalanobis distance. The larger the Mahalanobis distance, the more dissimilar the two are. The formula is as follows:

r_{M} = {(D - Q)}^{T} S^{- 1} (D - Q)

(17)

where S is the covariance of the predicted value Q and the measured value D.

Use the weighted values of the two as the total similarity, and calculate the formula as follows:

r = λ r_{I o U} + (1 - λ) r_{M}

(18)

where λ represents the weight; the larger the value of r, the more similar the predicted value Q is to the measured value D, and the range of R is [0, 1].

At this time, the calculated cost matrix is

(\begin{matrix} \begin{matrix} r_{11} & r_{12} \\ r_{21} & r_{22} \end{matrix} & \dots & \begin{matrix} r_{1 n} \\ r_{2 n} \end{matrix} \\ ⋮ & ⋱ & ⋮ \\ \begin{matrix} r_{m 1} & r_{m 2} \end{matrix} & \dots & r_{m n} \end{matrix})

; this should be input into the Hungarian algorithm in order to complete the target identity matching and status update.

4. Experiments and Analysis

4.1. Selection and Preprocessing of the Dataset

This paper adopts the DIV2K dataset [24] as the dataset for the super-resolution reconstruction model. This dataset contains 800 high-resolution images and their corresponding low-resolution images for training and validation and another 100 images for testing. Among them, the size of the low-resolution images is one-fourth that of the high-resolution images, respectively. Since remote sensing images are usually affected by various noises, this paper adds noise to the DIV2K dataset to better simulate the actual situation. The specific operations are as follows: by setting a signal-to-noise ratio of 0.05 and generating random numbers for pixel-by-pixel noise addition operations, salt-and-pepper noise is added to the training set; with a mean of 0 and a variance of 0.03, Gaussian perturbations are performed on the pixel values of the pictures in the training set. After enhancement, the training set and the validation set are three times the original size. Then, the training set, the validation set, and the test set are divided, with 2100, 300, and 100 images, respectively.

The VISO dataset is adopted as the dataset for the target detection and tracking models. This dataset contains four types of targets, namely ships, cars, trains, and airplanes. For the training set of target detection, random horizontal and vertical flips are performed with the proportion controlled at 3:7 to enhance the model’s ability to represent spatial positions. The RGB channels of the images are converted to HSV, and the perturbation coefficients for hue, saturation, and brightness are set to 0.02, 1.15, and 0.6, respectively. The tracking dataset is processed using the sliding window method with a size of 10 and a step length of 1. Therefore, the input feature dimension of the model is 8, the sequence length is 10, the output feature dimension is 8, and the sequence length is 1.

4.2. Experimental Platform and Parameter Settings

This paper adopts a cloud server system with the following configuration: Ubuntu 18.04 64-bit system; the CUDA version is 11.0.2; the GPU is NVIDIA T4; and the AIACC training acceleration and AIACC inference acceleration architectures are employed. During model training, the batch size is set to 32, the optimizer is Adam, and the learning rate is 0.0001. The weight λ for data association is solved as 0.674.

4.3. Results and Analysis of Super-Resolution Reconstruction

This paper employs two of the most commonly used metrics in the field of super-resolution, namely peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) [25], to evaluate the reconstruction effect of remote sensing images. PSNR is used to measure the differences between two images and assess the quality of super-resolution reconstructed images, as well as the performance of algorithms.

P S N R = 10 \log_{10} (\frac{M a x V a l u e^{2}}{M S E})

(19)

Among them, MSE represents the mean squared error of two images on a pixel-by-pixel basis, and MaxValue is the maximum value that an image pixel can be assigned.

The SSIM is an indicator used to measure the similarity between two images. It is based on three relatively independent subjective metrics, namely luminance, contrast, and structure. Compared with the traditional MSE, it is more in line with the perception of images by the human visual system.

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})}

(20)

Among them,

μ_{x}

is the mean value of x,

σ_{x}

is the variance of x, x and y are N × N images of the same size, and

σ_{x y}

is the covariance of x and y.

c_{1} = {(k_{1} L)}^{2}

,

c_{2} = {(k_{2} L)}^{2}

, and

c_{3} = c_{2} / 2

are constants, L is the range of pixel values,

k_{1}

and

k_{2}

are constants much smaller than 1. By default,

k_{1}

= 0.01, and

k_{2}

= 0.03.

The ablation experiment is an experimental method widely used in the fields of machine learning, deep learning, etc. Its main purpose is to analyze the contribution degree of each component (such as network layers, features, modules, etc.) in a complex model to the overall performance of the model. In this way, it can quantitatively characterize the improvement before and after network improvement and the contribution of each component to the whole. The results of the ablation experiment are shown in Table 1. Improvement 1 redesigned the residual module, with the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) increased by 0.38 and 0.086, respectively, indicating that appropriately deepening the network can fully extract image features. Improvement 2 set the upsampling block to alternate calculations between the deconvolution layer and the sub-pixel convolution layer, which has higher accuracy in expanding the model and filling pixels than the upsampling block that only adopts the deconvolution layer. Improvement 3 added the SSIM loss function, which reflects the performance of the model from three aspects: luminance, contrast, and structure. Therefore, compared with the traditional EDSR model, the super-resolution reconstruction network proposed in this paper has been greatly improved in processing remote sensing images; among them, the PSNR and SSIM indicators have been increased by 1.17 dB and 0.12, respectively.

4.4. Analysis of Target Tracking Results

The image sequences in the VISO dataset are input into the model for testing. Parts of the testing results are shown in the Figure 5 and Figure 6. The upper and lower parts, respectively, display the effects of remote sensing target tracking before and after super-resolution reconstruction. The tracked targets are marked by rectangular boxes, and their current identities are shown.

To describe the effect of the tracking model more precisely, this paper adopts two metrics, namely multiple object tracking accuracy (MOTA) and IDF1, to analyze the tracking accuracy of the model. Among them, MOTA is a widely used evaluation metric for measuring the overall performance of multi-object tracking algorithms. It comprehensively considers three types of errors: false detections, missed detections, and identity switches. The calculation formula of MOTA is as follows:

M O T A = 1 - \frac{\sum_{t} (F N_{t} + F P_{t} + I D S W_{t})}{\sum_{t} G T_{t}}

(21)

Among them,

F N_{t}

is the number of missed detections in the t-th frame,

F P_{t}

is the number of false detections,

I D S W_{t}

is the number of identity switches, and

G T_{t}

is the actual number of targets.

IDF1 is an important indicator for evaluating the consistency of identity preservation in multi-object tracking. The calculation formula of IDF1 is as follows:

I D F 1 = \frac{2 \times I D T P}{2 \times I D T P + I D F P + I D F N}

(22)

Among them, IDTP (true positives) represents the number of correctly matched identities, IDFP (false positives) represents the number of incorrectly matched identities, and IDFN (false negatives) represents the number of missed identities.

In order to reflect the multi-object tracking performance of the optimized model more intuitively, a comparison is made among the traditional target tracking algorithms SORT and DeepSORT and the improved CNN-GRU. To maintain the uniqueness of variables in the comparative experiments, the detector used in the experiment process is DETR, and the performance comparison on the test set is shown in Table 2.

As can be seen from Table 2, the accuracy of the hybrid neural network has increased by 21.7% and 6.2%, respectively, compared with the SORT and DeepSORT trackers, which indicates that CNN-GRU can handle the nonlinear motion states of remote sensing targets better and has a significant improvement in multi-object tracking performance compared with traditional target tracking algorithms. Moreover, this paper also compares the impact of remote sensing images before and after super-resolution reconstruction on multi-object tracking (CNN-GRU). The results show that the accuracy and IDF1 value of the images with super-resolution reconstruction have increased by 4.3% and 0.031, respectively, compared with those without super-resolution reconstruction. This demonstrates that the remote sensing images after super-resolution reconstruction can not only improve the resolution and display the detailed features of the targets in the images better but also effectively reduce the noise in the original images and improve the performance of the tracker. The super-resolution reconstruction network combined with the CNN-GRU network has increased the accuracy (MOTA) by 26% and 10.5%, respectively, compared to SORT and DeepSORT.

5. Conclusions

In view of the characteristics such as scene complexity and noise in remote sensing images, this paper redesigns the residual blocks, upsampling blocks, and loss functions of the super-resolution reconstruction network based on EDSR. As a result, compared with the traditional EDSR, the super-resolution reconstruction model has improved its feature extraction ability and further enhanced the details and texture of the reconstructed remote sensing images. Among them, the PSNR and SSIM indicators have been increased by 1.17 dB and 0.12, respectively. Traditional target tracking algorithms consider the shape similarity of targets too much while ignoring the nonlinear characteristics in the motion process of remote sensing targets. The hybrid neural network CNN-GRU proposed in this paper regards the motion of remote sensing targets between frames as a time series prediction problem, thus effectively capturing the nonlinear characteristics of the inter-frame motion. It can effectively handle situations such as large variations in the spatial scale of targets and complex environments in remote sensing images. The experimental results show that the accuracy of the hybrid neural network proposed in this paper has increased by 21.7% and 6.2%, respectively, compared with the traditional SORT and DeepSORT trackers. Among them, the tracker with the super-resolution reconstruction module has increased its accuracy and IDF1 value by 4.3% and 0.031, respectively, compared with when it is not used, indicating that the super-resolution reconstruction network specially designed for remote sensing image processing proposed in this paper has greatly improved the accuracy of the target tracking algorithm. The shortcoming of the model lies in the fact that it does not extract features of the relative positions of remote sensing targets. It is necessary to improve the model’s ability to perceive the transformation of target positions in the next research work.

Author Contributions

Conceptualization, H.W. and S.X.; methodology, H.W.; software, H.W.; validation, H.W., S.X. and Y.Y.; formal analysis, S.X.; investigation, Y.L.; resources, S.X.; data curation, S.X.; writing—original draft preparation, H.W.; writing—review and editing, H.W.; visualization, Y.L.; supervision, S.X.; project administration, Y.Y.; funding acquisition, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data sets used in this paper are all based on open source data sets.

Conflicts of Interest

The authors declare no conflict of interest.

References

Meyer, P.G.; Cherstvy, A.G.; Seckler, H.; Hering, R.; Blaum, N.; Jeltsch, F.; Metzler, R. Directedeness, correlations, and daily cycles in springbok motion: From data via stochastic models to movement prediction. Phys. Rev. Res. 2023, 5, 043129. [Google Scholar] [CrossRef]
Wang, X.; Lu, W. Super-Resolution Reconstruction of Remote Sensing Images Based on Symmetric Local Fusion Blocks. Int. J. Inf. Secur. Priv. (IJISP) 2023, 17, 1–14. [Google Scholar] [CrossRef]
Sakthivel, S.S.K.; Moorthy, S.; Arthanari, S.; Jeong, J.H.; Joo, Y.H. Learning a Context-Aware Environmental Residual Correlation Filter via Deep Convolution Features for Visual Object Tracking. Mathematics 2024, 12, 2279. [Google Scholar] [CrossRef]
Rong, Y.; Jia, M.; Zhan, Y.; Zhou, L. SR-RDFAN-LOG: Arbitrary-scale logging image super-resolution re-construction based on residual dense feature aggregation. Geoenergy Sci. Eng. 2024, 240, 213042. [Google Scholar] [CrossRef]
Bian, J.; Liu, Y.; Chen, J. Lightweight Super-Resolution Reconstruction Vision Transformers of Remote Sensing Image Based on Structural Re-Parameterization. Appl. Sci. 2024, 14, 917. [Google Scholar] [CrossRef]
Zhang, T.; Du, Y.; Lu, F. Super-Resolution Reconstruction of Remote Sensing Images Using Multiple-Point Statistics and Isometric Mapping. Remote Sens. 2017, 9, 724. [Google Scholar] [CrossRef]
Zhu, H.; Gao, X.; Tang, X.; Xie, J.; Song, W.; Mo, F.; Jia, D. Super-Resolution Reconstruction and Its Application Based on Multilevel Main Structure and Detail Boosting. Remote Sens. 2018, 10, 2065. [Google Scholar] [CrossRef]
Yue, X.; Liu, D.; Wang, L.; Benediktsson, J.A.; Meng, L.; Deng, L. IESRGAN: Enhanced U-Net Structured Generative Adversarial Network for Remote Sensing Image Super-Resolution Reconstruction. Remote Sens. 2023, 15, 3490. [Google Scholar] [CrossRef]
Zhang, X. Superresolution Reconstruction of Remote Sensing Image Based on Middle-Level Supervised Convolutional Neural Network. J. Sens. 2022, 2022, 2603939. [Google Scholar] [CrossRef]
Deepthi, K.; Shastry, A.K.; Naresh, E. A novel deep unsupervised approach for super-resolution of remote sensing hyperspectral image using gompertz-function convergence war accelerometric-optimization generative adversarial network (GF-CWAO-GAN). Sci. Rep. 2024, 14, 1–20. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Xie, J.; Chi, K.; Zhang, Y.; Dong, Y. Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution. Remote Sens. 2024, 16, 4201. [Google Scholar] [CrossRef]
Ren, W.; Wang, Z.; Xia, M.; Lin, H. MFINet: Multi-Scale Feature Interaction Network for Change Detection of High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 1269. [Google Scholar] [CrossRef]
Ji, X.; Tang, L.; Chen, L.; Hao, L.Y.; Guo, H. Toward efficient and lightweight sea–land segmentation for remote sensing images. Eng. Appl. Artif. Intell. 2024, 135, 108782. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Bashir, A.K.; Shao, X.; Cheng, Y.; Yu, K. A remote sensing image rotation object detection approach for real-time environmental monitoring. Sustain. Energy Technol. Assess. 2023, 57, 103270. [Google Scholar] [CrossRef]
Ma, Y.; Zhang, H.; Du, C.; Wang, Z.; Tian, Y.; Yao, X.; Cheng, Z.; Fan, S.; Wu, J. Tracking Multiple Vehicles with a Flexible Life Cycle Strategy Based on Roadside LiDAR Sensors. J. Transp. Eng. Part A Syst. 2024, 150, 04024010. [Google Scholar] [CrossRef]
Jenefa, A.; Bessy, M.K.; Edward Naveen, V.; Lincy, A. EDSR: Empowering super-resolution algorithms with high-quality DIV2K images. Intell. Decis. Technol. 2023, 17, 1249–1263. [Google Scholar]
Guo, Y.; Wang, X.; Han, M.; Xin, J.; Hou, Y.; Gong, Z.; Wang, L.; Fan, D.; Feng, L.; Han, D. Detection and Localization of Albas Velvet Goats Based on YOLOv4. Animals 2023, 13, 3242. [Google Scholar] [CrossRef]
Khan, S.D.; Alarabi, L.; Basalamah, S. An Encoder–Decoder Deep Learning Framework for Building Footprints Extraction from Aerial Imagery. Arab. J. Sci. Eng. 2023, 48, 1273–1284. [Google Scholar] [CrossRef]
Pekmezci, M.; Genc, Y. Evaluation of SSIM loss function in RIR generator GANs. Digit. Signal Process. 2024, 154, 104685. [Google Scholar] [CrossRef]
Ouyang, Z.; Zhao, H.; Zhang, Y.; Chen, L. STVDNet: Spatio-temporal interactive video de-raining network. Vis. Comput. 2024, 1–16. [Google Scholar] [CrossRef]
Yao, S.; Wang, H.; Su, Y.; Li, Q.; Sun, T.; Liu, C.; Li, Y.; Cheng, D. A Renovated Framework of a Convolution Neural Network with Transformer for Detecting Surface Changes from High-Resolution Remote-Sensing Images. Remote Sens. 2024, 16, 1169. [Google Scholar] [CrossRef]
Moctezuma, L.A.; Suzuki, Y.; Furuki, J.; Molinas, M.; Abe, T. GRU-powered sleep stage classification with permutation-based EEG channel selection. Sci. Rep. 2024, 14, 17952. [Google Scholar] [CrossRef]
Mulindwa, D.B.; Du, S. An n-Sigmoid Activation Function to Improve the Squeeze-and-Excitation for 2D and 3D Deep Networks. Electronics 2023, 12, 911. [Google Scholar] [CrossRef]
Tao, P.; Yang, D. RSC-WSRGAN super-resolution reconstruction based on improved generative adversarial network. Signal Image Video Process. 2024, 18, 7833–7845. [Google Scholar] [CrossRef]
Wang, X.; Ao, Z.; Li, R.; Fu, Y.; Xue, Y.; Ge, Y. Super-Resolution Image Reconstruction Method between Sentinel-2 and Gaofen-2 Based on Cascaded Generative Adversarial Networks. Appl. Sci. 2024, 14, 5013. [Google Scholar] [CrossRef]

Figure 1. Overall system framework.

Figure 2. Residual module network structure.

Figure 3. Hybrid neural network structure.

Figure 4. Gated cycle unit, GRU.

Figure 5. Target tracking effect without overshoot reconstruction.

Figure 6. Target tracking effect after overshoot reconstruction.

Table 1. Comparison of ablation results.

Models	Residual Block	Upsample Block	Loss Function	PSNR	SSIM
EDSR	─	─	─	27.71	0.742
Improvement 1	√	─	─	28.09	0.828
Improvement 2	√	√	─	28.32	0.833
Improvement 3	√	√	√	28.88	0.862

Table 2. Track model performance comparisons.

Model	MOTA	IDF1
SORT	41.8%	0.389
DeepSORT	57.3%	0.564
CNN-GRU	63.5%	0.625
Super-resolution Reconstruction Network + CNN-GRU	67.8%	0.656

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, H.; Xu, S.; Yang, Y.; Li, Y. Remote Sensing Target Tracking Method Based on Super-Resolution Reconstruction and Hybrid Networks. J. Imaging 2025, 11, 29. https://doi.org/10.3390/jimaging11020029

AMA Style

Wan H, Xu S, Yang Y, Li Y. Remote Sensing Target Tracking Method Based on Super-Resolution Reconstruction and Hybrid Networks. Journal of Imaging. 2025; 11(2):29. https://doi.org/10.3390/jimaging11020029

Chicago/Turabian Style

Wan, Hongqing, Sha Xu, Yali Yang, and Yongfang Li. 2025. "Remote Sensing Target Tracking Method Based on Super-Resolution Reconstruction and Hybrid Networks" Journal of Imaging 11, no. 2: 29. https://doi.org/10.3390/jimaging11020029

APA Style

Wan, H., Xu, S., Yang, Y., & Li, Y. (2025). Remote Sensing Target Tracking Method Based on Super-Resolution Reconstruction and Hybrid Networks. Journal of Imaging, 11(2), 29. https://doi.org/10.3390/jimaging11020029

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Target Tracking Method Based on Super-Resolution Reconstruction and Hybrid Networks

Abstract

1. Introduction

2. Overall Framework of Remote Sensing Target Tracking System

3. Main Algorithms

3.1. Super-Resolution Reconstruction Algorithm

3.1.1. Residual Block

3.1.2. Upsampling Block

3.1.3. Loss Function

3.2. Motion Estimation

3.2.1. Estimation Model

3.2.2. Hybrid Neural Network

3.3. Data Association

4. Experiments and Analysis

4.1. Selection and Preprocessing of the Dataset

4.2. Experimental Platform and Parameter Settings

4.3. Results and Analysis of Super-Resolution Reconstruction

4.4. Analysis of Target Tracking Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI