1. Introduction
In recent years, UAVs have been more and more widely used in many fields, such as civil, military, and scientific research, by virtue of their small size, flexible movements, and easy control, for example, in power line detection in harsh environments, atmospheric environment detection, rescue and disaster relief, enemy reconnaissance, enemy target tracking, and search for battlefield intelligence [
1,
2,
3,
4,
5,
6]. UAV visual target tracking is one of the fundamental and challenging tasks with high research value, and it is also a major research hotspot in the field of computer vision at present. However, compared with general tracking scenarios, UAV visual target tracking faces more complicated challenges. To sum up, there are mainly the following points: (1) when UAVs perform tracking tasks, interference from wind and other factors leads to motion blur, camera shake, and frequent viewpoint changes, which in turn leads to the tracking drift and target loss. (2) Due to the open aerial view and complex ground environment, many interfering objects, mutual interference between targets and targets and between targets and backgrounds, poor distinguishability leads to the target model’s variability and exclusivity being not so high, and it is difficult to establish an accurate target model. (3) When the UAV flies at a certain altitude, the image shadow width becomes larger, the resolution and clarity become lower, the scale of the target to be tracked on the ground becomes small, and the target features and textures become sparse, making it difficult to extract the target features and the feature representation is not significant, leading to greater difficulty in tracking [
7]. Therefore, a robust UAV target tracking method is needed to cope with various complex tracking scenarios.
Deep learning has achieved better results in vision tracking in recent years, but it is still difficult to deploy deep learning tracking algorithms in UAVs for a wide range of applications in real flight scenarios due to limited on-board computational resources, small battery capacity, low power consumption, and maximum load limitations [
8]. Under the hard constraints of real-time vision-based UAS, the ideal tracker should have higher computational power for sensor fusion, advanced control, etc. [
9]. Although the discriminative correlation filter (DCF)-based approach cannot compete with the deep learning-based approach in terms of accuracy and precision, it can quickly learn the model from a single frame without the need for additional dataset training. Due to the use of fast Fourier transform (FFT) to transfer the evaluation of correlation to the frequency domain, it has high computational efficiency and excellent robustness on a single CPU. This method has been widely applied in the industry.
DCF uses cyclic shifting on a single image frame to generate a large number of samples for training, but due to the limited search area, it is difficult to avoid generating boundary effects and degrading the tracking performance [
10]. The background-aware filter (BACF) directly constructs a cropping matrix by circularly shifting in the whole image picture to obtain more negative samples from the background, which effectively mitigates the boundary effect. Due to the expanded search area, it is able to track targets with relatively high velocities with the camera or UAV. However, as more background information is introduced, many background clutters become byproducts. This will lead to similar objects around the target as well as the background being easily identified as the original target we want to track. Huang et al. introduced an aberration suppression term based on BACF, which provides a good suppression of the aforementioned aberrations. There are also works that consider learning context information during training to extend the acceptance domain of DCF and improve the filter’s ability to discriminate the background [
11,
12]. These methods have more limited limitations for target drift caused by fast target movement, camera shake, and partial occlusion, and lack reasonable attention mechanisms for aberration information as well as background information to devote limited computational resources to the part of the object most affected by the interference.
The attention mechanism (AM) is a data-processing method in machine learning, which is widely used in various different types of machine learning tasks such as natural language processing (NLP), image processing (CV) and speech recognition [
13]. The attention mechanism is able to select and focus on relevant information in a targeted manner according to the input context and task requirements, which improves the accuracy and efficiency of the model. Inspired by spatial attention, in this paper, we propose an aberration-repressed correlation filter (SAARCF) based on the spatial attention framework to more efficiently cope with the target tracking problem in various complex scenes from the UAV perspective under the condition of limited computational resources.
First, we introduce an aberration model built from two consecutive response frames, which assigns equal attention to aberrations near the target and aberrations occurring far away from the target; however, aberrations occurring far away from the target are often irrelevant, and aberrations occurring closer to the target have a greater impact on the filter training and are more likely to cause target drift. To address this, we designed an attention module to optimize the filter’s perception of aberrations by assigning different levels of attention to them according to their spatial information, thus improving the filter’s ability to discriminate between targets. To further optimize the use of computational resources, we used the APCE to determine whether to learn the global context in the aberration frame. For the aberration frame, a low APCE indicates that the tracker can no longer cope with the drastic changes in the scene by relying on its own robustness, and the global context needs to be learned to introduce more background information to cope with the drastic changes in the scene.
The main contributions of this work are listed as follows:
We propose a robust target tracking method (SAARCF), which alleviates boundary effects and fully utilizes the spatial information of aberrations to suppress sudden changes in the response map, and can better cope with tracking problems in complex environments from the perspective of unmanned aerial vehicles.
A mechanism of intermittent learning global context is proposed to utilize the background information more efficiently and cope with the drastic changes in the scene under the limited computational resources.
SAARCF is tested on challenging UAV datasets such as UAVDT, Visdrone2018, and the experimental results show that SAARCF has a better tracking performance.
The rest of this paper is organized as follows:
Section 2 provides a brief review of the prior work related to this work.
Section 3 presents the baseline tracker on which our tracker is based and elaborates our proposed method.
Section 4 shows the experimental results of our proposed method tested on UAV benchmarks such as UAVDT and Vistrone2018.
Section 5 provides the conclusion.
3. Proposed Tracking Methodology
In this chapter, we first introduce the Baseline tracker, and then we will elaborate on our proposed tracker, which includes an aberration suppression module based on spatial attention and an intermittent context learning mechanism. An overview of the tracker is shown in
Figure 1. In
Table 1, we provide explanations for the variables that will be mentioned later.
As can be seen from
Figure 1, our proposed tracker consists of two modules: a spatial attention aberration suppression module and an intermittent context learning mechanism. The spatial attention aberration suppression module effectively suppresses aberrations, while the intermittent context learning mechanism first determines whether the current frame is an aberrance frame or not. If it is an aberrance frame, the APCE criterion is employed to decide whether the global context learning is needed. The APCE criterion is used to determine the necessity of global context learning.
3.1. Baseline Tracker
In this work, the background-aware correlation filter (BACF) is considered our baseline tracker. The BACF algorithm uses cyclic matrix sampling. In contrast with other correlation filters, it does not perform cyclic shifts around the target but on the entire image. Then, a cropping matrix is used to crop the samples to the desired size. As such, the number of samples is significantly increased, and the negative samples are real rather than artificially stitched together.
BACF aims to minimize the following objective:
3.2. Aberration Suppression Module
The baseline tracker used in this work can effectively alleviate the boundary effects present in classic correlation filter algorithms. However, when introducing background information, the BACF introduces a large amount of clutter, leading to the misidentification of similar objects around the target. Moreover, in drone scenarios, the environment undergoes drastic changes, easily generating various distortions that are difficult for BACF to handle. The significant spatial shape changes between the two consecutive frames of response maps can easily lead to target loss. The difference in response maps can reflect the severity of aberration to a certain extent.
In light of this, we first model the aberrations:
where
is the response map of the k-th frame,
p and
q represent the position differences of the peak values of the response maps in the previous and current frames, and the shift operation
is used to align the peak positions of the two response maps. When aberrance occurs, the similarity of the response maps decreases, resulting in a higher value of
. In order to suppress the generation of aberrance during the filter training process, the training objective of the filter is optimized to minimize the following loss function:
However,
adopts the same focus on distortion for every point in space. Instead, we should focus on the aberration occurring near the peak of the response map or, in other words, near the tracking target. This is because the aberration in this area has the greatest impact on filter training. The farther away it is from this focal area, the less relevant the aberration becomes, and we should not allocate too much attention to these edge areas. Otherwise, the filter will focus too much on the information in the edge areas, relatively weakening the filter’s ability to distinguish the target. In view of this, this paper designs a spatial attention module
Q and introduces it into the training stage of the filter. The new objective function is described as follows:
The
attention module in the k-th frame can be obtained from
, where the peak coordinates of
are
. In this paper, the numerical representation of the attention that should be allocated to each area
in the spatial domain is as follows:
where
is the base factor and
is the scale factor.
Figure 2 shows the tracking results and response graphs of the proposed tracker in this paper (SAARCF) and the baseline tracker (BACF). The suppression effect of SAARCF on distortion can also be observed from the response map, which is significantly more prominent compared to BACF.
3.3. Intermittent Context Learning Mechanism
3.3.1. Aberrance Frame Recognition
In the baseline tracker, only limited context information is included, which makes the tracker prone to drift in situations of rapid movement, occlusion, and background noise. To address this issue, we incorporated global context information into the filter training process. Frame-by-frame context learning is considered highly redundant, as the capture frequency of drone cameras is generally lower than the context change frequency. For example, in a 30 frames per second (FPS) video, the interval between two consecutive time points is 0.03 s, but in reality, the background in an aerial view remains unchanged for a much longer time than 0.03 s [
27]. Furthermore, due to the ubiquitous appearance changes in aerial scenes, an unrestricted learning-based single filter is prone to damage.
In light of this, we also need to allocate reasonable attention to consecutive image sequences, focusing on frames with dramatic scene changes (which we refer to as aberrance frames) to fully utilize the limited computing resources. Based on this, we use the
mentioned in
Section 3.2 as the criterion for identifying aberrance frames, setting a threshold
, where
T is the number of image sequences between the current frame (
t) and the previous aberrance frame
. When
, it indicates that the aberrance is exceptionally noticeable, and the frame is recognized as an aberrance frame
.
3.3.2. Multi-Peak Confidence Level
In regard to the aberrance frames defined in
Section 3.3.1, the tracker has a certain level of robustness. In some cases, even without learning the global context, the tracker can rely on its own robustness to counteract the impact of aberrance on tracking results. In this situation, learning the global context during the filter training stage might mistakenly learn some background information, leading to error accumulation and a waste of computational resources. Therefore, after identifying the aberrance frames, it is necessary to first evaluate the confidence of the tracking results at this moment. When the confidence is low, it indicates that the tracker’s inherent robustness is no longer sufficient to handle the dramatic scene changes. In such cases, the global context should be learned during the filter training stage.
In this paper, the maximum response value of the filter output and the average peak-to-correlation energy (APCE) [
28] are used as metrics to assess the confidence level of the tracking results:
Here, represents the maximum peak of the response map, denotes the minimum peak value, and indicates the response value at the position. When APCE is small, it suggests that the target is significantly affected by surrounding interference, leading to low tracking result confidence. Conversely, a larger APCE implies a higher tracking result confidence. In this study, we establish a threshold L, where L is calculated as the mean APCE of all frames between the current frame and the previous aberrance frame, multiplied by ( is a scaling factor). If the current APCE is less than L, we determine that the confidence is low, indicating that the aberrance has a substantial impact on the tracker. Consequently, the tracker is unable to effectively handle the dramatic scene changes, necessitating the learning of the global context.
3.3.3. Context Patches Scoring Scheme
We calculate the scores of context patches using Euclidean distances. Context patches have the same dimensions as the target and are located around the target. The score of patch
j is specified as follows:
where
denotes the Euclidean distance between the target and the context patch
,
z is a constant, and
w and
h are the width and height of the target box, respectively.
At this point, for the aberrance frame, the new objective function is described as follows:
where the third term is the response of context patches (desired responses are 0) and P is the number of context patches. For non-aberrant frames, the objective function remains as Equation (
3).
Next, we need to transform the objective function to the frequency domain. To facilitate computation, we first convert Equation (
8) into matrix form as follows:
Now, we will explain the parameters in the objective function for the k-th frame. Note that the parameters we discuss below are all with the subscript k omitted. The detailed explanations are as follows.
⊗ is the Kronecker product and is an identity matrix whose size is . F is a kind of mapping: , . This is a more detailed description of these variables: is the matrix form of the input sample x, is the p-th context path. and .
3.4. Transformation into Frequency Domain
The equation above represents the total loss function in matrix form, which is essentially a convolution operation. In order to ensure computational efficiency, it is transformed into the frequency domain as follows:
Here, represents the discrete Fourier transform (DFT). To further optimize, we introduce a new parameter, which represents the discrete Fourier transform of the frame response signal after translation. At the k-th frame, the response image of the k − 1 frame and the value obtained from can be regarded as constants, which can further simplify the calculation.
3.5. Optimization through ADMM
The alternating direction method of multipliers (ADMM) is fast in solving convex optimization problems and has good convergence performance. Equation (
10) can be solved using ADMM to obtain the global optimal solution. Therefore, we transform Equation (
10) into the augmented Lagrangian multiplier (ALM) form as follows:
Here,
is the penalty factor and
is an auxiliary variable in the Fourier domain that we introduce with dimensions
. The advantage of using ADMM is that it can transform the solution of the overall problem into solving two simpler subproblems, that is, the solution of the
frame correlation filter can be transformed into solving for
and
. The specific solution methods are as follows:
3.5.1. Solution to Subproblem
The solution of the subproblem
can be easily obtained from the following equation:
The
and
can be obtained by Fourier inverse transform:
3.5.2. Solution to Subproblem
It can be seen that subproblem
includes a term
, which is time-consuming to compute. Therefore, by exploiting the sparsity of
, in each element of
i.e.,
, where
n = 1, 2, …,
N, is solely dependent on:
where operator conj(. ) denotes the complex conjugate operation.
From this, subproblem
is decomposed into
N smaller problems:
The solution for each smaller problem after decomposition can be easily obtained:
3.5.3. Update of Lagrangian Parameter
The Lagrangian parameters are updated by the following equation:
where
i denotes the
i-th iteration and
3.5.4. Update of Appearance Model
The appearance model
is updated as follows:
Here, is the learning rate of the appearance model.
4. Experiment
In this chapter, to evaluate the performance of the proposed algorithm, we conduct quantitative and qualitative experiments on two challenging UAV target tracking datasets, UAVDT and Visdrone2018. The results are compared with 11 other state-of-the-art trackers that use handcrafted features, including: AutoTrack, ECO-HC, ARCF (HOG + CN), ARCF-H (HOG), SRDCF, BACF, Staple, DSST, SAMF, KCF, and CSK. To ensure the fairness of the real-time evaluation, the code obtained from each author’s webpage was uniformly run on the author’s computer, rather than directly using their evaluation results. If readers are interested in understanding the specific parameters set by the authors during their testing, they can refer to the literature [
11,
21,
23,
24,
29,
30,
31,
32,
33,
34] for detailed information. The tracker’s settings and parameters used in my testing are the same as those used by the authors in the original study. The experiments can be divided into overall performance evaluation and sub-attribute performance evaluation. These datasets cover various complex scenes and diverse attributes in the UAV tracking domain and are widely used to verify the effectiveness of tracking algorithms in UAV scenarios.
4.1. Experimental Platform and Parameters
Regarding the aberration suppression module mentioned in
Section 3.2, we set its parameters as
and
during the testing. For the intermittent learning context mechanism mentioned in
Section 3.3, we set
and
during the testing. The number of iterations for ADMM is set to 5, and the learning rate is 0.0192. All 12 trackers are run on a computer equipped with an Intel(R) Core(TM) i5-10300H CPU @ 2.50 GHz processor, and 16 GB RAM using MATLAB R2019a.
4.2. Evaluation Methodology
We use the one-pass evaluation (OPE) method, which only provides the ground truth for the target’s first frame during the entire testing process and does not initialize it subsequently. We adopt an efficient set of evaluation rules proposed by Wu et al. [
35], which reasonably assess the performance of trackers from the aspects of the distance precision rate (DPR) and overlap success rate (OSR). The center location error (CLE) refers to the Euclidean distance between the center point of the bounding box and the tracking result [
36]. The DPR for a single sequence represents the proportion of image sequences with a CLE less than a certain threshold, typically using a 20-pixel CLE for sorting trackers. Similarly, the “intersection over union (IoU)” can also be used for sorting. The numerator of the IoU is the area of the intersection between the test result box and the ground truth, while the denominator is the area of the union. The OSR for a single sequence represents the proportion of image sequences with an IoU greater than a certain threshold.
4.3. Benchmarks for Experiments
The UAVDT benchmark consists of 100 video sequences, recorded at 30 frames per second (fps) with a resolution of 1080 × 540 pixels [
37]. UAVDT can be used for single-object detection and single-object tracking, and the single-object tracking part includes 50 video sequences with 9 attributes: camera motion (CM), illumination variation (IV), long-term tracking (LT), large occlusion (LO), object motion (OM), object blur (OB), background clutter (BC), small object (SO), and scale variation (SV).
The Visdrone benchmark is captured by cameras mounted on various drones, covering a wide range of aspects, including location (from 14 different cities in China, thousands of kilometers apart), environment (urban and rural), objects (pedestrians, vehicles, bicycles, etc.), and density (sparse and crowded scenes). The dataset is collected using different drone platforms (i.e., different drone models) in different scenes, as well as different weather and lighting conditions. The single-object tracking part of Visdrone contains 50 video sequences, including 13 attributes: aspect ratio change (ARC), background clutter (BC), camera motion (CM), fast motion (FM), full occlusion (FO), illumination variation (IV), low resolution (LR), out-of-view (OV), partial occlusion (PO), similar occlusion (SO), similar object (SO), scale variation (SV), and viewpoint change (VC).
4.4. Overall Performance Evaluation
The overall performance evaluation results of the proposed SAARCF algorithm on the UAVDT benchmark are shown in
Figure 3. It should be noted that the performance evaluation values in the figure are percentages. The precision and success rate are 0.743 and 0.514, respectively, which represent improvements of 12.1% and 12.7% compared to the baseline tracker. Moreover, compared to other advanced trackers, it generally has a better overall performance.
The overall performance evaluation results of the SAARCF algorithm on the Visdrone benchmark are shown in
Figure 4, and it should be noted that the performance evaluation values in the figure are all percentages. The precision and success rate are 0.798 and 0.719, respectively, which are 2.6/2.4% higher than those of the baseline tracker, and it also has a better overall performance compared to other advanced trackers. Although the precision is slightly lower than ECO-HC by 0.01, the success rate is higher than ECO-HC by 0.19, so the comprehensive performance is still better than ECO-HC.
4.5. Attribute Evaluation
Figure 5 and
Figure 6 demonstrate the comparison of precision and success rates for 8 attributes on the UAVDT dataset. As can be seen from the figures, SAARCF achieves an improvement of more than 10% in all 8 attributes compared to the baseline tracker. In contrast to the other trackers, SAARCF achieves scores of SV (0.647), CM (0.718), BC (0.668), OM (0.672), and LO (0.548), while the overall performance of Autotrack, which is the second-best, improves by SV (1.7%), CM (3.2%), BC (3.2%), OM (1.3%), and LO (4.9%). SAARCF obtains the highest success rates for the attributes SV (0.457), CM (0.493), BC (0.441), and SO (0.558), with improvements over Autotrack of SV (0.4%), CM (2.2%), BC (1.1%), OM (1.3%), and OB (1.6%).
Figure 7 provides a more intuitive comparison of the performance of various trackers in the 8 attributes.
The evaluation results for the 13 attributes of the Visdrone dataset are shown in
Figure 8.
Figure 8 presents the comparison of precision and success rates for trackers on the 13 attributes in the Visdrone dataset. SAARCF outperforms the baseline tracker in all 13 attributes. The precision rates were: VC (0.819), SV (0.766), SO (0.641), PO (0.737), OV (0.802), LR (0.706), IV (0.856), FO (0.701), FM (0.746), CM (0.799), BC (0.663), and ARC (0.799). The success rates were: VC (0.787), SV (0.674), SO (0.540), PO (0.698), OV (0.693), LR (0.461), IV (0.818), FO (0.664), FM (0.669), CM (0.722), BC (0.543), and ARC (0.661). SAARCF ranks first in both the success and precision rates for the CM, FM, and SO attributes.
Figure 9,
Figure 10 and
Figure 11 show the precision and success rates of various trackers for these three attributes, respectively.
Due to the introduction of the spatial attention aberrance suppression module, SAARCF can effectively cope with the distortion of the environment and targets. This is evidenced by the highest scores achieved by SAARCF in the CM and MO attributes in
Figure 9 and
Figure 10. Moreover,
Figure 11 shows that SAARCF performs better in dealing with similar target interference compared to other trackers, which further confirms that learning the global context can effectively enhance the filter’s ability to distinguish targets.
4.6. Real-Time Evaluation
We conducted a real-time evaluation of SAARCF and 11 other trackers on the UAVDT and Visdrone datasets, with the results displayed in
Table 2. As shown in
Table 2, our proposed SARRCF algorithm can meet the requirements of real-time tracking. To verify the effectiveness of our proposed intermittent learning context mechanism (ILCM), we compared the FPS of three scenarios: the baseline tracker with the aberration suppression module (baseline + SAAR), baseline tracker with aberration suppression module and the frame-by-frame learning of global context (Baseline + SAAR + LCFF), and baseline tracker with aberration suppression module and ILCM (baseline + SAAR + ILCM). The results are shown in
Table 3. As can be seen from
Table 3, our proposed ILCM can effectively save computational resources and improve the computation speed.
4.7. Qualitative Experimental Analysis
To more intuitively analyze the effectiveness of SAARCF, we selected five video sequences from the UAVDT dataset for evaluation. We compared SAARCF with two other top trackers (AutoTrack and ECO-HC) and a baseline tracker (BACF). The tracking results are shown in
Figure 12.
In Seq1, all four trackers can stably track the target before any occlusion occurs. At frame 153, leaves partially occlude the target, but the four trackers still do not exhibit tracking drift. However, at frame 214, the vehicle is mostly occluded, and only SAARCF still maintains the target. This indicates that, during occlusion, SAARCF learns the global context of the target and incorporates more background information. Therefore, even if most of the target area is occluded and the available target information is limited, the target can still be accurately recognized. However, at frame 279, the target is completely occluded, and even SAARCF cannot track it.
In Seq2, although the illumination changes significantly, all four trackers can still accurately track the target. At frame 150, a car passes near the target, and the baseline tracker fails to distinguish the object similar to the target, tracking the wrong target instead. At frame 356, the camera suddenly rotates rapidly, causing ECO-HC to experience tracking drift, while AutoTrack and SAARCF can still accurately follow the target.
In Seq3, as the camera moves upward, the target is completely occluded. Between frames 30 and 36, the target is fully occluded, but when it reappears at frame 37, it is immediately captured by SAARCF. This shows that SAARCF has a certain level of robustness to complete occlusion.
In Seq4 and Seq5, there is some disturbance around the target, accompanied by sudden camera movements and partial occlusion. It can be seen that only SAARCF performs well in completing the tracking task. Especially at frame 429 in Seq5, when the target suddenly jumps to shoot, only SAARCF successfully follows the target. This indicates that SAARCF has a better robustness to sudden changes in the environment or target.
5. Conclusions
To address the drastic changes in scenes and targets from the perspective of UAVs, this paper proposes SAARCF, and in order to fully utilize background information under limited computational resources, an intermittent context learning mechanism is also introduced. Extensive tests of the proposed algorithm were conducted on the UAVDT and Visdrone datasets, and the experimental results show that the proposed SAARCF outperforms the state-of-the-art trackers in overall performance, with a particularly outstanding performance in attributes such as camera movement, rapid target movement, and partial occlusion.
At the same time, there are some shortcomings in this work, namely the fact that it does not take into account the need for target scale estimation in complex environments when the target size changes, in order to adjust the size of the target bounding box and avoid learning too much background information due to an excessively large bounding box. The next step in this research will focus on implementing target scale adaptation.