1. Introduction
Visual object tracking (VOT) [
1,
2,
3,
4,
5,
6,
7] is a fundamental task in remote sensing and computer vision that involves predicting the position and shape of a target in each subsequent frame based on its position and shape in an initial frame. This task has numerous applications, including aerial or satellite platform surveillance, human–computer interaction [
8], mobile robots [
9], and autonomous driving [
10]. Despite the ability of mainstream tracking algorithms to balance accuracy and speed [
11,
12,
13,
14,
15,
16,
17], the task of object tracking faces significant challenges, such as fast-moving targets, background interference, occlusion, and changes in shape, as well as loss of targets. Therefore, designing a robust, accurate, and real-time tracker remains a daunting challenge. VOT tasks generally consist of a multi-level pipeline [
17,
18,
19] that comprises three parts. The first part is feature extraction, which involves generic feature extraction from both the template and search region. The second part is information interaction, which aggregates the feature information of the template and the search region. The final part involves bounding box and location estimation, typically achieved through a specific task head that allows for precise target positioning and bounding box estimation.
The classic VOT algorithms, including correlation-based methods such as SiamFC (Siamese Fully Convolutional Network) [
20], SiamRPN (Siamese Region Proposal Network) [
21], C-RPN (Cascaded Region Proposal Network) [
22], SiamFC++ [
17], SiamBAN (Siamese Box Adaptive Network) [
23], SiamCAR (Siamese Classification and Regression network) [
24], and Ocean (Object-aware Anchor-free Network) [
13], as well as online learning algorithms such as MOSSE (Minimum Output Sum of Squared Error filter) [
25], KCF (Kernelized Correlation Filter) [
26], CSR-DCF (Channel Spatial Reliability-Discriminative Correlation Filter) [
27], ATOM (Accurate Tracking by Overlap Maximization) [
28], DiMP (Discriminative Model Prediction) [
29], FCOT (Fully Convolutional Online Tracking) [
14], and ECO (Efficient Convolution Operators) [
30], generally utilize CNN as the basic component of feature extraction and information interaction. However, CNNs are not always effective in modeling long-range dependencies between image content and time-series features, which may result in target loss in some cases (e.g., when the target is briefly occluded). In sequence modeling, a transformer [
31] overcomes this problem with its global and dynamic modeling capabilities. For tracking tasks [
32,
33], the transformer can extract discriminative spatial and temporal features, making it a fundamental component of modern high-precision trackers. Although the flexibility of attention in vision transformers (VIT) [
34,
35] enables higher performance, one of the main challenges of current VIT-based trackers [
19] is their slow training convergence speed due to the large amount of input image content, which places a heavier burden on computing resources [
32].
The current algorithms used for tracking unmanned aerial vehicles (UAVs) face several challenges, as highlighted in previous studies [
36,
37]. These challenges can be summarized into two points. First, UAV videos are prone to motion blur, occlusion, and background clutter due to the rapid movement of the drone. The altitude variation of the drone’s six degrees of freedom (6 DOF) movement further complicates the task by causing objects to appear at different scales. As a result, extracting stable features from UAV videos is a comparatively difficult task compared to videos from other fields. Second, object tracking in UAV videos requires the rapid and accurate location of moving objects, making real-time performance a critical requirement. Therefore, it is necessary to reduce the convergence difficulty of a popular tracker for being deployed on the UAV platform. To address the challenges faced by existing tracking methods, we design a new tracker called ParallelTracker, which offers several key advantages. First, we design the prior knowledge extractor module (PEM) to extract spatial information from the template image, addressing the lack of spatial prior information in the VIT architecture. This spatial knowledge is utilized to assist the tracker in both training and inference while reducing the computational burden during training and maintaining optimal performance. Second, we design the template features parallel enhancing module (TPM), which is a parallel attention module that utilizes prior knowledge from the template image to capture more target-oriented discriminative features and enhance feature extraction. Furthermore, we design the template prior knowledge merge module (TPKM) based on TPM to further extract features and facilitate information communication between the template and search regions. TPKM utilizes position self-attention and channel self-attention to extract deep features from both the template and search region in parallel, allowing for information interaction in both position dimension and channel dimension through parallel attention operations. This improves the acquisition of correlations between the template and the search region. Finally, we use a simple location prediction head to complete the tracking process. Additionally, we design a score prediction head based on parallel attention to implement the target template update strategy. Our ParallelTracker requires fewer training epochs than the existing VIT-based trackers [
18,
19,
38], effectively decreasing the difficulty of convergence and striking a balance between performance and speed.
The main contributions of this paper can be divided into two aspects. First, we propose a simple, concise, and effective VIT-based tracker called ParallelTracker that aims to enhance the accuracy of VIT-based tracking algorithms for UAV video by addressing the challenges of motion blur, camera motion, occlusion, and the convergence speed. We design PEM, TPM, and TPKM to assist feature extraction and information interaction, leading to faster convergence and better performance. Second, We demonstrate that ParallelTracker achieves state-of-the-art performance on the UAV tracking benchmark UAV123, faster convergence speed, and a real-time performance of 25 FPS on an NVIDIA GeForce RTX 2070 GPU.
2. Related Literature
Early Siamese tracking networks, such as SiamFC [
20], SiamRPN [
21], and SiamBAN [
23], first extract features from the template and the search region using CNN networks with identical structures and parameters. The template and the search region features are then fused using correlation calculation to estimate the subsequent state. These networks also utilized CNN networks, including ResNet [
39] and others [
40], as the backbone for extracting image features.
The transformer [
31], originally proposed for machine translation tasks in NLP (natural language processing), is now widely adopted for various vision tasks. It introduced a multi-head attention mechanism to enable information interaction among the elements of a sequence, thereby endowing it with unique memory and global computing capabilities. Compared to a CNN with a local receptive field, the core component of the transformer, namely self-attention, is capable of capturing global information and computing interdependencies among image features. The DETR model [
41] introduced the transformer into object detection and surpassed many CNN-based methods. Many researchers [
19,
32,
36] have also introduced DETR into the field of visual object tracking, achieving excellent results. OSTrack [
18] proposed a complete end-to-end tracker that used the transformer to perform both feature extraction and information interaction simultaneously. TrDiMP [
33] integrated transformer structures into tracking frameworks by separating the encoder and decoder of the transformer and exploring temporal information between frames, which effectively enhanced the robustness of Siamese-like trackers. TrTr [
42] utilized contextual information extracted from the learned template image in the encoder and transmited it to the decoder for information exchange with the search region. TransT [
32] proposed an attention mechanism for nonlinear feature fusion, which effectively integrated global information from both template images and search regions by capturing associated information from long-range features. STARK [
19] concatenated template images and search regions to learn spatiotemporal features jointly across frames. In this study, ParallelTracker differs fundamentally from existing trackers in several aspects. Specifically, instead of using self-attention as in Dual_VIT [
43], our designed module employs a parallel attention structure in feature extraction and information interaction. Moreover, we adopt multiple templates and search regions as inputs and utilize corner-based localization heads to generate bounding boxes.
3. Methodology
We propose an end-to-end tracking framework named ParallelTracker, which comprises the PEM module for extracting the spatial prior of the template, as well as the TPM and TPKM modules for enhancing the features and fusing the target information, respectively. In addition, ParallelTracker incorporates a corner-based fully convolutional localization head that estimates the bounding box of the tracked object. Moreover, we have introduced a confidence-based score head, which offers a specially designed mechanism for updating the online template to address potential deformations of the tracked object. The overall architecture of ParallelTracker is illustrated in
Figure 1.
3.1. PEM, TPM and TPKM
We design PEM, as shown in
Figure 2a, to extract prior knowledge from object templates.
We propose PEM to extract local spatial information from the template image, boosting the capacity of the tracker while mitigating the computational burden during training. The utilization of local spatial information in the vision transformer enables the model to capture specific local features of the object. Additionally, this approach imbues the model with inductive biases that are similar to CNNs, such as locality and spatial invariance. Prior knowledge can be defined as the local spatial feature of the target, which informs the model’s understanding of the target’s spatial context. The specific operation of PEM is as follows. First, we obtain downsampled template tokens by using a fully convolutional patch embedding layer. The patch embedding layer is composed of convolutional layers to increase the channel resolution while reducing the spatial resolution. The specific operation of patch embedding layers is: Given T templates (i.e., the first template and
online templates) with the size of
, we first map it into patch embeddings using a convolutional layer with stride 4 and kernel size 7. Then, we flatten the patch embeddings, resulting in a token sequence with a size of
, where C represents the number of channels and is set to 96. Here,
and
denote the height and width of the template, respectively, and are both set to 128. Next, these tokens are fed into an average pooling layer, which reduces their resolution while preserving vital information. The target query is initialized using a random vector and upsampled through bicubic interpolation to match the shape of the template tokens. To map the template tokens and target query to the same linear space while retaining the spatial information of the template in the target query using a self-attention mechanism, we incorporate a linear mapping (LM) layer into our model. This allows us to extract local spatial information as spatial prior knowledge, which can guide subsequent modules in feature extraction, feature enhancement, and information interaction. As a result, it is possible to incorporate prior information, called prior tokens (PT), into the tracking network. This is a special feature of our model and sets it apart from other approaches in the field. As shown in (1),
where
d represents the dimension of the key,
represents the target query, and
represents the template tokens. The learnable target token
is interpolated to match the spatial shape of the template tokens
, and then all tokens are linearly mapped to obtain
,
, and
. We utilize a self-attention mechanism to extract spatial prior information from the template tokens and fuse it into the target tokens, resulting in the generation of tokens with template prior knowledge.
TPM is designed to enrich template features, as illustrated in
Figure 2b. TPM consists of two parallel attention modules and two feed-forward layers (FFN), with layer normalization (LN) applied before the FFN and LM layers. The TPM module utilizes prior tokens (PT) extracted by the PEM module to enhance the template features, thereby improving the ability of the extracted features to attend to both spatial and semantic information in the target template. The first attention module is utilized to enhance the spatial representation in the template tokens, while the other attention module effectively extracts key information from the template features and stores it in the enhanced prior tokens. The implementation of these two attentions is shown in Equation (
2).
TPKM, as illustrated in
Figure 2c, closely resembles the TPM module. TPKM consists of two multi-head attention layers and three feed-forward layers (FFN). Layer normalization is applied before the FFN and attention. Parallel attention is utilized to engage with distinct sequence feature dimensions, thereby intensifying the fused feature information communication, as shown in (3):
T and S denote the feature sequence of the template and the search region, respectively. The fused features are represented by x, which are obtained by concatenating online template/template tokens and search tokens along the same dimension. To perform information interaction, we employ parallel position attention and channel attention.
3.2. Detailed Structure of ParallelTracker
The proposed end-to-end ViT-based tracking network, ParallelTracker, embedded with PEM, TPM, and TPKM (as illustrated in
Figure 2) consists of three parts: a backbone, a localization head, and a confidence prediction head. By integrating feature extraction and information interaction, our tracker achieves faster convergence speed. Moreover, it decouples the image content features from prior knowledge, without requiring any additional post-processing or integration modules.
Backbone. The backbone employs a step-by-step multi-block architecture strategy, which consists of three distinct blocks. These blocks operate on feature maps that have been down-sampled to the same scale, with the same number of channels. Each block varies slightly in structure, incorporating overlapping patch embedding layers and a different number () of either PEMs, TPMs, or TPKMs.
In our approach, we input T templates with a size of and a search region with a size of into the tracker. The T templates comprise templates generated from the initial frame and online templates selected from the confidence prediction head. In this study, we set . Additionally, the search region is cropped according to the previous target state. To map the input image to tokens and increase the channel dimension while reducing the spatial resolution of image features, we utilize the patch embedding layer, which comprises a convolutional layer, batch normalization, and ReLU. Subsequently, in the first block, the image features are converted into template tokens and search tokens with shapes of and , respectively, through the embedding layer. Where C is 96, and are 128, and are 320. The template tokens are then fed into the PEM to obtain the template of prior knowledge.
In the second block, the image features are further down-sampled to
and
, respectively, using the embedding layer. To enhance the template token features, we employ TPM in the next step. In the final block, the down-sampled features from the first two blocks are concatenated to generate the fused feature token sequence with a size of
. We employ TPKM to interact information. Before being fed into the location head, the fused feature tokens sequence is first split into template tokens, online template tokens, and search tokens. However, only the search tokens are inputted to both the localization head and the score head. The search tokens are converted into the format of
. In this context,
and
indicate the spatial dimensions of the search region, whereas
and
indicate the spatial dimensions of the template. Unlike other trackers [
36,
38], we do not employ a multi-scale feature aggregation strategy.
Head. The proposed tracking network employs a fully convolutional localization head based on corners for predicting the bounding box, which bears a resemblance to the STARK method [
19]. The localization head exclusively relies on a basic convolutional network to predict the upper-left and lower-right corners, after which it computes the expected probability distribution of the corners to predict the bounding box. Utilizing the localization head to estimate the bounding box represents Stage 1 of our tracking system. The effectiveness of tracking is significantly impacted by low-quality online templates. In view of this, we draw inspiration from the scoring modules in contemporary trackers [
19,
32,
36] and propose our parallel scoring prediction head (PSH) for selecting reliable online templates based on expected confidence scores, as shown in
Figure 3.
The PSH model is composed of three attention blocks, including two position attentions and one channel attention, and a three-layer perceptron. Initially, a learnable score token is utilized as a query to attend the search ROI tokens, allowing the score tokens to encode the extracted target information. The attended score token compares the extracted target with the tracked target at all positions of the initial target, and the scores are generated using MLP layers and sigmoid activations. Any online template prediction scores below 0.5 are deemed negative. The utilization of PSH to derive scores represents Stage 2 of our tracking process.
3.3. Training and Inference
Training. The training procedure of ParallelTracker generally adheres to the standard training protocol employed by current trackers [
19]. We begin by pre-training the backbone network of ParallelTracker using the Dual_VIT model. Subsequently, we fine-tune the tracking network using labeled tracking datasets. We employ a combination of L1 loss and GIOU [
44] loss as the objective function.
where
and
are the weight values of the two losses;
B is the ground-truth bounding box of the target, and
is the predicted regression box.
Template online update. The template is an image patch of size
, which is cropped from the initial frame of a video. The online templates are also image patches of the same size, but they are dynamically selected from the current frames by PSH. Online templates play a crucial role in extracting target-related information from time series data. Nevertheless, tracking may encounter issues such as motion blur or objects going beyond the field of view, which may decrease the similarity between online templates and targets. To address this issue, we propose parallel scoring prediction head (PSH), as illustrated in
Figure 3, to select reliable online templates using confidence scores. We train PSH separately using standard cross-entropy loss:
where
y is the ground-truth label,
p is the predicted confidence score, and
i represents the
ith sample.
5. Conclusions
We have developed a new end-to-end tracking algorithm, called ParallelTracker, which incorporates prior knowledge and parallel attention mechanisms to integrate image priors with feature extraction and interaction processes. Specifically, the PEM module is designed to address the lack of spatial prior information in the VIT-based tracker and enhance efficiency, the TPM module leverages the prior information in the template image to extract object-oriented and discriminative features in searched frames, and the TPKM module facilitates information exchange between templates and search regions. In addition, PSH incorporates an object template update strategy to enhance the tracker’s ability to accommodate changes in object shape and occlusions.
Experimental results demonstrate that ParallelTracker outperforms state-of-the-art algorithms in UAV videos. ParallelTracker can also maintain high levels of accuracy comparable to the latest methods in close-range video tracking scenarios, showing its powerful generalizing ability. Moreover, ParallelTracker can significantly reduce the epochs required by the popular methods to achieve convergence. This leads to a noteworthy reduction in the convergence difficulty of the VIT-based tracker.